Skip to content

Redis

Implementation: Building the Socket Foundation

To build a Redis-like server, you must transition from treating networking as a "black box" to managing low-level TCP byte streams. Since TCP does not provide message boundaries, the application protocol is responsible for splitting the byte stream into messages via serialization.

Key Concepts

  • TCP vs. Messages: TCP provides a continuous stream of bytes. You must implement a protocol to handle message framing (e.g., length-prefixing or delimiters).
  • The Socket Handle: On Linux, a socket is represented as a file descriptor (fd). This integer serves as an opaque handle to refer to a connection or a listening port.

Server-Side Setup (Listening Socket) Establishing a server requires a specific sequence of system calls to prepare the OS for incoming traffic:

  1. socket(): Allocates a new file descriptor for networking.
  2. bind(): Associates the socket with a specific local IP address and port number.
  3. listen(): Configures the socket to wait for incoming connection requests.
  4. accept(): Retrieves a pending connection from the queue, returning a new file descriptor specifically for that individual client connection.

Basic Reference Flow

// 1. Initialize the listener
fd = socket();
bind(fd, address);
listen(fd);

// 2. The event loop
while (true) {
    // Wait for a client to connect
    conn_fd = accept(fd);

    // 3. Data Transfer
    // Use read() to receive requests and write() to send responses
    process_request(conn_fd);

    // 4. Cleanup
    close(conn_fd);
}

Client-Side Connection For a client to interact with the server, it follows a simpler two-step process:

  1. socket(): Create the handle.
  2. connect(): Initiate the handshake with the server's address and port.

Tutorial: Implementing Basic Client-Server Interaction

Building a functional Redis-like interaction requires moving from theory to implementing a practical "Hello World" exchange. This involves setting specific socket options, handling network byte order (endianness), and managing the request-response loop.

Core Implementation Steps

  • Socket Configuration: Use setsockopt with SO_REUSEADDR to allow the server to restart and bind to the same port immediately, bypassing the kernel's TIME_WAIT delay.
  • Endianness Management: Network protocols use Big-endian (Network Byte Order). Use htons() and htonl() to convert host integers to network order for ports and IP addresses.
  • Wildcard Binding: Binding to 0.0.0.0 (wildcard) allows the server to listen on all available network interfaces.

Server-Side Logic

// 1. Setup
int fd = socket(AF_INET, SOCK_STREAM, 0);
int val = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val));

// 2. Bind & Listen
struct sockaddr_in addr = {};
addr.sin_family = AF_INET;
addr.sin_port = htons(1234); // Port 1234
addr.sin_addr.s_addr = htonl(0); // 0.0.0.0
bind(fd, (struct sockaddr *)&addr, sizeof(addr));
listen(fd, SOMAXCONN);

// 3. Request Loop
while (true) {
    int connfd = accept(fd, NULL, NULL);
    // Read request
    char rbuf[64] = {};
    read(connfd, rbuf, sizeof(rbuf) - 1);
    // Send response
    write(connfd, "world", 5);
    close(connfd);
}

Client-Side Logic

// 1. Connect
int fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr = {};
addr.sin_family = AF_INET;
addr.sin_port = ntohs(1234);
addr.sin_addr.s_addr = ntohl(INADDR_LOOPBACK); // 127.0.0.1
connect(fd, (struct sockaddr *)&addr, sizeof(addr));

// 2. Exchange
write(fd, "hello", 5);
char rbuf[64] = {};
read(fd, rbuf, sizeof(rbuf) - 1);
close(fd);

Key API Reference

Function Purpose
AF_INET Address family for IPv4.
SOCK_STREAM Specifies TCP (byte stream) protocol.
htons() / htonl() Host to Network (short / long).
INADDR_LOOPBACK Refers to 127.0.0.1 (local loopback).

Tutorial: Designing the Request-Response Protocol

To handle multiple requests over a single connection, you must implement an application-level protocol to frame messages. Since TCP is a continuous byte stream with no internal boundaries, the server needs a reliable way to determine where one message ends and the next begins.

Message Framing: Length-Prefixing The most robust way to split a byte stream into messages is to use a fixed-size header that specifies the length of the following payload.

  • Structure: Each message starts with a 4-byte little-endian integer (len), followed by len bytes of data.
  • Benefit: Unlike delimiter-based protocols (like those using \n), length-prefixing avoids complex escaping rules if the payload contains the delimiter.

Handling Partial Reads and Writes A common pitfall in network programming is assuming read() or write() will process the requested number of bytes in one go. In reality, these calls can return early due to kernel buffer limits or signal interruptions. To solve this, you must implement helper functions that loop until the operation is complete:

// Ensures exactly 'n' bytes are read from the socket
static int32_t read_full(int fd, char *buf, size_t n) {
    while (n > 0) {
        ssize_t rv = read(fd, buf, n);
        if (rv <= 0) return -1; // Error or unexpected EOF
        n -= (size_t)rv;
        buf += rv;
    }
    return 0;
}

// Ensures exactly 'n' bytes are written to the socket
static int32_t write_all(int fd, const char *buf, size_t n) {
    while (n > 0) {
        ssize_t rv = write(fd, buf, n);
        if (rv <= 0) return -1; // Error
        n -= (size_t)rv;
        buf += rv;
    }
    return 0;
}

Implementation Logic

  1. Header First: Read exactly 4 bytes to determine the payload size.
  2. Safety Check: Verify that the length does not exceed a predefined maximum (e.g., 4096 bytes) to prevent memory exhaustion attacks.
  3. Payload Second: Read the remaining len bytes into a buffer.
  4. Process & Reply: Construct a response using the same length-prefixed format and send it using write_all().

Protocol Comparison

Feature Length-Prefixed (Binary) Delimiter-Based (Text)
Parsing Fast & simple memcpy Slow; must scan for delimiters
Complexity Low; no escaping needed High; requires escaping/encoding
Example [4-byte len][data] data\n

Implementation: Managing Concurrent I/O with Event Loops

To scale a Redis-like server to handle thousands of simultaneous connections, you must move beyond thread-per-connection models. Creating a new thread for every client is memory-intensive and incurs high context-switching overhead. Instead, an Event Loop allows a single thread to monitor multiple sockets and react only when they are "ready" for I/O.

The Core Mechanism An event loop relies on three primary operating system capabilities:

  1. Non-blocking I/O: Configuring sockets so that read() or write() calls return immediately with an error (like EAGAIN) if no data is available or the buffer is full, rather than suspending the thread.
  2. Readiness Notification: Using a system call to wait for multiple file descriptors (fds) at once. The call blocks until at least one socket is ready for reading or writing.
  3. The Loop: An infinite cycle that queries for ready sockets, processes their data without blocking, and repeats.

Setting Up Non-blocking Mode On Linux, you must explicitly set the O_NONBLOCK flag using the fcntl system call:

static void fd_set_nonblock(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    flags |= O_NONBLOCK;
    fcntl(fd, F_SETFL, flags);
}

The Readiness API: poll() While epoll is the standard for high-performance Linux applications, poll() is simpler for initial implementations. It takes an array of pollfd structures, each defining a file descriptor and the events you wish to monitor (e.g., POLLIN for data to read).

Comparison of I/O Strategies

Method Scalability Key API
Thread-per-connection Low pthread_create
Process-per-connection Low fork()
Event Loop (Basic) High poll()
Event Loop (Scalable) Very High epoll (Linux), kqueue (BSD)

Disk Files

Readiness APIs like poll() and epoll do not work with disk files. Because disk files are always considered "ready" by the kernel even if the physical I/O would block the thread, they must be handled via thread pools or specialized APIs like io_uring.


Implementation: Coding the Event Loop with poll()

To build a production-grade Redis-like server, you must implement a robust event loop that manages per-connection state and handles asynchronous I/O across multiple loop iterations.

Per-Connection State Management Since a single request may span multiple loop cycles, you must explicitly store the state of each connection. A Conn structure typically includes:

  • File Descriptor (fd): The unique handle for the socket.
  • Intention Flags (want_read, want_write): Tells the event loop which operations to monitor for this socket.
  • Buffers:
  • incoming: Accumulates raw data from read() until a full message is parsed.
  • outgoing: Stores generated responses until the socket is ready for a write().

The 5-Step Loop Execution

  1. Construct pollfd List: Map the application's intent (e.g., want_read) to the poll() flags (POLLIN).
  2. Wait for Readiness: Call poll(). This is the only blocking call in the system. Always handle EINTR to retry if a signal interrupts the wait.
  3. Accept Connections: If the listening socket (usually at index 0) shows POLLIN, call accept(), set the new socket to non-blocking mode, and initialize a new Conn object.
  4. Process Callbacks: Iterate through the connection sockets. If POLLIN is set, trigger handle_read(); if POLLOUT is set, trigger handle_write().
  5. Terminate & Cleanup: Close sockets and free memory if POLLERR is returned or if the application sets a want_close flag (due to EOF or protocol errors).

Non-Blocking Logic Flow

  • Reading: Perform a single read(), append the result to incoming, and then attempt to parse a message. If the buffer is too small for a complete message, wait for the next loop iteration.
  • Writing: Write as much data as possible from the outgoing buffer. If a partial write occurs, the remaining data stays in the buffer for the next POLLOUT event.
  • State Transition: After processing a request, toggle the connection's state from want_read to want_write to signal that a response is ready to be sent.

Tutorial: Advanced Event Loop & Pipelining

Once a basic event loop is functional, you must optimize it to handle high-throughput scenarios and edge cases where single requests span multiple I/O operations. This phase transitions the server from a "one-at-a-time" processor to a system capable of handling pipelined requests and large data payloads.

Handling Pipelined Requests Pipelining occurs when a client sends multiple requests over a single connection without waiting for individual responses. A naive event loop might process one request and then immediately switch to a "wait for write" state, leaving unread requests sitting in the kernel buffer.

  • The Fix: Instead of processing a single request per handle_read() call, use a while loop to consume all complete messages currently available in the incoming buffer.
  • Buffer Management: Use buf_consume() to remove only the processed bytes from the front of the buffer, rather than clearing the entire buffer, to preserve any subsequent pipelined requests.

Optimizing with "Optimistic" Writes In a standard request-response flow, a socket is almost always ready to write immediately after a request is read. You can save a system call by attempting an "optimistic" write() immediately inside the handle_read() function.

  • Logic: After parsing a request and generating a response, call handle_write() directly.
  • Safety: The handle_write() function must be updated to handle EAGAIN or EWOULDBLOCK. If the write buffer is actually full, the function should return gracefully and let the event loop handle the remaining data when the socket becomes truly ready.

Efficient Buffer Structures Using a standard dynamic array (like std::vector) for a FIFO buffer is inefficient because removing data from the front requires shifting all subsequent elements (\(O(N)\)).

  • A Better Approach: Implement a buffer with separate data_begin and data_end pointers.
  • Operation: "Consuming" data from the front becomes a simple pointer increment (\(O(1)\)). Only when the "unused" space at the front becomes too large do you perform a single memory move or reallocation to reset the buffer.

**Debugging with strace** To verify that your event loop is behaving correctly—especially regarding non-blocking behavior and pipelining—use strace to monitor system calls in real-time.

  • What to look for:
  • poll() returning with multiple ready events.
  • read() returning a large chunk of data containing multiple prefixed messages.
  • write() handling large payloads across multiple iterations without blocking the entire server.

Implementation: Building a Key-Value Server (GET, SET, DEL)

In this phase, the server evolves from a simple echo service to a functional key-value store. This requires extending the protocol to handle multiple arguments and implementing a mechanism to store and retrieve data.

Request Protocol: List of Strings To support commands like SET key value, the request format is structured as an array of strings. Each message contains:

  1. Header (4 bytes): An integer specifying the number of strings in the request.
  2. String Data: For each string, a 4-byte length prefix followed by the actual string bytes.

Example of a SET k v request: [3 (strings)] [3 (len)] ["SET"] [1 (len)] ["k"] [1 (len)] ["v"]

Data Storage and Command Logic A global std::map<std::string, std::string> serves as the primary storage. The server parses the first string of any request as the command and dispatches it to the appropriate handler:

  • SET: Expects two additional arguments (key and value). It updates the map and typically returns a success indicator.
  • GET: Expects one additional argument (key). If found, it returns the value; otherwise, it returns a "not found" status.
  • DEL: Expects one additional argument (key). It removes the entry from the map and returns a status indicating whether the key was present.

Processing Flow

  1. Parse Arguments: After the framing layer provides the full payload, iterate through the byte stream to extract each string based on its length prefix.
  2. Dispatch: Use the first argument to determine the operation.
  3. Execute & Reply: Perform the map operation and construct a response. Responses should also follow the length-prefixed protocol (often a single status string or the requested data) so the client can reliably parse the result.

Efficiency

While std::map is convenient for this initial step, it is not ideal for a high-performance database due to its \(O(\log N)\) complexity and memory overhead. Subsequent iterations replace this with a custom-built, \(O(1)\) hash table.