Redis
Implementation: Building the Socket Foundation¶
To build a Redis-like server, you must transition from treating networking as a "black box" to managing low-level TCP byte streams. Since TCP does not provide message boundaries, the application protocol is responsible for splitting the byte stream into messages via serialization.
Key Concepts
- TCP vs. Messages: TCP provides a continuous stream of bytes. You must implement a protocol to handle message framing (e.g., length-prefixing or delimiters).
- The Socket Handle: On Linux, a socket is represented as a file descriptor (fd). This integer serves as an opaque handle to refer to a connection or a listening port.
Server-Side Setup (Listening Socket) Establishing a server requires a specific sequence of system calls to prepare the OS for incoming traffic:
socket(): Allocates a new file descriptor for networking.bind(): Associates the socket with a specific local IP address and port number.listen(): Configures the socket to wait for incoming connection requests.accept(): Retrieves a pending connection from the queue, returning a new file descriptor specifically for that individual client connection.
Basic Reference Flow
// 1. Initialize the listener
fd = socket();
bind(fd, address);
listen(fd);
// 2. The event loop
while (true) {
// Wait for a client to connect
conn_fd = accept(fd);
// 3. Data Transfer
// Use read() to receive requests and write() to send responses
process_request(conn_fd);
// 4. Cleanup
close(conn_fd);
}
Client-Side Connection For a client to interact with the server, it follows a simpler two-step process:
socket(): Create the handle.connect(): Initiate the handshake with the server's address and port.
Tutorial: Implementing Basic Client-Server Interaction¶
Building a functional Redis-like interaction requires moving from theory to implementing a practical "Hello World" exchange. This involves setting specific socket options, handling network byte order (endianness), and managing the request-response loop.
Core Implementation Steps
- Socket Configuration: Use
setsockoptwithSO_REUSEADDRto allow the server to restart and bind to the same port immediately, bypassing the kernel'sTIME_WAITdelay. - Endianness Management: Network protocols use Big-endian (Network Byte Order). Use
htons()andhtonl()to convert host integers to network order for ports and IP addresses. - Wildcard Binding: Binding to
0.0.0.0(wildcard) allows the server to listen on all available network interfaces.
Server-Side Logic
// 1. Setup
int fd = socket(AF_INET, SOCK_STREAM, 0);
int val = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val));
// 2. Bind & Listen
struct sockaddr_in addr = {};
addr.sin_family = AF_INET;
addr.sin_port = htons(1234); // Port 1234
addr.sin_addr.s_addr = htonl(0); // 0.0.0.0
bind(fd, (struct sockaddr *)&addr, sizeof(addr));
listen(fd, SOMAXCONN);
// 3. Request Loop
while (true) {
int connfd = accept(fd, NULL, NULL);
// Read request
char rbuf[64] = {};
read(connfd, rbuf, sizeof(rbuf) - 1);
// Send response
write(connfd, "world", 5);
close(connfd);
}
Client-Side Logic
// 1. Connect
int fd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr = {};
addr.sin_family = AF_INET;
addr.sin_port = ntohs(1234);
addr.sin_addr.s_addr = ntohl(INADDR_LOOPBACK); // 127.0.0.1
connect(fd, (struct sockaddr *)&addr, sizeof(addr));
// 2. Exchange
write(fd, "hello", 5);
char rbuf[64] = {};
read(fd, rbuf, sizeof(rbuf) - 1);
close(fd);
Key API Reference
| Function | Purpose |
|---|---|
AF_INET |
Address family for IPv4. |
SOCK_STREAM |
Specifies TCP (byte stream) protocol. |
htons() / htonl() |
Host to Network (short / long). |
INADDR_LOOPBACK |
Refers to 127.0.0.1 (local loopback). |
Tutorial: Designing the Request-Response Protocol¶
To handle multiple requests over a single connection, you must implement an application-level protocol to frame messages. Since TCP is a continuous byte stream with no internal boundaries, the server needs a reliable way to determine where one message ends and the next begins.
Message Framing: Length-Prefixing The most robust way to split a byte stream into messages is to use a fixed-size header that specifies the length of the following payload.
- Structure: Each message starts with a 4-byte little-endian integer (
len), followed bylenbytes of data. - Benefit: Unlike delimiter-based protocols (like those using
\n), length-prefixing avoids complex escaping rules if the payload contains the delimiter.
Handling Partial Reads and Writes
A common pitfall in network programming is assuming read() or write() will process the requested number of bytes in one go. In reality, these calls can return early due to kernel buffer limits or signal interruptions. To solve this, you must implement helper functions that loop until the operation is complete:
// Ensures exactly 'n' bytes are read from the socket
static int32_t read_full(int fd, char *buf, size_t n) {
while (n > 0) {
ssize_t rv = read(fd, buf, n);
if (rv <= 0) return -1; // Error or unexpected EOF
n -= (size_t)rv;
buf += rv;
}
return 0;
}
// Ensures exactly 'n' bytes are written to the socket
static int32_t write_all(int fd, const char *buf, size_t n) {
while (n > 0) {
ssize_t rv = write(fd, buf, n);
if (rv <= 0) return -1; // Error
n -= (size_t)rv;
buf += rv;
}
return 0;
}
Implementation Logic
- Header First: Read exactly 4 bytes to determine the payload size.
- Safety Check: Verify that the length does not exceed a predefined maximum (e.g., 4096 bytes) to prevent memory exhaustion attacks.
- Payload Second: Read the remaining
lenbytes into a buffer. - Process & Reply: Construct a response using the same length-prefixed format and send it using
write_all().
Protocol Comparison
| Feature | Length-Prefixed (Binary) | Delimiter-Based (Text) |
|---|---|---|
| Parsing | Fast & simple memcpy |
Slow; must scan for delimiters |
| Complexity | Low; no escaping needed | High; requires escaping/encoding |
| Example | [4-byte len][data] |
data\n |
Implementation: Managing Concurrent I/O with Event Loops¶
To scale a Redis-like server to handle thousands of simultaneous connections, you must move beyond thread-per-connection models. Creating a new thread for every client is memory-intensive and incurs high context-switching overhead. Instead, an Event Loop allows a single thread to monitor multiple sockets and react only when they are "ready" for I/O.
The Core Mechanism An event loop relies on three primary operating system capabilities:
- Non-blocking I/O: Configuring sockets so that
read()orwrite()calls return immediately with an error (likeEAGAIN) if no data is available or the buffer is full, rather than suspending the thread. - Readiness Notification: Using a system call to wait for multiple file descriptors (fds) at once. The call blocks until at least one socket is ready for reading or writing.
- The Loop: An infinite cycle that queries for ready sockets, processes their data without blocking, and repeats.
Setting Up Non-blocking Mode
On Linux, you must explicitly set the O_NONBLOCK flag using the fcntl system call:
static void fd_set_nonblock(int fd) {
int flags = fcntl(fd, F_GETFL, 0);
flags |= O_NONBLOCK;
fcntl(fd, F_SETFL, flags);
}
The Readiness API: poll()
While epoll is the standard for high-performance Linux applications, poll() is simpler for initial implementations. It takes an array of pollfd structures, each defining a file descriptor and the events you wish to monitor (e.g., POLLIN for data to read).
Comparison of I/O Strategies
| Method | Scalability | Key API |
|---|---|---|
| Thread-per-connection | Low | pthread_create |
| Process-per-connection | Low | fork() |
| Event Loop (Basic) | High | poll() |
| Event Loop (Scalable) | Very High | epoll (Linux), kqueue (BSD) |
Disk Files
Readiness APIs like poll() and epoll do not work with disk files. Because disk files are always considered "ready" by the kernel even if the physical I/O would block the thread, they must be handled via thread pools or specialized APIs like io_uring.
Implementation: Coding the Event Loop with poll()¶
To build a production-grade Redis-like server, you must implement a robust event loop that manages per-connection state and handles asynchronous I/O across multiple loop iterations.
Per-Connection State Management
Since a single request may span multiple loop cycles, you must explicitly store the state of each connection. A Conn structure typically includes:
- File Descriptor (
fd): The unique handle for the socket. - Intention Flags (
want_read,want_write): Tells the event loop which operations to monitor for this socket. - Buffers:
incoming: Accumulates raw data fromread()until a full message is parsed.outgoing: Stores generated responses until the socket is ready for awrite().
The 5-Step Loop Execution
- Construct
pollfdList: Map the application's intent (e.g.,want_read) to thepoll()flags (POLLIN). - Wait for Readiness: Call
poll(). This is the only blocking call in the system. Always handleEINTRto retry if a signal interrupts the wait. - Accept Connections: If the listening socket (usually at index 0) shows
POLLIN, callaccept(), set the new socket to non-blocking mode, and initialize a newConnobject. - Process Callbacks: Iterate through the connection sockets. If
POLLINis set, triggerhandle_read(); ifPOLLOUTis set, triggerhandle_write(). - Terminate & Cleanup: Close sockets and free memory if
POLLERRis returned or if the application sets awant_closeflag (due to EOF or protocol errors).
Non-Blocking Logic Flow
- Reading: Perform a single
read(), append the result toincoming, and then attempt to parse a message. If the buffer is too small for a complete message, wait for the next loop iteration. - Writing: Write as much data as possible from the
outgoingbuffer. If a partial write occurs, the remaining data stays in the buffer for the nextPOLLOUTevent. - State Transition: After processing a request, toggle the connection's state from
want_readtowant_writeto signal that a response is ready to be sent.
Tutorial: Advanced Event Loop & Pipelining¶
Once a basic event loop is functional, you must optimize it to handle high-throughput scenarios and edge cases where single requests span multiple I/O operations. This phase transitions the server from a "one-at-a-time" processor to a system capable of handling pipelined requests and large data payloads.
Handling Pipelined Requests Pipelining occurs when a client sends multiple requests over a single connection without waiting for individual responses. A naive event loop might process one request and then immediately switch to a "wait for write" state, leaving unread requests sitting in the kernel buffer.
- The Fix: Instead of processing a single request per
handle_read()call, use awhileloop to consume all complete messages currently available in theincomingbuffer. - Buffer Management: Use
buf_consume()to remove only the processed bytes from the front of the buffer, rather than clearing the entire buffer, to preserve any subsequent pipelined requests.
Optimizing with "Optimistic" Writes
In a standard request-response flow, a socket is almost always ready to write immediately after a request is read. You can save a system call by attempting an "optimistic" write() immediately inside the handle_read() function.
- Logic: After parsing a request and generating a response, call
handle_write()directly. - Safety: The
handle_write()function must be updated to handleEAGAINorEWOULDBLOCK. If the write buffer is actually full, the function should return gracefully and let the event loop handle the remaining data when the socket becomes truly ready.
Efficient Buffer Structures
Using a standard dynamic array (like std::vector) for a FIFO buffer is inefficient because removing data from the front requires shifting all subsequent elements (\(O(N)\)).
- A Better Approach: Implement a buffer with separate
data_beginanddata_endpointers. - Operation: "Consuming" data from the front becomes a simple pointer increment (\(O(1)\)). Only when the "unused" space at the front becomes too large do you perform a single memory move or reallocation to reset the buffer.
**Debugging with strace**
To verify that your event loop is behaving correctly—especially regarding non-blocking behavior and pipelining—use strace to monitor system calls in real-time.
- What to look for:
poll()returning with multiple ready events.read()returning a large chunk of data containing multiple prefixed messages.write()handling large payloads across multiple iterations without blocking the entire server.
Implementation: Building a Key-Value Server (GET, SET, DEL)¶
In this phase, the server evolves from a simple echo service to a functional key-value store. This requires extending the protocol to handle multiple arguments and implementing a mechanism to store and retrieve data.
Request Protocol: List of Strings
To support commands like SET key value, the request format is structured as an array of strings. Each message contains:
- Header (4 bytes): An integer specifying the number of strings in the request.
- String Data: For each string, a 4-byte length prefix followed by the actual string bytes.
Example of a SET k v request:
[3 (strings)] [3 (len)] ["SET"] [1 (len)] ["k"] [1 (len)] ["v"]
Data Storage and Command Logic
A global std::map<std::string, std::string> serves as the primary storage. The server parses the first string of any request as the command and dispatches it to the appropriate handler:
SET: Expects two additional arguments (key and value). It updates the map and typically returns a success indicator.GET: Expects one additional argument (key). If found, it returns the value; otherwise, it returns a "not found" status.DEL: Expects one additional argument (key). It removes the entry from the map and returns a status indicating whether the key was present.
Processing Flow
- Parse Arguments: After the framing layer provides the full payload, iterate through the byte stream to extract each string based on its length prefix.
- Dispatch: Use the first argument to determine the operation.
- Execute & Reply: Perform the map operation and construct a response. Responses should also follow the length-prefixed protocol (often a single status string or the requested data) so the client can reliably parse the result.
Efficiency
While std::map is convenient for this initial step, it is not ideal for a high-performance database due to its \(O(\log N)\) complexity and memory overhead. Subsequent iterations replace this with a custom-built, \(O(1)\) hash table.