← Articulet System Design, Made Clear Chapter 15 · Chat System
Part 4 · Real Interview Systems Chapter 15

The chat system.

Delivering messages quickly enough, in a way users can trust.

Learning objective
Design a chat system by separating connection handling, message storage, delivery, offline behavior, and per-conversation ordering, and explain the trade-offs clearly in an interview.
Before you read

Make a prediction first.

Predict

Answer before the explanation.

What must a chat system promise besides “message appears quickly”?

Commit

Write a rough answer.

Before reading, list one delivery guarantee, one ordering rule, and one offline behavior.

Connect

Notice where it returns.

Chat combines realtime connection management, message storage, ordering, fan-out, and recovery.

Plain English

Realtime is only half the story. Durable message flow is the real system.

To the user, chat feels simple: type a message, hit send, see it appear. Underneath, the sender might be online or offline, the receiver might be online or offline, the message has to be stored durably, and delivery still has to make sense inside each conversation.

A clean first version is one-to-one text chat with stored history, live delivery when possible, and push notification when the receiver is offline. That scope is enough to surface the real design questions without drowning in group fan-out, media, or presence.

Reasonable v1 scope
  • One-to-one chat.
  • Text messages.
  • Stored message history.
  • Online delivery when possible.
  • Push notification when the receiver is offline.
Layer on later
  • Group chat.
  • Media messages.
  • Read receipts.
  • Typing indicators.
  • Presence.

The most important simplification is this: you usually care about order within a conversation, not global order across the entire system.

Why it matters in interviews

This problem forces you to balance latency, durability, and offline behavior.

Interviewers like chat because it reveals whether the candidate can reason about persistent connections, durable storage, per-conversation ordering, live delivery, store-and-forward behavior, and the difference between sending a message and notifying someone about it.

Weak opener
We use WebSockets and store messages in a database.
Strong opener
I want a persistent connection for low-latency delivery, but the message should still be durably stored before I treat it as accepted. Ordering matters within each conversation, not globally. If the receiver is offline, I store-and-forward and trigger a push notification.

The stronger answer explains user-visible behavior, durable acceptance, and delivery fallback in one pass.

Mental model

Each conversation is its own lane.

Do not imagine one global highway. Imagine many smaller conversation lanes.
Conversation A 1 2 3 Conversation B 1 2 Conversation C 1 2 3 4
Ordering should make sense within a conversation lane. Requiring global ordering across all lanes adds coordination you usually do not need.
Key ideas

Seven anchors.

Core diagram

Separate acceptance, live delivery, and offline fallback.

SEND PATH LIVE DELIVERY PATH OFFLINE / HISTORY PATH S Sender Chat gateway persistent connection Message service assign sequence in conversation Message store durable history conversation_id + seq Receiver gateway deliver if online R Receiver Push notification history fetch on reconnect Chat service read conversation history
A message is accepted after durable storage. Live delivery is one path. Offline retrieval is another. Push notification is only the alert side path.
Speaking script

Lines you can actually say out loud.

Opening
I will start with one-to-one text chat and treat group chat as a later extension.
Sketching
I want persistent connections for low-latency delivery, but I still need durable message storage before I treat a send as accepted.
Deep dive
I care about ordering within each conversation, not global ordering across the whole system.
Deep dive
If the receiver is offline, the message stays in durable storage and the system can trigger a push notification.
Extending
Push notifications are a side channel for offline users, not the main chat transport.
Defending
The trade-off is lower latency and better user experience versus more connection management and delivery complexity.
Common mistakes

Predictable ways this answer goes wrong.

Misconception check

Correct the wrong model before it sticks.

Wrong intuition

What feels tempting

Chat is solved once you add WebSockets.

Better model

What to replace it with

WebSockets are only the connection. The system also needs message durability, ordering, delivery state, offline handling, and group fan-out.

Interview move

What to do in the room

Describe the message lifecycle: send, persist, fan out, deliver, acknowledge, and recover.

Trade-offs

The decisions that come up every time.

Chat design choiceGood whenWeak whenInterview line
Polling for new messages Scale is small and realtime expectations are modest. Users expect fast interactive delivery. Polling is the simplest start, but persistent connections usually fit realtime chat better.
Persistent connection delivery Default Low-latency message delivery matters. Connection management complexity is not justified for a very small system. A persistent connection gives me fast delivery without repeated polling overhead.
Per-conversation ordering Default Users need messages to make sense within each chat. The design is forced into unnecessary global coordination. Conversation-level ordering is the guarantee I actually need here.
Acknowledge only after durable write Message loss is unacceptable after the UI says sent. The system is over-optimized for latency at the cost of user trust. I would not treat the message as accepted until it is durably stored.
Push notifications for offline users Users may be disconnected but still need awareness of new messages. Push delivery is treated as the primary transport rather than a fallback. Push is my offline alert path, not my main message channel.
Deep dive

Online receiver, offline receiver, and message acceptance are different states.

Many weak answers blur these together. A cleaner answer separates three questions: was the message accepted, was it delivered live, and was the user notified?

Accepted durably stored UI can show sent because the message is in durable storage Delivered live receiver is connected Message appears now live delivery path succeeded Notified receiver may still be offline Push sent alert path, not delivery proof
Accepted, delivered, and notified are not the same thing. That distinction makes the answer sound much more precise.
Mini case study

The sender is online, but the receiver is offline.

This is the clean store-and-forward test. If the design only works when both users are connected, it is incomplete.

What should happen

  • Sender sends the message through the chat gateway.
  • Message is durably stored.
  • Sender can see the message accepted.
  • System triggers a push notification.

What happens later

  • Receiver reconnects.
  • Conversation history is fetched from storage.
  • Order still makes sense inside the conversation.

What should not happen

  • Treating push delivery as proof the chat message was delivered.
  • Losing the message because the receiver was offline at send time.

The lesson

  • Chat systems need durable store-and-forward behavior, not just sockets.
Demo conversation

How a strong exchange sounds.

Interviewer
What are the states you care about when someone sends a message?
Candidate
I separate accepted, stored, delivered, and maybe read later. Those are different guarantees, and keeping them separate avoids vague correctness claims.
Interviewer
What happens if the receiver is offline?
Candidate
The sender-facing path should still end after durable storage. Live delivery is conditional. If the receiver is offline, I fall back to later sync or notification rather than blocking the send path.
Interviewer
Where does scale usually hurt first?
Candidate
Usually on conversation fan-out, active connection management, or message ordering within busy conversations. That is why I like treating each conversation as its own lane.
Worked example to solo answer

Fade the support before the real practice.

Do not jump straight from reading to a full answer. First see the shape, then complete part of it, then answer alone.

I do

Study the model move.

I would say: "I will store the message before fan-out so reconnecting clients can recover missed messages."

We do

Complete the missing piece.

For large groups, compare direct fan-out with topic or stream-based delivery.

You do

Answer without notes.

Answer the practice prompt using the message lifecycle, not only the websocket box.

Practice

Try it before you read the model answer.

Prompt
Extend the chat system to support large group chats.
  • What gets harder?
  • What guarantee would you keep?
  • What new bottleneck appears?
Show a strong model answer
Large group chat makes fan-out much harder because one message may need to reach many recipients. I would still keep the main guarantee as durable storage plus conversation-level ordering rather than global ordering. The new bottleneck is delivery fan-out and possibly notification fan-out, so I would likely separate durable message acceptance from downstream delivery work and scale that delivery path independently.
Training loop

Make this chapter stick.

Before moving on, turn recognition into production. Close the model answer, answer from memory, then retry one small slice.

Recall

Say the chapter's core idea without looking. Then name one related idea from an earlier chapter.

Vary

Change one constraint in the practice prompt and answer again in half the time.

Score

Use the rubric to pick one dimension below 3, then retry only that dimension.

Memory hook
Store it. Then deliver it.
Recap

Three things to take into the room.

1

Persistent connections are only part of the system.

Durable message flow is the real backbone.

2

Conversation-level ordering is usually enough.

Global ordering usually adds cost without improving the product.

3

Offline behavior is core, not an edge case.

Store-and-forward behavior is part of the main design.

Reusable interview line
"I want low-latency delivery, but I would not confuse delivery with durability: the message is accepted after durable storage, then delivered live if possible or fetched later if the receiver is offline."