← Articulet System Design, Made Clear Chapter 18 · Notification System
Part 4 · Real Interview Systems Chapter 18

The notification system.

Sending the right message, through the right channel, without losing user trust.

Learning objective
Design a notification system that turns product events into channel-specific deliveries while respecting preferences, retries, deduplication, and provider failures, and explain the trade-offs clearly in an interview.
Before you read

Make a prediction first.

Predict

Answer before the explanation.

What can go wrong if a notification system simply calls email, SMS, and push APIs directly?

Commit

Write a rough answer.

Before reading, name one user preference, one retry risk, and one duplicate risk.

Connect

Notice where it returns.

Notifications combine events, preferences, queues, provider failures, idempotency, and user trust.

Plain English

This is a routing and policy system, not just a provider call.

Notification systems look small from the outside: something happens, then a user gets notified. In reality, the system has to decide which events matter, who should receive them, which channel should be used, what the user preferences allow, and what should happen if a provider is down.

A clean first version is enough to expose the real design: receive an event, decide recipients, check preferences, create channel-specific jobs, then let channel workers send through providers. That is already much better than treating the problem like one direct email API call.

Reasonable v1 scope
  • One event enters the system.
  • The system decides who should receive it.
  • User preferences are checked.
  • Channel-specific jobs are created.
  • Channel workers send through providers.
Layer on later
  • Batching or digests.
  • Scheduled sends.
  • Quiet hours.
  • Templates.
  • Per-channel analytics.

The key simplification is separating intent from delivery. Once you do that, the rest of the architecture gets much clearer.

Why it matters in interviews

This problem shows whether you can separate policy, routing, and retries.

Interviewers like notification systems because they reveal whether the candidate distinguishes event intake from delivery, respects user preferences, isolates channels with queues, handles provider failures, and retries safely without duplicates.

Weak opener
When an order ships, call the email API.
Strong opener
When an event arrives, the notification system decides eligible recipients, checks preferences, and creates channel-specific delivery jobs. Email, push, and SMS workers can retry independently, and the system should deduplicate so temporary provider failures do not spam the user.

The stronger answer separates intent, policy, and delivery. That makes the system sound deliberate instead of ad hoc.

Mental model

A postal sorting center.

Receive the delivery intent, sort it, then route each piece to the right carrier.
Intent in order shipped what should happen Sort and check prefs eligibility who, whether, how Channels out push email sms separate carriers, separate failure modes
The system is a sorting center: intake, policy check, routing, and independent carriers for each channel.
Key ideas

Seven anchors.

Core diagram

Keep the whole pipeline visible at once.

E Event source Notification service intent intake + recipient decision Preference check quiet hours, channel opts Channel router create delivery jobs Templates / log Push queue + worker Email queue + worker SMS worker Push provider Email provider SMS provider
Intent comes in once. Channel-specific work fans out later. That is what makes retries, preferences, and failure isolation manageable.
Speaking script

Lines you can actually say out loud.

Opening
I want to separate notification intent from actual channel delivery.
Sketching
The first step after receiving an event is deciding who is eligible and what the user's preferences allow.
Deep dive
I would route each channel into its own queue so retries and failures are isolated.
Deep dive
If one provider fails, I do not want the whole notification system to stop.
Defending
Retries are helpful, but I need deduplication or idempotency so users are not spammed.
Defending
The trade-off is richer policy control and safer delivery versus more queues, more state, and more operational complexity.
Common mistakes

Predictable ways this answer goes wrong.

Misconception check

Correct the wrong model before it sticks.

Wrong intuition

What feels tempting

Notifications are just API calls to external providers.

Better model

What to replace it with

Notifications are an orchestration pipeline with preferences, channel choice, retries, provider failures, and duplicate control.

Interview move

What to do in the room

Model event intake, preference resolution, channel queues, provider adapters, and delivery state.

Trade-offs

The decisions that come up every time.

Notification choiceGood whenWeak whenInterview line
Single direct send path The system is tiny and there is one simple channel. Multiple channels, retries, and provider failures matter. A direct send path is fine only for the smallest first version.
One queue for all notifications Traffic is low and channel behavior is similar. Channels have very different retry patterns, priorities, or providers. One queue is simpler, but separate channel queues give better isolation and control.
Per-channel queues and workers Default Channels fail differently and should scale or retry independently. The system is so small the extra queues add unnecessary overhead. Per-channel queues help me isolate failures and tune retries independently.
Immediate send for every event The event is transactional and user-facing timing matters. Many notifications could be digested, batched, or deprioritized. I would send urgent transactional notifications immediately, but batch or digest lower-priority ones.
Retry with dedupe Default Provider failures are transient and eventual delivery matters. Retries are unsafe and duplicate protection is weak. Retries improve delivery, but only if I can prevent duplicates cleanly.
Deep dive

Retries are useful only if the user does not get spammed.

This is the reliability question in the chapter. Temporary provider failure is common. Duplicate user-visible sends are what destroy trust.

Retry without dedupe delivery may repeat visibly Email worker Timeout retry may send twice if the first actually succeeded late Retry with dedupe key intent id + channel checked first Email worker Dedupe record notification_intent + email safe retry checks whether this delivery already happened
Retries improve delivery reliability only if the system can recognize that the same notification intent on the same channel should not be sent twice.
Mini case study

Order shipped event.

This is the clean routing test. One product event may produce several channel-specific attempts, each with its own failure behavior.

What should happen

  • Order service emits order shipped.
  • Notification system determines the recipient.
  • Preferences are checked for email, push, SMS, or in-app.
  • Channel-specific jobs are created.

What can go wrong

  • Email provider times out.
  • Push succeeds but email fails.
  • A retry sends duplicate email.

What helps

  • Per-channel queues.
  • Retry policy per channel.
  • Dedupe keys per notification intent and channel.
  • Delivery logs.

The lesson

  • One notification event is a routing problem, not one direct API call.
Demo conversation

How a strong exchange sounds.

Interviewer
What is the biggest framing mistake people make here?
Candidate
Treating notification delivery like one direct provider call. The real system is intake, preference checks, routing, and per-channel delivery with independent failure handling.
Interviewer
Why separate channels so early?
Candidate
Because push, email, SMS, and in-app differ in urgency, throughput, provider behavior, and retry rules. Mixing them too early hides both failures and trade-offs.
Interviewer
What is the main reliability risk?
Candidate
Retries without dedupe. Reliability is good only if we can retry safely without spamming the user or duplicating the same notification intent.
Worked example to solo answer

Fade the support before the real practice.

Do not jump straight from reading to a full answer. First see the shape, then complete part of it, then answer alone.

I do

Study the model move.

I would say: "The order event creates a notification job, preferences choose channels, and each channel has its own retry policy."

We do

Complete the missing piece.

For ride sharing, separate urgent trip notifications from marketing or summary messages.

You do

Answer without notes.

Answer the practice prompt with one provider failure and one duplicate-prevention strategy.

Practice

Try it before you read the model answer.

Prompt
Design a notification system for a ride-sharing app.
  • What should be immediate?
  • What channels would you separate?
  • What failure would you design around first?
Show a strong model answer
I would treat ride status updates as immediate notifications because timing matters directly to the user experience. I would separate push, SMS, and email into different delivery paths because their urgency, provider behavior, and retry logic differ. The first failure I would design around is provider failure or timeout, because the system should retry safely and possibly fall back to another channel without sending duplicates or blocking the whole pipeline.
Training loop

Make this chapter stick.

Before moving on, turn recognition into production. Close the model answer, answer from memory, then retry one small slice.

Recall

Say the chapter's core idea without looking. Then name one related idea from an earlier chapter.

Vary

Change one constraint in the practice prompt and answer again in half the time.

Score

Use the rubric to pick one dimension below 3, then retry only that dimension.

Memory hook
Intent in. Channels out. Retries safe.
Recap

Three things to take into the room.

1

Intent first.

Do not jump straight to provider calls.

2

Isolate channels.

Different channels fail differently and deserve independent control.

3

Retry safely.

Reliability without dedupe becomes spam.

Reusable interview line
"I would separate notification intent from delivery, check user preferences before channel fan-out, and keep retries isolated and deduplicated so one provider failure does not spam the user or stall the whole system."