Part 4 · Real Interview Systems Chapter 18

The notification system.

Sending the right message, through the right channel, without losing user trust.

Learning objective

Design a notification system that turns product events into channel-specific deliveries while respecting preferences, retries, deduplication, and provider failures, and explain the trade-offs clearly in an interview.

Before you read

Make a prediction first.

Predict

Answer before the explanation.

What can go wrong if a notification system simply calls email, SMS, and push APIs directly?

Commit

Write a rough answer.

Before reading, name one user preference, one retry risk, and one duplicate risk.

Connect

Notice where it returns.

Notifications combine events, preferences, queues, provider failures, idempotency, and user trust.

Plain English

This is a routing and policy system, not just a provider call.

Notification systems look small from the outside: something happens, then a user gets notified. In reality, the system has to decide which events matter, who should receive them, which channel should be used, what the user preferences allow, and what should happen if a provider is down.

A clean first version is enough to expose the real design: receive an event, decide recipients, check preferences, create channel-specific jobs, then let channel workers send through providers. That is already much better than treating the problem like one direct email API call.

Reasonable v1 scope

One event enters the system.
The system decides who should receive it.
User preferences are checked.
Channel-specific jobs are created.
Channel workers send through providers.

Layer on later

Batching or digests.
Scheduled sends.
Quiet hours.
Templates.
Per-channel analytics.

The key simplification is separating intent from delivery. Once you do that, the rest of the architecture gets much clearer.

Why it matters in interviews

This problem shows whether you can separate policy, routing, and retries.

Interviewers like notification systems because they reveal whether the candidate distinguishes event intake from delivery, respects user preferences, isolates channels with queues, handles provider failures, and retries safely without duplicates.

Weak opener

When an order ships, call the email API.

Strong opener

When an event arrives, the notification system decides eligible recipients, checks preferences, and creates channel-specific delivery jobs. Email, push, and SMS workers can retry independently, and the system should deduplicate so temporary provider failures do not spam the user.

The stronger answer separates intent, policy, and delivery. That makes the system sound deliberate instead of ad hoc.

Key ideas

Seven anchors.

Notification intent should be separated from actual channel delivery.
User preferences and eligibility checks should happen before channel fan-out.
Different channels usually deserve separate queues and workers.
Retries are useful, but deduplication or idempotency is needed to avoid duplicate sends.
Provider failures should not take down the whole notification pipeline.
Some notifications are immediate; others can be batched or digested.
Delivery status and audit trails matter for support and debugging.

Speaking script

Lines you can actually say out loud.

Opening

I want to separate notification intent from actual channel delivery.

Sketching

The first step after receiving an event is deciding who is eligible and what the user's preferences allow.

Deep dive

I would route each channel into its own queue so retries and failures are isolated.

Deep dive

If one provider fails, I do not want the whole notification system to stop.

Defending

Retries are helpful, but I need deduplication or idempotency so users are not spammed.

Defending

The trade-off is richer policy control and safer delivery versus more queues, more state, and more operational complexity.

Common mistakes

Predictable ways this answer goes wrong.

Treating notifications like a direct function call to one provider.
Ignoring user preferences or quiet hours.
Using one queue for all channels even when their failure modes differ.
Retrying blindly and sending duplicates.
Letting one failing provider block all notification processing.
Ignoring delivery logs or debugging visibility.
Mixing truly transactional notifications with low-priority promotional traffic without saying how they differ.

Misconception check

Correct the wrong model before it sticks.

Wrong intuition

What feels tempting

Notifications are just API calls to external providers.

Better model

What to replace it with

Notifications are an orchestration pipeline with preferences, channel choice, retries, provider failures, and duplicate control.

Interview move

What to do in the room

Model event intake, preference resolution, channel queues, provider adapters, and delivery state.

Trade-offs

The decisions that come up every time.

Notification choice	Good when	Weak when	Interview line
Single direct send path	The system is tiny and there is one simple channel.	Multiple channels, retries, and provider failures matter.	A direct send path is fine only for the smallest first version.
One queue for all notifications	Traffic is low and channel behavior is similar.	Channels have very different retry patterns, priorities, or providers.	One queue is simpler, but separate channel queues give better isolation and control.
Per-channel queues and workers Default	Channels fail differently and should scale or retry independently.	The system is so small the extra queues add unnecessary overhead.	Per-channel queues help me isolate failures and tune retries independently.
Immediate send for every event	The event is transactional and user-facing timing matters.	Many notifications could be digested, batched, or deprioritized.	I would send urgent transactional notifications immediately, but batch or digest lower-priority ones.
Retry with dedupe Default	Provider failures are transient and eventual delivery matters.	Retries are unsafe and duplicate protection is weak.	Retries improve delivery, but only if I can prevent duplicates cleanly.

Deep dive

Retries are useful only if the user does not get spammed.

This is the reliability question in the chapter. Temporary provider failure is common. Duplicate user-visible sends are what destroy trust.

Retries improve delivery reliability only if the system can recognize that the same notification intent on the same channel should not be sent twice.

Mini case study

Order shipped event.

This is the clean routing test. One product event may produce several channel-specific attempts, each with its own failure behavior.

What should happen

Order service emits order shipped.
Notification system determines the recipient.
Preferences are checked for email, push, SMS, or in-app.
Channel-specific jobs are created.

What can go wrong

Email provider times out.
Push succeeds but email fails.
A retry sends duplicate email.

What helps

Per-channel queues.
Retry policy per channel.
Dedupe keys per notification intent and channel.
Delivery logs.

The lesson

One notification event is a routing problem, not one direct API call.

Demo conversation

How a strong exchange sounds.

Interviewer

What is the biggest framing mistake people make here?

Candidate

Treating notification delivery like one direct provider call. The real system is intake, preference checks, routing, and per-channel delivery with independent failure handling.

Interviewer

Why separate channels so early?

Candidate

Because push, email, SMS, and in-app differ in urgency, throughput, provider behavior, and retry rules. Mixing them too early hides both failures and trade-offs.

Interviewer

What is the main reliability risk?

Candidate

Retries without dedupe. Reliability is good only if we can retry safely without spamming the user or duplicating the same notification intent.

Worked example to solo answer

Fade the support before the real practice.

Do not jump straight from reading to a full answer. First see the shape, then complete part of it, then answer alone.

I do

Study the model move.

I would say: "The order event creates a notification job, preferences choose channels, and each channel has its own retry policy."

We do

Complete the missing piece.

For ride sharing, separate urgent trip notifications from marketing or summary messages.

You do

Answer without notes.

Answer the practice prompt with one provider failure and one duplicate-prevention strategy.

Practice

Try it before you read the model answer.

Prompt

Design a notification system for a ride-sharing app.

What should be immediate?
What channels would you separate?
What failure would you design around first?

Show a strong model answer

I would treat ride status updates as immediate notifications because timing matters directly to the user experience. I would separate push, SMS, and email into different delivery paths because their urgency, provider behavior, and retry logic differ. The first failure I would design around is provider failure or timeout, because the system should retry safely and possibly fall back to another channel without sending duplicates or blocking the whole pipeline.

Training loop

Make this chapter stick.

Before moving on, turn recognition into production. Close the model answer, answer from memory, then retry one small slice.

Recall

Say the chapter's core idea without looking. Then name one related idea from an earlier chapter.

Vary

Change one constraint in the practice prompt and answer again in half the time.

Score

Use the rubric to pick one dimension below 3, then retry only that dimension.

Recap

Three things to take into the room.

Intent first.

Do not jump straight to provider calls.

Isolate channels.

Different channels fail differently and deserve independent control.

Retry safely.

Reliability without dedupe becomes spam.

Reusable interview line

"I would separate notification intent from delivery, check user preferences before channel fan-out, and keep retries isolated and deduplicated so one provider failure does not spam the user or stall the whole system."

The notification system.

Make a prediction first.

Answer before the explanation.

Write a rough answer.

Notice where it returns.

This is a routing and policy system, not just a provider call.

This problem shows whether you can separate policy, routing, and retries.

A postal sorting center.

Seven anchors.

Keep the whole pipeline visible at once.

Lines you can actually say out loud.

Predictable ways this answer goes wrong.

Correct the wrong model before it sticks.

What feels tempting

What to replace it with

What to do in the room

The decisions that come up every time.

Retries are useful only if the user does not get spammed.

Order shipped event.

What should happen

What can go wrong

What helps

The lesson

How a strong exchange sounds.

Fade the support before the real practice.

Study the model move.

Complete the missing piece.

Answer without notes.

Try it before you read the model answer.

Make this chapter stick.

Recall

Vary

Score

Three things to take into the room.

Intent first.

Isolate channels.

Retry safely.