Design a Notification Service
Sending user notifications is a common requirement in system design. Design a notification service for an organization. The system will use shared services for the underlying messaging implementation (email, sms, push notifications, etc) so the actual messaging implementation does not need to be designed. The system should support a user publishing a notification to a single user or groups of users. Notifications can be triggered manually via a web UI or programmatically via an API. Users should be able to view their past notifications they published. If a user is unable to receive a notification, they should still receive it at the next opportunity and not miss the message. The notification service should scale to billions of notifications per day, with messages delivered within a few seconds, with five 9s uptime.
Designing a scalable notification service involves building a robust architecture that can handle high volumes, support multiple channels (email, SMS, push notifications), and ensure messages are delivered reliably. Given the requirements, a pub/sub model with message queues, such as Apache Kafka topics, is a central component for decoupling message production (publishing notifications) from message consumption (delivering notifications to users). This design allows for horizontal scaling, fault tolerance, and efficient processing of large volumes of notifications. Here’s a detailed breakdown of the design:
Pub/Sub with Kafka Topics
In this architecture, Kafka can be used as the backbone for the pub/sub system. When a user publishes a notification, it is sent to a Kafka topic. Each type of notification (email, SMS, push) can have its own topic, or you could have a general topic and filter messages based on their type. Kafka’s partitioning allows you to scale out consumers across multiple servers, ensuring that high throughput can be managed. For example:
Notifications are published to a notifications topic, with different partitions for types like email, SMS, or push.
A consumer group for each type of notification processes messages from this topic, handling the delivery to users through the appropriate external services.
Managing User Notifications
To handle the need for both individual and group notifications, a metadata database (such as a relational database like PostgreSQL or a NoSQL solution like MongoDB) can store user information and group memberships. When a notification is published, a lookup can determine which users belong to a group. The notification can then be sent to the corresponding Kafka topic for delivery. This design allows for flexible targeting of notifications and the ability to adjust user-group relationships without altering the messaging pipeline.
Ensuring Message Delivery & Reliability
Since users should receive notifications even if they are temporarily unreachable (e.g., their phone is off), the service should ensure retries and message persistence:
Kafka’s at-least-once delivery guarantees that messages remain in a topic until a consumer successfully processes them. This helps ensure that if a consumer fails or if the user is temporarily unreachable, the message is not lost.
For further reliability, dead-letter queues can be used to store messages that have failed to deliver after a certain number of retries. These can then be processed separately or retried later.
Additionally, if real-time notification delivery is crucial, Redis can be used as a temporary cache to store recent undelivered notifications for quick lookup, allowing users to retrieve pending notifications as soon as they become reachable.
Viewing Past Notifications
To support the requirement for users to view their past published notifications, a database is needed to store a record of all notifications sent. This could be implemented with a relational database like PostgreSQL if complex queries and relationships are needed (e.g., users, groups, and notification history). Alternatively, Elasticsearch could be used if fast searches and filtering of notification records are critical, as it can provide full-text search capabilities.
High Scalability and Uptime
Kafka’s distributed nature ensures that the system can scale horizontally to handle billions of notifications per day. By adding more partitions and consumer instances, you can increase throughput.
To achieve five 9s uptime, you should deploy Kafka clusters with replication to ensure no data is lost even if a broker fails. Deploying the database and notification consumers across multiple availability zones (or data centers) ensures that regional outages do not affect the overall service.
Load balancers are critical for distributing requests across different servers, preventing any single instance from being overwhelmed.
Summary
In this design, Kafka handles message ingestion and queuing, providing scalability and resilience. Consumer services read from Kafka and make API calls to send notifications via the appropriate channel (email, SMS, push). A database stores metadata about users and notifications for tracking, while retry mechanisms ensure messages are eventually delivered, even if initial attempts fail. This architecture balances reliability, scalability, and speed, making it suitable for a large-scale notification system with high uptime and quick delivery requirements.
Related Problems
Functional Requirements
1. As users type text in a search box, show the top 10 auto complete results with very low latency
2. Analytics will be collected on what the user types
Design a service with the following functional requirements
1. Users should be able to upload and download files
2. The files should be able to be shared with other users
3. Changes to the files should be pushed to other users with the content on their machine
4. There must be no risk of file corruption
5. Keep track of different versions of the files so they may be rolled back
6. Users should be able to edit files without an internet connection and the changes sync up when a connection becomes available
Design a social network website with the following functional requirements
1. Users should be able to post content with text, images or video 2. Users should be able to follow other users 3. Each user will have a relatively low latency feed which shows content posted by users they follow
Functional Requirements
The ability to set limits on the number of requests allowed within a specific timeframe
Keep performance and fault tolerance in mind