AWS Notification Platform

A large scale, highly available and extensible multi-channel notification solution.

AWS Architecture Diagram
Figure 1: AWS Notification Platform Architecture

Note: AWS Certificate Manager (ACM) is a regional service, meaning certificates are issued and managed per AWS region. When using Amazon CloudFront (a global service), the TLS certificate must be provisioned specifically in the us-east-1 region, regardless of where the origin aws resources are hosted. This allows a single ACM certificate in us-east-1 to secure traffic for both primary and secondary (DR) regions.

However, services like Amazon Cognito require the ACM certificate to be in the same region as the User Pool when using a custom domain (e.g. www.13techs.com). Since Cognito is deployed in both primary and secondary region in our architecture, we must provision separate ACM services with TLS certificates, for same custom domain, in each of those regions specifically for Cognito.

Introduction

Purpose

The AWS Notification Platform is designed to empower businesses and services to deliver seamless, real-time notifications to their users across multiple channels. Whether it's an e-commerce platform handling millions of order updates, a financial institution providing transaction alerts, or a social media app sending personalized notifications, this platform ensures reliability, scalability, and user-centric delivery of messages.

Audience

This document is intended for the following audiences who are responsible for building, deploying, and maintaining the Notification Platform:

  • Architects
  • Developers
  • DevOps engineers
  • System administrators
  • Business stakeholders
  • Compliance teams
  • Operational teams

Pre-requisites

CategoryDetails
AWS Account SetupAn active AWS account with appropriate permissions for resource provisioning.
IAM Configuration
  • IAM roles for GitHub Actions with OIDC integration.
  • Roles for Lambda functions, CodePipeline, and other services with least privilege.
Source ControlA GitHub repository containing the application code, configuration files, and CloudFormation templates.
CI/CD Configuration
  • GitHub Actions workflow files for build, test, and deployment steps.
  • Pre-configured AWS CodePipeline and CodeBuild projects.
Networking Requirements
  • Route 53 hosted zones for DNS routing.
Monitoring and Logging Setup
  • CloudWatch logs and metrics for Lambda, API Gateway, and DynamoDB.
  • CloudTrail enabled for API call tracking.
Security Measures
  • WAF and Shield Standard for DDoS protection.
  • SSL/TLS certificates provisioned via AWS ACM.
Disaster Recovery
  • Secondary region infrastructure templates for failover.
  • Route 53 health checks configured.
Budgeting and Cost Monitoring
  • AWS Budgets and CloudWatch alarms for cost control.
  • Systems Manager Runbooks for automated remediation.
Development EnvironmentTools like AWS CLI, AWS SAM, and IDEs configured for local development.

Business Objectives

Organizations today require an efficient, scalable, and cost-effective way to communicate with customers, employees, and stakeholders across multiple channels. The Multi-Channel Notification Platform ensures seamless, reliable, and targeted message delivery through various communication methods. Below are the key business objectives:

  • Unified Customer Engagement: Businesses interact with customers via multiple touchpoints. A unified notification platform ensures that transactional alerts, promotional messages, and service updates reach users via Email, SMS, and Push Notifications, maximizing engagement and response rates.

    Example: An e-commerce company sending order confirmations via email, delivery updates via SMS, and promotional offers via push notifications.

  • Improved Operational Efficiency: Managing different notification channels individually creates complexity, overhead, and maintenance challenges. A centralized platform eliminates these inefficiencies by providing a single integration point for multiple messaging services.

    Example: A financial institution automating fraud alerts, reducing the need for manual intervention in customer communications.

  • Scalability & High Availability: Organizations with large user bases require a notification system that can handle millions of messages while maintaining high availability. This architecture is designed with AWS auto-scaling, multi-region failover, and event-driven processing to ensure uninterrupted service.

    Example: A global ride-sharing app notifying drivers and riders of trip updates in real-time, ensuring seamless coordination.

  • Personalization & Context-Aware Messaging: Effective communication depends on sending the right message to the right person at the right time. The platform enables dynamic message routing based on customer preferences, behavior, and geographical location.

    Example: A healthcare provider sending appointment reminders via SMS and follow-up instructions via email.

  • Regulatory Compliance & Security: Businesses handling sensitive customer data must ensure compliance with industry regulations like GDPR, HIPAA, and PCI-DSS. The platform integrates security best practices, including authentication, encryption, and logging.

    Example: A banking application sending transaction alerts while ensuring end-to-end encryption and audit logging.

  • Cost Optimization: Adopting a serverless, event-driven architecture minimizes idle infrastructure costs, optimizes compute usage, and leverages AWS cost-saving mechanisms such as Savings Plans and Reserved Capacity .

    Example: A SaaS company reducing operational costs by switching from self-managed notification servers to AWS serverless solutions.

By addressing these business objectives, the Multi-Channel Notification Platformprovides an enterprise-grade messaging solution that enhances user engagement, streamlines operations, and ensures cost-effectiveness.

Design Overview

System Users and Customers

The Notification Platform serves two distinct categories of users:

Ingestion Users (Producers of Notifications)

These are the services, applications, or teams that generate and send notifications through the platform.

User TypeDescriptionExamples
Internal ApplicationsApplications within the organization that trigger notifications.CRM, ERP, E-commerce platforms
Third-Party ServicesExternal businesses integrating with the notification API.SaaS applications, partner integrations
DevOps & Security TeamsInfrastructure monitoring and security alerts.CloudWatch, GuardDuty alerts
Automated WorkflowsEvent-driven triggers from other AWS services.AWS Step Functions, Lambda functions

Egression Users (Recipients of Notifications)

These are the end-users or customer-facing entities that receive notifications via various channels.

User TypeDescriptionExamples
End CustomersThe final recipients of messages via Email, SMS, or Push Notifications.Online shoppers, registered app users
IT & Operations TeamsInternal personnel receiving operational alerts.DevOps teams, system admins
Business UsersEmployees receiving business communications.HR, Sales, Customer Support teams
IoT & DevicesSmart devices and applications receiving push notifications.Iphone (Apple Push Notification Service)

Galaxy (Google Cloud Messaging)

Kindle (Amazon Device Messaging)

Architectural Guidelines

The table below summarizes the key architectural best practices to ensure a scalable, cost-efficient, and reliable system. These guidelines are intended for Developers, DevOps Engineers, Security Teams, and Infrastructure Architects.

CategoryGuideline
Provisioning ResourcesAlways consider AWS service quotas, such as Lambda execution time (15 mins max), API Gateway request limits (10,000 RPS), and SNS subscription limits (100,000 per topic).
Performance & LatencyAvoid cold start penalties by keeping Lambda package sizes small and leveraging Provisioned Concurrency where necessary.
High AvailabilityRoute 53 failover and multi-region supportshould be validated periodically to ensure seamless disaster recovery.
Cost OptimizationUse AWS Savings Plans, optimized CloudWatch logging, and caching strategies to reduce unnecessary compute and data transfer costs.
Security & ComplianceAdhere to AWS IAM policies, secure sensitive data with encryption, and monitor activity using AWS CloudTrail.
Event-Driven ProcessingLeverage SQS + Lambda for long-running tasks instead of direct API calls. Use EventBridge rule-based routing to scale event handling efficiently.
Logging & ObservabilityEnable structured logging (JSON format) for easier queries in CloudWatch Logs. Use AWS X-Ray to trace request execution across microservices.
RESTful API Design Best PracticesImplement API versioning, ensure proper RESTful principles (at least Level-2 REST), return appropriate HTTP status codes, enable pagination, enforce authentication, and use OpenAPI documentation (Swagger).
CI/CD & DeploymentSecure GitHub authentication using OpenID Connect with AWS. Use automated tests before deployment and maintain CloudFormation templates for version-controlled infrastructure.
Infrastructure as Code (IaC)Use AWS CloudFormation to define infrastructure, store environment variables inAWS Parameter Store, and automate deployments using AWS CodePipelineor GitHub Actions.

API Endpoints

The following RESTful API endpoints are available for sending notifications through the platform:

  • Login: https://notifications.example.com/api/v1/login
  • Email Notifications: https://notifications.example.com/api/v1/notifications/email
  • SMS Notifications: https://notifications.example.com/api/v1/notifications/sms
  • Push Notifications: https://notifications.example.com/api/v1/notifications/push

Workflow Steps

This section outlines the step-by-step workflow of the notification platform, for detailed depiction refer AWS Notification Platform Architecture Figure:

1. Request Ingestion

  1. A client sends a notification request to the platform via the CloudFront URL, which serves as the global entry point for the website.
  2. CloudFront routes the request to the API Gateway, ensuring low latency and global availability.

2. Authentication and Authorization

  1. The API Gateway validates the request using AWS Cognito (OAuth 2.0/OIDC) to authenticate external users.
  2. If authentication is successful, a JWT token is issued by Cognito and passed to the next stage.

3. Event Processing

  1. The validated request is forwarded to Amazon EventBridge (Primary Event Bus) for event routing.
  2. EventBridge evaluates the event and routes it based on pre-defined rules (e.g., by channel type such as Email, SMS, or Push Notification).

4. Message Preparation

  1. If user preferences are required (e.g., language or time zone), a Lambda function retrieves these preferences from DynamoDB Global Tables.
  2. The Lambda function processes the event, formats the message, and sends it to the appropriate SNS Topic for routing.

5. Channel Routing

  1. Amazon SNS fans out the message to the corresponding SQS queues (e.g., EmailQueue, SMSQueue, PushQueue) based on the notification type.

6. Channel-Specific Handling

  1. Email Notifications: Messages in the EmailQueue are consumed by a Lambda function that uses Amazon SES to send emails.
  2. SMS Notifications: Messages in the SMSQueue are consumed by another Lambda function that interacts with SNS SMS APIs to send text messages.
  3. Push Notifications: Messages in the PushQueue are processed by a Lambda function that interfaces with APNS (Apple Push Notification Service) or FCM (Firebase Cloud Messaging) for delivery to mobile devices.

7. Delivery Confirmation

  1. Each channel (SES, SMS, Push) provides delivery status updates back to CloudWatch and logs the activity for monitoring.

8. Monitoring and Logging

  1. Amazon CloudWatch captures logs, metrics, and alarms for all services involved in the workflow.
  2. AWS CloudTrail tracks API calls and user activity for auditing purposes.

9. Cost and Security Management

  1. AWS Budgets monitors platform costs, triggering Systems Manager Runbooks for automated remediation if thresholds are exceeded.
  2. AWS Shield Standard and AWS WAF protect the platform from DDoS attacks and malicious traffic.

10. Disaster Recovery

  1. In case of primary region failure, Route 53 triggers failover to the secondary region.
  2. Secondary region resources are activated using Systems Manager Runbooks, ensuring uninterrupted service.

11. CI/CD Steps

This section outlines the step-by-step CI/CD workflow of the notification platform, for detailed depiction refer Figure 2, below:

Notification Platform CICD Arch Diagram
Figure 2: AWS Notification Platform CICD Architecture
  1. Source Control:
    1. Developers push code changes to the GitHub repository.
    2. GitHub Actions triggers build, test, and artifact upload workflows.
  2. Artifact Management: Built Lambda packages and CloudFormation templates are stored in Amazon S3.

    Note: Lambda function code is uploaded to S3 instead of embedding it in CloudFormation to enable versioning, multi-region deployment, faster updates, and to avoid CloudFormation size limits for inline resources.

  3. Deployment Orchestration:
    1. AWS CodePipeline fetches artifacts from S3 and orchestrates deployments across primary and secondary regions.
    2. CloudFormation provisions resources in both regions, ensuring readiness for disaster recovery.
  4. Testing and Rollout:AWS CodeDeploy implements targeted canary or rolling deployment strategies for Lambda functions.

System Design Constraints

Constraint with LimitsQuantificationDesign JustificationAlternative Approach and Trade-offs
Lambda Cold Starts (100-500ms latency)~150ms for 10 MB package, ~200ms increase for every 50 MB increase in package size.Functions are pre-warmed or use Provisioned Concurrency for critical workflows, ensuring minimal latency.Fargate: Eliminates cold starts but adds cost and complexity for short-lived workloads.
Lambda Execution Time (Max 15 mins)Tasks requiring >15 minutes must be broken into multiple asynchronous steps. Requires Design Change ApprovalWorkflows are segmented using SQS and Step Functions to ensure execution remains within limits.Step Functions Alone: Adds orchestration but may increase execution complexity for smaller tasks.
Lambda Deployment Size (Max 250 MB)Cold start latency increases by ~200ms for every 50 MB increase in package size.Lambda layers ensure dependencies are optimized, reducing deployment package size and improving performance.ECS with Fargate: No size limits but adds cost and management overhead.
SQS - Max Message Size, Retention, Throughput256 KB per message, 4-day retention (default), 3,000 TPS (standard queue).Messages are optimized, large payloads are stored in S3, and FIFO queues are used only where strict ordering is required.EventBridge – Can be used instead of SQS for routing but lacks message retention and deduplication.
SES - Sending Rate, Recipient Limit14 emails/sec (sandbox), unlimited (production), Max 50 recipients per email.Production mode enables scaling without limits, and batching strategiesare used to split large recipient lists.Third-party providers (e.g., SendGrid) – More flexibility in email handling but higher costs.
SNS - Max Throughput, Subscription Scaling100 messages/sec per AWS account, 12.5M subscriptions per topic.Implementing region-based scaling to distribute SMS load and using topic hierarchy to split channels effectively.Twilio – Can provide more control over SMS delivery but is more expensive.
API Gateway Requests (Max 10,000 RPS)Exceeding 10,000 RPS may cause throttling.API Gateway usage is monitored with throttling and quotas adjusted via AWS Support where necessary.Custom NGINX Proxy: Provides higher limits but requires additional infrastructure management.
Route 53 DNS Failover (~30s latency)Failover activation delayed by DNS propagation time.Low TTL settings ensure faster DNS propagation, and automated Runbooks minimize failover downtime.Global Accelerator: Reduces latency further but adds cost for all traffic routing.
CloudFront Request Limits (10TB/month)Traffic beyond 10 TB/month results in additional data transfer costs.Cache optimization minimizes data transfer, ensuring costs are controlled for high-traffic use cases.Akamai: Provides advanced caching but adds significant costs for integration.
Cognito MAU Limits (50K free MAUs)Additional users incur incremental costs (~$0.0055/user beyond free tier).User segmentation and cost tracking ensure cost-effective scaling of user authentication.Custom Identity Provider: Offers flexibility but requires extensive management and integration efforts.

Disaster Recovery Analysis

A quantitative disaster recovery (DR) analysis is essential to ensure business continuity and minimize downtime. The following metrics define the DR strategy:

MetricDefinitionTarget ValueEstimation Basis
Recovery Time Objective (RTO)Maximum allowable downtime after failure before services must be restored. Estimated based on AWS Route 53 DNS failover time and Lambda cold start latency for reinitialization of services.≤ 15 minutesRoute 53 DNS failover time = 10 mins, Lambda cold start and reinitialization = 5 mins.
Recovery Point Objective (RPO)Maximum acceptable data loss measured in time before the failure occurred. Determined based on DynamoDB Global Tables synchronization latency and replication lag across AWS regions.≤ 5 minutesDynamoDB Global Table replication time = ~1 - 3 seconds,Replication consistency settings buffer = 2 mins.
Point-in-Time Recovery (PITR)The ability to recover data at a specific point in time to prevent corruption. Based on AWS DynamoDB PITR feature, ensuring restoration at a fine granularity level.Enabled for DynamoDBDynamoDB PITR allows restoring up to 35 days back with second-level granularity.
Failover AutomationAutomated redirection of traffic to the secondary region using Route 53. Evaluated based on AWS Route 53 health checks and latency-based routing configurations.Fully automatedRoute 53 health check interval = 30s,TTL settings = 1 min,Overall failover execution time = ~2 mins.
Multi-Region Data ReplicationEnsuring real-time data synchronization between primary and secondary regions. Estimated using DynamoDB Global Table's replication speed (~1 second) and consistency settings.DynamoDB Global TablesCross-region replication time = ~1 second,Consistency guarantees = Strong/Eventual.
Backup FrequencyFrequency at which full system snapshots and backups are taken. Based on AWS Backuppolicy settings for cost-effective redundancy while minimizing data loss risks.Every 4 hoursAWS Backup scheduled snapshot retention = 7 days,Backup completion time = ~10 mins per 100GB.

Disaster Recovery Strategy

  • Active-Passive Model: The primary region is always active, while the secondary region remains on standby.
  • Automated Failover: Route 53 DNS failover ensures near-instant traffic redirection.
  • Data Replication: DynamoDB Global Tables keep user preferences and configurations synchronized across regions.
  • Lambda Triggers for Synchronization: Ensures that Cognito user pools and other stateful services remain up to date.
  • AWS Backup for Redundancy: S3 and RDS backups are taken periodically and stored cross-region.
  • Budget-Controlled DR Activation: AWS Budgets and Runbooks trigger additional resources in the secondary region only when failover occurs, optimizing cost.

Cost Analysis

The following analysis presents cost for both Primary Region (Normal Traffic) and Secondary Region (Disaster Recovery) scenarios in all design options.

Projected Cost Reductions

ApproachBefore Optimization (Monthly)After Optimization (Monthly)1-Year Savings3-Year Savings
Serverless$5,050$3,700$16,200$48,600
Pinpoint$3,850$3,850--
Containerized$4,200$3,500$8,400$25,200
Hybrid$4,900$3,900$12,000$36,000

Comparative Cost Analysis

FactorServerless (Updated)ContainerizedHybridPinpoint
Infrastructure Cost$3,700/month$5,500/month$4,400/month$3,600/month
Operational CostLow: Managed services reduce manual intervention. ($400/month).High: Kubernetes expertise required. ($1,600/month).Moderate: Mixed overhead. ($1,000/month).Very Low: Pinpoint minimizes staff effort. ($200/month).
Scaling CostMinimal incremental costs. Fully serverless. ($600/month).Higher: Kubernetes scaling requires tuning. ($1,400/month).Moderate: Scaling cost split. ($1,000/month).Moderate: Simple campaign-based scaling. ($800/month).
Maintenance CostLow: Serverless reduces maintenance overhead. ($200/month).High: Requires ongoing updates. ($1,200/month).Moderate: Mixed maintenance. ($800/month).Very Low: Fully managed Pinpoint. ($100/month).
Extensibility CostLow: Modular and extensible. ($300/month).Moderate: Container updates required. ($600/month).Moderate: Mixed extensibility challenges. ($400/month).High: Limited customizability. ($600/month).
Disaster RecoveryAutomated multi-region failover ($1,000/month).Complex multi-region setup. ($1,800/month).Moderate DR complexity. ($1,400/month).Built-in but limited customization. ($800/month).
Total$5,050/month$12,100/month$9,000/month$6,100/month

Chosen Decision

The Serverless Approach was chosen as the preferred solution for multi-channel notification platform due to rational presented in the table below.

Comparative (GAP) Analysis of Approaches

FactorServerlessContainerizedHybridPinpoint
ScalabilityAuto-scaling with Lambda and SQS ensures seamless handling of spikes. (+4)Kubernetes-based scaling is robust but requires manual tuning. (+3)Combines serverless auto-scaling and containerized resources. (+3)Limited flexibility; scaling is tied to campaign volume. (+2)
High AvailabilityMulti-region setup with EventBridge and DynamoDB ensures resilience. (+4)Achieves high availability with Kubernetes, but setup is complex. (+3)Mixed components add complexity but ensure failover. (+3)Limited support for multi-region high availability. (+2)
Cost EfficiencyPay-as-you-go eliminates idle costs but increases with complex workflows. (+3)High infrastructure costs, especially for Kubernetes. (+1)Combines serverless savings with container overhead. (+2)Predictable pricing but higher costs due to built-in features. (+2)
Operational OverheadManaged services reduce maintenance to near-zero. (+4)High overhead due to Kubernetes management. (-2)Moderate due to mixed infrastructure. (+1)Minimal overhead due to managed nature. (+3)
ExtensibilityModular architecture makes adding channels simple. (+4)Extending Kubernetes-based systems requires custom integrations. (+2)Requires effort to ensure smooth integration of serverless and containerized parts. (+2)Limited extensibility for highly custom and dynamic workflows. (+2)
SecurityBuilt-in IAM, WAF, and Shield simplify robust security. (+4)Security is configurable but requires manual setup. (+2)Mixed architecture introduces additional security concerns. (+2)Integrated security features are easy to configure but limited. (+3)
Setup TimeFast deployment using CloudFormation templates. (+4)Long setup time due to Kubernetes orchestration. (-1)Longest setup due to integration of multiple components. (-2)Quick initial setup but limited customizability. (+3)
Multi-Channel SupportSeamless support for Email, SMS, and Push Notifications. (+4)Requires complex integrations for multiple channels. (+2)Combines serverless multi-channel support with container-based logic. (+3)Excellent for campaigns but less flexible for custom needs. (+3)
MonitoringUnified monitoring with CloudWatch and CloudTrail. (+4)Requires integrating third-party tools for Kubernetes monitoring. (+2)Complex monitoring due to mixed components. (+2)Built-in tracking but limited operational monitoring. (+3)
Why Not ChosenN/AHigh complexity and overhead for the problem statement.Complexity with marginal benefits compared to serverless.Lacks API-driven flexibility and extensibility for multi-channel dynamic workflows.
Total Score+35+19+22+23

AWS Well-Architected Framework EvaluationI

PillarMeets RequirementsEvaluationRecommendation
Operational ExcellenceMostly compliant. Automated deployment pipelines and CloudWatch monitoring are implemented.Add detailed dashboards and incident runbooks for end-to-end operational visibility.
SecurityStrong. AWS WAF, Shield, and ACM are in place. Default encryption is used.Adopt AWS KMS for fine-grained encryption control and consider Secrets Manager for better secret rotation.
ReliabilityWell-designed with multi-region failover using Route 53 and EventBridge.Use Step Functions for more reliable and visible failover orchestration.
Performance EfficiencyServerless architecture ensures scalability. Multi-region SQS and Lambda minimize latency.Evaluate caching mechanisms (e.g., DynamoDB Accelerator) for faster reads.
Cost OptimizationOptimized using Savings Plans and Reserved Capacity. CloudWatch and Budgets are used for tracking costs.Consolidate S3 buckets and optimize storage classes for infrequently accessed logs.

Conclusion

The Notification Platform design provides a highly scalable, cost-effective, and secure solution for handling multi-channel notifications using AWS services. By leveraging a serverless architecture, it ensures auto-scaling, low operational overhead, and event-driven processing. The integration of multi-region disaster recovery, AWS Savings Plans, and optimized monitoring further strengthens the reliability and cost-effectiveness of the platform.

Key Takeaways:

  • The serverless model minimizes infrastructure costs while maintaining high availability.
  • The event-driven approach using EventBridge, SNS, andSQS ensures efficient message delivery.
  • Disaster recovery mechanisms such as DynamoDB Global Tablesand Route 53 failover provide robust fault tolerance.
  • The design meets AWS Well-Architected Framework principles, ensuring best practices in security, performance, and operational excellence.

Future Enhancements:

  • Integration of Step Functions for advanced workflow orchestration.
  • Further cost optimizations using advanced reserved capacity strategies.
  • Improving machine learning-driven notification personalization for enhanced user engagement.

This system architecture solution serves as a comprehensive blueprint for deploying and maintaining a highly available, scalable, and secure notification platform on AWS.