A production-grade platform for collecting, storing, visualizing, and alerting on infrastructure, container, and application metrics using Prometheus, Grafana, Amazon EKS, and AWS managed services.
Click any component below to explore its role in the observability stack, configuration details, and documentation links.

Select a component
Click any node to explore its configuration and role
Six functional categories covering the full observability spectrum — from raw metric collection to distributed tracing and centralized logging.
Pull-based metrics collection from all layers of the stack. Prometheus scrapes annotated pods every 15 seconds and evaluates alerting rules continuously.
Rich, interactive dashboards for infrastructure, Kubernetes, and application metrics. AMG provides enterprise SSO and team collaboration features.
Multi-channel alert routing with deduplication, grouping, and silencing. Alerts flow from Prometheus → Alertmanager → SNS → Slack/Email.
Serverless, scalable metric and log storage. AMP provides unlimited retention with PromQL compatibility. CloudWatch handles logs and RDS metrics.
Distributed tracing for end-to-end request visibility. The OTel Collector receives OTLP traces and exports them to X-Ray for service map generation.
Centralized log aggregation from all Kubernetes pods. Fluentd enriches logs with pod metadata and ships them to CloudWatch for search and analysis.
Four distinct monitoring layers provide complete observability coverage from bare-metal node metrics up through application-level business KPIs.

Node-level hardware and OS metrics
100 - (avg by(instance)(irate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)Percentage of CPU time not spent idle, averaged across all cores per node.
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Percentage of total memory currently in use on each node.
rate(node_disk_io_time_seconds_total[5m])
Rate of time the disk was busy performing I/O operations.
rate(node_network_receive_bytes_total[5m])
Rate of bytes received on all network interfaces per node.
Production-grade Prometheus alerting rules with severity classification, PromQL expressions, and multi-channel routing via Alertmanager.
A multi-layered security approach following AWS Well-Architected Framework principles — least privilege, defense in depth, and encryption everywhere.
IRSA binds a Kubernetes ServiceAccount to an AWS IAM Role via the EKS OIDC provider. Each workload — Prometheus, Fluentd, OTel Collector — receives only the minimum IAM permissions it needs. This eliminates the need for long-lived credentials and enforces the principle of least privilege at the pod level.
All critical workloads — EKS nodes, RDS, Prometheus, Grafana — are deployed exclusively in private subnets with no direct internet-facing exposure. Outbound internet access is provided through NAT Gateways in the public subnets. The Application Load Balancer is the only public-facing entry point.
Security groups enforce strict ingress and egress rules at the resource level. The RDS security group only allows traffic on port 3306 from the EKS node security group. The ALB security group only allows HTTPS (443) from the internet. Internal service communication is restricted to known port ranges.
All data in transit between components is encrypted using TLS. The ALB terminates TLS for external traffic. Internal communication between Prometheus and Amazon Managed Prometheus uses SigV4 signing over HTTPS. Grafana connects to Prometheus via HTTPS with certificate validation enabled.
Two deployment configurations — a lean lab setup for learning and a production-grade HA architecture — with estimated monthly AWS costs.
Self-managed on minimal EKS — ideal for learning and portfolio demos
Use Spot Instances for EKS node groups to reduce EC2 costs by up to 70%.
Enable metric downsampling in Prometheus to reduce AMP ingestion costs.
Set CloudWatch log retention to 30 days to avoid unbounded storage costs.
Step-by-step instructions for deploying the full observability stack on AWS from scratch using Terraform and kubectl.
Ensure the following tools are installed and configured on your local machine before proceeding with the deployment.
# Verify required tools aws --version # AWS CLI v2+ terraform --version # Terraform v1.5+ kubectl version # kubectl v1.27+ helm version # Helm v3.12+ docker --version # Docker 24+ # Configure AWS credentials aws configure # AWS Access Key ID: <your-key> # AWS Secret Access Key: <your-secret> # Default region name: us-east-1
Ensure your AWS IAM user has AdministratorAccess or the specific permissions required for EKS, VPC, RDS, AMP, and AMG.