How Duolingo Cut Their AWS Bill by 20%

Duolingo embarked on a significant cost-cutting journey to reduce their AWS cloud bills by 20%. They achieved this through a series of optimizations addressing various aspects of their cloud infrastructure.

Below are the key problems faced, optimizations done, and final approaches adopted:

Initial Visibility Crisis

Problem Faced:
Duolingo initially realized they were “flying blind,” lacking understanding of their cloud spending. A staging environment was costing more than production due to being scaled up for testing and not scaled back down.
Optimization Done:
Enlisted CloudZero, a third-party service, to gain insight into cloud costs.
Final Approach Adopted:
CloudZero converted billing data into queryable line items, helping engineers drill down into specific cost categories.

S3 Buckets – Indefinite Revision History

Problem Faced:
S3 buckets were storing entire revision history indefinitely, acting like a “digital hoarder.”
Optimization Done:
Implemented lifecycle rules on the largest buckets.
Final Approach Adopted:
Used lifecycle policies to transition data to cheaper tiers (e.g., S3 Standard-IA, Glacier) and delete older versions, controlling storage growth.

DynamoDB Tables – Stale Data Accumulation

Problem Faced:
Large volumes of stale data were accumulating, incurring costs without business value.
Optimization Done:
Introduced Time to Live (TTL) rules.
Final Approach Adopted:
TTL automatically removed outdated records. However, legacy data required manual updates.
Lesson: Design for TTL early in schema planning.

CloudWatch – Extremely Verbose Logs

Problem Faced:
Every user action, API call, and system event was being logged in maximum detail, including full stack traces — drastically increasing ingestion and storage costs.
Optimization Done:
Removed stack traces from logs.
Final Approach Adopted:
Best practices (implied):
- Log only what’s necessary
- Set log retention policies
- Reduce verbosity/log levels

Overprovisioned Resources

Problem Faced:
Services were overprovisioned — excess memory, unused CPU, dormant databases, and legacy microservices bloated the cost.
Optimization Done:
Focused on right-sizing and implementing autoscaling.
Final Approach Adopted:
Right-sizing just one service led to hundreds of thousands in annual savings.

Microservice Communication – Unnecessary API Calls

Problem Faced:
A single user request triggered a chain reaction of internal API calls. One legacy service caused 2.1 billion unnecessary API calls/day. Another frequently polled unchanged data due to a 1-minute cache TTL.
Optimization Done:
Audited services and analyzed data change patterns.
Final Approach Adopted:
- Disabled legacy service triggering excessive calls
- Increased cache TTL from 1 minute to 1 hour, reducing traffic by 60%
- Key lesson: Optimize service communication for cascading savings

Database Optimization – Unused Cost-Saving Features

Problem Faced:
Built-in AWS cost-saving features were never activated.
Optimization Done:
Applied pricing optimizations for DynamoDB and RDS.
Final Approach Adopted:
Switching one DB to Aurora IO-optimized config saved several hundred thousand dollars/year.
This pricing model favors high read/write workloads.

Reserved Instances – Inefficient Allocation

Problem Faced:
Reserved instance purchases were not aligned with usage patterns.
Optimization Done:
Improved analysis and planning for reserved instance purchases.
Final Approach Adopted:
Identified baseline usage across EC2, RDS, and ElastiCache.
Made bulk reserve purchases to secure long-term discounts.

Ongoing Monitoring & Reporting

Problem Faced:
A broader visibility crisis across services.
Optimization Done:
Expanded cloud monitoring.
Final Approach Adopted:
- Integrated cloud spending metrics (including OpenAI)
- Added weekly cost reports
- Enabled passive team-wide tracking

Key Takeaways

Visibility first – You can’t optimize what you can’t see
Right-sizing works – Overprovisioning adds up fast
Design for cost – Make smart decisions early (TTL, cache, logs)
Microservice traffic matters – Optimize communication and reduce internal API chatter
Small changes scale – Lifecycle rules, instance reservations, and config tweaks add up

Duolingo’s journey highlights how thoughtful engineering and cost awareness can lead to substantial savings—even at scale.