How Duolingo Cut Their AWS Bill by 20%

Duolingo embarked on a significant cost-cutting journey to reduce their AWS cloud bills by 20%. They achieved this through a series of optimizations addressing various aspects of their cloud infrastructure.

Below are the key problems faced, optimizations done, and final approaches adopted:


Initial Visibility Crisis

  • Problem Faced:
    Duolingo initially realized they were “flying blind,” lacking understanding of their cloud spending. A staging environment was costing more than production due to being scaled up for testing and not scaled back down.

  • Optimization Done:
    Enlisted CloudZero, a third-party service, to gain insight into cloud costs.

  • Final Approach Adopted:
    CloudZero converted billing data into queryable line items, helping engineers drill down into specific cost categories.


S3 Buckets – Indefinite Revision History

  • Problem Faced:
    S3 buckets were storing entire revision history indefinitely, acting like a “digital hoarder.”

  • Optimization Done:
    Implemented lifecycle rules on the largest buckets.

  • Final Approach Adopted:
    Used lifecycle policies to transition data to cheaper tiers (e.g., S3 Standard-IA, Glacier) and delete older versions, controlling storage growth.


DynamoDB Tables – Stale Data Accumulation

  • Problem Faced:
    Large volumes of stale data were accumulating, incurring costs without business value.

  • Optimization Done:
    Introduced Time to Live (TTL) rules.

  • Final Approach Adopted:
    TTL automatically removed outdated records. However, legacy data required manual updates.
    Lesson: Design for TTL early in schema planning.


CloudWatch – Extremely Verbose Logs

  • Problem Faced:
    Every user action, API call, and system event was being logged in maximum detail, including full stack traces — drastically increasing ingestion and storage costs.

  • Optimization Done:
    Removed stack traces from logs.

  • Final Approach Adopted:
    Best practices (implied):

    • Log only what’s necessary
    • Set log retention policies
    • Reduce verbosity/log levels

Overprovisioned Resources

  • Problem Faced:
    Services were overprovisioned — excess memory, unused CPU, dormant databases, and legacy microservices bloated the cost.

  • Optimization Done:
    Focused on right-sizing and implementing autoscaling.

  • Final Approach Adopted:
    Right-sizing just one service led to hundreds of thousands in annual savings.


Microservice Communication – Unnecessary API Calls

  • Problem Faced:
    A single user request triggered a chain reaction of internal API calls. One legacy service caused 2.1 billion unnecessary API calls/day. Another frequently polled unchanged data due to a 1-minute cache TTL.

  • Optimization Done:
    Audited services and analyzed data change patterns.

  • Final Approach Adopted:

    • Disabled legacy service triggering excessive calls
    • Increased cache TTL from 1 minute to 1 hour, reducing traffic by 60%
    • Key lesson: Optimize service communication for cascading savings

Database Optimization – Unused Cost-Saving Features

  • Problem Faced:
    Built-in AWS cost-saving features were never activated.

  • Optimization Done:
    Applied pricing optimizations for DynamoDB and RDS.

  • Final Approach Adopted:
    Switching one DB to Aurora IO-optimized config saved several hundred thousand dollars/year.
    This pricing model favors high read/write workloads.


Reserved Instances – Inefficient Allocation

  • Problem Faced:
    Reserved instance purchases were not aligned with usage patterns.

  • Optimization Done:
    Improved analysis and planning for reserved instance purchases.

  • Final Approach Adopted:
    Identified baseline usage across EC2, RDS, and ElastiCache.
    Made bulk reserve purchases to secure long-term discounts.


Ongoing Monitoring & Reporting

  • Problem Faced:
    A broader visibility crisis across services.

  • Optimization Done:
    Expanded cloud monitoring.

  • Final Approach Adopted:

    • Integrated cloud spending metrics (including OpenAI)
    • Added weekly cost reports
    • Enabled passive team-wide tracking

Key Takeaways

  • Visibility first – You can’t optimize what you can’t see
  • Right-sizing works – Overprovisioning adds up fast
  • Design for cost – Make smart decisions early (TTL, cache, logs)
  • Microservice traffic matters – Optimize communication and reduce internal API chatter
  • Small changes scale – Lifecycle rules, instance reservations, and config tweaks add up

Duolingo’s journey highlights how thoughtful engineering and cost awareness can lead to substantial savings—even at scale.