Latency vs. Throughput

When we talk about system performance, two metrics come up constantly: latency and throughput. Many developers use them as if they were the same, but they’re not. And confusing them can lead you to optimize the wrong thing.

A system can be extremely fast for a single user (low latency) but collapse when a thousand simultaneous users arrive (insufficient throughput). Or it can handle thousands of users (high throughput) but each one experiences slow responses (high latency).

In this article I’ll explain the vital difference between both metrics, how to measure them, and what strategies to apply depending on the problem you actually have.

What is latency?

Latency is the time it takes for an operation to complete from the perspective of a single user or request. It’s the delay between “I send a request” and “I get the response.”

Examples:

Time for a page to load (from click to visible content)
Time for an API to respond (from request to response)
Time for a transaction to confirm

Typical metrics:

P50 (median): Half of requests are faster than this value
P95: 95% of requests are faster than this value
P99: 99% of requests are faster than this value

Latency is measured in milliseconds (ms) or seconds. A system with “low latency” responds quickly. A system with “high latency” feels slow to the user.

What is throughput?

Throughput is the amount of work a system can process in a period of time. It’s how many complete operations per second (or per minute) the system can handle when there are multiple users or concurrent load.

Examples:

Requests per second an API can handle
Transactions per minute a database processes
Concurrent users an application can support

Typical metrics:

Requests per second (RPS)
Transactions per second (TPS)
Supported concurrent users

Throughput is measured in operations per unit of time. A system with “high throughput” can handle a lot of load. A system with “low throughput” saturates when many users arrive.

The difference in practice

Imagine a highway:

Latency: How long it takes one car to go from point A to point B. If the road is empty, it can be very fast (low latency). If there’s an accident or a bottleneck, it can take a long time (high latency).

Throughput: How many cars can pass through the highway in an hour. A single-lane road has lower throughput than a four-lane one. When there are too many cars, traffic jams form (the system saturates).

You can have:

Low latency and low throughput: An empty single-lane road. Each car goes fast, but few pass per hour.
High latency and high throughput: A very busy multi-lane road. Each car goes slow, but many pass per hour.
Low latency and high throughput: The ideal. Wide, flowing road.
High latency and low throughput: The worst. Narrow, congested road.

What is your real problem?

Before optimizing, you need to identify whether your problem is latency or throughput.

Latency problem

Symptoms:

A single user reports the application is slow
Responses take a long time even with light load
Response time (P95, P99) is high
The application “feels slow” in normal use

Typical causes:

Slow database queries (missing indexes, N+1 queries)
Heavy business logic in the request
Slow external service calls
Unoptimized code (inefficient algorithms)
Blocking network or I/O

Strategies:

Optimize queries (indexes, avoid N+1)
Cache for frequently accessed data
Asynchronous processing (don’t block the request)
Reduce round-trips (fewer calls, aggregated data)
Profiling to find real bottlenecks

Throughput problem

Symptoms:

The application works fine with few users
With many simultaneous users, everything slows down or fails
Timeouts under load
503 or “service unavailable” errors at peak
CPU or memory saturate

Typical causes:

A single server or process can’t handle the load
Database that doesn’t scale
Limited connections (pool exhausted)
Shared resources that become a bottleneck
Lack of horizontal scalability

Strategies:

Scale horizontally (more instances)
Load balancing
Queues to decouple and smooth peaks
Scale the database (read replicas, sharding)
Optimize shared resources (connection pooling, limits)

They aren’t mutually exclusive

A system can have both problems. Or optimizing one can make the other worse.

Example: Adding cache can reduce latency (fewer DB queries) but if the cache isn’t well sized or there are many writes, throughput may not improve or may even get worse.

Example: Adding more servers (scaling horizontally) can improve throughput, but if each request is still slow due to a heavy query, latency stays high.

That’s why it’s crucial to measure first and define goals. What’s unacceptable today? High latency for users? Saturation under load? That defines what to optimize.

How to measure

Latency:

APM (Application Performance Monitoring): Datadog, New Relic, etc.
Logs with timestamps at request entry and exit
Framework metrics (middleware that measures response time)
Load tools that report percentiles (k6, Artillery, wrk)

Throughput:

Load tests: how many RPS it supports before degrading
Infrastructure metrics: CPU, memory, connections under load
Limits observed in production (real traffic peaks)

Typical targets:

Latency P95 < 200 ms for APIs
Latency P99 < 500 ms
Throughput: support X RPS according to your business (expected peak + margin)

My personal perspective

I’ve seen teams optimize throughput (add servers, more replicas) when the real problem was latency: a badly written query that took 2 seconds. Users kept complaining about slowness.

And I’ve seen teams obsess over reducing latency (cache everywhere, premature optimizations) when the real problem was throughput: on Black Friday the system collapsed because there weren’t enough instances or a database prepared for that peak.

The key is to measure, define the problem, and then choose the strategy. Latency and throughput are different concepts, with different solutions. Understanding the difference lets you not only optimize better but also communicate better with business and operations: “We have a throughput problem” or “We have a latency problem” leads to very different actions.

In well-designed systems, both are addressed: acceptable latency per request and enough capacity (throughput) for the expected number of users. But when something fails, knowing whether it fails due to individual slowness or saturation gives you the map to fix it.