Latency and Throughput

Latency of an operation is the time from start to finish of that one operation. The latency of the same operation might vary each time due to number of factors involved. Usually, the latency is measured over a distribution to deduce its percentile. Usually, the latency of an task is the sum of the latencies of its sub-tasks. It can also be thought of as execution/response time in some context.

Throughput of a system is the number of operations it can process/complete in some unit of time. Number of requests completed per second, total tasks executed per minute etc. are all examples of modeling throughput of a system.

It is wrong to imagine:

This is because throughput can exploit parallelism/pipelining where as latency cannot. Imagine you have 5 servers servicing clients. Each request takes 2 secs. The latency of each of the request is 2 secs. And, throughput of a server is 30 reqs/second. But since we have 5 servers, the total throughput of the system will be 5 * 30 = 150 reqs/minute. Similarly, consider a car manufacturing assembly line where the car needs to go through a sequence of steps before coming out fully built. One step might build the engine and another will build the body etc. Here, we can pipeline the manufacturing process to improve throughput. Consider there are 10 total steps and each step takes 1 minute. The latency of getting a new car is 10 mins. But the throughput is not 6/hour. With pipelining, you can get a new car every 10 mins (except for the first time).

Measuring Performance

Let us try to measure performance in terms of latency and throughput. We need to define Speedup between two system X and Y to tell if X is faster or slower than Y by how much. When using latency as the measure, the speedup is defined as:

and when using the throughput as the measure, the speedup is defined as:

The formulas are so since the speedup decreases as latency increases. The performance of the system is proportional to the throughput and inversely proportional to latency. Usually, a speedup greater than 1 means improved performance and less than 1 means decreased performance. Also, as explained earlier, throughput can exploit parallelism/pipelining where as latency cannot.

A small primer on performance in computer systems: A computing unit is the one responsible for executing instructions. The speed of the computing unit is determined by the number of operations it can do in one cycle - Instructions Per Cycle (IPC) and the number of cycles it can execute in a second - Clock Speed. They are at the opposite ends of the spectrum and an increase in IPC can decrease clock speed and vice versa. GPUs can get both the works due to vectorization possible in them. Processor designers balance the both when designing new processors for the market. The above will be useful when we use these in the next section. Also, performance will also involve memory and network units but we will not be looking at them in this post.

Improving Performance

Performance can mainly be improved through Parallelism. Parallelism here means multiple CPUs, multiple storage units, multiple network interfaces, pipelining etc.

We will look at improving the performance by looking at various CPU specific measurements.

The Iron law states the total time spent by the CPU on the program. It is defined as:

The Cycles Per Instructions can also defined as 1 / Instructions Per Cycle (IPC) and the Clock Cycle Time can be defined as 1 / Clock Frequency. In the real world, all instructions will not have the same Cycles Per Instruction. In that case, the total instructions and their cost need to be calculated and summed up before multiplying it with Time Per Cycle.

Another law that can used to measure performance improvements, when using multiple CPUs is Amdhal’s law. Simply, it states the improvement of using multiple processors over using only a single CPU. A program is divided into its serial and parallel portions and if F is the fraction of serial portion that must be executed serially, then we can achieve a speedup, when using N processors, of at most:

And as N goes up to infinity, the speedup is 1 / F which means that the serial portion limit the overall speedup that can be obtained by using multiple CPUs. It is quite hard to calculate the serial and the parallel portions of a program and you need abstractions to help you in this.

Pipelining can greatly improve performance as seen above with a car assembly manufacturing plant. It is the same idea where the CPU pipelines instructions and does not need to perform them atomically thereby improving performance.

That’s it. Hope this post helps in understanding the difference between Latency and Throughput. Also, we looked at various laws governing performance and how you go about calculating them.