Product SiteDocumentation Site

3. Improving throughput by pipelining

A multi-threaded server process, as an improvement over forking servers, can utilize multiple CPUs or cores for processing multiple requests streams in parallel. However, it can not speed up the processing of a single request stream.
Pipelining, a technique that other areas of computer science have utilized for quite some time, can improve the throughput on a single request stream. To achieve that, we have to split the processing of a single request into stages, for example, READ, PROCESS, WRITE.

3.1. Threads for pipeline stages

A straightforward approach is to implement the server with a pool of threads. Preferably, the number of threads should match the number of available cores. The first would use the discussed select/poll based state machine to read in requests. Then a pool of threads would do the necessary processing. Finally, a thread would pick up the finished results and send them in-order back to the clients.
Figure 1. Threads for pipeline stages timing

Besides the main goal, of achieving pipelining, this approach handles multiple connections to clients out of the box, potentially many thousands without any scalability concerns. Unfortunately this architecture also bears some downsides.
One is arguably code maintainability. Just as with the select/poll architecture the programmer has to split the reading and processing, as well sending back the results carefully.
CPU scalability is an other major issue of this architecture. With a low number of CPUs the processing threads might be the bottleneck, with a higher number of CPUs the IO threads might be the bottleneck. It boils down to the fact that this architecture does not level the load between threads (CPUs) reasonably.
Finally, I should point out that this approach is especially hostile to the locality requirements of CPU caches. Each time a request advances in the pipeline, the data needs to be handed from one CPU cache to another.