Product SiteDocumentation Site

3.2. Threads per CPU

The idea to address the downsides of the previous architecture is to have one thread/CPU execute all stages of a request. The first thread starts to read in the request. As soon as it has read the whole request, it signals, to the next waiting worker thread, that it can start reading in the next request.
Figure 2. CPU cache friendly program flow

This approach overcomes all the downsides of the one thread per pipeline stage approach, but has a different issue. It does not yet have support for multiple client connections, let alone support for massive client scalability.
One obvious way to get support for multiple clients into the picture is to combine this with the threaded server architecture. For a single client connection a thread on each CPU gets created, to achieve best possible pipelining benefits. This may lead to a scalability problem with high numbers of CPUs and client connections, since the number of resulting threads might become huge. Additionally, only one thread can run on a single CPU at a time, therefore it is actually not necessary to create such a high number of threads. Due to the two downsides of this approach, I have chosen to not further investigate this design option.
A different approach to support multiple clients is to combine the threads per CPU architecture with the select/poll architecture. The advantage over the previous model is that per CPU core in the system, we create only a single thread. In order to overcome the code complexity/maintenance issue of the select/poll architecture user space thread switching can be used. That effectively creates a cooperative multi-threading environment based on makecontext(3) and swapcontext(3) inside the system level worker threads.

3.2.1. Input coordination

It is not guaranteed that when the FD becomes ready for read, that a whole request is already available. Since a single thread has to read a whole request, coordination amongst the threads is necessary. That is possible by using a mutex.
The worker thread that got the mutex, reads until it has a whole request. Then it releases the mutex again, allowing some other thread to grab the mutex and enter the READ stage on this FD.