轉：一個C語言實現的相似協程庫(StateThreads)

時間 2019-12-09

標籤一個 c語言實現相似 statethreads 简体版

原文原文鏈接

http://blog.csdn.net/win_lin/article/details/8242653html

譯文在後面。

State Threads for Internet Applications

Introduction

State Threads is an application library which provides a foundation for writing fast and highly scalable Internet Applications on UNIX-like platforms. It combines the simplicity of the multithreaded programming paradigm, in which one thread supports each simultaneous connection, with the performance and scalability of an event-driven state machine architecture.react

1. Definitions

1.1 Internet Applications

An Internet Application (IA) is either a server or client network application that accepts connections from clients and may or may not connect to servers. In an IA the arrival or departure of network data often controls processing (that is, IA is a data-driven application). For each connection, an IA does some finite amount of work involving data exchange with its peer, where its peer may be either a client or a server. The typical transaction steps of an IA are to accept a connection, read a request, do some finite and predictable amount of work to process the request, then write a response to the peer that sent the request. One example of an IA is a Web server; the most general example of an IA is a proxy server, because it both accepts connections from clients and connects to other servers.web

We assume that the performance of an IA is constrained by available CPU cycles rather than network bandwidth or disk I/O (that is, CPU is a bottleneck resource).express

1.2 Performance and Scalability

The performance of an IA is usually evaluated as its throughput measured in transactions per second or bytes per second (one can be converted to the other, given the average transaction size). There are several benchmarks that can be used to measure throughput of Web serving applications for specific workloads (such as SPECweb96, WebStone, WebBench). Although there is no common definition for scalability, in general it expresses the ability of an application to sustain its performance when some external condition changes. For IAs this external condition is either the number of clients (also known as "users," "simultaneous connections," or "load generators") or the underlying hardware system size (number of CPUs, memory size, and so on). Thus there are two types of scalability: load scalability and system scalability, respectively.apache

The figure below shows how the throughput of an idealized IA changes with the increasing number of clients (solid blue line). Initially the throughput grows linearly (the slope represents the maximal throughput that one client can provide). Within this initial range, the IA is underutilized and CPUs are partially idle. Further increase in the number of clients leads to a system saturation, and the throughput gradually stops growing as all CPUs become fully utilized. After that point, the throughput stays flat because there are no more CPU cycles available. In the real world, however, each simultaneous connection consumes some computational and memory resources, even when idle, and this overhead grows with the number of clients. Therefore, the throughput of the real world IA starts dropping after some point (dashed blue line in the figure below). The rate at which the throughput drops depends, among other things, on application design.編程

We say that an application has a good load scalability if it can sustain its throughput over a wide range of loads. Interestingly, the SPECweb99 benchmark somewhat reflects the Web server's load scalability because it measures the number of clients (load generators) given a mandatory minimal throughput per client (that is, it measures the server's capacity). This is unlike SPECweb96 and other benchmarks that use the throughput as their main metric (see the figure below).api

System scalability is the ability of an application to sustain its performance per hardware unit (such as a CPU) with the increasing number of these units. In other words, good system scalability means that doubling the number of processors will roughly double the application's throughput (dashed green line). We assume here that the underlying operating system also scales well. Good system scalability allows you to initially run an application on the smallest system possible, while retaining the ability to move that application to a larger system if necessary, without excessive effort or expense. That is, an application need not be rewritten or even undergo a major porting effort when changing system size.緩存

Although scalability and performance are more important in the case of server IAs, they should also be considered for some client applications (such as benchmark load generators).安全

1.3 Concurrency

Concurrency reflects the parallelism in a system. The two unrelated types are virtual concurrency and real concurrency.服務器

Virtual (or apparent) concurrency is the number of simultaneous connections that a system supports.

Real concurrency is the number of hardware devices, including CPUs, network cards, and disks, that actually allow a system to perform tasks in parallel.

An IA must provide virtual concurrency in order to serve many users simultaneously. To achieve maximum performance and scalability in doing so, the number of programming entities than an IA creates to be scheduled by the OS kernel should be kept close to (within an order of magnitude of) the real concurrency found on the system. These programming entities scheduled by the kernel are known as kernel execution vehicles. Examples of kernel execution vehicles include Solaris lightweight processes and IRIX kernel threads. In other words, the number of kernel execution vehicles should be dictated by the system size and not by the number of simultaneous connections.

2. Existing Architectures

There are a few different architectures that are commonly used by IAs. These include the Multi-Process, Multi-Threaded, and Event-Driven State Machine architectures.

2.1 Multi-Process Architecture

In the Multi-Process (MP) architecture, an individual process is dedicated to each simultaneous connection. A process performs all of a transaction's initialization steps and services a connection completely before moving on to service a new connection.

User sessions in IAs are relatively independent; therefore, no synchronization between processes handling different connections is necessary. Because each process has its own private address space, this architecture is very robust. If a process serving one of the connections crashes, the other sessions will not be affected. However, to serve many concurrent connections, an equal number of processes must be employed. Because processes are kernel entities (and are in fact the heaviest ones), the number of kernel entities will be at least as large as the number of concurrent sessions. On most systems, good performance will not be achieved when more than a few hundred processes are created because of the high context-switching overhead. In other words, MP applications have poor load scalability.

On the other hand, MP applications have very good system scalability, because no resources are shared among different processes and there is no synchronization overhead.

The Apache Web Server 1.x ([Reference 1]) uses the MP architecture on UNIX systems.

2.2 Multi-Threaded Architecture

In the Multi-Threaded (MT) architecture, multiple independent threads of control are employed within a single shared address space. Like a process in the MP architecture, each thread performs all of a transaction's initialization steps and services a connection completely before moving on to service a new connection.

Many modern UNIX operating systems implement a many-to-few model when mapping user-level threads to kernel entities. In this model, an arbitrarily large number of user-level threads is multiplexed onto a lesser number of kernel execution vehicles. Kernel execution vehicles are also known as virtual processors. Whenever a user-level thread makes a blocking system call, the kernel execution vehicle it is using will become blocked in the kernel. If there are no other non-blocked kernel execution vehicles and there are other runnable user-level threads, a new kernel execution vehicle will be created automatically. This prevents the application from blocking when it can continue to make useful forward progress.

Because IAs are by nature network I/O driven, all concurrent sessions block on network I/O at various points. As a result, the number of virtual processors created in the kernel grows close to the number of user-level threads (or simultaneous connections). When this occurs, the many-to-few model effectively degenerates to a one-to-one model. Again, like in the MP architecture, the number of kernel execution vehicles is dictated by the number of simultaneous connections rather than by number of CPUs. This reduces an application's load scalability. However, because kernel threads (lightweight processes) use fewer resources and are more light-weight than traditional UNIX processes, an MT application should scale better with load than an MP application.

Unexpectedly, the small number of virtual processors sharing the same address space in the MT architecture destroys an application's system scalability because of contention among the threads on various locks. Even if an application itself is carefully optimized to avoid lock contention around its own global data (a non-trivial task), there are still standard library functions and system calls that use common resources hidden from the application. For example, on many platforms thread safety of memory allocation routines (malloc(3), free(3), and so on) is achieved by using a single global lock. Another example is a per-process file descriptor table. This common resource table is shared by all kernel execution vehicles within the same process and must be protected when one modifies it via certain system calls (such as open(2), close(2), and so on). In addition to that, maintaining the caches coherent among CPUs on multiprocessor systems hurts performance when different threads running on different CPUs modify data items on the same cache line.

In order to improve load scalability, some applications employ a different type of MT architecture: they create one or more thread(s) per task rather than one thread per connection. For example, one small group of threads may be responsible for accepting client connections, another for request processing, and yet another for serving responses. The main advantage of this architecture is that it eliminates the tight coupling between the number of threads and number of simultaneous connections. However, in this architecture, different task-specific thread groups must share common work queues that must be protected by mutual exclusion locks (a typical producer-consumer problem). This adds synchronization overhead that causes an application to perform badly on multiprocessor systems. In other words, in this architecture, the application's system scalability is sacrificed for the sake of load scalability.

Of course, the usual nightmares of threaded programming, including data corruption, deadlocks, and race conditions, also make MT architecture (in any form) non-simplistic to use.

2.3 Event-Driven State Machine Architecture

In the Event-Driven State Machine (EDSM) architecture, a single process is employed to concurrently process multiple connections. The basics of this architecture are described in Comer and Stevens [Reference 2]. The EDSM architecture performs one basic data-driven step associated with a particular connection at a time, thus multiplexing many concurrent connections. The process operates as a state machine that receives an event and then reacts to it.

In the idle state the EDSM calls select(2) or poll(2) to wait for network I/O events. When a particular file descriptor is ready for I/O, the EDSM completes the corresponding basic step (usually by invoking a handler function) and starts the next one. This architecture uses non-blocking system calls to perform asynchronous network I/O operations. For more details on non-blocking I/O see Stevens [Reference 3].

To take advantage of hardware parallelism (real concurrency), multiple identical processes may be created. This is called Symmetric Multi-Process EDSM and is used, for example, in the Zeus Web Server ([Reference 4]). To more efficiently multiplex disk I/O, special "helper" processes may be created. This is called Asymmetric Multi-Process EDSM and was proposed for Web servers by Druschel and others [Reference 5].

EDSM is probably the most scalable architecture for IAs. Because the number of simultaneous connections (virtual concurrency) is completely decoupled from the number of kernel execution vehicles (processes), this architecture has very good load scalability. It requires only minimal user-level resources to create and maintain additional connection.

Like MP applications, Multi-Process EDSM has very good system scalability because no resources are shared among different processes and there is no synchronization overhead.

Unfortunately, the EDSM architecture is monolithic rather than based on the concept of threads, so new applications generally need to be implemented from the ground up. In effect, the EDSM architecture simulates threads and their stacks the hard way.

3. State Threads Library

The State Threads library combines the advantages of all of the above architectures. The interface preserves the programming simplicity of thread abstraction, allowing each simultaneous connection to be treated as a separate thread of execution within a single process. The underlying implementation is close to the EDSM architecture as the state of each particular concurrent session is saved in a separate memory segment.

3.1 State Changes and Scheduling

The state of each concurrent session includes its stack environment (stack pointer, program counter, CPU registers) and its stack. Conceptually, a thread context switch can be viewed as a process changing its state. There are no kernel entities involved other than processes. Unlike other general-purpose threading libraries, the State Threads library is fully deterministic. The thread context switch (process state change) can only happen in a well-known set of functions (at I/O points or at explicit synchronization points). As a result, process-specific global data does not have to be protected by mutual exclusion locks in most cases. The entire application is free to use all the static variables and non-reentrant library functions it wants, greatly simplifying programming and debugging while increasing performance. This is somewhat similar to a co-routine model (co-operatively multitasked threads), except that no explicit yield is needed -- sooner or later, a thread performs a blocking I/O operation and thus surrenders control. All threads of execution (simultaneous connections) have the same priority, so scheduling is non-preemptive, like in the EDSM architecture. Because IAs are data-driven (processing is limited by the size of network buffers and data arrival rates), scheduling is non-time-slicing.

Only two types of external events are handled by the library's scheduler, because only these events can be detected by select(2) or poll(2): I/O events (a file descriptor is ready for I/O) and time events (some timeout has expired). However, other types of events (such as a signal sent to a process) can also be handled by converting them to I/O events. For example, a signal handling function can perform a write to a pipe (write(2) is reentrant/asynchronous-safe), thus converting a signal event to an I/O event.

To take advantage of hardware parallelism, as in the EDSM architecture, multiple processes can be created in either a symmetric or asymmetric manner. Process management is not in the library's scope but instead is left up to the application.

There are several general-purpose threading libraries that implement a many-to-one model (many user-level threads to one kernel execution vehicle), using the same basic techniques as the State Threads library (non-blocking I/O, event-driven scheduler, and so on). For an example, see GNU Portable Threads ([Reference 6]). Because they are general-purpose, these libraries have different objectives than the State Threads library. The State Threads library is not a general-purpose threading library, but rather an application library that targets only certain types of applications (IAs) in order to achieve the highest possible performance and scalability for those applications.

3.2 Scalability

State threads are very lightweight user-level entities, and therefore creating and maintaining user connections requires minimal resources. An application using the State Threads library scales very well with the increasing number of connections.

On multiprocessor systems an application should create multiple processes to take advantage of hardware parallelism. Using multiple separate processes is the only way to achieve the highest possible system scalability. This is because duplicating per-process resources is the only way to avoid significant synchronization overhead on multiprocessor systems. Creating separate UNIX processes naturally offers resource duplication. Again, as in the EDSM architecture, there is no connection between the number of simultaneous connections (which may be very large and changes within a wide range) and the number of kernel entities (which is usually small and constant). In other words, the State Threads library makes it possible to multiplex a large number of simultaneous connections onto a much smaller number of separate processes, thus allowing an application to scale well with both the load and system size.

3.3 Performance

Performance is one of the library's main objectives. The State Threads library is implemented to minimize the number of system calls and to make thread creation and context switching as fast as possible. For example, per-thread signal mask does not exist (unlike POSIX threads), so there is no need to save and restore a process's signal mask on every thread context switch. This eliminates two system calls per context switch. Signal events can be handled much more efficiently by converting them to I/O events (see above).

3.4 Portability

The library uses the same general, underlying concepts as the EDSM architecture, including non-blocking I/O, file descriptors, and I/O multiplexing. These concepts are available in some form on most UNIX platforms, making the library very portable across many flavors of UNIX. There are only a few platform-dependent sections in the source.

3.5 State Threads and NSPR

The State Threads library is a derivative of the Netscape Portable Runtime library (NSPR) [Reference 7]. The primary goal of NSPR is to provide a platform-independent layer for system facilities, where system facilities include threads, thread synchronization, and I/O. Performance and scalability are not the main concern of NSPR. The State Threads library addresses performance and scalability while remaining much smaller than NSPR. It is contained in 8 source files as opposed to more than 400, but provides all the functionality that is needed to write efficient IAs on UNIX-like platforms.

	NSPR	State Threads
Lines of code	~150,000	~3000
Dynamic library size(debug version)
IRIX	~700 KB	~60 KB
Linux	~900 KB	~70 KB

Conclusion

State Threads is an application library which provides a foundation for writing Internet Applications. To summarize, it has the following advantages:

It allows the design of fast and highly scalable applications. An application will scale well with both load and number of CPUs.

It greatly simplifies application programming and debugging because, as a rule, no mutual exclusion locking is necessary and the entire application is free to use static variables and non-reentrant library functions.

The library's main limitation:

All I/O operations on sockets must use the State Thread library's I/O functions because only those functions perform thread scheduling and prevent the application's processes from blocking.

References

Apache Software Foundation, http://www.apache.org.
Douglas E. Comer, David L. Stevens, Internetworking With TCP/IP, Vol. III: Client-Server Programming And Applications, Second Edition, Ch. 8, 12.
W. Richard Stevens, UNIX Network Programming, Second Edition, Vol. 1, Ch. 15.
Zeus Technology Limited, http://www.zeus.co.uk.
Peter Druschel, Vivek S. Pai, Willy Zwaenepoel, Flash: An Efficient and Portable Web Server. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.
GNU Portable Threads, http://www.gnu.org/software/pth/.
Netscape Portable Runtime, http://www.mozilla.org/docs/refList/refNSPR/.

Other resources covering various architectural issues in IAs

Dan Kegel, The C10K problem, http://www.kegel.com/c10k.html.
James C. Hu, Douglas C. Schmidt, Irfan Pyarali, JAWS: Understanding High Performance Web Systems, http://www.cs.wustl.edu/~jxh/research/research.html.

網絡架構庫：StateThreads

介紹

StateThreads是一個C的網絡程序開發庫，提供了編寫高性能、高併發、高可讀性的網絡程序的開發庫，支持UNIX-like平臺。它結合了多線程編寫並行成的簡單性，一個進程支持多個併發，支持基於事件的狀態機架構的高性能和高併發能力。

(譯註：提供了EDSM的高性能、高併發、穩定性，「多線程」形式的簡單編程方式，用setjmp和longjmp實現的一個線程模擬多線程，即用戶空間的多線程，相似於如今的協程和纖程)

1. 定義

1.1 網絡程序（Internet Applications）

網絡程序（Internet Application）（IA）是一個網絡的客戶端或者服務器程序，它接受客戶端鏈接，同時可能須要鏈接到其餘服務器。在IA中，數據的到達和發送完畢常常操縱控制流，就是說IA是數據驅動的程序。對每一個鏈接，IA作一些有限的工做，包括和peer的數據交換，peer多是客戶端或服務器。IA典型的事務步驟是：接受鏈接，讀取請求，作一些有限的工做處理請求，將相應寫入peer。一個iA的例子是Web服務器，更典型的例子是代理服務器，由於它接受客戶端鏈接，同時也鏈接到其餘服務器。

咱們假定IA的性能由CPU決定，而不是由網絡帶寬或磁盤IO決定，即CPU是系統瓶頸。

1.2 性能和可擴展性

IA的性能通常能夠用吞吐量來評估，即每秒的事務數，或每秒的字節數（二者能夠相互轉換，給定事務的平均大小就能夠）。有不少種工具能夠用來測量Web程序的特定負載，譬如SPECweb96, WebStone, WebBench。儘管對擴展性沒有通用的定義，通常而言，可擴展性指系統在外部條件改變時維持它的性能的能力。對於IAs而言，外部條件指鏈接數（併發），或者底層硬件（CPU數目，內存等）。所以，有兩種系統的擴展性：負載能力和系統能力。

（譯註：scalability可擴展性，指條件改變了系統是否還能高效運行，譬如負載能力指併發（條件）增多時系統是否能承擔這麼多負載，系統能力指CPU等增多時是否能高效的利用多CPU達到更強的能力）

下圖描述了客戶端數目增多時系統的吞吐量的變化，藍色線條表示理想情況。最開始時吞吐量程線性增加，這個區間系統和CPU較爲空閒。繼續增加的鏈接數致使系統開始飽和，吞吐量開始觸及天花板（CPU跑滿能跑到的吞吐量），在天花板以後吞吐量變爲平行線再也不增加，由於CPU能力到達了極限。在實際應用中，每一個鏈接消耗了計算資源和內存資源，就算是空閒狀態，這些負擔都隨鏈接數而增加，所以，實際的IA吞吐量在某個點以後開始往下落（藍色虛線表示）。開始掉的點，不是其餘的緣由，而是由系統架構決定的。

咱們將系統有好的負載能力，是指系統在高負載時仍能很好的工做。SPECweb99基準測試能較好的反應系統的負載能力，由於它測量的是鏈接在最小流量需求時系統能支持的最大鏈接數（譯註：如圖中Capacity所指出的點即灰色斜線和藍色線交叉的點）。而不像SPECweb96或其餘的基準測試，是以系統的吞吐量來衡量的（譯註：圖中Max throughout，即藍色線的天花板）。

系統能力指程序在增長硬件單元例如加CPU時系統的性能，換句話說，好的系統能力意味着CPU加倍時吞吐量會加倍（圖中綠色虛線）。咱們假設底層操做系統也具備很好的系統能力。好的系統能力指假設程序在一個小的機器上運行很好，當有須要換到大型服務器上運行時也能得到很高的性能。就是說，改變服務器環境時，系統不須要重寫或者費很大的勁。

(譯註：

縱座標是吞吐量，橫座標是鏈接數。

灰色的線（min acceptable throughout pre client）表示是客戶端的須要的吞吐量，至少這個量才流暢。

藍色表示理想狀態的server，系統能力一直沒有問題，能達到最大吞吐量，CPU跑滿能跑到的吞吐量。

藍色虛線表示實際的server，每一個鏈接都會消耗CPU和內存，因此在某個臨界點以後吞吐量開始往下掉，這個臨界點就是系統結構決定的。好的系統架構能將臨界點日後推，穩定的支持更高的併發；差的架構在併發增長時可能系統就僵死了。

灰色虛線表示兩個測量基準，一個是SPECweb96測量的是系統最大吞吐量，一個是SPECweb99測量每一個鏈接在最小要求流量下系統能達到的最大鏈接數，後者更能反應系統的負載能力，由於它測量不一樣的鏈接的情況下系統的負載能力。

負載能力指的是系統支撐的最大負載，圖中的橫座標上的值，對應的藍色線和灰色線交叉的點，或者是藍色線往下掉的點。

系統能力指的是增長服務器能力，如加CPU時，系統的吞吐量是否也會增長，圖中綠色線表示。好的系統能力會在CPU增長時性能更高，差的系統能力增長CPU也不會更強。

)

儘管性能和擴展性對服務器來說更重要，客戶端也必須考慮這個問題，例如性能測試工具。

1.3 併發

併發反應了系統的並行能力，分爲虛擬併發和物理併發：

虛擬併發是指操做系統同時支持不少併發的鏈接。

物理併發是指硬件設備，例如CPU，網卡，硬盤等，容許系統並行執行任務。

IA必須提供虛擬併發來支持用戶的併發訪問，爲了達到最大的性能，IA建立的由內核調度的編程實體數目基本上和物理併發的數量要保持一致（在一個數量級上）（譯註：有多少個CPU就用多少個進程）。內核調度的編程實體即內核執行對象（kernel execution vehicles），包括Solaris輕量級進程，IRIX內核線程。換句話說，內核執行對象應該由物理條件決定，而不是由併發決定（譯註：即進程數目應該由CPU決定，而不是由鏈接數決定）。

2. 現有的架構

IAs(Internet Applications)有一些常見的被普遍使用的架構，包括基於進程的架構（Multi-Process）,基於線程的架構（Multi-Threaded）, 和事件驅動的狀態機架構（Event-Driven State Machine）。

2.1 基於進程的架構：MP

（譯註：Multi-Process字面意思是多進程，但事件驅動的狀態機EDSM也經常使用多進程，因此爲了區分，使用「基於進程的架構」，意爲每一個鏈接一個進程的架構）

在基於進程的架構（MP）中，一個獨立的進程用來服務一個鏈接。一個進程從初始化到服務這個鏈接，直到服務完畢才服務其餘鏈接。

用戶Session是徹底獨立的，所以，在這些處理不一樣的鏈接的進程之間，徹底沒有同步的必要。由於每一個進程有本身獨立的地址空間，這種架構很是強壯。若服務某個鏈接的進程崩潰，其餘的鏈接不會受到任何影響。然而，爲了服務不少併發的鏈接，必須建立相等數量的進程。由於進程是內核對象，其實是最「重」的一種對象，因此至少須要再內核建立和鏈接數相等的進程。在大多數的系統中，當建立了上千個進程時，系統性能將大幅下降，由於超負荷的上下文切換。也就是說，MP架構負載能力很弱，沒法支持高負載（高併發）。

另外一方面，MP架構有很高的系統能力（利用系統資源，穩定性，複雜度），由於不一樣的進程之間沒有共享資源，於是沒有同步的負擔。

ApacheWeb服務器就是採用的MP架構。

2.2 基於線程的架構：MT

（譯註：Multi-Threaded字面意思是多線程，但側重一個線程服務一個鏈接的方式，用「基於線程」會更準確）

在基於線程（MT）架構中，使用多個獨立的線程，它們共享地址空間。和MP結構的進程同樣，每一個線程獨立服務每一個鏈接直到服務完畢，這個線程才用來服務其餘鏈接。

不少現代的UNIX操做系統實現了一個多對一的模型，用來映射用戶空間的線程到系統內核對象。在這個模型中，任意多數量的用戶空間線程複用少許的內核執行對象，內核執行對象即爲虛擬處理器。當用戶空間線程調用了一個阻塞的系統調用時，內核執行對象也會在內核阻塞。若是沒有其餘沒有阻塞的內核執行對象，或者有其餘須要運行的用戶空間線程，一個新的內核執行對象會被自動建立，這樣就防止一個線程阻塞時其餘線程都被阻塞。

因爲IAs由網絡IO驅動，全部的併發鏈接都會阻塞在不一樣的地方。所以，內核執行對象的數目會接近用戶空間線程的數目，也就是鏈接的數目。此時，多對一的模型就退化爲一對一的模型，和MP架構同樣，內核執行對象的數目由併發決定而不是由CPU數目決定。和MP同樣，這下降了系統的負載能力。儘管這樣，因爲內核線程是輕量級進程，使用了較少的資源，比內核進程要輕，MT架構比MP架構在負載能力方面稍強一些。

在MT架構中，內核線程共享了地址空間，各類同步鎖破壞了系統能力。儘管程序能夠很當心的避免鎖來提升程序性能（是個複雜的任務），標準庫函數和系統調用也會對通用資源上鎖，例如，平臺提供的線程安全函數，例如內存分配函數（malloc，free等）都是用了一個全局鎖。另一個例子是進程的文件描述表，這個表被內核線程共享，在系統調用（open，close等）時須要保護。除此以外，多核系統中須要在CPU之間維護緩存的一致，當不一樣的線程運行在不一樣的CPU上並修改一樣的數據時，嚴重下降了系統的性能。

爲了提升負載能力，產生了一些不一樣類型的MT架構：建立多組線程，每組線程服務一個任務，而不是一個線程服務一個鏈接。例如，一小組線程負責處理客戶端鏈接的任務，另一組負責處理請求，其餘的負責處理響應。這種架構的主要優勢是它對併發和線程解耦了，再也不須要同等數量的線程服務鏈接。儘管這樣，線程組之間必須共享任務隊列，任務隊列須要用鎖來保護（典型的生產者-消費者問題）。額外的線程同步負擔致使在多處理器系統上性能很低。也就是說，這種架構用系統能力換取了負載能力（用性能換高併發）。

固然，線程編程的噩夢，包括數據破壞，死鎖，條件競爭，也致使了任何形式的MT架構沒法實用。

2.3 基於事件的狀態機架構：EDSM

在基於事件驅動的狀態機架構（EDSM）中，一個進程用來處理多個併發。Comer和Stevens[Reference 2]描述了這個架構的基礎。EDSM架構中，每次每一個鏈接只由數據驅動一步（譯註：例如，收一個包，動做一次），所以必須複用多個併發的鏈接（譯註：必須複用一個進程處理多個鏈接），進程設計成狀態機每次收到一個時間就處理並變換到下一個狀態。

在空閒狀態時，EDSM調用select/poll/epoll等待網絡事件，當一個特殊的鏈接能夠讀寫時，EDSM調用響應的處理函數處理，而後處理下一個鏈接。EDSM架構使用非阻塞的系統調用完成異步的網絡IO。關於非阻塞的IO，請參考Stevens [Reference 3]。

爲了利用硬件並行性能，能夠建立多個獨立的進程，這叫均衡的多進程EDSM，例如ZeusWeb服務器[Reference 4]（譯註：商業的高性能服務器）。爲了更好的利用多磁盤的IO性能，能夠建立一些輔助進程，這叫非均衡的多進程EDSM，例如DruschelWeb服務器[Reference 5]。

EDSM架構多是IAs的最佳架構，由於併發鏈接徹底和內核進程解耦，這種架構有很高的負載能力，它僅僅須要少許的用戶空間的資源來管理鏈接。

和MP架構同樣，多核的EDSM架構也有很高的系統能力（多核性能，穩定性等），由於進程間沒有資源共享，因此沒有同步鎖的負擔。

不幸的是，EDSM架構其實是基於線程的概念（譯註：狀態機保存的其實就是線程的棧，上次調用的位置，下次繼續從這個狀態開始執行，和線程是同樣的），因此新的EDSM系統須要從頭開始實現狀態機。實際上，EDSM架構用很複雜的方式模擬了多線程。

3. State Threads Library

StateThreads庫結合了上面全部架構的優勢，它的api提供了像線程同樣的編程方式，容許一個併發在一個「線程」裏面執行，但這些線程都在一個進程裏面。底層的實現和EDSM架構相似，每一個併發鏈接的session在單獨的內存空間。

（譯註：StateThreads提供的就是EDSM機制，只是將狀態機換成了它的「線程」（協程或纖程），這些「線程」其實是一個進程一個線程實現但表現起來像多線程。因此StateThread的模型是EDSM的高性能和高併發，而後提供了MT的可編程性和簡單接口，簡化了EDSM的狀態機部分。）

3.1 狀態改變和調度

每一個併發的session包含它本身的棧環境（棧指針，PC，CPU寄存器）和它的棧。從概念上講，一次線程上下文切換至關於進程改變它的狀態。固然除了進程以外，並無使用線程（譯註：它是單線程的方式模擬多線程）。和其餘通用的線程庫不同，StateThreads庫的設計目標很明確。線程上下文切換（進程狀態改變）只會在一些函數中才會發生（IO點，或者明確的同步點）。因此，進程級別的數據不須要鎖來保護，由於是單線程。整個程序能夠自由的使用靜態變量和不可重入的函數，極大的簡化了編程和調試，從而增長了性能。這其實是和協程（co-routine）相似，可是不須要顯式的用yield指定——線程調用阻塞的IO函數被阻塞而交出控制權是遲早的事。全部的線程（併發鏈接）都有一樣的優先級，因此是非搶佔式的調度，和EDSM架構相似。因爲IAs是數據驅動（處理流程由網絡緩衝區大小和數據到達的次序決定），調度不是按時間切片的。

只有兩類的外部事件能夠被庫的調度器處理，由於只有這類事件能被select/poll檢測到：

1. IO事件：一個文件描述符可讀寫時。

2. 定時器時間：指定了timeout。

儘管這樣，其餘類型的事件（譬如發送給進程的信號）也能被轉換成IO事件來處理。例如，信號處理函數收到信號時能夠寫入pipe，所以將信號轉換成了IO事件。

爲了能更好的發揮硬件並行的性能，和EDSM架構同樣，能夠建立均衡和非均衡的進程。進程管理不是庫的功能，而是留給用戶處理。

有一些通用的線程庫，實現了多對一的模型（多個用戶空間的線程，對一個內核執行對象），使用了和StateThreads庫相似的技術（非阻塞IO，事件驅動的調度器等）。譬如，GNU Portable Threads [Reference 6]。由於他們是通用庫，因此它們和StateThreads有不一樣的目標。StateThreads不是通用的線程庫，而是爲少數的須要得到高性能、高併發、高擴展性和可讀性的IAs系統而設計的。

3.2 可擴展性

StateThreads是很是輕量級的用戶空間線程，所以建立和維護用戶鏈接須要不多的資源。使用StateThreads的系統在高併發時能得到很高性能。

多CPU的系統上，程序須要建立多個進程才能利用硬件的平行能力。使用獨立的進程是惟一獲取高系統能力的方式，由於複製進程的資源是惟一的方式來避免鎖和同步這種負擔的惟一方式。建立UNIX進程通常會複製進程的資源。再次強調，EDSM架構中，併發的鏈接和系統對象（進程線程）沒有任何的聯繫，也就是說，StateThreads庫將大量併發複用到了少許的獨立的進程上，所以得到很高的系統能力和負載能力。

3.3 性能

高性能是StateThreads庫的主要目標之一，它實現了一系列的系統調用，儘量的提升線程建立和切換的速度。例如，沒有線程級別的信號屏蔽（和POSIX線程不同），因此線程切換時不須要保存和恢復進程的信號屏蔽字，這樣在線程切換時少了兩個系統調用。信號事件能被高效的轉換成IO事件（如上所述）。

3.4 便攜性

StateThreads庫使用了和EDSM架構一樣的基礎概念，包括非阻塞IO，文件描述符，IO複用。這些概念在大多數的UNIX平臺都通用，因此UNIX下庫的通用性很好，只有少數幾個平臺相關的特性。

3.5 State Threads 和 NSPR

StateThreads庫是從Netscape Portable Runtime library (NSPR) [Reference 7]發展來的。NSPR主要的目標是提供一個平臺無關的系統功能，包括線程，線程同步和IO。性能和可擴展性不是NSPR主要考慮的問題。StateThreads解決了性能和可擴展性問題，可是比NSPR要小不少；它僅僅包含8個源文件，卻提供了在UNIX下寫高效IAs系統的必要功能：

	NSPR	State Threads
Lines of code	~150,000	~3000
Dynamic library size(debug version)
IRIX	~700 KB	~60 KB
Linux	~900 KB	~70 KB

總結

StateThreads是一個提供了編寫IA的基礎庫，它包含如下優勢：

1. 能設計出高效的IA系統，包括很高的負載能力和系統能力。

2. 簡化了編程和調試，由於沒有同步鎖，可使用靜態變量和不可重入函數。

它主要的限制：

1. 全部socket的IO必需要使用庫的IO函數，由於調度器能夠避免被阻塞（譯註：用操做系統的socket的IO函數天然調度器就管不了了）。

References

Apache Software Foundation, http://www.apache.org.
Douglas E. Comer, David L. Stevens, Internetworking With TCP/IP, Vol. III: Client-Server Programming And Applications, Second Edition, Ch. 8, 12.
W. Richard Stevens, UNIX Network Programming, Second Edition, Vol. 1, Ch. 15.
Zeus Technology Limited, http://www.zeus.co.uk.
Peter Druschel, Vivek S. Pai, Willy Zwaenepoel, Flash: An Efficient and Portable Web Server. In Proceedings of the USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999.
GNU Portable Threads, http://www.gnu.org/software/pth/.
Netscape Portable Runtime, http://www.mozilla.org/docs/refList/refNSPR/.

Other resources covering various architectural issues in IAs

Dan Kegel, The C10K problem, http://www.kegel.com/c10k.html.
James C. Hu, Douglas C. Schmidt, Irfan Pyarali, JAWS: Understanding High Performance Web Systems, http://www.cs.wustl.edu/~jxh/research/research.html.

譯註：

用StateThread寫了幾個程序。