Scheduler Message Queue Levels

Each server has an associated scheduler thread. This thread is in charge of creating server threads for each new client transaction. It keeps a table of active server threads, and forwards client requests to the corresponding server threads. The scheduling is based on the following property: in each replica at a given time only a single thread is running, either the scheduler thread or a server thread. Each time a server thread returns control to the scheduler thread, the scheduler thread chooses the next server thread to be executed (if any). A server thread returns control when either it finishes its execution, or it blocks on a lock when accessing a datum, or when it is about to accept a request. An important problem to be solved has to do with the fact that a totally ordered multicast only guarantees that messages are delivered in the same order in all the members of the group. But this kind of multicast does not ensure that the number of messages queued at a given instant is the same in all the replicas. A message will be queued first in some replicas and later in other ones. The scheduling algorithm must take this fact into account to guarantee replica determinism so the scheduler behavior does not depend on the number of messages already queued. To illustrate the problem let's study the following example. At a given instant the message queue of a replica can contain the message m₁ while the message queue of another replica can contain the messages m₁ and m₂. Observe that the example does not break the total ordering of the messages (m₂ can be queued later on the former replica). If the code of the replicas contains a selective reception (like the select statement of Ada [7]) accepting both messages, the former replica will choose the message m₁, as it is its only option, but the latter replica can choose either the m₁ or m₂, and thus determinism can be broken. What it is more, in some languages as Ada, it is also possible to use guards referring to the number of messages queued in some entry that could force to choose m₂ in the latter replica.

**Figure 2:** Queue levels in the scheduler

In our proposal we have used for each replica three message queue levels (fig. 2) to deal with this problem. The most external level corresponds to the group communication layer and we call it communication level. In this queue the total ordering of messages is guaranteed. As it has been noted, the length of this queue can differ in different replicas thus breaking the determinism. The problem is solved if each replica uses the minimum prefix of the communication queue that it needs to progress. This minimum prefix constitutes the third queue level or entry queue level. The number of messages in this level at a given step will be always the same in all the replicas of a server. In the entry level there is a queue per server thread (transaction) and entry (service), this is why we have called it entry level. Our scheduling algorithm was aimed to a system implemented in Ada 95 where a task can either accept a request from a particular entry (by means of the accept statement), or accept a request from a set of entries (by means of a select statement). The entry level corresponds to the Ada95 task entry queues and thus any request on this level can be directly accepted by a server thread by means of an accept or a select statement. An additional medium level (transaction level) has been used to ensure that each server thread sees the minimum set of messages that it needs to progress. Note that the entry queue level ensured that the whole replica sees the minimum set of messages while this one also ensures the same property all the server threads of a replica (what is stricter). Without this queue level, a server thread can contain in its entry level queues more messages than it needs to progress. This level is called transaction level because when messages are moved from the communication level to this level they are classified by the transaction that issued them. When the scheduler gains control, if all the server threads are blocked, it must take messages from the communication level queue and put them on the corresponding transaction level queue until a server thread can progress or the communication queue becomes empty. Before activating a server thread, the scheduler ensures that it can progress by moving as many messages as needed by the server thread to progress from its transaction queue to its corresponding entry queues. That is, this process will end when an awaited message is moved to an entry level queue. Despite of the queue levels when a selective reception takes place more than one message can be selectable. In this case it is necessary to take a deterministic decision so all the replicas choose the same message and so consistency is mantained. The queue levels ease this task, as all the replicas will consider the same set of messages at a given step. A straightforward decision is to take the oldest awaited message, although more involved decisions are allowed in our approach. It must be observed that context changes take place when a server thread makes a reception (selective or not), or when it blocks on an incompatible lock. Another kind of messages not discussed yet are internal messages, that is, the ones reporting about client termination or transaction abortion or commitment. They are handled in the same way than explicit requests, that is, the multicast used to send them has the same properties. This ensures that transactions will be serialized in the same way in all the replicas because locks will be freed in the same order in all the replicas. Additionally, transaction ends (and client terminations) will cause the removal of server threads and this must be accomplished in the same order in all the replicas. Although the scheduling algorithm is aimed to accept as few messages as possible from the communication level queue, the end of a transaction can unblock more than one server thread. This means that sometimes there will be more than one thread ready to run. These threads will be kept in a queue (ready threads queue), for the sake of fairness. The determinism is not compromised due to the insertions and removals from this queue are done deterministically. Insertions will take place when:

A server thread transfers the control to the scheduler before accepting a message that it is already queued on its transaction or entry level queues.
A message extracted from the communication level queue unblocks a server thread.
When the release of locks (due to a transaction end) unblocks one or more server threads.

Removals will only take place to transfer the control to a server thread.

1999-11-17