A FORMAL MODEL FOR THE PARALLEL SEMANTICS OF P3L

A. Cavarra
Dip. di Matematica e Informatica
Univerisità di Catania
Italy

E. Riccobene
Dip. di Matematica e Informatica
Univerisità di Catania
Italy

A. Zavanella
Dip. di Informatica
Univerisità di Pisa
Italy

Abstract

    Some formalizations have been developed in order to give a complete description of the parallel semantics of P3L, a programming language which ensures both task and data parallelism. However, the description level provided by the employed formalisms is either too much `abstract' - unable to grasp the operational behavior of the skeletons of the language - or too much `concrete' - close to implementation details and therefore related to a particular machine model.
    This paper shows how the Gurevich's Abstract State Machines (ASMs) have been successfully employed to describe the parallel semantics of skeletons of P3L completely abstracting from any implementation detail.
    The proposed model describes both the operation of the compiler which leads to define a network of processes starting from a given program, and the computation of the running processes.
    The model has been also taken as a theoretical basis to prove some of the most interesting rewriting rules used by the compiler to get a better program performance and efficiency.

1 Introduction

P3L - Pisa Parallel Programming Language - is a parallel language based on the skeleton idea. A skeleton is a build-in higher-order function representing a parallel construct which may have a sequential code as parameter. The sequential part of the program can be written by using a host (sequential) language.
It is widely recognized the validity of using formal methods in software development, especially in defining semantics of programming languages. Many are the advantages of formal semantics: a precise and unambiguous meaning, the possibility of applying mathematical reasoning on the formal model to prove - whenever available even by machine support - rewriting rules and other properties, the opportunity of using tools for language development and design, and for automatically generate compilers and interpreters.
Several attempts [1,8,4,13] as

farm(f(x₁,x₂,...,x_n)) = = (f(x₁),f(x₂),...,f(x_n))

i.e. computing f on a stream means computing f on each element of the stream. Clearly, this notation is unable to capture the ``parallel meaning'' of the construct. On the other side, approaches as in [1,8,4] describe the parallel behavior of a P3L construct in ``human language'' and give examples of the corresponding implementations on particular architectures (as Transputers, those using PVM, etc.).
    Indeed, what we need is a formalism powerful enough to be able to abstract from: (a) the creation of parallel activities, (b) the communication system, (c) the effect of the environment during the compiling process. Moreover, we need a framework rigorous enough to show properties of the compiler and eventually to drive the porting of the framework on different platforms.
    In this paper we develop a complete operational semantics of P3L programs using the Gurevich's Abstract State Machines (ASMs) (previously called evolving algebras [12]). This formalism describes the operational behavior of real dynamic systems in terms of transition rules, and it has been applied successfully to specify real programming languages and architectures, to validate standard language implementations, to verify real-time protocols, etc. (see available materials and ASM bibliography in [3]).
By using ASMs we are able to tune the ``abstraction level'' and to express the behavior of P3L independently from any particular implementation.
    According to the skeleton based language approach [4,10,9], a skeleton on a specified architecture is implemented by a template, i.e. a parametric process network, strictly depending on the target architecture. In order to give a machine-independent semantics of the P3L constructs, we describe their operational behavior in terms of a P3L virtual machine M based on an abstract architecture A which includes an unbounded number of processing nodes interacting by means of a complete interconnection topology. The implementation template of a P3L construct in A is unique and corresponds to a well-defined logical process graph [13,1]. The semantics of each construct is caught by the computational behavior of its corresponding (logical) network of processes.
    In terms of operations of the virtual machine M we describe (a) the compiler actions to build the logical process graph from the given program, and (b) the subsequent computation of the process network yielding the semantics of the program. Programs for M are ASM transition rules. We provide transition rules both for the logical graph construction phase and for the process network running phase. In such a way we compile ``P3L programs'' in ``ASM programs'' executable by the P3L virtual machine M. The advantages are the following. (1) We give the meaning of P3L programs in terms of ASM programs which have a precise semantics. (2) We describe the operational behavior of P3L programs independently by any machine and abstracting from any choice of specific architecture. (3) We take the formal model as a mathematical basis to prove some interesting rewriting rules currently used by the compiler to optimize the source code. All other attempts [1,2] to prove such properties are informal. Our proofs are mechanizable through the use of theorem provers or model checkers, exploiting existing encoding of ASMs in PVS [11] and SMV [15]. (4) Applying the technique of abstract machine models refinement, it is possible to drive, in a controlled and reliable way, the implementation of the P3L parallel constructs on a specific target architecture. Such technique of stepwise refinements to executable code has already successfully employed in real case studies of software system development [6,16,5].
    The paper is organized as follows. Section 3 briefly introduces the reader to the Gurevich's Abstract State Machine. An overview of P3L is given in Section 3. The compiler operation resulting in the generation of the logical process graph is presented in Section 4; whereas the computation of the running processes on the network is described in Section 5. The most relevant rewriting rules used by the P3L compiler to get a better program performance and efficiency are stated in Section 6 where we also report proof results. Related works and conclusions are discussed in Section 7.

2 The Gurevich's Abstract State Machines

We quote here only the essential definitions of the Abstract State Machines defined in [12]. We refer to [3] for more details.
    Given a program Prog consisting of finite number of transition rules and a (class of) initial state(s) S₀, a Gurevich's Abstract State Machine M models the operational behavior of a real dynamic system S in terms of state transitions. A state is a first-order structure S_i representing the instantaneous configuration of S at time i.
    The basic form of a transition rule is the following function update f(t₁, ..., t_n): = t where f is an arbitrary n-ary function and t₁, ..., t_n, t are first-order terms. To fire this rule to a state S_i, i >= 0, evaluate all terms t₁, ..., t_n, t at S_i and update the function f to t on parameters t₁, ..., t_n. This produces another state S_i+1 which differs from S_i only in the new interpretation of the function f.
    There are some rule constructors.
The conditional constructor produces ``guarded'' transition rules of the form: if g thenR₁elseR₂ where g is a ground term (the guard of the rule) and R₁, R₂ are transition rules. To fire a guarded rule to a state S_i, i >= 0, evaluate the guard; if it is true, then execute R₁, otherwise execute R₂. The else part may be omitted.
    The parallel constructor produces the ``parallel'' synchronous rule of form: do forall x: x in U R(x) where R(x) is a transition rule with a variable x ranging over the universe U. The operational meaning of a parallel rule consists in the execution of R(x) for every x in U.
    In order to specify some systems which are characterized by sequential operations, we may need a sequential application of rules. This could be easily achieved by guarded rules with suitable flags. Because it would results in a syntactical heaviness, we make use of the sequential constructor: seq R₁ ... R_n Each rule R_i is fired only after R_i-1, i = 2,3,...,n, and causes the transition to a new state.
    If we need a sequential transition with more than one rule per each step, we can use the block constructor, do in-parallel R₁ ... R_nenddo to apply R₁, ..., R_n simultaneously, inside a seq constructor.
    We also make use of the following construct to let universes grow: extend U withx₁, ..., x_n withUpdates where Updates may (and should) depend on the x_i and are used to define certain functions for (some of) the new objects x_i of the resulting universe U.
    To avoid the trouble of writing a lengthly term t several times in a rule, we use the constructor, let x = t R(x) The let rule above is equivalent to the rule R(x -> t) resulting by substituting t for x.
    In ASM it is possible to express the non-determinism using the choose constructor: choose x in U s.t. cond(x) R(x) which allows to fire the rule R(x) choosing randomly an x in U satisfying conditions cond(x).
    State transitions of M may be influenced in two ways: internally, through the rules of the program Prog, or externally through the modifications of the environment. A non static function f that is not updated by any transition rule is called monitored function and its values are given non-deterministically by the environment. Non static functions updated by a transition rule are called controlled functions.
    A computation of S is modeled through a finite or infinite sequence S₀, S₁, ..., S_n, ... of states of M, where S₀ is an initial state and each S_n+1 is obtained from S_n by firing simultaneously all of the rules of Prog to S_n.

3 An overview of P3L

P3L - Pisa Parallel Programming Language - is a structured language based on skeletons (also called ``constructs'' in P3L terminology). A skeleton is a predefined set of higher-order functions. Each skeleton represents a parallel paradigm (a form of parallelism), while the functions executed sequentially are specified by a sequential programming language, called host language ¹.
Let us describe briefly the constructs of P3L as in [4]. A complete syntax definition can be found in [8].

The seq construct corresponds to a sequential process that, for each data item coming form an input stream, produces a new data item of an output stream.
The farm construct models processor farm parallelism. In this form of parallelism, a set of identical workers execute in parallel the tasks which come from an input stream, and produce an output stream of results.
The map construct models data parallel computations. In this form of parallelism, each input data item from an input stream is decomposed into a set of partitions, and assigned to a set of identical and parallel workers. The results produced by the workers are recomposed to make up a new data item of an output stream of results.
The pipe construct models pipeline parallelism. In this form of parallelism, a set of stages execute serially over a stream of input data, producing an output stream of results.
The reduce construct models the reduction of an array by an associative binary operator. A set of identical workers execute in parallel the computation of the binary operation between two elements of the array. Observe that the only reducible structures are arrays.
The loop construct models computations where, for each input data item, a loop body has to be iteratively executed, until a given condition is reached and an output data item is produced.

The P3L skeletons can be hierarchically compound in order to construct complex forms of parallelism: this property has important consequences on the semantics of the constructs. Each instance of construct, in fact, behaves as a functional module - Mod - elaborating data coming from the input channel (input stream) and putting results into the output channel (output stream). The order of tasks in a stream has to be preserved, i.e. [in₁, in₂, ..., in_n] -> Mod -> [f(in₁), f(in₂), ..., f(in_n)], being f the function computed by the module Mod.
A P3L program is represented by a nested structure of constructs named constructs' tree. All the leaves of the tree are sequential fragments embedded into the sequential construct seq.
The methodology of P3L includes a compiler able to analyze the tree and decide a strategy to optimize the program. The compiler exploits two different libraries, the templates library and the rewriting rules library, to drive the optimization. The template library includes the scheme to implement each of the parallel constructs of the language. The scheme also provides a set of analytical formulae to predict the behavior of the template. The rewriting rules library contains a set of rules which leave inalterate the semantics of the program. The compiler uses these rules and the performance formula of the template library to get a better performance or a better efficiency. The compiler also takes implementation decisions as the parallelism degree of the program and the grain of communications by using a view of the execution costs onto the target machine.

4 Graph Construction Phase

In this section we describe the compiler operation of transforming each P3L construct in the corresponding logical process graph representing its implementation template on the abstract architecture A. Therefore, we define the transition system formalizing the logical graph construction phase: starting from a program P, a direct graph representing the whole process network is built.
Notice that we suppose the input program be syntactically correct and optimized in the sense that no rewriting rules are applicable.

4.1 Signature

The direct graph representation leads to define a universe NODE of the graph nodes. Each node may have more then one incoming and outgoing edge. The functions fanin, fanout : NODE --> NAT, being NAT the set of natural numbers, yield respectively the number of all incoming and outgoing edges of a node. We establish a numeration mechanism univocally labeling each of the incoming/outgoing edge in a node m by a number i in 1 ... fanin/ fanout(m). Given a node m and an outgoing edge i, to detect which node is connected to m by the edge i, we introduce the function receiver : NODE × NAT --> NODE such that receiver(m,i) = n. We use the abbreviation receiver(m) = receiver(m,1).
To represent P3L programs and I/O communication channels, we introduce sets PROGRAM and CHANNEL.

Elements of PROGRAM are nested structures of the form:
pipe (in,out,p₁(in,out),...,p_k(in,out))
farm (in,out,p(in,out))
map (in,out,p(in,out))
loop (in,out,feedback,p(in,out),condition)
reduce (in,out,p(in,out))
where in,out,feedback in CHANNEL, and p, p₁,..., p_k in PROGRAM. (In the sequel we omit arguments in and out of a program p). According to the P3L syntax, we assume as constraint that the program p within a construct reduce contains an associative operator, and that the construct pipe must contain more than one stage.
We start from an initial configuration as depicted in Fig.1. P is the input program. Since we dynamically build the process network, at any stage of this phase we expand the network replacing the current node with a new network fragment according to the structure of the program associated to that node. The program fragment is associated with a node by the function pgm : NODE --> PROGRAM. We keep track of the current node by the 0-ary function compt : NODE yielding the current compilation point. The function succ : NODE --> NODE returns the next
compilation point.

Figure 1: Starting configuration of the logical graph

4.2 Transition Rules

4.2.1 The construct `pipe`

If the program of the compt is a pipe statement of the form pipe(in,out,p₁,...,p_k), a network performing the computation of the k stages in pipeline must be created (see Fig.2). By the PipeConstrRule the node compt is replaced by a chain of k nodes (k-1 new ones and compt redefined as the head of the chain); each node of the chain takes a stage p_i as program. The pipe stages p₁, ..., p_kwill be expanded in turn (accessed by the function succ) starting from the program p₁of the compt.

Figure 2: Logical graph for pipe

PipeConstrRule
if pgm(compt) = pipe(in,out,p1,...,pk)
then   let n₁ = compt
         extend NODE with n₂, ..., n_k
                              pgm(n_l): = pl
                              receiver(n_k) : = receiver(compt)
                              receiver(n_i): = n_i+1
                              succ(n_i): = n_i+1
                              succ(n_k): = succ(compt)
where 1 <= i < k, 1 <= l <= k

4.2.2 Parallel constructs `farm` and `map`

If the current program is a farm statement of the form farm(in,out,p), the node compt must be replaced by a network consisting of an emitter e - redefinition of compt -, a collector c and m workers n₁,..., n_m. The value m is the parallel degree of the (current) farm statement, and it is computed by the function parDegree : PROGRAM --> NAT that, given a program, returns the number of processes that have to be created to compute it in the most efficient way. By this function we leave abstract the parallel degree computation. However, it could be possible to refine the definition of the function parDegree taking in consideration computational models as in [18,7].
The newly created nodes are connected as depicted in Fig.3 (read par as farm) by the function receiver. The nodes e and c take programs farmEmit and farmCollect, respectively, since they will perform the emitter and collector operation for a farm (see next section); the workers' programs are defined as the farm body p and they will be expanded in turn accessed by the function succ. The expansion proceeds on the first node n₁.
The construction of the graph structure of a map statement having form map(in,out,p), is similar to the one for the farm construct, except for the programs mapEmit and mapCollect assigned to the emitter and the collector nodes (see Fig.3 reading par as map). These differences are outlined in the rule by representing the specific map statements enclosed between square brackets.

Figure 3: Logical graph for farm[map]

FarmConstrRule [MapConstrRule]
if pgm (compt) = farm(in,out,p) [map(in,out,p)]
then let m = parDegree(farm(in,out,p))
                        [parDegree(map(in,out,p))]
                   e = compt
         extend   NODE with n₁, ..., n_m, c
                             pgm(n_l): = p
                             pgm(e): = farmEmit [mapEmit(in,out,p)]
                             pgm(c): = farmCollect [mapCollect(in,out,p)]
                             receiver(e,i): = n_i
                             receiver(c): = receiver(compt)
                             receiver(n_i): = c
                  if m > 1
                  then   succ(n_i): = n_i+1
                                           succ(n_m): = succ(compt)
                                           fanout(e): = m
                                           fanin(c): = m
                                           compt : = n₁
where 1 <= i < m, 1 <= l <= m

4.2.3 The construct `loop`

The loop statement loop(in,out,feedback,p,condition) produces a network as in Fig.4. The node compt (now renamed li) takes the program loopIn becoming the input process of the loop instruction, and is connected to a new node n. This new node receives the program p specified in the body of loop and is connected to the node lo. lo gets program loopOut and therefore works as the output process of the loop; it will make use of the information yielded by the function pgmcond to control the exit condition of the loop. Nodes lo and li are in their turn connected by the feedback channel. The construction proceeds on the node n containing the body of the loop.

Figure 4: Logical graph for loop

LoopConstrRule
if pgm(compt) = loop(in,out,feedback,p,condition)
then let li = compt
          extend NODE with n, lo
                           pgm(li): = loopIn
                           pgm(lo): = loopOut
                           pgmcond(lo) : = condition
                           pgm(n): = p
                           receiver(li) : = n
                           receiver(n): = lo
                           receiver(lo,1): = receiver(compt)
                           receiver(lo,2): = li
                           succ(n): = succ(compt)
                           compt : = n

4.2.4 The construct `reduce`

The program p argument of an instruction reduce(in,out,p) is defined in terms of an associative binary operation. The logical graph associated with reduce is a binary balanced tree structure. Therefore, we expand the graph replacing the current node compt by a binary tree whose leaves are all connected to an emitter node e through the function receiver (see Fig.5). Let s = size(in)/2. To build a balanced structure, we define a function redSucc:NAT--> NAT associating to an integer i the index of its unique child node in the (static) tree structure: redSucc(i) = ceil(i / 2)+s if odd(i); redSucc(i) = i / 2 + s if even(i).
Each tree node takes red(p) as program to indicate that p is a sequential program inside a reduce statement. The node e gets the program redEmit. The compilation carries on through the successor of the compilation point.

Figure 5: Logical graph for reduce

ReduceConstrRule
if pgm(compt) = reduce(in,out,p)
then let k = lg(size(in)+1)
                   s = size(in/2)
                   e = compt
          extend NODE with n₁,..., n_k
                            pgm(n_l): = red(p)
                            pgm(e): = redEmit
                            receiver(e,i): = n_i
                            receiver(n_j): = n_redSucc(j)
                            receiver(n_k): = receiver(compt)
                             fanout(e): = s
                            compt : = succ(compt)
where 1 <= i <= s, 1 <= j < k, 1 <= l <= k

4.2.5 The construct `seq`

A sequential statement seq(p) simply involves the update of the compilation point by its successor (Fig.6).

Figure 6: Logical graph for seq

SeqConstrRule
if pgm (compt) = seq(p)
then compt : = succ(compt)

5 Running Phase

We assume an active process is associated to each node of the network obtained after the logical graph construction phase. Therefore, the run of a given program P corresponds to the execution of the resulting network of processes. In the sequel we do not make any distinction between nodes and processes.

5.1 Signature Refinement

Among all the nodes of the network, we can distinguish the root as the unique element of NODE which is not the receiver of any other node, more formally: root in NODE /\ (forall n in NODE forall i in INT receiver(n,i) \= root).
Each node of the logical process graph has, as attached information, a program in the set {mapEmit, matCollect, farmEmit, farmCollet, seq(q), loopIn, loopOut, redEmit, red(q)}, being q an external sequential program code. Each active process computes the relative program and sends the output to the receiver node. A node can have mode:

receiving, if the process is waiting for information from its sender process;
working, when the process is computing its program;
finished, when the process is no more active. After receiving the signal EOS (End Of Stream) and sending it out, a process has definitively concluded its own computation and might be killed. (Note that we do not describe here the process of eliminating the no more active processes because this regards an actual implementation).

All the nodes apart from the root has initial mode receiving, whereas root has initial mode working. Note that for a loopIn node we need to refine the mode receiving in receivingIn; the reason for that is explained further.
    The function val : NODE --> DATA represents the local ``memory'' of a process and is updated anytime a sending action is performed by another process (see macro Send below). It is initially undefined for all the nodes apart from the root, where it holds IN(P), the input of P.
    All processes work upon changing data through the communication channels accessible by the function receiver. The sending and receiving operations are here voluntarily left abstract because we are not interested in modeling the message passing mechanism which can be considered as part of the implementation. Therefore, the communication among processes is abstractly formalized through the macros Send and Receive.
    Firing the macro ``Send val(m) to n'', m,n in NODE, allows to update val(n) to val(m). The effective updating happens if and when n is in mode receiving and it has executed a macro Receive. Otherwise, by the blocking effect of the macro Send, the computation of the node m suspends and will start again as soon as such a condition will hold.
    If a node n executes the macro Receive, then val(n) is updated to x, provided there is a node m such that n = receiver(m,i) for some i, that has executed the macro ``Send x to n''. Otherwise, the execution of n suspends (by the blocking effect of the macro Receive), to restart as soon as there will be a working node ready to send to n the expected value.
Note that for a loopIn node we make use of the macro ReceiveFromLoop which is a variant of the macro Receive, as explained further.
    To describe the map and the farm collector's operation, we need to take a counter of the workers' processes that have finished their own computation. This leads to the function count : NODE --> NAT which is initialized to 0 for all nodes n such that pgm(n) in {mapCollector,farmCollector}.

5.2 Transition Rules

We now describe the transition rules for the running phase. They are split into two groups: the Receiving Rules describing the reaction of a process upon receiving data; the Working Rules formalizing the operation of a working process according to the program to compute.

5.2.1 Receiving Rules

They have the effect to update val(node) when node is in mode receiving. Among all nodes of the network, we need to distinguish between those having program loopIn and all the others (as a node loopIn can receive data from two channels). All nodes in mode receiving, sequentially receive data and change mode in working (see Receiving Rule 1).

Receiving Rule 1 If mode(node) = receiving then seq Receive mode(node): = working			Receiving Rule 2 If pgm(node) = `loopIn`& mode(node) = receivingLoopRes then seq ReceiveFromLoop mode(node): = working

A node n which has program loopIn can receive values in two ways: from the outside ``input'' node, i.e. the node m such that receiver(m,0) = n, and from the related node loopOut, i.e. the node m such that receiver(m,1) = n. According with the semantics of the loop instruction, loopIn must give priority to the value received by loopOut. In order to manage such a situation, we refine the function mode on a node loopIn which gets mode receiving if it is ready to receive values from outside - as any other node of the network - and mode receivingLoopRes when it is waiting the result of the loop computation from its related node loopOut.
The Receiving Rule 2 describes the dynamics of a node loopIn in mode receivingLoopRes. The macro ReceiveFromLoop updates the value of val(node) just like the macro Receive, but the condition receiver(m,1) = node & pgm(m)= loopOut must be also satisfied, being m the sending node. That guarantees the node loopIn to get the result of the loop computation sent by the node loopOut.

5.2.2 Working Rules

In this section we describe the dynamics of a process in mode working.

The construct farm

A construct farm allows the parallel execution of tasks which come from an input stream and produces an output stream of results. The logical structure of a farm is shown in Fig.3: two processes emitter (e) and collector (c) perform the distribution of the data and the collection of the results, respectively. They are connected to a set of workers (w_i) which are instances of the module corresponding to the nested construct p inside the farm. The emitter's and collector's operations are formalized by the programs farmEmit and farmCollect, respectively.

* The program farmEmit.
The items of the input data stream (on which the tasks have to be computed) arrive at the emitter process which must distribute them to the workers. Therefore, if the received value is not the EOS which marks the end of the input stream, the process selects non-deterministically one of the ready sons, if any, sends to it the received item and changes mode, being now ready to receive the subsequent item of the input stream (see FarmEmitRule 1).
The set readySons (node) = { n in NODE | mode(n) = receiving & exists i: receiver(node,i) = n} contains those receiver nodes which are ready to receive and compute a new item.
When receives the EOS, the emitter sends EOS in broadcast to all its workers and changes mode to finished, having ended its own activity (see FarmEmitRule 2).

FarmEmitRule 1

If pgm(node) = farmEmit &
      mode(node) = working &
     val(node) \= EOS&
     readySons(node)\= {}
then
    choose n in readySons(node)
    seq Send val(node) to n
           mode(node): = receiving

FarmEmitRule 2

If pgm(node) = farmEmit &
      mode(node) = working &
      val(node) = EOS
then
    seq
       Broadcast(val(node))
       mode(node): = finished

The following macro performs the parallel sending of the same signal to all receivers:

Broadcast(val(n)) = do forall i in [1..fanout(n)]
Send val(n) to receiver(n,i)

* The program farmCollect.
The results of the workers' computation are received by the collector. Task results have to be sent out whenever available and in the same order of the corresponding data into the input stream. To perform the reorder activity, the collector has a buffer where to temporarily save the received results. The collector receives each item together with a tag representing the item's position into the output stream; each item is saved into the buffer - by the function insert(pos,buffer,value) - at the tag position. The function lastsent, initialized to 0, yields the tag of the last sent item. Thus, at any time, all consecutive items with tags starting from lastsent + 1 can be sent out. The function lenghtReadySequence yields the length of the larger sequence of consecutive items into the buffer which starts from lastsent + 1. If the collector receives an item (which is not the EOS) with tag greater than lastsent + 1, it inserts the item into the buffer and gets mode receiving (see FarmCollectRule 1). Otherwise, it first updates lenghtReadySequence, sends out all the items in the range between lastsent + 1 and lastsent + lengthReadySequence, updates lastsent, and then changes mode ready to receive new items.

FarmCollectRule 1

If pgm(node) = farmCollect &
       mode(node) = working &
        val(node)) \= EOS
then If tag(val(node)) > lastsent +1
         then buffer(node): = insert( tag(val(node)),buffer(node),val(node))
                      mode(node) : = receiving
         elseseq
                           update(lengthReadySequence)
                 do in parallel
                                        SendReadySequence(lengthReadySequence)
                                        lastsent : = lastsent + lengthReadySequence
                 enddo
                           mode(node) : = receiving

Note that to simplify the notation, we have skipped the argument buffer (node ) into the functions lastsent and lengthReadySequence.
    The macro update(lengthReadySequence) has the effect to set the variable lengthReadySequence to the value of the length of the larger sequence of results contained into the buffer and starting from lastsent + 1.
    The macro SendReadySequence(lengthReadySequence) stands for the following update:
              do forall i in [lastsent+1..lastsent+ lengthReadySequence]
                                     Send buffer(node)[i] to receiver(node)
    If the received data is EOS (see FarmCollectRule 2), the collector can not simply propagate the EOS and finish, because, before ending, it has to be sure that all its workers have finished, i.e. the collector must have received EOS from all workers. Therefore, upon receiving EOS, the collector increments a counter until this counter gets value fanin(node) -1. When that happens, the collector is sure to have received the last EOS, then sends out EOS and finishes.

FarmCollectRule 2

If pgm(node) = farmCollect &
       mode(node) = working &
       val(node) = EOS
then If count(node) = fanin(node) -1
         thenseq
                  Send val(node) to receiver(node)
                   mode(node): = finished
         else count(node): = count(node) +1
                  mode(node): = receiving

The construct map

A construct map allows another kind of parallelism which consists of decomposing a data item from an input stream into a set of partitions to be assigned for computation to identical workers which work in parallel. The results produced by the workers are recomposed to make up a new data item of an output stream of results.
    The logical structure of a map is shown in Fig.3: two processes emitter (e) and collector (c) perform the decomposition of the data and the composition of the results, respectively. They are connected to a set of workers (n_i) each of which computes the task p on the single data partition.
    The decomposition and the subsequent composition of data lead to introduce the functions partition and composition and, as integrity constraint, we assume composition to be the inverse function of partition modulo the application of the program p, i.e. composition(p(partition(d))) = p(d), d in DATA.
    The emitter's and collector's operations are formalized by the programs mapEmit and mapCollect, respectively.

* The program mapEmit.
When the emitter receives a data which is not EOS (see MapEmitRule 1), it decomposes the data - by the function partition - into a number of partitions given by the number of its related workers, and then sends each partition item to the corresponding worker. The mode changes to receiving. After receiving EOS (see MapEmitRule 2), the emitter sends EOS in broadcast to all its workers and changes mode to finished, thus stopping its (own) activity.

MapEmitRule 1
If pgm(node) = mapEmit(in,out,p)&
      mode(node) = working &
      val(node)\= EOS
then seq
         doforall i in [1...fanout(node)]
                       Send proj(i,partition(p,in,out, val(node),fanout(node)))
                                to receiver(node,i)
                  mode(node): = receiving

MapEmitRule 2
If pgm(node) = mapEmit(in,out,p)&
       mode(node) = working &
      val(node) = EOS
then seq
                  Broadcast(val(node))
                  mode(node): = finished

* The program mapCollect.
Collectors for map and farm work in a similar way. The main difference is that the map collector sends out the result of a map computation only after composing all the results received from the workers. The collector saves into its buffer all the significant (i.e. different from EOS) results received from the workers. The buffer size must be initialized to zero (see MapCollectRule 1).

MapCollectRule 1

If pgm(node) = mapCollect(in,out,p)&
     mode(node) = working &
     val(node)\= EOS
then
         buffer(node): = insert(buffer(node),val(node))
         mode(node): = receiving

The collector might receive EOS from a certain worker before having received all the workers' results (see MapCollectRule 2). Therefore, it must count the received EOS (by incrementing a suitable counter as described for the farm) and each time an EOS arrives, it must control if the buffer is full (i.e. the buffer size is equal to the number of workers) to send out the final result under composition of the buffer data. When the last EOS arrives, it sends it out and finishe

MapCollectRule 2
If pgm(node) = mapCollect(in,out,p)&
      mode(node) = working &
     val(node) = EOS
then If count(node) = fanin(node) -1
         thenseq
                           Send val(node) to receiver(node)
                           mode(node): = finished
         else count(node): = count(node) +1
            if size(buffer(node)) = fanin(node)
            then size(buffer(node)) : = 0
                        seq
                              Send composition(p,in,out,buffer(node))
                                          to receiver(node)
                              mode : = receiving
            else mode(node): = receiving

The program seq

A process which has to compute a sequential program, upon receiving a data different from EOS, sends out the result of computing the associated program on the received input (see SeqRule 1). Otherwise, if it receives EOS changes mode in finished after sending, in turn, EOS (see SeqRule 2).

SeqRule 1

If pgm(node) = seq(p)&
      val(node)\= EOS&
      mode(node) = working
then   seq
            Send p(val(node)) to
                        receiver(node)
             mode(node): = receiving

SeqRule 2

If pgm(node) = seq(p) &
      val(node) = EOS&
      mode(node) = working
then seq
            Send val(node) to
                       receiver(node)
             mode(node): = finished

The program redEmit

The instruction reduce also involves a process emitter to perform the data distribution to the associated workers. The received inputs are arrays of primitive data and size double respect to the number of workers.
When the emitter receives an input which is not EOS (see RedEmitRule 1), it sends to each worker a pair of data in order to compute the associative (binary) operation on such input pair. The i-th worker receives the pair i-th consisting of the components (2i-1)-th and 2i-th of the array. The macro SendPair formalizes the pair data distribution, sending sequentially each component. When receives the EOS, the emitter sends EOS in broadcast to all its workers and changes mode to finished (see RedEmitRule 2).

RedEmitRule 1 If pgm(node) = `redEmit`& mode(node) = working & val(node)\= `EOS` then seq doforall i in [1..fanout(node)] SendPair i_th(val(node)) to receiver(node,i) mode(node): = receiving where SendPair (a,b) to N = = seq Send a to N Send b to N
RedEmitRule 2 If pgm(node) = `redEmit`& mode(node) = working & val(node) = `EOS` then seq Broadcast(val(node)) mode(node): = finished

The program red

A worker computing the program within a reduce instruction, works as a usual sequential process. However, we need to distinguish these two categories because the worker of reduce needs to wait for two inputs before computing. We make use of the function aux_val on NODE to save the first received input.

RedRule 1
If pgm(node) = red(p)&
      val(node) \= EOS&
      mode(node) = working
then
     If auxVal(node)\not = undef
     then seq
                 Send p(val(node), auxVal(node))
                            to receiver(node)
                  mode(node): = receiving
               auxVal(node): = undef
     else auxVal(node): = val(node)
               mode(node): = receiving

RedRule 2
If pgm(node) = red(p)&
      val(node) = EOS&
      mode(node) = working
then seq
              Send val(node) to receiver(node)
              mode(node): = finished

The construct loop

The logical structure of a loop is shown in Fig.4: two processes loopIn (li) and loopOut (lo) perform the iterated calls of the nested construct p inside the loop. Their operation modes are formalized by the programs loopIn and loopOut.

* The program loopIn.
The process computing p can take its input data items either from the input stream, or from the stream of the results produced by previous calls of itself - if the final loop condition has not been reached -. The process loopIn discriminates between input and feedback streams. Therefore, upon receiving a data different from EOS and EOL (``End Of Loop'', see below), the process loopIn sends the data to the connected worker and changes mode in receivingLoop so temporarily stopping the receiving data from the input stream (see LoopInRule 1).

LoopInRule 1
If pgm(node) = loopIn&
      val(node) \= (EOS, EOL) &
      mode(node) = working
then seq
                  Send val(node) to receiver(node)
                  mode(node): = receivingLoop

The EOL signal is sent by the process loopOut along the feedback channel when the final condition is reached. Therefore, when EOL arrives, the process loopIn gets mode receiving so it is ready again to receive data from the input stream (see LoopInRule 2).

LoopInRule 2
If pgm(node) = loopIn&
      val(node) = EOL&
      mode(node) = working
then
         mode(node): = receiving

If the data received is an EOS, the process sends it out and finishes (see LoopInRule 3).

LoopInRule 3
If pgm(node) = loopIn&
      val(node) = EOS&
      mode(node) = working
then seq
                  Send val(node) to receiver(node)
                  mode(node): = finished

* The program loopOut.

The process loopOut takes the results produced by p and, in case the final condition has been reached, sends out these results, producing a new item of the output data steam, and simultaneously sends the signal EOL to the process loopIn (see LoopOutRule 1).

LoopOutRule 1
If pgm(node) = loopOut &
      val(node) \= EOS&
      mode(node) = working
then If pgmCond(val(node))
         then seq
                        do in parallel
                                  Send val(node) to receiver(node,0)
                                  Send(EOL) to receiver(node,1)
                      enddo
                            mode(node): = receiving
         elseseq
                            Send val(node) to receiver(node,1)
                            mode(node): = receiving

Otherwise, if the final condition has not been reached, since a new call of p has to occur, the process loopOut sends the results to the other process loopIn over the feedback channel. When it receives EOS, sends it out as a new data of the output data stream and finishes (see LoopOutRule 2).

LoopOutRule 2
If pgm(node) = loopOut&
      val(node)) = EOS&
      mode(node) = working
then seq
                  Send (val(node))   to receiver(node,0)
                  mode(node): = finished

6 Rewriting Rules

As outlined in other works [1,2] on P3L, many rewriting rules on the equivalence between skeleton trees can be established. These rewriting rules are currently applied during the compilation phase to optimize the constructors' tree. In [1,2] only an informal proof based on the skeleton tree transformation is provided.
Starting from our operational model, we have proved (proofs are omitted because of lack of space) the following rewriting rules considered as the most interesting by the P3L designers [1,2]. The automatic proof checking of these properties using PVS is under investigation.

Proposition 1 For each input stream and program p, the following equivalences hold

farm(in,out,p(in,out)) = p(in,out)
farm(in,out,farm(in,out,p(in,out))) = farm(in,out,p(in,out))
map(in,out,p(in,out)) = p(in,out)
map(in,out,map(in,out,p(in,out)) = map(in,out,p(in,out))

These propositions have been proved under the assumptions that emitters and collectors never alter data while sending; collectors correctly reorder and recompose data; there exists a constraint relation among functions partition, composition and projection.

7 Related works and conclusions

In literature a complete formal operational semantics of the P3L language has been never given. All attempts existing are either too much `abstract', so unable to catch the operational behavior of the skeletons, or too much `concrete', thus too close to the implementation level.
    In [13] a functional semantics of the constructs is formalized in the Miranda-like notation, i.e. as definition of the functions computed by the construct on a single input value. Nevertheless, the parallel behavior, that is the kind of parallelism exploited in the implementation of each construct, is informally provided as description of the behavior of logical process graphs.
    In other works (see for ex. [1,8,4,
    The main achievement of this paper is to give a formal operational semantics which is machine-independent, so able to abstract from the management of parallel tasks and communication primitives of any specific architecture. In fact, in our model the semantics of each P3L construct is caught by the operational behavior of its corresponding template which is a pure logical process graph. Moreover, our model has been taken as a mathematical basis to prove the most interesting rewriting rules which are currently used to optimize the actual compilation phase. The only existing attempts to prove such rules are presented in [1] in an informal manner.
    As future work we intend to extend the proposed model to give a complete formal semantics of the recent extensions of P3L [17], to prove other properties of the new compilers and eventually to drive the porting of the parallel framework on different platforms.

Acknowledgments. Special thanks to Egon Börger for his invaluable help and encouragement to start this work.

References

[1]: M. Aldinucci, M. Coppola, ``Optimisation techniques for structured parallel programs''. Graduation thesis (in italian), Dip. di Informatica Univ. di Pisa, 1997.
[2]: M. Aldinucci, M. Coppola, M. Danelutto, ``Rewriting skeleton programs: how to evaluate the data-parallel stream-parallel tradeoff''. In Proc. of CMPP'98. University of Passau TR MIP-9805.
[3]: http://www.eecs.umich.edu/gasm
[4]: B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi, ``P3L: A Structured High level programming language and its structured support'', in Concurrency: Practice and Experience, vl.7, 1995, pp.225-255.
[5]: C. Beierle, E. Börger, I. Durdanivich, U. Glässer, E. Riccobene, ``Refining Abstract State Machine Specifications of the Steam Boiler Control to Well Documented Executable Code'', in Formal Methods for Industrial Applications: The Steam-Boiler Case Study LNCS 1165, pp.52-78, 1996.
[6]: E. Boerger, W. Schulte, ``A Modular Design for the Java VM architecture'', in E. Boerger (Ed.): Architecture Design and Validation Methods, Springer, 1998.
[7]: D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian, von T. Eicken, ``LogP: Towards a Realistic Model of Parallel Computation'', Proc. of 4th ACM SIGPLAN, 1993.
[8]: M. Danelutto, ``P3L-PQE2000: Syntax''. Intern. note. Ver. 0.8. Univ. di Pisa, Dip. di Informatica, May 1996.
[9]: M. Danelutto, R. Di Meglio, S. Orlando, S. Pelagatti, M. Vanneschi ``A methodology for the development and the support of massively parallel programs''. TR-25/91. Ver. 0.8. Univ. di Pisa, Dip. di Informatica, Dec. 1991.
[10]: J. Darlington, A.J. Field, P.G. Harrison, P.H.J. Kelly, D.W.N. Sharp, Q. Wu, R.L. While. ``Parallel Programming Using Skeleton Functions'', Proc. of PARLE93, pp. 146-60. Springer, 1993.
[11]: A. Dold, T. Gaul, V. Vialard, W. Zimmerman, ``ASM-Based Mechanized Verification of Compilter Back-Ends'', Proc. of the 28th Annual Conference of the German Society of Computer Science, Tech. Rep., Magdeburg University, 1998.
[12]: Y. Gurevich, ``Evolving Algebras 1993: Lipari Guide''; in E. Börger (Ed.): Specification and Validation Methods, Oxford University Press (1995).
[13]: S. Pelagatti, ``A methodology for the development and the support of massively parallel programs'', Ph.D. thesis, Dip. di Informatica Univ. di Pisa, 1993.
[14]: S. Pelagatti, ``Structured Development of Parallel Programs''; Taylor & Francis (Ed.), 1997.
[15]: K. Winter, ``Model Checking for Abstract State Machines', in: JUCS, Vol.3, No.5, pp. 689-701, 1997.
[16]: W. Zimmerman, T. Gaul, ``On the Construction of Correct Compiler Back-Ends: An ASM Approach'', in: JUCS, Vol.3, No.5, pp. 504-567, 1997.
[17]: http://www.di.unipi.it/ ~ susanna/p3l.html
[18]: L.G. Valiant, ``A Bridging Model for Parallel Computation'', Comm. of the ACM, 3(8), pp. 103-111, 1990.
[19]

Footnotes:

¹ In the current implementation the language used is C++.

File translated from T_EX by T_TH, version 2.32.
On 13 Jan 2000, 15:20.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and or fee.
SAC 2000 March 19-21 Como, Italy
(c) 2000 ACM 1-58113-239-5/00/003>...>$5.00