Copyright ACM, 2000
A FORMAL MODEL FOR THE PARALLEL SEMANTICS OF P3L
A. Cavarra
Dip. di Matematica e Informatica
Univerisità di Catania
Italy 
E. Riccobene
Dip. di Matematica e Informatica
Univerisità di Catania
Italy 
A.
Zavanella
Dip. di Informatica
Univerisità di Pisa
Italy 
Abstract
Some formalizations have been developed in order to
give a complete description of the parallel semantics of P3L, a
programming language which ensures both task and data parallelism. However,
the description level provided by the employed formalisms is either too
much `abstract'  unable to grasp the operational behavior of the skeletons
of the language  or too much `concrete'  close to implementation details
and therefore related to a particular machine model.
This paper shows how the Gurevich's Abstract
State Machines (ASMs) have been successfully employed to describe the
parallel semantics of skeletons of P3L completely abstracting from
any implementation detail.
The proposed model describes both the operation
of the compiler which leads to define a network of processes starting from
a given program, and the computation of the running processes.
The model has been also taken as a theoretical basis
to prove some of the most interesting rewriting rules used by the compiler
to get a better program performance and efficiency.
1 Introduction
P3L  Pisa Parallel Programming Language  is a parallel
language based on the skeleton idea. A skeleton is a buildin higherorder
function representing a parallel construct which may have a sequential
code as parameter. The sequential part of the program can be written by
using a host (sequential) language.
It is widely recognized the validity of using formal
methods in software development, especially in defining semantics of programming
languages. Many are the advantages of formal semantics: a precise and unambiguous
meaning, the possibility of applying mathematical reasoning on the formal
model to prove  whenever available even by machine support  rewriting
rules and other properties, the opportunity of using tools for language
development and design, and for automatically generate compilers and interpreters.
Several attempts [1,8,4,13]
as
farm(f(x_{1},x_{2},...,x_{n}))
= = (f(x_{1}),f(x_{2}),...,f(x_{n}))
i.e. computing f on a stream means computing f on each element of the
stream. Clearly, this notation is unable to capture the ``parallel meaning''
of the construct. On the other side, approaches as in [1,8,4]
describe the parallel behavior of a P3L construct in ``human language''
and give examples of the corresponding implementations on particular architectures
(as Transputers, those using PVM, etc.).
Indeed, what we need is a formalism powerful enough
to be able to abstract from: (a) the creation of parallel activities, (b)
the communication system, (c) the effect of the environment during the
compiling process. Moreover, we need a framework rigorous enough to show
properties of the compiler and eventually to drive the porting of the framework
on different platforms.
In this paper we develop a complete operational
semantics of P3L programs using the Gurevich's Abstract State Machines
(ASMs) (previously called
evolving algebras [12]).
This formalism describes the operational behavior of real dynamic systems
in terms of transition rules, and it has been applied successfully to specify
real programming languages and architectures, to validate standard language
implementations, to verify realtime protocols, etc. (see available materials
and ASM bibliography in [3]).
By using ASMs we are able to tune the ``abstraction level'' and to
express the behavior of P3L independently from any particular implementation.
According to the skeleton based language approach
[4,10,9],
a skeleton on a specified architecture is implemented by a template,
i.e. a parametric process network, strictly depending on the target architecture.
In order to give a machineindependent semantics of the P3L constructs,
we describe their operational behavior in terms of a P3L virtual machine
M
based on an abstract architecture
A
which includes an unbounded number of processing nodes interacting by means
of a complete interconnection topology. The implementation template of
a P3L construct in A
is unique and corresponds to a welldefined logical process graph
[13,1].
The semantics of each construct is caught by the computational behavior
of its corresponding (logical) network of processes.
In terms of operations of the virtual machine M
we describe (a) the compiler actions to build the logical process graph
from the given program, and (b) the subsequent computation of the process
network yielding the semantics of the program. Programs for M
are ASM transition rules. We provide transition rules both for the logical
graph construction phase and for the process network running phase.
In such a way we compile ``P3L programs'' in ``ASM programs'' executable
by the P3L virtual machine M.
The advantages are the following. (1) We give the meaning of P3L programs
in terms of ASM programs which have a precise semantics. (2) We describe
the operational behavior of P3L programs independently by any machine and
abstracting from any choice of specific architecture. (3) We take the formal
model as a mathematical basis to prove some interesting rewriting rules
currently used by the compiler to optimize the source code. All other attempts
[1,2]
to prove such properties are informal. Our proofs are mechanizable through
the use of theorem provers or model checkers, exploiting existing encoding
of ASMs in PVS [11]
and SMV [15].
(4) Applying the technique of abstract machine models refinement, it is
possible to drive, in a controlled and reliable way, the implementation
of the P3L parallel constructs on a specific target architecture. Such
technique of stepwise refinements to executable code has already successfully
employed in real case studies of software system development [6,16,5].
The paper is organized as follows. Section 3
briefly introduces the reader to the Gurevich's Abstract State Machine.
An overview of P3L is given in Section 3. The compiler
operation resulting in the generation of the logical process graph is presented
in Section 4; whereas the computation of the
running processes on the network is described in Section 5.
The most relevant rewriting rules used by the P3L compiler to get a better
program performance and efficiency are stated in Section 6
where we also report proof results. Related works and conclusions are discussed
in Section 7.
2 The Gurevich's Abstract State Machines
We quote here only the essential definitions of the Abstract
State Machines defined in [12].
We refer to [3]
for more details.
Given a program Prog consisting of finite
number of transition rules and a (class of) initial state(s)
S_{0}, a
Gurevich's Abstract State Machine M
models the operational behavior of a real dynamic system S
in terms of state transitions. A state is a firstorder structure
S_{i} representing the instantaneous configuration of S
at time i.
The basic form of a transition rule is the following
function update
f(t_{1}, ..., t_{n}): = t
where f is an arbitrary nary function and t_{1}, ...,
t_{n}, t are firstorder terms. To fire this rule
to a state S_{i}, i >= 0, evaluate all terms t_{1},
..., t_{n}, t at S_{i} and update the function
f
to t on parameters t_{1}, ..., t_{n}.
This produces another state S_{i+1} which differs from S_{i}
only in the new interpretation of the function f.
There are some rule constructors.
The conditional constructor produces ``guarded'' transition
rules of the form:
if g thenR_{1}elseR_{2}
where g is a ground term (the guard of the rule) and R_{1},
R_{2}
are transition rules. To fire a guarded rule to a state S_{i},
i >= 0, evaluate the guard; if it is true, then execute
R_{1},
otherwise execute R_{2}. The else
part may be omitted.
The parallel constructor produces the ``parallel''
synchronous rule of form:
do forall x: x in
U R(x)
where R(x) is a transition rule with a variable x ranging
over the universe U. The operational meaning of a parallel rule consists
in the execution of R(x) for every x in U.
In order to specify some systems which are characterized
by sequential operations, we may need a sequential application of rules.
This could be easily achieved by guarded rules with suitable flags. Because
it would results in a syntactical heaviness, we make use of the sequential
constructor:
seq R_{1} ... R_{n}
Each rule R_{i} is fired only after R_{i1},
i = 2,3,...,n, and causes the transition to a new state.
If we need a sequential transition with more than
one rule per each step, we can use the block constructor,
do inparallel R_{1}
... R_{n }enddo
to apply R_{1}, ...,
R_{n} simultaneously,
inside a seq constructor.
We also make use of the following construct to let
universes grow:
extend U withx_{1},
..., x_{n} withUpdates
where Updates may (and should) depend on the x_{i}
and are used to define certain functions for (some of) the new objects
x_{i}
of the resulting universe U.
To avoid the trouble of writing a lengthly term
t several times in a rule, we use the constructor,
let x = t R(x)
The let rule above is equivalent to the rule
R(x
> t) resulting by substituting t for x.
In ASM it is possible to express the nondeterminism
using the choose constructor:
choose x in
U s.t. cond(x) R(x)
which allows to fire the rule R(x) choosing randomly an x
in U satisfying conditions cond(x).
State transitions of M
may be influenced in two ways:
internally, through the rules of
the program Prog, or externally through the modifications
of the environment. A non static function f that is not updated
by any transition rule is called
monitored function and its values
are given nondeterministically by the environment. Non static functions
updated by a transition rule are called controlled functions.
A computation of S
is modeled through a finite or infinite sequence S_{0}, S_{1},
..., S_{n}, ... of states of M,
where S_{0} is an initial state and each S_{n+1} is obtained
from S_{n} by firing simultaneously all of the rules of Prog
to S_{n}.
3 An overview of P3L
P3L  Pisa Parallel Programming Language
 is a structured language based on skeletons (also called ``constructs''
in P3L terminology). A skeleton is a predefined set of higherorder
functions. Each skeleton represents a parallel paradigm (a form of parallelism),
while the functions executed sequentially are specified by a sequential
programming language, called host language^{1}.
Let us describe briefly the constructs of P3L as
in [4].
A complete syntax definition can be found in [8].

The seq construct corresponds to a sequential
process that, for each data item coming form an input stream, produces
a new data item of an output stream.

The farm construct models processor farm
parallelism. In this form of parallelism, a set of identical workers
execute in parallel the tasks which come from an input stream, and produce
an output stream of results.

The map construct models data parallel
computations. In this form of parallelism, each input data item from
an input stream is decomposed into a set of partitions, and assigned to
a set of identical and parallel workers. The results produced by
the workers are recomposed to make up a new data item of an output stream
of results.

The pipe construct models pipeline parallelism.
In this form of parallelism, a set of stages execute serially over
a stream of input data, producing an output stream of results.

The reduce construct models the reduction
of an array by an associative binary operator. A set of identical
workers
execute in parallel the computation of the binary operation between two
elements of the array. Observe that the only reducible structures are arrays.

The loop construct models computations where,
for each input data item, a loop body has to be iteratively executed,
until a given condition is reached and an output data item is produced.
The P3L skeletons can be hierarchically compound in order to construct
complex forms of parallelism: this property has important consequences
on the semantics of the constructs. Each instance of construct, in fact,
behaves as a functional module  Mod  elaborating data coming from
the input channel (input stream) and putting results into the output channel
(output stream). The order of tasks in a stream has to be preserved, i.e.
[in_{1}, in_{2}, ..., in_{n}]
> Mod > [f(in_{1}), f(in_{2}),
..., f(in_{n})], being f the function computed
by the module Mod.
A P3L program is represented by a nested structure
of constructs named
constructs' tree. All the leaves of the tree
are sequential fragments embedded into the sequential construct
seq.
The methodology of P3L includes a compiler able
to analyze the tree and decide a strategy to optimize the program. The
compiler exploits two different libraries, the templates library and the
rewriting rules library, to drive the optimization. The template library
includes the scheme to implement each of the parallel constructs of the
language. The scheme also provides a set of analytical formulae to predict
the behavior of the template. The rewriting rules library contains a set
of rules which leave inalterate the semantics of the program. The compiler
uses these rules and the performance formula of the template library to
get a better performance or a better efficiency. The compiler also takes
implementation decisions as the parallelism degree of the program
and the grain of communications by using a view of the execution
costs onto the target machine.
4 Graph Construction Phase
In this section we describe the compiler operation
of transforming each P3L construct in the corresponding logical process
graph representing its implementation template on the abstract architecture
A.
Therefore, we define the transition system formalizing the logical graph
construction phase: starting from a program P, a direct graph
representing the whole process network is built.
Notice that we suppose the input program be syntactically
correct and optimized in the sense that no rewriting rules are applicable.
4.1 Signature
The direct graph representation leads to define a universe NODE
of the graph nodes. Each node may have more then one incoming and outgoing
edge. The functions fanin, fanout : NODE > NAT, being NAT
the set of natural numbers, yield respectively the number of all incoming
and outgoing edges of a node. We establish a numeration mechanism univocally
labeling each of the incoming/outgoing edge in a node m by a number
i
in 1 ... fanin/ fanout(m). Given a node m and an outgoing
edge i, to detect which node is connected to m by the edge
i,
we introduce the function
receiver : NODE × NAT > NODE
such that receiver(m,i) = n. We use the abbreviation
receiver(m)
= receiver(m,1).
To represent P3L programs and I/O communication
channels, we introduce sets PROGRAM and CHANNEL.
Elements of PROGRAM are nested structures of the form:
pipe (in,out,p_{1}(in,out),...,p_{k}(in,out))
farm (in,out,p(in,out))
map (in,out,p(in,out))
loop (in,out,feedback,p(in,out),condition)
reduce (in,out,p(in,out))
where in,out,feedback in CHANNEL,
and p, p_{1},..., p_{k}
in PROGRAM. (In the sequel we omit arguments
in
and out of a program
p).
According to the P3L syntax, we assume as constraint that the program p
within a construct
reduce contains an associative
operator, and that the construct pipe must
contain more than one stage.
We start from an initial configuration as depicted
in Fig.1.
P is the input
program. Since we dynamically build the process network, at any stage of
this phase we expand the network replacing the current node with a new
network fragment according to the structure of the program associated to
that node. The program fragment is associated with a node by the function
pgm
: NODE > PROGRAM. We keep track of the current node by the 0ary
function compt : NODE yielding the current compilation point.
The function succ : NODE > NODE returns the next
compilation point.
Figure 1: Starting configuration of the logical graph
4.2 Transition Rules
4.2.1 The construct pipe
If the program of the compt is a pipe statement of the form
pipe(in,out,p_{1},...,p_{k}),
a network performing the computation of the k stages in pipeline must be
created (see Fig.2). By the PipeConstrRule the
node compt is replaced by a chain of k nodes (k1 new ones and compt
redefined as the head of the chain); each node of the chain takes a stage
p_{i}
as program. The pipe stages p_{1}, ..., p_{k}will
be expanded in turn (accessed by the function succ) starting from
the program p_{1}of the compt.
Figure 2: Logical graph for pipe
PipeConstrRule
if pgm(compt) = pipe(in,out,p1,...,pk)
then let n_{1} = compt
extend
NODE with n_{2}, ..., n_{k}
pgm(n_{l}): = pl
receiver(n_{k}) : = receiver(compt)
receiver(n_{i}): = n_{i+1}
succ(n_{i}): = n_{i+1}
succ(n_{k}): = succ(compt)
where 1 <= i < k, 1 <= l <= k
4.2.2 Parallel constructs
farm
and map
If the current program is a farm statement of the form farm(in,out,p),
the node compt must be replaced by a network consisting of an emitter
e
 redefinition of compt , a collector c and
m
workers n_{1},...,
n_{m}.
The value m is the parallel degree of the (current) farm statement, and
it is computed by the function parDegree : PROGRAM > NAT that,
given a program, returns the number of processes that have to be created
to compute it in the most efficient way. By this function we leave abstract
the parallel degree computation. However, it could be possible to
refine the definition of the function parDegree taking in consideration
computational models as in [18,7].
The newly created nodes are connected
as depicted in Fig.3 (read par
as farm) by the function
receiver. The nodes e and c take programs farmEmit
and
farmCollect, respectively,
since they will perform the emitter and collector operation for a farm
(see next section); the workers' programs are defined as the farm body
p
and they will be expanded in turn accessed by the function succ.
The expansion proceeds on the first node n_{1}.
The construction of the graph structure
of a map statement having form map(in,out,p),
is similar to the one for the farm construct, except for the programs
mapEmit
and mapCollect assigned
to the emitter and the collector nodes (see Fig.3 reading
par
as map). These differences
are outlined in the rule by representing the specific map statements enclosed
between square brackets.
Figure 3: Logical graph for farm[map]
FarmConstrRule [MapConstrRule]
if pgm (compt) = farm(in,out,p) [map(in,out,p)]
then let m = parDegree(farm(in,out,p))
[parDegree(map(in,out,p))]
e = compt
extend
NODE with n_{1}, ..., n_{m}, c
pgm(n_{l}): = p
pgm(e): = farmEmit [mapEmit(in,out,p)]
pgm(c): = farmCollect [mapCollect(in,out,p)]
receiver(e,i): = n_{i}
receiver(c): = receiver(compt)
receiver(n_{i}): = c
if m > 1
then succ(n_{i}): = n_{i+1}
succ(n_{m}): = succ(compt)
fanout(e): = m
fanin(c): = m
compt : = n_{1}
where 1 <= i < m, 1 <= l <= m
4.2.3 The construct loop
The loop statement loop(in,out,feedback,p,condition)
produces a network as in Fig.4. The node compt
(now renamed li) takes the program loopIn
becoming the input process of the loop instruction, and is connected to
a new node n. This new node receives the program p
specified in the body of loop and is connected to the node lo. lo
gets program loopOut
and therefore works as the output process of the loop; it will make use
of the information yielded by the function pgmcond to control the
exit condition of the loop. Nodes lo and
li are in their
turn connected by the feedback channel. The construction proceeds on the
node n containing the body of the loop.
Figure 4: Logical graph for loop
LoopConstrRule
if pgm(compt) = loop(in,out,feedback,p,condition)
then let li = compt
extend
NODE with n, lo
pgm(li): = loopIn
pgm(lo): = loopOut
pgmcond(lo) : = condition
pgm(n): = p
receiver(li) : = n
receiver(n): = lo
receiver(lo,1): = receiver(compt)
receiver(lo,2): = li
succ(n): = succ(compt)
compt : = n
4.2.4 The construct reduce
The program p
argument of an instruction reduce(in,out,p)
is defined in terms of an associative binary operation. The logical graph
associated with reduce
is a binary balanced tree structure. Therefore, we expand the graph replacing
the current node compt by a binary tree whose leaves are all connected
to an emitter node e through the function
receiver (see Fig.5).
Let s = size(in)/2.
To build a balanced structure, we define a function redSucc:NAT> NAT
associating to an integer i the index of its unique child node in
the (static) tree structure: redSucc(i) = ceil(i
/
2)+s if odd(i); redSucc(i)
= i / 2 + s if even(i).
Each tree node takes red(p)
as program to indicate that p
is a sequential program inside a reduce
statement. The node e gets the program redEmit.
The compilation carries on through the successor of the compilation point.
Figure 5: Logical graph for reduce
ReduceConstrRule
if pgm(compt) = reduce(in,out,p)
then let k = lg(size(in)+1)
s = size(in/2)
e = compt
extend
NODE with n_{1},..., n_{k}
pgm(n_{l}): = red(p)
pgm(e): = redEmit
receiver(e,i): = n_{i}
receiver(n_{j}): = n_{redSucc(j)}
receiver(n_{k}): = receiver(compt)
fanout(e): = s
compt : = succ(compt)
where 1 <= i <= s, 1 <= j < k, 1 <= l <=
k
4.2.5 The construct seq
A sequential statement seq(p)
simply involves the update of the compilation point by its successor (Fig.6).
Figure 6: Logical graph for seq
SeqConstrRule
if pgm (compt) = seq(p)
then compt : = succ(compt)
5 Running Phase
We assume an active process is associated
to each node of the network obtained after the logical graph construction
phase. Therefore, the run of a given program P corresponds to
the execution of the resulting network of processes. In the sequel we do
not make any distinction between nodes and processes.
5.1 Signature Refinement
Among all the nodes of the network, we can distinguish the
root
as the unique element of NODE which is not the receiver of any other
node, more formally:
root in NODE /\ (forall n in NODE
forall i in INT receiver(n,i) \= root).
Each node of the logical process graph
has, as attached information, a program in the set {mapEmit,
matCollect, farmEmit, farmCollet, seq(q), loopIn, loopOut, redEmit, red(q)},
being q an external
sequential program code. Each active process computes the relative program
and sends the output to the receiver node. A node can have mode:

receiving, if the process is waiting for information
from its sender process;

working, when the process is computing its program;

finished, when the process is no more active. After
receiving the signal EOS
(End Of Stream) and sending it out, a process has definitively concluded
its own computation and might be killed. (Note that we do not describe
here the process of eliminating the no more active processes because this
regards an actual implementation).
All the nodes apart from the root has initial mode
receiving,
whereas
root has initial mode working. Note that for a loopIn
node we need to refine the mode receiving in receivingIn;
the reason for that is explained further.
The function
val : NODE > DATA
represents the local ``memory'' of a process and is updated anytime a sending
action is performed by another process (see macro Send below). It
is initially undefined for all the nodes apart from the root, where
it holds
IN(P),
the input of P.
All processes work upon changing data
through the communication channels accessible by the function receiver.
The sending and receiving operations are here voluntarily left abstract
because we are not interested in modeling the message passing mechanism
which can be considered as part of the implementation. Therefore, the communication
among processes is abstractly formalized through the macros Send
and Receive.
Firing the macro ``Send val(m)
to n'',
m,n in
NODE, allows to update val(n)
to val(m). The effective updating happens if and when n is
in mode receiving and it has executed a macro
Receive. Otherwise,
by the blocking effect of the macro Send, the computation of the
node m suspends and will start again as soon as such a condition will hold.
If a node n executes the macro
Receive, then val(n) is updated to x, provided
there is a node
m such that
n = receiver(m,i) for some i,
that has executed the macro ``Send x to n''. Otherwise,
the execution of n suspends (by the blocking effect of the macro
Receive),
to restart as soon as there will be a working node ready to send to n
the expected value.
Note that for a loopIn
node we make use of the macro ReceiveFromLoop which is a variant
of the macro Receive, as explained further.
To describe the map and the farm collector's
operation, we need to take a counter of the workers' processes that have
finished their own computation. This leads to the function
count : NODE
> NAT which is initialized to 0 for all nodes n such that
pgm(n) in {mapCollector,farmCollector}.
5.2 Transition Rules
We now describe the transition rules for the running phase.
They are split into two groups: the Receiving Rules describing the
reaction of a process upon receiving data; the
Working Rules formalizing
the operation of a working process according to the program to compute.
5.2.1 Receiving Rules
They have the effect to update val(node) when node
is in mode receiving. Among all nodes of the network, we need to
distinguish between those having program loopIn
and all the others (as a node loopIn
can receive data from two channels). All nodes in mode receiving,
sequentially receive data and change mode in
working (see Receiving
Rule 1).
Receiving Rule 1
If mode(node) = receiving
then seq
Receive
mode(node): = working 


Receiving Rule 2
If pgm(node) = loopIn&
mode(node) = receivingLoopRes
then seq
ReceiveFromLoop
mode(node): = working 




A node n which has program loopIn
can receive values in two ways: from the outside ``input'' node, i.e. the
node m such that receiver(m,0) = n, and from
the related node loopOut,
i.e. the node m such that receiver(m,1) = n.
According with the semantics of the
loop
instruction, loopIn
must give priority to the value received by
loopOut.
In order to manage such a situation, we refine the function
mode
on a node loopIn which
gets mode receiving if it is ready to receive values from outside
 as any other node of the network  and mode receivingLoopRes when
it is waiting the result of the loop computation from its related node
loopOut.
The Receiving Rule 2 describes
the dynamics of a node loopIn
in mode
receivingLoopRes. The macro ReceiveFromLoop updates
the value of val(node) just like the macro Receive,
but the condition receiver(m,1)
= node
& pgm(m)=
loopOut
must be also satisfied, being m the sending node. That guarantees
the node loopIn to get
the result of the loop computation sent by the node loopOut.
5.2.2 Working Rules
In this section we describe the dynamics of a process in
mode working.
The construct farm
A construct farm
allows the parallel execution of tasks which come from an input stream
and produces an output stream of results. The logical structure of a farm
is shown in Fig.3: two processes emitter (e)
and collector (c) perform the distribution of the data and
the collection of the results, respectively. They are connected to a set
of workers (w_{i}) which are instances of the module
corresponding to the nested construct p
inside the farm. The
emitter's and collector's operations are formalized by the programs
farmEmit
and farmCollect, respectively.
* The program farmEmit.
The items of the input data stream (on which the tasks
have to be computed) arrive at the emitter process which must distribute
them to the workers. Therefore, if the received value is not the EOS
which marks the end of the input stream, the process selects nondeterministically
one of the ready sons, if any, sends to it the received item and
changes mode, being now ready to receive the subsequent item of the input
stream (see FarmEmitRule 1).
The set readySons (node)
= { n in NODE 
mode(n)
= receiving & exists i: receiver(node,i)
= n} contains those receiver nodes which are ready to receive and
compute a new item.
When receives the EOS,
the emitter sends EOS
in broadcast to all its workers and changes mode to finished, having ended
its own activity (see FarmEmitRule 2).
FarmEmitRule 1
If pgm(node) = farmEmit &
mode(node) = working &
val(node) \= EOS
&
readySons(node)\= {}
then
choose n in readySons(node)
seq Send val(node) to n
mode(node):
= receiving 
FarmEmitRule 2
If pgm(node) = farmEmit &
mode(node) = working &
val(node) = EOS
then
seq
Broadcast(val(node))
mode(node): = finished 
The following macro performs the parallel sending of the
same signal to all receivers:
Broadcast(val(n)) = do forall i in [1..fanout(n)]
Send val(n) to receiver(n,i)
* The program farmCollect.
The results of the workers' computation are received
by the collector. Task results have to be sent out whenever available and
in the same order of the corresponding data into the input stream. To perform
the reorder activity, the collector has a buffer where to temporarily
save the received results. The collector receives each item together with
a tag representing the item's position into the output stream; each
item is saved into the buffer  by the function insert(pos,buffer,value)
 at the tag position. The function lastsent, initialized
to 0, yields the tag of the last sent item. Thus, at any time, all consecutive
items with tags starting from lastsent + 1 can be sent out. The
function lenghtReadySequence yields the length of the larger sequence
of consecutive items into the buffer which starts from lastsent +
1. If the collector receives an item (which is not the EOS) with
tag greater than lastsent + 1, it inserts the item into the buffer
and gets mode receiving (see
FarmCollectRule 1). Otherwise, it first
updates
lenghtReadySequence, sends out all the items in the range
between lastsent + 1 and lastsent + lengthReadySequence,
updates lastsent, and then changes mode ready to receive new items.
FarmCollectRule 1
If pgm(node) = farmCollect &
mode(node) = working
&
val(node)) \= EOS
then If tag(val(node)) > lastsent +1
then
buffer(node): = insert( tag(val(node)),buffer(node),val(node))
mode(node) : = receiving
elseseq
update(lengthReadySequence)
do in parallel
SendReadySequence(lengthReadySequence)
lastsent : = lastsent + lengthReadySequence
enddo
mode(node) : = receiving 
Note that to simplify the notation, we have skipped the
argument
buffer (node ) into the functions lastsent and lengthReadySequence.
The macro update(lengthReadySequence)
has the effect to set the variable lengthReadySequence to the value
of the length of the larger sequence of results contained into the buffer
and starting from lastsent + 1.
The macro SendReadySequence(lengthReadySequence)
stands for the following update:
do forall i in [lastsent+1..lastsent+
lengthReadySequence]
Send buffer(node)[i] to receiver(node)
If the received data is EOS
(see FarmCollectRule 2), the collector can not simply propagate
the
EOS and finish,
because, before ending, it has to be sure that all its workers have finished,
i.e. the collector must have received
EOS
from all workers. Therefore, upon receiving EOS,
the collector increments a counter until this counter gets value fanin(node)
1.
When that happens, the collector is sure to have received the last
EOS,
then sends out
EOS and
finishes.
FarmCollectRule 2
If pgm(node) = farmCollect &
mode(node) = working &
val(node) = EOS
then If count(node) = fanin(node) 1
thenseq
Send val(node) to receiver(node)
mode(node): = finished
else
count(node): = count(node) +1
mode(node): = receiving

The construct map
A construct map
allows another kind of parallelism which consists of decomposing a data
item from an input stream into a set of partitions to be assigned for computation
to identical workers which work in parallel. The results produced by the
workers are recomposed to make up a new data item of an output stream of
results.
The logical structure of a map
is shown in Fig.3: two processes emitter (e)
and collector (c) perform the decomposition of the data and
the composition of the results, respectively. They are connected to a set
of workers (n_{i}) each of which computes the task
p
on the single data partition.
The decomposition and the subsequent
composition of data lead to introduce the functions partition and
composition and, as integrity constraint, we assume composition
to be the inverse function of partition modulo the application of
the program p, i.e.
composition(p(partition(d)))
= p(d), d in
DATA.
The emitter's and collector's operations
are formalized by the programs
mapEmit
and mapCollect, respectively.
* The program mapEmit.
When the emitter receives a data which is not EOS
(see MapEmitRule 1), it decomposes the data  by the function partition
 into a number of partitions given by the number of its related workers,
and then sends each partition item to the corresponding worker. The mode
changes to receiving. After receiving EOS
(see MapEmitRule 2), the emitter sends EOS
in broadcast to all its workers and changes mode to finished, thus stopping
its (own) activity.
MapEmitRule 1
If pgm(node) = mapEmit(in,out,p)&
mode(node) = working &
val(node)\= EOS
then seq
doforall
i in [1...fanout(node)]
Send proj(i,partition(p,in,out, val(node),fanout(node)))
to receiver(node,i)
mode(node): = receiving
MapEmitRule 2
If pgm(node) = mapEmit(in,out,p)
&
mode(node) = working &
val(node) = EOS
then seq
Broadcast(val(node))
mode(node): = finished

* The program mapCollect.
Collectors for map
and farm work in a similar
way. The main difference is that the map collector sends out the result
of a map computation
only after composing all the results received from the workers. The collector
saves into its buffer all the significant (i.e. different from EOS)
results received from the workers. The buffer size must be initialized
to zero (see MapCollectRule 1).
MapCollectRule 1
If pgm(node) = mapCollect(in,out,p)&
mode(node) = working &
val(node)\= EOS
then
buffer(node): =
insert(buffer(node),val(node))
mode(node): = receiving
The collector might receive EOS
from a certain worker before having received all the workers' results (see
MapCollectRule
2). Therefore, it must count the received EOS
(by incrementing a suitable counter as described for the farm)
and each time an
EOS
arrives, it must control if the buffer is full (i.e. the buffer size is
equal to the number of workers) to send out the final result under composition
of the buffer data. When the last EOS
arrives, it sends it out and finishe
MapCollectRule 2
If pgm(node) = mapCollect(in,out,p)&
mode(node) = working &
val(node) = EOS
then If count(node) = fanin(node) 1
thenseq
Send val(node) to receiver(node)
mode(node): = finished
else
count(node): = count(node) +1
if size(buffer(node)) = fanin(node)
then size(buffer(node)) : = 0
seq
Send composition(p,in,out,buffer(node))
to receiver(node)
mode : = receiving
else mode(node): = receiving
The program seq
A process which has to compute a sequential program, upon
receiving a data different from EOS,
sends out the result of computing the associated program on the received
input (see SeqRule 1). Otherwise, if it receives EOS
changes mode in finished after sending, in turn,
EOS
(see SeqRule 2).
SeqRule 1
If pgm(node) = seq(p)&
val(node)\= EOS&
mode(node) = working
then seq
Send p(val(node)) to
receiver(node)
mode(node): = receiving 
SeqRule 2
If pgm(node) = seq(p) &
val(node) = EOS&
mode(node) = working
then seq
Send val(node) to
receiver(node)
mode(node): = finished 
The program redEmit
The instruction reduce
also involves a process emitter to perform the data distribution
to the associated workers. The received inputs are arrays of primitive
data and size double respect to the number of workers.
When the emitter receives an input
which is not EOS (see
RedEmitRule 1), it sends to each worker a pair of data in order
to compute the associative (binary) operation on such input pair. The ith
worker receives the pair ith consisting of the components (2i1)th
and 2ith of the array. The macro SendPair formalizes the
pair data distribution, sending sequentially each component. When receives
the EOS, the emitter
sends EOS in broadcast
to all its workers and changes mode to finished (see RedEmitRule
2).
RedEmitRule 1
If pgm(node) = redEmit&
mode(node) = working &
val(node)\= EOS
then seq
doforall
i in [1..fanout(node)]
SendPair i_th(val(node)) to receiver(node,i)
mode(node): = receiving
where SendPair (a,b) to N = = seq
Send a to N
Send b to N


RedEmitRule 2
If pgm(node) = redEmit&
mode(node) = working &
val(node) = EOS
then seq
Broadcast(val(node))
mode(node): = finished




The program red
A worker computing the program within a reduce
instruction, works as a usual sequential process. However, we need to distinguish
these two categories because the worker of reduce
needs to wait for two inputs before computing. We make use of the function
aux_val
on
NODE to save the first received input.
RedRule 1
If pgm(node) = red(p)&
val(node) \= EOS&
mode(node) = working
then
If auxVal(node)\not = undef
then seq
Send p(val(node), auxVal(node))
to receiver(node)
mode(node): = receiving
auxVal(node): = undef
else auxVal(node): = val(node)
mode(node): = receiving
RedRule 2
If pgm(node) = red(p)&
val(node) = EOS&
mode(node) = working
then seq
Send val(node) to receiver(node)
mode(node): = finished


The construct loop
The logical structure of a loop
is shown in Fig.4: two processes loopIn (li)
and loopOut (lo) perform the iterated calls of the nested construct
p
inside the loop.
Their operation modes are formalized by the programs loopIn
and loopOut.
* The program loopIn.
The process computing p
can take its input data items either from the input stream, or from the
stream of the results produced by previous calls of itself  if the final
loop condition has not been reached . The process
loopIn
discriminates between input and feedback streams. Therefore, upon receiving
a data different from EOS
and EOL (``End Of Loop'',
see below), the process loopIn
sends the data to the connected worker and changes mode in receivingLoop
so temporarily stopping the receiving data from the input stream (see LoopInRule
1).
LoopInRule 1
If pgm(node) = loopIn&
val(node) \= (EOS,
EOL) &
mode(node) = working
then seq
Send val(node) to receiver(node)
mode(node): = receivingLoop
The EOL
signal is sent by the process loopOut
along the feedback channel when the final condition is reached. Therefore,
when EOL arrives, the
process loopIn gets
mode receiving so it is ready again to receive data from the input stream
(see LoopInRule 2).
LoopInRule 2
If pgm(node) = loopIn&
val(node) = EOL&
mode(node) = working
then
mode(node): = receiving
If the data received is an EOS,
the process sends it out and finishes (see
LoopInRule 3).
LoopInRule 3
If pgm(node) = loopIn&
val(node) = EOS&
mode(node) = working
then seq
Send val(node) to receiver(node)
mode(node): = finished
* The program loopOut.
The process loopOut
takes the results produced by p
and, in case the final condition has been reached, sends out these results,
producing a new item of the output data steam, and simultaneously sends
the signal EOL to the
process loopIn (see
LoopOutRule
1).
LoopOutRule 1
If pgm(node) = loopOut &
val(node) \= EOS&
mode(node) = working
then If pgmCond(val(node))
then seq
do in parallel
Send val(node) to receiver(node,0)
Send(EOL) to receiver(node,1)
enddo
mode(node): = receiving
elseseq
Send val(node) to receiver(node,1)
mode(node): = receiving
Otherwise, if the final condition has not been reached, since
a new call of p has
to occur, the process loopOut
sends the results to the other process loopIn
over the feedback channel. When it receives EOS,
sends it out as a new data of the output data stream and finishes (see
LoopOutRule
2).

LoopOutRule 2
If pgm(node) = loopOut&
val(node)) = EOS&
mode(node) = working
then seq
Send (val(node)) to receiver(node,0)
mode(node): = finished

6 Rewriting Rules
As outlined in other works [1,2]
on P3L, many rewriting rules on the equivalence between skeleton trees
can be established. These rewriting rules are currently applied during
the compilation phase to optimize the constructors' tree. In [1,2]
only an informal proof based on the skeleton tree transformation is provided.
Starting from our operational model,
we have proved (proofs are omitted because of lack of space) the following
rewriting rules considered as the most interesting by the P3L designers
[1,2].
The automatic proof checking of these properties using PVS is under investigation.
Proposition 1 For each input stream and program
p,
the following equivalences hold

farm(in,out,p(in,out)) = p(in,out)

farm(in,out,farm(in,out,p(in,out))) = farm(in,out,p(in,out))

map(in,out,p(in,out)) = p(in,out)

map(in,out,map(in,out,p(in,out)) = map(in,out,p(in,out))
These propositions have been proved under the assumptions
that emitters and collectors never alter data while sending; collectors
correctly reorder and recompose data; there exists a constraint relation
among functions partition, composition and projection.
7 Related works and conclusions
In literature a complete formal operational
semantics of the P3L language has been never given. All attempts existing
are either too much `abstract', so unable to catch the operational behavior
of the skeletons, or too much `concrete', thus too close to the implementation
level.
In [13]
a functional semantics of the constructs is formalized in the Mirandalike
notation, i.e. as definition of the functions computed by the construct
on a single input value. Nevertheless, the parallel behavior, that is the
kind of parallelism exploited in the implementation of each construct,
is informally provided as description of the behavior of logical
process graphs.
In other works (see for ex. [1,8,4,
The main achievement of this paper
is to give a formal operational semantics which is machineindependent,
so able to abstract from the management of parallel tasks and communication
primitives of any specific architecture. In fact, in our model the semantics
of each P3L construct is caught by the operational behavior of its corresponding
template which is a pure logical process graph. Moreover, our model has
been taken as a mathematical basis to prove the most interesting rewriting
rules which are currently used to optimize the actual compilation phase.
The only existing attempts to prove such rules are presented in [1]
in an informal manner.
As future work we intend to extend
the proposed model to give a complete formal semantics of the recent extensions
of P3L [17],
to prove other properties of the new compilers and eventually to drive
the porting of the parallel framework on different platforms.
Acknowledgments. Special thanks to Egon Börger
for his invaluable help and encouragement to start this work.
References

[1]

M. Aldinucci, M. Coppola, ``Optimisation techniques for structured
parallel programs''. Graduation thesis (in italian), Dip. di Informatica
Univ. di Pisa, 1997.

[2]

M. Aldinucci, M. Coppola, M. Danelutto, ``Rewriting skeleton
programs: how to evaluate the dataparallel streamparallel tradeoff''.
In Proc. of CMPP'98. University of Passau TR MIP9805.

[3]

http://www.eecs.umich.edu/gasm

[4]

B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi,
``P3L: A Structured High level programming language and its structured
support'', in Concurrency: Practice and Experience, vl.7, 1995,
pp.225255.

[5]

C. Beierle, E. Börger, I. Durdanivich, U. Glässer,
E. Riccobene, ``Refining Abstract State Machine Specifications of the Steam
Boiler Control to Well Documented Executable Code'', in Formal Methods
for Industrial Applications: The SteamBoiler Case Study LNCS 1165,
pp.5278, 1996.

[6]

E. Boerger, W. Schulte, ``A Modular Design for the Java VM
architecture'', in E. Boerger (Ed.): Architecture Design and Validation
Methods, Springer, 1998.

[7]

D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser,
E. Santos, R. Subramonian, von T. Eicken, ``LogP: Towards a Realistic Model
of Parallel Computation'', Proc. of 4th ACM SIGPLAN, 1993.

[8]

M. Danelutto, ``P3LPQE2000: Syntax''. Intern. note. Ver.
0.8. Univ. di Pisa, Dip. di Informatica, May 1996.

[9]

M. Danelutto, R. Di Meglio, S. Orlando, S. Pelagatti, M.
Vanneschi ``A methodology for the development and the support of massively
parallel programs''. TR25/91. Ver. 0.8. Univ. di Pisa, Dip. di Informatica,
Dec. 1991.

[10]

J. Darlington, A.J. Field, P.G. Harrison, P.H.J. Kelly, D.W.N.
Sharp, Q. Wu, R.L. While. ``Parallel Programming Using Skeleton Functions'',
Proc. of
PARLE93, pp. 14660. Springer, 1993.

[11]

A. Dold, T. Gaul, V. Vialard, W. Zimmerman, ``ASMBased Mechanized
Verification of Compilter BackEnds'', Proc. of the 28th Annual Conference
of the German Society of Computer Science, Tech. Rep., Magdeburg University,
1998.

[12]

Y. Gurevich, ``Evolving Algebras 1993: Lipari Guide''; in
E. Börger (Ed.): Specification and Validation Methods, Oxford University
Press (1995).

[13]

S. Pelagatti, ``A methodology for the development and the
support of massively parallel programs'', Ph.D. thesis, Dip. di Informatica
Univ. di Pisa, 1993.

[14]

S. Pelagatti, ``Structured Development of Parallel Programs'';
Taylor & Francis (Ed.), 1997.

[15]

K. Winter, ``Model Checking for Abstract State Machines',
in: JUCS, Vol.3, No.5, pp. 689701, 1997.

[16]

W. Zimmerman, T. Gaul, ``On the Construction of Correct Compiler
BackEnds: An ASM Approach'', in: JUCS, Vol.3, No.5, pp. 504567, 1997.

[17]

http://www.di.unipi.it/ ~ susanna/p3l.html

[18]

L.G. Valiant, ``A Bridging Model for Parallel Computation'',
Comm. of the ACM, 3(8), pp. 103111, 1990.

[19]

Footnotes:
^{1} In the
current implementation the language used is C++.
File translated from T_{E}X by T_{T}H,
version 2.32.
On 13 Jan 2000, 15:20.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists,
requires prior specific permission and or fee.
SAC 2000 March 1921 Como, Italy
(c) 2000 ACM 1581132395/00/003>...>$5.00