DIRTRIBUTED QUERY OPTIMIZATION USING PERF JOINS
Ramzi Haraty, Roula Fany
Lebanese American University
P.O. Box 135053
Beirut, Lebanon
rharaty@beirut.lau.edu.lb, rolafany@sodetel.net.lb
ABSTRACT
The advent of telecommunication era and the constant development of hardware and network structures have encouraged the decentralization of data while increasing the needs to access information from different sites. Query optimization strategies aim to minimize the cost of transferring data across networks. Many techniques and algorithms have been proposed to optimize queries. Perhaps one of the more important algorithms is the AHY algorithm using semijoins that is implemented by Apers, Hevner and Yao. Nowadays, a new technique called PERF (Partially Encoded Record Filters), presented by Kenneth Ross seems to bring some improvement over semijoins. PERF joins are twoway semijoins using a bit vector as their backward phase. Our research encompasses applying PERF joins to the AHY algorithm and producing the AHYPERF algorithm. Programs were designed to implement both the AHY and AHYPERF. Several experiments were conducted and the results showed a very considerable enhancement of AHYPERF over the original AHY.
Keywords: Distributed Query Optimization, Semi Joins, and PERF Joins.
1. Introduction
The recent telecommunication boom has encouraged business expansion resulting in the decentralization of data while increasing the needs for instant information access.
A distributed database system (DDBS) is a collection of sites connected on a common highbandwidth network [5]. Logically, data belongs to the same system but physically it is spread over the sites of the network, making the distribution invisible to the user [6]. Each site is an autonomous
database with its processing capability and data storage capacity. The advantage of this distribution resides in achieving performance, reliability, availability, and modularity.
Distributed query processing is the process of retrieving data from different sites. Accessing data from sites involves transmission via communication links that creates delays. The basic challenge is to design and develop efficient query processing techniques and strategies to minimize the communication cost.
Nowadays, with the explosion of interest in data warehouses and the development of huge applications such as federation and mediation over heterogeneous and objectoriented databases, there is a pressing need for data reduction. This is the main purpose of query optimization which estimates the cost of alternative query plans in order to choose the best plan to answer quickly and efficiently, complex and expensive queries [7].
The query optimization problem was addressed many times, from different perspectives, and a lot of work has been done. Proposed algorithms and techniques can be categorized in two main approaches:
Some might add another category which is the hybrid approach, merging both data reduction and time reduction.
In this paper, we will mainly focus on the first approach. One of the most popular and important algorithms suggested for query optimization with minimum cost was algorithm GENERAL (total cost) presented by Apers, Hevner and Yao [2]. The advent of AHY was a revolution in query optimization domain because it introduced semijoins as reducers in the query optimization process. It uses the threephased approach method that consists of the following:
Using local operations such as projections and selections, data was filtered then shipped to other sites. Reduction was made by gradually applying semijoins while transferring data from one site to another. The improvement made over the traditional unoptimized method was enormous.
A decade after, and with the continuous research and methods developed, a new technique called PERF (Partially Encoded Record Filter) was presented by Kenneth Ross [4]. This method adds to semijoins another dimension that is the backward phase that will be used to eliminate unnecessary redundant semijoins by using bit vectors.
In this paper we present an improvement over AHY using PERF joins applied to AHY algorithm. The paper is organized as follows: Section 2 discusses the original AHY algorithm. Section 3 presents our own improvement over AHY – the AHYPERF algorithm. Section 4 presents the experimental results. And section 5 presents the conclusion.
2. The AHY Algorithm
For a special class of queries called simple queries, Apers, Hevner and Yao have developed a collection of algorithms to minimize the cost of query processing. The algorithms developed in were SERIAL and PARALLEL for total time and minimum response time optimization respectively [2]. Then, later they proposed an extended version suitable for general queries called GENERAL. Using this algorithm, data transmissions used for reducing a relation and the transmission of the reduced relation to the assembly site form a schedule for this relation. Throughout this algorithm, different schedules are built and compared until the most optimal one is chosen. Note also that the transmission cost between two computers is a linear function of the size of the data. The incoming selectivity of a schedule for a relation is the product of selectivity of all the attributes in the schedule. As our study mainly focuses on total cost optimization, we will limit our discussion on AHY to SERIAL and GENERAL:
In other terms, in a distributed database system, it is better to first perform initial local processing in order to reduce the amount of data to be transmitted. After initial processing, if the relations contain only one attribute that is the joining attribute, queries are called simple [1]. Algorithm GENERAL tries to decompose complex queries into simple ones. Then algorithm SERIAL orders relations by increasing order in terms of size and starts creating schedules for the simple queries. The resulting schedules are examined and integrated to form an optimized schedule [3]. For every candidate schedule to relation R_{i} containing a transmission of a joining attribute from the same relation R_{i}, the algorithm adds a new candidate schedule without the transmission of this joining attribute. This step is important because a relation cannot be reduced by the selectivity of its own joining attribute. Then, the schedule that minimizes the total time of transmitting R_{i}, if only one joining attribute, is considered. The selected schedule is called BEST_{ij.} Note that procedure TOTAL does not take into consideration the redundant transmissions due to the fact that schedules are constructed separately. This problem is considered in procedure COLLECTIVE that constructs only one basic strategy for the entire query and then tries to do some variations in order to reach the most optimal schedule [1]. In this paper we will discuss algorithm AHY GENERAL Total time version using procedure TOTAL.
2.1 Complexity Analysis of Algorithm GENERAL (Total Time)
The worst case complexity of algorithm GENERAL was proved to be O(ó m^{2}) where ó is the number of different simple queries (i.e., different joining attribute to which algorithm GENERAL is applied) and m is the number of relations in the query.
3. The AHYPERF Algorithm
Partially Encoded Record Filter, is a new twoway semijoin implementation primitive. The basic idea of PERF is as follows  consider two relations R and S. Apply the following steps to the two tables:
The fourth step is known as the backward phase. The main utility of PERF is that it minimizes this phase and hence makes the forward phase (step 2) cost greater than the backward phase. PERF joins can be better enhanced by sending back to R not all the bit vector corresponding the P_{R} but only the 0s part or 1s part according to which one is less in size and hence has lower transmission cost. Figure 1 illustrates the PERF join technique.
R(A,X) R(X,B)
1 
a1 
10 
100 
b1 
1 

2 
a2 
20 
30 
b2 
2 

3 
a3 
30 
50 
b3 
3 

4 
a4 
40 
90 
b4 
4 

5 
a5 
50 
20 
b5 
5 

6 
a6 
66 
70 
b6 
6 
PERF(R) PERF(S)
1 
0 
0 
1 

2 
1 
1 
2 

3 
1 
1 
3 

4 
0 
0 
4 

5 
1 
1 
5 

6 
0 
0 
6 
Figure 1: PERF for R and S.
When applying PERF to AHY algorithm the following is performed:
3.1 Algorithm SERIAL_PERF
S_{1£ } S_{2 £ }… £ S_{m}
R_{1 Õ } R_{2} Õ    R_{n Õ } result node
Or else if R_{r} is a result at the result node,
then there are two strategies:
R_{1Õ } R_{2} Õ    Õ R_{r} Õ    R_{n Õ } R_{r}
Or
R_{1Õ } R_{2} … Õ R_{r1} Õ R _{r+1 Õ } … R_{n Õ } R_{r}
Select the one with minimum total time.
If PERF _{Ri Ri + 1 j} = 1 then
Cost = 0
Else
Cost = C_{0} + C_{1} * b_{ik} + (b_{ik} * ñ _{(i + 1) k} )/8
where C_{0} + C_{1} * b_{ik} is the linear function of transmission cost that is equal to the fixed cost per byte transmitted (C_{1}) multiplied by the size in bytes of the join attribute projected. This is the usual cost of a semijoin known as the forward cost. (b_{ik} * ñ _{(i + 1) k} )/8 is the backward cost that is the cost of transmitting back to R_{i} the bit vector consisting of only matching values of the corresponding attribute. For simplicity of this equation, we are considering attribute k of width 1 byte.
3.2 Algorithm TOTAL_PERF
ART_{i1}+C(s_{1}*SLT_{i1})£ .£ ART_{ió }+C(s_{i}*SLT_{ió })
where ART is the arrival time of the best
schedule, SLT is the accumulated
attribute selectivity of the best schedule,
and s is the selectivity of the
corresponding relation.
As it can be seen, the PERF version of AHY algorithm does not eliminate redundant transmissions from the schedules but it makes their cost 0 when they occur. This can be made possible by adding a little overhead on the transmission cost that is the backward cost. Using this fact, if a transmission was done from site A to site B using a join attribute j, then every other transmission from A to B using j will have a zero cost and every transmission from B to A using j will have also a zero cost. From this point, a PERF join can be seen as a nonredundant symmetric function. This fundamental property allowed us to enhance over the traditional AHY GENERAL Total Time.
We note that the reduction effect of PERF is proportional to the width of the attributes used. In section 5, we show results from different width selections to clarify this issue.
3.3 Complexity Analysis of the AHYPERF Algorithm
As far as complexity is concerned, there was not a considerable increase in the complexity of AHY algorithm since data will be still scanned in the same way and for the same number of times. Ordering will also be done in the same fashion. What is added is only the maintenance of the PERF list. According to its implementation, PERF list could be very easily maintained and with minimum complexity time. In our case, PERF list was implemented as a threedimensional array. So globally, and without loss of generality, we can assume that PERF version of AHY algorithm takes no more than O (ó m^{2}) where ó is the number of different simple queries, and m is the number of relations in the query.
4. Experimental Results
Different scenarios were conceived in order to evaluate the performance of the different algorithms and for each scenario programs were run 1500 times. Different kinds of results are collected:
4.1 Scenario 1:
In this scenario the attribute width is taken as 1 byte for all attributes.
TYPE 
AHY 
AHYPERF 
AHYPERF /AHY 
22 
24.61 
31.70 
7.09 
23 
37.31 
43.86 
6.55 
24 
47.13 
53.79 
6.67 
32 
22.58 
30.25 
7.68 
33 
34.00 
40.36 
6.37 
34 
43.10 
48.90 
5.80 
42 
30.38 
37.87 
7.50 
43 
37.04 
43.70 
6.66 
44 
44.61 
50.55 
5.94 
52 
37.25 
44.92 
7.67 
53 
42.27 
49.01 
6.75 
54 
48.18 
54.37 
6.19 
TOT: 
37.37 
44.11 
6.74 
TYPE 
AHYPERF/AHY 
22 
9.59 
23 
10.82 
24 
13.08 
32 
10.15 
33 
10.17 
34 
10.97 
42 
11.33 
43 
11.36 
44 
11.59 
52 
13.31 
53 
12.91 
54 
13.27 
TOT: 
11.54 
Graphically, the results are represented as follows: comparing AHYPERF to AHY: we notice that AHYPERF outperforms AHY in all cases.
4.2 Scenario 2:
In this scenario the attribute width is taken as 50 bytes for all attributes.
TYPE 
AHY 
AHYPERF 
AHYPERF /AHY 
22 
25.29 
33.22 
7.93 
23 
37.53 
45.27 
7.74 
24 
48.39 
56.21 
7.83 
32 
22.08 
30.98 
8.90 
33 
34.34 
41.98 
7.64 
34 
42.81 
50.02 
7.21 
42 
29.81 
38.92 
9.11 
43 
36.27 
44.01 
7.75 
44 
43.99 
51.33 
7.34 
52 
36.49 
46.24 
9.75 
53 
41.75 
50.01 
8.26 
54 
48.35 
55.89 
7.54 
TOT: 
37.26 
45.34 
8.08 
TYPE 
AHYPERF/AHY 
22 
10.86 
23 
12.81 
24 
15.59 
32 
11.71 
33 
12.32 
34 
13.39 
42 
13.74 
43 
13.18 
44 
14.19 
52 
16.63 
53 
15.64 
54 
16.30 
TOT: 
13.86 
Graphically, the results are represented as follows: comparing AHYPERF to AHY: we notice that AHYPERF outperforms AHY in all cases.
We used two different scenarios in order to study the performance of the above mentioned algorithms from different perspectives. For each scenario, we compared the performance of the algorithms with respect to each other and with respect to the unoptimized solution. Using different scenarios we studied better the behavior of all algorithms under a variety of circumstances. We could be able to note that AHYPERF has the best performance for a field width of 50 bytes. This result was expected because of the overhead added by PERF to the backward phase. Remember that PERF concept consists of returning back to the original site a bit vector representing the matching tuples. This overhead is somehow more considerable when the original field width is <= 1 byte because it might be more profitable sometimes not to send back this data. But when having a width of 50 bytes, the backward cost becomes negligible as compared to the forward cost.
Finally, we can conclude that the results of our experiments were up to the expectations and proved the power of PERF joins and their advantage in optimizing the total time of distributed queries.
6. Conclusion
In this paper, we have fully exposed both concepts of semijoins and PERF joins and then, we have taken an optimization algorithm using semijoins (AHY) and enhanced it by applying PERF joins (AHYPERF). In addition, we have discussed the advantages of PERF joins over semijoins which mainly consist of removing the cost associated with redundant transmissions by adding a relatively negligible cost to the backward phase of each PERF join.
Experimental results confirmed our expectations by showing a considerable enhancement over the AHY algorithm. Different series of experiments were conducted, allowing us to study even better the efficiency of PERF joins from different perspectives and to consider the best case for which PERF joins perform at most. We could then, based on our experiments recommend the use of PERF joins for huge textual and graphic distributed databases where the width of some join attributes is quite large, as well as for ordinary data.
References
[1] Alan R. Hevner, O. Qi Wu and S. Bing Yao, "Query Optimization on Local Area Networks", ACM Transactions on Office Information Vol.3, No.1 Pages: 35  62, January 1985.
[2] Peter M.G. Apers, Alan R. Hevner and S. Bing Yao , "Optimization Algorithms For Distributed Queries", IEEE Transactions On Software Engineering, Vol. Se9, No.1, Pages: 5768, January 1983.
[3] Todd Bealor, "Semijoin Strategies For Total Cost Minimization In Distributed Query Processing", Master Thesis, University of Windsor, Canada, 1995.
[4] Zhe Li, K. A. Ross, "PERF Join: An Alternative to TwoWay Semijoin and Bloomjoin", Columbia University, New York, 1995.
[5] LiYun Chen, "An Algorithm For Distributed Query Processing In a Fiber Distributed Data Interface, HighBandwidth, Token Ring Local Area Network", Master Thesis, North Dakota State University, May 1990.
[6] R. ElMasri , S. Navathe, "Fundamentals of Database Systems", Second Edition, Addison Wesley, 1994.
[7] D. Barbara, W. DuMouchel, C. Faloustos, P. J. Haas, J. M. Hellerstein, Y. Iaonnidies, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross and K. C. Sevcik, "The New Jersey Data Reduction Report", Bulletin Of The Technical Committee On Data Engineering, Pages: 345, December 1997.
Biography
Ramzi A. Haraty is an Assistant Professor of Computer Science at the Lebanese American University in Beirut, Lebanon. He received his B.S. and M.S. degrees in Computer Science from Minnesota State University  Mankato, Minnesota, and his Ph.D. in Computer Science from North Dakota State University  Fargo, North Dakota. His research interests include database management systems, artificial intelligence, and multilevel secure systems engineering. He has well over 35 journal and conference paper publications.
Roula Fany is currently with the Beirut Riyad Bank. She received her Maters of Science degree in Computer Science from the Lebanese American University.