Copyright ACM, 2000

THE APPLICATION OF ASSOCIATION RULE MINING

TO REMOTELY SENSED DATA

 

 

Jianning Dong, William Perrizo, Qin Ding and Jingkai Zhou

North Dakota State University

Fargo, ND 58105

Perrizo@plains.nodak.edu

 

 

KEYWORDS

 

Association rule mining, data mining, remotely sensed data

 

ABSTRACT

 

The explosive growth in data and database has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Data mining is such a technique. In this paper, we consider the mining of association rules from remotely sensed data and its application in precision. Based on the characteristics of the remotely sensed data and the problem itself, we present a bit oriented formal model and discuss the issues of partitioning quantitative attributes into equal, unequal and discontinuous partitions. We propose two new pruning techniques and compare the performances with a base algorithm. An improvement in performance is shown when using these pruning techniques.

 

 

  1. INTRODUCTION
  2.  

    In the last decade, we have seen an explosive growth in our capabilities to both generate and collect data. Advances in scientific data collection (e.g. from remote sensors and satellites), the widespread introduction of bar codes for almost all commercial products, the computerization of many business (e.g. credit card purchases) and government transactions (e.g. tax returns) have generated a flood of data. Advances in data storage technology, such as faster, higher capacity, and cheaper devices (e.g. magnetic disks, CD-ROMS), better database management systems, and data warehouse technology, have allow us to transform this data deluge into "mountains" of stored data. But, what we really want is information.

     

    Data mining, which is also referred to as knowledge discovery in databases, give users the tools to extract previously unknown and potentially useful information (such as knowledge rules, constraints, and irregularities) form data in database [CHY96]. Data mining has been recognized as one of the promising areas of research encompassing database, statistics and machine learning.

     

     

     

    The primary goal of data mining in practice is to assist strategic decision making. It does that by building a model of the real world based on data collected from a variety of sources. Models can be used to predict as well as describe [Fl98].

     

    Association rule mining is one of the important data mining techniques. Association rules are primarily useful for description of the behavior that is captured in the database. It can be expressed as: X Þ Y, where X and Y are set of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y. Each rule has two measures of value, support and confidence which indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule, respectively. The support of the rule X Þ Y is support (X È Y). The confidence c of the rule X Þ Y in the transaction set D means c% of transaction in D that contain X also contain Y, which can be written as the radio support (X È Y) / support (X).

     

    Given a user specified minimum support and minimum confidence, the problem of mining association rules is to find all the association rules whose support and confidence are larger than the respective thresholds. Thus, it can be decomposed into two sub-problems:

    * Finding the large itemsets which have support above the predetermined minimum support.

    * Deriving all rules, based on each large itemset, which have more than predetermined minimum confidence.

     

    Efficiency is an important issue for a mining algorithm. Many interesting and efficient data mining algorithms have been proposed ([AIS93], [ASR94], and [CNF96]). The performance improvement is an essential concern for a particular algorithm because mining association rules may require multiple scan of database. In this paper, we proposed two new pruning techniques for generating candidate itemset to mine association rules from remotely sensed data in precision agriculture.

     

  3. BASE ALGORITHM (APRIORI ALGORITHM)
  4.  

    Various algorithms have been proposed to discover the large itemsets [AIS93], [ASR94]. The Apriori algorithm is one of the most popular algorithm in the mining of association rules in a centralized database. The main idea of Apriori is outlined in the following [FUM96].

     

     

    Table1. Notation for mining algorithm

     

    K-itemset

    An itemset having k items.

    Lk

    Set of large k-itemset.

    Ck

    Set of candidate k-itemset.

     

    Apriori algorithm is shown in Figure 1.

     

    L1 = { large 1-itemsets };

    For (k =2; Lk-1 ¹ 0; k++ ) do begin

    Ck= apriori-gen ( Lk-1 ); // New candidates

    Forall transactions t Î D do begin

    Ct = subset ( Ck, t); // Candidates contained in t

    Forall condidates c Î Ct do

    c.count ++;

    end

    Lk = { c Î Ck | c.count ³ minsup }

    End

    Answer = È K LK ;

     

    Figure 1. Apriori algorithm

     

    3. MINING ASSOCIATION RULES FROM IMAGERY DATA

     

    3.1 Problem Definition

     

    An example is described as following. We have an image scene and a yield map. They are both digit photos taken from the same field. The image scene contains red, blue and green bands. It represents the reflectance levels of each pixel of the scene. The yield map contains red, blue and green bands. There is a formula to convert these three bands into a gray scale (one band). Now we have five attributes in our database. They are: pixel-number, red (band1), blue (band2), green (band3) and yield (band4). The problem of mining association rules from this imagery data is to discover the associations between band1, band2, band3 and band4. This will help farmers to understand what combination of spectral bands will have a high crop yield.

     

    3.2 Partitioning Quantitative Attributes

     

    The Apriori algorithm can only handle category basket data. When we have quantitative attributes in database, we can not directly apply techniques of mining association rules in basket database. In basket database, we say a specific transaction include a set of items. It means some items appear in this transaction. So we can use a boolean value TURE or FALSE to express the relationship between item and transaction. But for quantitative attributes, say one box beer and ten boxes beer, it is unable to tell the difference between these two values in this way. To deal with this type of data, several approaches were proposed. The most common method was simply partition the attribute values into intervals, the intervals are mapped to consecutive integers, such that order of the intervals is preserved [SRI96]. We separate the value in equal, uneven, and discontinuous depth without loss of generality. We then have a set of <attribute, interval> pairs. The "item" <attribute, interval1> would be "1" if attribute had value within the interval 1 in the original tuple, and "0" otherwise.

     

    3.3 Finding Large Itemsets from Imagery Data

     

    In data mining, it is essential to collect a sufficient amount of data that we can derive meaningful conclusion from them. As a result, the amount of these data tends to be huge. There are 40 million pixel in one image scene. Efficiency is an important issue for a mining algorithm.

     

    Finding large itemset algorithm can be done iteratively in the sense that the large itemsets discovered in one iteration will be used as the basis to generate the candidate itemsets for the next iteration for example, in Apriori algorithm, at kth iteration, large k-itemset is generated. In the next iteration, a heuristic is used to extend large k-itemsets into candidate (k+1)-itemsets. The heuristic to construct the candidate itemsets of large itemsets is crucial to performance. Clearly, in order to be efficient, the heuristic should only generate candidates with high likelihood of being large itemsets because for each candidate, we need to count its appearances in the database. The larger the candidate itemsets, the more processing cost required to discover the large itemsets. A performance study [PCY97] shows that the candidate itemsets generated during an early iteration is generally, in orders of magnitude, larger than the set of large itemsets it really contains. Therefore, the initial candidate set generation, especially for the large 2-itemset, is the key issue to improve the performance of data mining.

     

  5. NEW PRUNING TECHNIQUES FOR FAST DATA MINING

 

Two effective pruning techniques for candidate itemset generation are proposed to progressively improve the efficiency of the base algorithm.

 

4.1 Technique one

 

Lemma 1 A pixel value can not belong to two different intervals from the same band.

Lemma 2 The combination of k intervals (k > 1) from same band has support zero.

 

From lemma 2, we know, for any candidate k-itemsets (1 < k < n) generated from same band, the support for these candidate sets must be zero. It is impossible for these candidate k-itemsets to become large k-itemsets later. The new pruning technique proposed in this paper is that we do not need to generate the candidate k-itemsets among those "items" which are in same band with different interval.

We compare Apriori algorithm and our new algrithm to generate candidate 2-itemsets. The notation is shown in table 2.

 

Table 2. Notation for comparison

 

Ck

Candidate k-itemsets

Lk

Large k-itemsets

| Ck |

Number of itemset in candidate k-itemsets

| Lk |

Number of itemset in large k-itemsets

*

An operation for concatenation

Rj

Number of intervals in bandj

 

1. Apriori algorithm: According to the fact that any subset of a large itemset must also be large itemset, Apriori use L1 * L1 to generate a candidate set of itemsets C2.

| C2 | apriori = | L1 | ( | L1 | - 1) / 2

|L1 |

=

2

 

2. New algorithm: Based on Lemma 2, we will not choose the intervals from the same band to generate candidate 2-itemset. Assume | L 1 | = R1 + R2 + … + Rn.

| C2 |new =R1 (R2 + R3 + … + Rn) + R2 (R3 + R4 + … + Rn) + … + Rn-2 (Rn-1 + Rn) + Rn-1(Rn)

n

n-1 Rj S Rk

= S 1 k = j + 1

j =1 1

 

The number of candidate 2-itemsets generated by new algorithm is much less than by Apriori.

| C2 |prune 1 = | C2 |apriori - | C2 |new

n Rj

= S 2

j=1

 

Note that when n is large and Rj is large, | C2 |prune 1 becomes an extremely large number. For example, if the imagery data has eight bands and each band has sixteen intervals, the number of pruned candidate 2-itemsets is (8 x 16 x (16 - 1) / 2 ) = 960. It sharply reduce the process cost. New algorithm employs effective pruning techniques to progressively reduce the number of candidate 2-itemsets, thus improving the performance bottleneck of the whole process.

 

4.2 Technique two

 

During the process of data mining, allow user interaction with the mining engine and use users’ prior knowledge will help to speed up the mining algorithms by restricting the search space.

 

In application of precision agriculture, farmers like to know "combinations of intervals of reflectance from certain bands imply certain agricultural phenomena like high yield". It means farmers are only interested in those rules that reflectance bands appear in the antecedent and other bands appears in the consequent. The mining task is to find the association rules X Þ Y, where X = { collection of reflectance interval }, Y = { output interval }, and X Ç Y = F .

 

This gives us some heuristic that if the large itemsets we mined do not contain Y, the rules derived later should be uninteresting to the user. In this paper, we propose another pruning technique for candidate itemset generation base on this heuristic.

 

Assume there are n bands in an imagery data, 1, .. , R are indexes of reflectance bands, and R+1, .., N are indexes of output bands. The association rules we mined should have the form:

band1 Ù … Ù bandR Þ band(R+1) Ù … Ù bandN.

 

  1. Consider only one band "bandN" in output. The association rule is the form: band1Ù … Ù band(N-1) Þ bandN.

 

For candidate 2-itemsets, since we know that band N should be in every candidate itemset, the new algorithm only generate those candidate itemset in which one item is bandN interval and another is bandk (k = 1, .., N-1) interval. Assume the candidate 2-itemsets is:

{<bandN, range>, <bandk, range>}, (k = 1, .., N-1).

 

Then, the number of candidate 2-itemset

RN N-1

| C2 |new = S Rj

    1. j=1

1

 

The reason is that we are not interested in those itemsets which do not contain bandN. We will prune those candidate itemset in which none of the interval is chose from bandN. The number of pruned candidate 2-itemset is

 

N-1

Rj S Rk

N-2 k = j + 1

| C2 |prune 2 = S

j=1

1 1

 

Compare to Apriori algorithm, apply new pruning technique described in technique one

N Rj

| C2 |prune1 = S

j=1 2

The total number of pruned candidate 2-itemset

| C2 |prune = | C2 |prune 1 + | C2 |prune 2

N-1

S Rj RN

= j=1 + 2

2

 

And the remaining steps are the same as Apriori algorithm.

 

2. Consider two bands in the output. The association rule is the form:

band1 Ù … Ù band(N-2) Þ band(N-1) Ù bandN.

 

For candidate 2-itemsets, the new algorithm only generate those candidate itemset in which one item is in either band(N-1) intervals or bandN intervals. Assume the candidate 2-itemsets is:

{ <bandk, range>, <band(N-1), range> }(k + 1, .., N-2) or

{ <bandk, range>, <bandN, range> }(k = 1, .., N-1).

 

The number of cnadidate 2-itemset

N-2 N-1

R(N-1) S Rj RN S Rj

| C2 |new = 1 j=1 + 1 j=1

1 1

 

We will prune those candidate itemset in which none of the interval is chosen from neither band(N-1) nor bandN. The number of pruned candidate 2-itemset is:

N-2

N -3 Rj S Rj

| C2 |prune2 = S k=j+1

j=1 1 1

Compare to Apriori algorithm, apply new pruning technique described in technique one,

N Rj

| C2 |prune1 = S

j =1 2

 

The total number of pruned candidate 2-itemset is:

| C2 |prune = | C2 |prune1 + | C2 |prune2

And the remaining steps are the same as Apriori algorithm.

 

3. In general case, the association rule is the form:

band1 Ù … Ù bandM Þ band(mM+1) Ù … Ù bandN.

 

It means there are (N-M) bands in output. We only generate those candidate itemsets in which at least one is in bandj intervals (j=M+1, …, N).

 

Assume the candidate 2-itemsets is:

{<bandk, range>, <bandj, range>} where k = 1,…, M, j + M+1, …, N. or

{<bandm, range>, <bandn, range>} where m = M+1,…N-1 and n = m+1, .. N.

 

The number of candidate 2-itemset

 

M N

N Rj S Rj N-1 Rj S Rk

| C2 |new = ( S ) k=1 + S k=j+1

j =m+1 1 1 j=M+1 1 1

 

We will prune those candidate itemset in which none of the interval is chosen from bandj (j = M+1,…, N). The number of pruned candidate 2-itemset is:

 

M

M-1 Rj S Rk

| C2 |prune2 = S k =j+1

j =1 1 1

 

compare to Apriori algorithm, apply new pruning technique described in technique one,

 

N Rj

| C2 |prune1 = S

j=1 2

 

The total number of pruned candidate 2-itemset

| C2 |prune = | C2 |prune1 + | C2 |prune2

M N Rj

= S Rk + S

k=1 j=M+1 1

1

 

And the remaining steps are the same as Apriori algorithm.

From mathematics point of view, if Rj is large and N is large, it means the number of intervals is large and number of bands is large, our new algorithm is efficient for pruning unnecessary candidate 2-itemset. Thus it greatly improves the performance of the whole mining process.

 

4.3 Summary of the new algorithm

 

The following summaries the phases of the new algorithm by adding new pruning techniques to the base algorithm.

 

Phase1. Choose one of the partition method (equal depth, uneven depth and discontinous partition) to determine the intervals.

Phase2. From large 1-itemset, apply new pruning technique (technique one and technique two) to generate candidate 2-itemset.

Phase3. Applying remaining steps of Apriori algorithm.

 

4.4 Performance analysis

 

The following figure shows how the two pruning techniques improve the performance.

Suppose we have 8 bands here and each band has the same number of intervals. From the figure, the greater the number of intervals in each band, the less number of the candidate k-itemsets (in this example considers candidate 2-itemsets), and the greater the performance improvement.

 

From the graph we can see that the number of comparisons required under the Apriori algorithm increases markedly as the number of intervals increases. A reduced number of comparisons is required when using pruning technique number 1. However, using pruning technique number 2, the number of comparisons is greatly reduced for larger numbers of intervals.

 

From this analysis, we conclude that, for larger numbers of intervals, these pruning techniques, especially technique number 2, will substantially reduce the execution time of the Apriori algorithm. Since the execution time of the Apriori algorithm grows exponentially in the number of intervals, this is a very important issue.

 

Figure 2. Comparison of Candidate 2-itemsets

 

4.5 An Example for Applying New Algorithm

 

Consider the following imagery data. Assume this table contains four attributes (bands) and has five tuples (pixels).

 

Table 3. An example database for new algorithm

 

Pixel

Band1

Band2

Band3

Band4

1

40

140

200

240

2

50

130

210

250

3

45

135

210

190

4

100

180

50

100

5

110

170

40

120

 

From the domain knowledge, we know that band1, band2 and band3 refer to reflectance data and band4 refer to yield data. The association rules the user likes to mine are of the form: band1 Ù band2 Ù band3 Þ band4.

 

Now we exactly know that band1, band2 and band3 are input data and band4 is output data. Following is the three phases for applying new algorithm.

 

Phase 1. Assume user select equal depth for partitioning. Diameter two is for band1 and band4, diameter three is for band2 and band3. The number of intervals equals to 2d. (See notation for corresponding band interval and diameter in table 4 and table 5).

 

Table 4. Diameter two for band1 and band4

 

 

[0,

63]

{64, 127]

[128, 191]

[192, 255]

Band1

B11

B12

B13

B14

Band4

B41

B42

B43

B44

 

Table 5. Diameter three for band2 and band3

 

 

[0, 31]

[32, 63]

[64, 95]

[96, 127]

[128, 159]

[160, 191]

[192, 225]

[226, 255]

Band2

B21

B22

B23

B24

B25

B26

B27

B28

Band3

B31

B32

B33

B34

B35

B36

B37

B38

 

After selecting partition method, we map each value in table 3 into intervals. Figure 3 shows are result.

 

Pixel

b

11

b

12

b

13

b

14

b

21

b

25

b

26

b

28

b

31

b

32

b

37

b

38

b

41

b

42

b

43

b

44

1

1

0

0

0

0

1

0

0

0

0

1

0

0

0

0

1

2

1

0

0

0

0

1

0

0

0

0

1

0

0

0

0

1

3

1

0

0

0

0

1

0

0

0

0

1

0

0

0

1

0

4

0

1

0

0

0

0

1

0

0

1

0

0

0

1

0

0

5

0

1

0

0

0

0

1

0

0

1

0

0

0

1

0

0

 

Figure 3. An example of partition the value into intervals

 

Phase 2

We use Apriori algorithm as base mining algorithm, and apply our new pruning technique for candidate 2-itemset generation. Assume the minsup = 40% and minconf = 60%.

 

  1. Candidate 1-itemset: all the intervals we separated belong to candidate 1-itemset. They are: {b11, b12, b13, b14, b21, b22, b23, b24, b25, b26, b27, b28, b31, b32, b33, b34, b35, b36, b37, b38, b41, b42, b43, b44}.
  2. Large 1-itemset: Since minsup = 40%, the minimum support for the number of transaction is at least two. We scan the database and count the support for each candidate 1-itemset, prune those whose support is less than two, we can get the large 1-itemset {b11(3), b12(2), b25(3), b26(2), b32(2), b37(3), b42(2), b44(2)} (the number in the brace refer to support).
  3. Candidate 2-itemsets: apply the new algorithm, we generate the candidate 2-itemsets | C2 |new =2 x (2 + 2 + 2) =12.
  4.  

    They are: {{b42, b11}, {b42, b12}, {b42, b25}, {b42, b32}, {b42, b37}, {b44, b11}, {b44, b12}, {b44, b25}, {b44, b26}, {b44, b32}, {b44, b37}}.

     

    According to the formula we derived, apply pruning technique one,

    | C2 |prune1 = 1 + 1 + 1 + 1 = 4,

     

    apply pruning technique two,

    | C2 |prune2 = 2 x (2 + 2) + 2 x 2 = 12,

     

    the total pruned number of candidate 2-itemsets is

    | C2 |prune1 + | C2 |prune2 = 4 + 12 = 16.

     

    Applying Apriori algorithm, the number of candidate 2-itemset

    | C2 |apriori = (8 x 7) / 2 = 28

     

    The percentage of pruning is 57%. The execution efficiency of mining process is improved.

     

    Phase 3

    Remaining steps are the same as Apriori algorithm.

     

  5. Large 2-itemsets: Scan the database and count the support for each candidate 2-itemsets, we get the large 2-itemsets: { {b42, b12}(2), {b42, b26}(2), {b42, b32}(2), {b44, b11}(2), {b44, b25}(2), {b44, b37}(2) } (the number in the brace refer to support).
  6. Candidate 3-itemsets: Apply the Apriori algorithm, we have: { {b42, b12, b26}, {b42, b12, b32}, {b42, b26, b32}, {b44, b11, b25}, {b44, b11, b37}, {b44, b25, b37} }.
  7. Large 3-itemsets: Scan the database and count the support for each candidate 3-itemsets, we get the large 3-itemsets: { { b42, b12, b26}(2), {b42, b12, b32}(2), {b42, b26, b32}(2), {b44, b11, b25}(2), {b44, b11, b37}(2), {b44, b25, b37}(2) } (the number in the brace refer to support).
  8. Candidate 4-itemsets: Apply the Apriori algorithm, we have { {b42, b12, b26, b32}, {b44, b11, b25, b37} }.
  9. Large 4-itemsets: Scan the database and count the support for each candidate 4-itemsets, we get the large 4-itemsets { {b42, b12, b26, b32}(2), {b44, b11, b25, b37}(2) }.
  10. Candidate 5-itemset is empty. The large itemset generation algorithm terminates.
  11. Derive association rules.

 

Apply domain knowledge, we are interested in those rules in which band1, band2, and band3 appear in the antecedent and band4 in the consequent. We derive following association rules:

b12 Ù b26 Ù b32 Þ b42, with support = 40% and confidence = 66.7%.

b11 Ù b25 Ù b37 Þ b44, with support = 40% and confidence = 100%.

 

In precision agriculture, users’ domain knowledge help us eliminate some unnecessary steps in derive rules. For example, we need not to compute the confidence for those rules, such as b11 Þ b44 Ù bb25 Ù b37, b25 Þ b44 Ù b11 Ù b37, b37 Þ b44 Ù b11 Ù b25, bb11 Ù b25 Þ b44 Ù b37, b11 Ù b37 Þ b44 Ù b25, and b44 Þ b11 Ù b25 Ù b37. It’s obvious that for one large 4-itemset, we only need to compute one rule out of eight. The percentage of improvement is (1-1/8) = 87.5%.

 

  1. CONCLUSION
  2.  

    In this paper, we defined a new data mining problem --- mining association rules from imagery data and its application in precision agriculture.

     

    Since the efficiency of a mining algorithm is a very important issue of data mining, we proposed two simple and effective pruning techniques for candidate 2-itemset generation. Our performance analysis showed that by exploiting the nature of the problem and characteristics of imagery data, we can prune significant number of unnecessary candidate itemsets during the very early phase of mining process. We presented an algorithm that applied new pruning techniques to the base algorithm.

     

  3. REFERENCES

 

[AIS93] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Database. Proc. ACM-SIGMOD International Conference, May 1993.

[ASR94] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules. Proc International Conference on Very Large Databases, Santiageo, Chile, September 1994.

[CHY96] Ming-Syan Chen, Jiawei Han, and Philip S. Yu, Data Mining: An Overview from a Database Perspective. IEEE Transaction on Knowledge and Data Engineering, Vol.s, No. 6, December 1996.

[CNF96] D.W.Chung, U.T. Ng, A.W.Fu and Y.J.Fu, Efficient Mining of Association Rules in Distributed Databases. IEEE Transaction on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996.

[DSE] Stuart E. Dreyfus, The Art and Theory of Dynamic Programming . ACADEMIC PRESS, INC.

[FUM96] U.M. Fayyad, G. Piaesky-shapiro, P. Smyth and R.Uthurusamy editors, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

[HKK96] E.H. Han, G. Karypis and Vipin Kumar, Clustering Based On Association Rule Hypergraphs. 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, May 1997.

[PCY97] J.S. Park, M.S. Chen and P.S. Yu, Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Transaction on Knowledge and Data Engineering, Vol. 9, No. 5, September/October 1997.

[SR196] R. Srikant and R. Agrawal, Mining Quantitative Association Rules in Large Relational Tables SIGMOD, June 1996.

[WAY91] J. Way and E.A. Smith, The Evolution of Synthetic Aperture Radar System and Their Progression to the EOS SAR. IEEE Transaction on Geoscience and Remote Sensing, 1991

[YSJ96] Show-Jane Yen and Arbee L.P. Chen, An Efficient Approach to Discovering Knowledge from Large Database. Fourth International Conference on PDIS 1996.

Copyright 2000 ACM

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and or fee.
SAC 2000 March 19-21 Como, Italy
(c) 2000 ACM 1-58113-239-5/00/003>...>$5.00