THE APPLICATION OF ASSOCIATION RULE MINING
TO REMOTELY SENSED DATA
Jianning Dong, William Perrizo, Qin Ding and Jingkai Zhou
North Dakota State University
Fargo, ND 58105
KEYWORDS
Association rule mining, data mining, remotely sensed data
ABSTRACT
The explosive growth in data and database has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Data mining is such a technique. In this paper, we consider the mining of association rules from remotely sensed data and its application in precision. Based on the characteristics of the remotely sensed data and the problem itself, we present a bit oriented formal model and discuss the issues of partitioning quantitative attributes into equal, unequal and discontinuous partitions. We propose two new pruning techniques and compare the performances with a base algorithm. An improvement in performance is shown when using these pruning techniques.
In the last decade, we have seen an explosive growth in our capabilities to both generate and collect data. Advances in scientific data collection (e.g. from remote sensors and satellites), the widespread introduction of bar codes for almost all commercial products, the computerization of many business (e.g. credit card purchases) and government transactions (e.g. tax returns) have generated a flood of data. Advances in data storage technology, such as faster, higher capacity, and cheaper devices (e.g. magnetic disks, CDROMS), better database management systems, and data warehouse technology, have allow us to transform this data deluge into "mountains" of stored data. But, what we really want is information.
Data mining, which is also referred to as knowledge discovery in databases, give users the tools to extract previously unknown and potentially useful information (such as knowledge rules, constraints, and irregularities) form data in database [CHY96]. Data mining has been recognized as one of the promising areas of research encompassing database, statistics and machine learning.
The primary goal of data mining in practice is to assist strategic decision making. It does that by building a model of the real world based on data collected from a variety of sources. Models can be used to predict as well as describe [Fl98].
Association rule mining is one of the important data mining techniques. Association rules are primarily useful for description of the behavior that is captured in the database. It can be expressed as: X Þ Y, where X and Y are set of items. The intuitive meaning of such a rule is that transactions of the database which contain X tend to contain Y. Each rule has two measures of value, support and confidence which indicates the frequencies of the occurring patterns, and confidence denotes the strength of implication in the rule, respectively. The support of the rule X Þ Y is support (X È Y). The confidence c of the rule X Þ Y in the transaction set D means c% of transaction in D that contain X also contain Y, which can be written as the radio support (X È Y) / support (X).
Given a user specified minimum support and minimum confidence, the problem of mining association rules is to find all the association rules whose support and confidence are larger than the respective thresholds. Thus, it can be decomposed into two subproblems:
* Finding the large itemsets which have support above the predetermined minimum support.
* Deriving all rules, based on each large itemset, which have more than predetermined minimum confidence.
Efficiency is an important issue for a mining algorithm. Many interesting and efficient data mining algorithms have been proposed ([AIS93], [ASR94], and [CNF96]). The performance improvement is an essential concern for a particular algorithm because mining association rules may require multiple scan of database. In this paper, we proposed two new pruning techniques for generating candidate itemset to mine association rules from remotely sensed data in precision agriculture.
Various algorithms have been proposed to discover the large itemsets [AIS93], [ASR94]. The Apriori algorithm is one of the most popular algorithm in the mining of association rules in a centralized database. The main idea of Apriori is outlined in the following [FUM96].
Table1. Notation for mining algorithm
Kitemset 
An itemset having k items. 
Lk 
Set of large kitemset. 
Ck 
Set of candidate kitemset. 
Apriori algorithm is shown in Figure 1.
L1 = { large 1itemsets };
For (k =2; Lk1 ¹ 0; k++ ) do begin
Ck= apriorigen ( Lk1 ); // New candidates
Forall transactions t Î D do begin
Ct = subset ( Ck, t); // Candidates contained in t
Forall condidates c Î Ct do
c.count ++;
end
Lk = { c Î Ck  c.count ³ minsup }
End
Answer = È K LK ;
Figure 1. Apriori algorithm
3. MINING ASSOCIATION RULES FROM IMAGERY DATA
3.1 Problem Definition
An example is described as following. We have an image scene and a yield map. They are both digit photos taken from the same field. The image scene contains red, blue and green bands. It represents the reflectance levels of each pixel of the scene. The yield map contains red, blue and green bands. There is a formula to convert these three bands into a gray scale (one band). Now we have five attributes in our database. They are: pixelnumber, red (band1), blue (band2), green (band3) and yield (band4). The problem of mining association rules from this imagery data is to discover the associations between band1, band2, band3 and band4. This will help farmers to understand what combination of spectral bands will have a high crop yield.
3.2 Partitioning Quantitative Attributes
The Apriori algorithm can only handle category basket data. When we have quantitative attributes in database, we can not directly apply techniques of mining association rules in basket database. In basket database, we say a specific transaction include a set of items. It means some items appear in this transaction. So we can use a boolean value TURE or FALSE to express the relationship between item and transaction. But for quantitative attributes, say one box beer and ten boxes beer, it is unable to tell the difference between these two values in this way. To deal with this type of data, several approaches were proposed. The most common method was simply partition the attribute values into intervals, the intervals are mapped to consecutive integers, such that order of the intervals is preserved [SRI96]. We separate the value in equal, uneven, and discontinuous depth without loss of generality. We then have a set of <attribute, interval> pairs. The "item" <attribute, interval1> would be "1" if attribute had value within the interval 1 in the original tuple, and "0" otherwise.
3.3 Finding Large Itemsets from Imagery Data
In data mining, it is essential to collect a sufficient amount of data that we can derive meaningful conclusion from them. As a result, the amount of these data tends to be huge. There are 40 million pixel in one image scene. Efficiency is an important issue for a mining algorithm.
Finding large itemset algorithm can be done iteratively in the sense that the large itemsets discovered in one iteration will be used as the basis to generate the candidate itemsets for the next iteration for example, in Apriori algorithm, at kth iteration, large kitemset is generated. In the next iteration, a heuristic is used to extend large kitemsets into candidate (k+1)itemsets. The heuristic to construct the candidate itemsets of large itemsets is crucial to performance. Clearly, in order to be efficient, the heuristic should only generate candidates with high likelihood of being large itemsets because for each candidate, we need to count its appearances in the database. The larger the candidate itemsets, the more processing cost required to discover the large itemsets. A performance study [PCY97] shows that the candidate itemsets generated during an early iteration is generally, in orders of magnitude, larger than the set of large itemsets it really contains. Therefore, the initial candidate set generation, especially for the large 2itemset, is the key issue to improve the performance of data mining.
Two effective pruning techniques for candidate itemset generation are proposed to progressively improve the efficiency of the base algorithm.
4.1 Technique one
Lemma 1 A pixel value can not belong to two different intervals from the same band.
Lemma 2 The combination of k intervals (k > 1) from same band has support zero.
From lemma 2, we know, for any candidate kitemsets (1 < k < n) generated from same band, the support for these candidate sets must be zero. It is impossible for these candidate kitemsets to become large kitemsets later. The new pruning technique proposed in this paper is that we do not need to generate the candidate kitemsets among those "items" which are in same band with different interval.
We compare Apriori algorithm and our new algrithm to generate candidate 2itemsets. The notation is shown in table 2.
Table 2. Notation for comparison
Ck 
Candidate kitemsets 
Lk 
Large kitemsets 
 Ck  
Number of itemset in candidate kitemsets 
 Lk  
Number of itemset in large kitemsets 
* 
An operation for concatenation 
Rj 
Number of intervals in bandj 
1. Apriori algorithm: According to the fact that any subset of a large itemset must also be large itemset, Apriori use L1 * L1 to generate a candidate set of itemsets C2.
 C2  apriori =  L1  (  L1   1) / 2
L1 
=
2
2. New algorithm: Based on Lemma 2, we will not choose the intervals from the same band to generate candidate 2itemset. Assume  L _{1}  = R1 + R2 + + Rn.
 C_{2 }_{new} =R1 (R2 + R3 + + Rn) + R2 (R3 + R4 + + Rn) + + Rn2 (Rn1 + Rn) + Rn1(Rn)
n
n1 Rj S Rk
= S 1 k = j + 1
j =1 1
The number of candidate 2itemsets generated by new algorithm is much less than by Apriori.
 C_{2 }_{prune 1} =  C_{2} _{apriori }  C_{2} _{new}
n Rj
= S 2
j=1
Note that when n is large and Rj is large,  C_{2} _{prune 1} becomes an extremely large number. For example, if the imagery data has eight bands and each band has sixteen intervals, the number of pruned candidate 2itemsets is (8 x 16 x (16  1) / 2 ) = 960. It sharply reduce the process cost. New algorithm employs effective pruning techniques to progressively reduce the number of candidate 2itemsets, thus improving the performance bottleneck of the whole process.
4.2 Technique two
During the process of data mining, allow user interaction with the mining engine and use users prior knowledge will help to speed up the mining algorithms by restricting the search space.
In application of precision agriculture, farmers like to know "combinations of intervals of reflectance from certain bands imply certain agricultural phenomena like high yield". It means farmers are only interested in those rules that reflectance bands appear in the antecedent and other bands appears in the consequent. The mining task is to find the association rules X Þ Y, where X = { collection of reflectance interval }, Y = { output interval }, and X Ç Y = F .
This gives us some heuristic that if the large itemsets we mined do not contain Y, the rules derived later should be uninteresting to the user. In this paper, we propose another pruning technique for candidate itemset generation base on this heuristic.
Assume there are n bands in an imagery data, 1, .. , R are indexes of reflectance bands, and R+1, .., N are indexes of output bands. The association rules we mined should have the form:
band1 Ù Ù bandR Þ band(R+1) Ù Ù bandN.
For candidate 2itemsets, since we know that band N should be in every candidate itemset, the new algorithm only generate those candidate itemset in which one item is bandN interval and another is bandk (k = 1, .., N1) interval. Assume the candidate 2itemsets is:
{<bandN, range>, <bandk, range>}, (k = 1, .., N1).
Then, the number of candidate 2itemset
RN N1
 C_{2} _{new} = S Rj
1
The reason is that we are not interested in those itemsets which do not contain bandN. We will prune those candidate itemset in which none of the interval is chose from bandN. The number of pruned candidate 2itemset is
N1
Rj S Rk
N2 k = j + 1
 C_{2 }_{prune 2 }= S
j=1
1 1
Compare to Apriori algorithm, apply new pruning technique described in technique one
N Rj
 C_{2 }_{prune1 }= S
j=1 2
The total number of pruned candidate 2itemset
 C_{2} _{prune} =  C_{2} _{prune 1} +  C_{2} _{prune 2}
N1
S Rj RN
= j=1 + 2
2
And the remaining steps are the same as Apriori algorithm.
2. Consider two bands in the output. The association rule is the form:
band1 Ù Ù band(N2) Þ band(N1) Ù bandN.
For candidate 2itemsets, the new algorithm only generate those candidate itemset in which one item is in either band(N1) intervals or bandN intervals. Assume the candidate 2itemsets is:
{ <bandk, range>, <band(N1), range> }(k + 1, .., N2) or
{ <bandk, range>, <bandN, range> }(k = 1, .., N1).
The number of cnadidate 2itemset
N2 N1
R(N1) S Rj RN S Rj
 C_{2} _{new} = 1 j=1 + 1 j=1
1 1
We will prune those candidate itemset in which none of the interval is chosen from neither band(N1) nor bandN. The number of pruned candidate 2itemset is:
N2
N 3 Rj S Rj
 C_{2 }_{prune2 }= S k=j+1
j=1 1 1
Compare to Apriori algorithm, apply new pruning technique described in technique one,
N Rj
 C_{2} _{prune1} = S
j =1 2
The total number of pruned candidate 2itemset is:
 C_{2} _{prune} =  C_{2} _{prune1} +  C_{2} _{prune2}
And the remaining steps are the same as Apriori algorithm.
3. In general case, the association rule is the form:
band1 Ù Ù bandM Þ band(mM+1) Ù Ù bandN.
It means there are (NM) bands in output. We only generate those candidate itemsets in which at least one is in bandj intervals (j=M+1, , N).
Assume the candidate 2itemsets is:
{<bandk, range>, <bandj, range>} where k = 1, , M, j + M+1, , N. or
{<bandm, range>, <bandn, range>} where m = M+1, N1 and n = m+1, .. N.
The number of candidate 2itemset
M N
N Rj S Rj N1 Rj S Rk
 C_{2} _{new} = ( S ) k=1 + S k=j+1
j =m+1 1 1 j=M+1 1 1
We will prune those candidate itemset in which none of the interval is chosen from bandj (j = M+1, , N). The number of pruned candidate 2itemset is:
M
M1 Rj S Rk
 C_{2} _{prune2} = S k =j+1
j =1 1 1
compare to Apriori algorithm, apply new pruning technique described in technique one,
N Rj
 C_{2} _{prune1} = S
j=1 2
The total number of pruned candidate 2itemset
 C_{2} _{prune} =  C_{2} _{prune1 }+  C_{2} _{prune2}
M N Rj
= S Rk + S
k=1 j=M+1 1
1
And the remaining steps are the same as Apriori algorithm.
From mathematics point of view, if Rj is large and N is large, it means the number of intervals is large and number of bands is large, our new algorithm is efficient for pruning unnecessary candidate 2itemset. Thus it greatly improves the performance of the whole mining process.
4.3 Summary of the new algorithm
The following summaries the phases of the new algorithm by adding new pruning techniques to the base algorithm.
Phase1. Choose one of the partition method (equal depth, uneven depth and discontinous partition) to determine the intervals.
Phase2. From large 1itemset, apply new pruning technique (technique one and technique two) to generate candidate 2itemset.
Phase3. Applying remaining steps of Apriori algorithm.
4.4 Performance analysis
The following figure shows how the two pruning techniques improve the performance.
Suppose we have 8 bands here and each band has the same number of intervals. From the figure, the greater the number of intervals in each band, the less number of the candidate kitemsets (in this example considers candidate 2itemsets), and the greater the performance improvement.
From the graph we can see that the number of comparisons required under the Apriori algorithm increases markedly as the number of intervals increases. A reduced number of comparisons is required when using pruning technique number 1. However, using pruning technique number 2, the number of comparisons is greatly reduced for larger numbers of intervals.
From this analysis, we conclude that, for larger numbers of intervals, these pruning techniques, especially technique number 2, will substantially reduce the execution time of the Apriori algorithm. Since the execution time of the Apriori algorithm grows exponentially in the number of intervals, this is a very important issue.
Figure 2. Comparison of Candidate 2itemsets
4.5 An Example for Applying New Algorithm
Consider the following imagery data. Assume this table contains four attributes (bands) and has five tuples (pixels).
Table 3. An example database for new algorithm
Pixel 
Band1 
Band2 
Band3 
Band4 
1 
40 
140 
200 
240 
2 
50 
130 
210 
250 
3 
45 
135 
210 
190 
4 
100 
180 
50 
100 
5 
110 
170 
40 
120 
From the domain knowledge, we know that band1, band2 and band3 refer to reflectance data and band4 refer to yield data. The association rules the user likes to mine are of the form: band1 Ù band2 Ù band3 Þ band4.
Now we exactly know that band1, band2 and band3 are input data and band4 is output data. Following is the three phases for applying new algorithm.
Phase 1. Assume user select equal depth for partitioning. Diameter two is for band1 and band4, diameter three is for band2 and band3. The number of intervals equals to 2^{d}. (See notation for corresponding band interval and diameter in table 4 and table 5).
Table 4. Diameter two for band1 and band4
[0, 63] 
{64, 127] 
[128, 191] 
[192, 255] 

Band1 
B11 
B12 
B13 
B14 
Band4 
B41 
B42 
B43 
B44 
Table 5. Diameter three for band2 and band3
[0, 31] 
[32, 63] 
[64, 95] 
[96, 127] 
[128, 159] 
[160, 191] 
[192, 225] 
[226, 255] 

Band2 
B21 
B22 
B23 
B24 
B25 
B26 
B27 
B28 
Band3 
B31 
B32 
B33 
B34 
B35 
B36 
B37 
B38 
After selecting partition method, we map each value in table 3 into intervals. Figure 3 shows are result.
Pixel 
b 11 
b 12 
b 13 
b 14 
b 21 
b 25 
b 26 
b 28 
b 31 
b 32 
b 37 
b 38 
b 41 
b 42 
b 43 
b 44 
1 
1 
0 
0 
0 
0 
1 
0 
0 
0 
0 
1 
0 
0 
0 
0 
1 
2 
1 
0 
0 
0 
0 
1 
0 
0 
0 
0 
1 
0 
0 
0 
0 
1 
3 
1 
0 
0 
0 
0 
1 
0 
0 
0 
0 
1 
0 
0 
0 
1 
0 
4 
0 
1 
0 
0 
0 
0 
1 
0 
0 
1 
0 
0 
0 
1 
0 
0 
5 
0 
1 
0 
0 
0 
0 
1 
0 
0 
1 
0 
0 
0 
1 
0 
0 
Figure 3. An example of partition the value into intervals
Phase 2
We use Apriori algorithm as base mining algorithm, and apply our new pruning technique for candidate 2itemset generation. Assume the minsup = 40% and minconf = 60%.
They are: {{b42, b11}, {b42, b12}, {b42, b25}, {b42, b32}, {b42, b37}, {b44, b11}, {b44, b12}, {b44, b25}, {b44, b26}, {b44, b32}, {b44, b37}}.
According to the formula we derived, apply pruning technique one,
 C_{2} _{prune1} = 1 + 1 + 1 + 1 = 4,
apply pruning technique two,
 C_{2} _{prune2} = 2 x (2 + 2) + 2 x 2 = 12,
the total pruned number of candidate 2itemsets is
 C_{2} _{prune1} +  C_{2} _{prune2} = 4 + 12 = 16.
Applying Apriori algorithm, the number of candidate 2itemset
 C_{2} _{apriori }= (8 x 7) / 2 = 28
The percentage of pruning is 57%. The execution efficiency of mining process is improved.
Phase 3
Remaining steps are the same as Apriori algorithm.
Apply domain knowledge, we are interested in those rules in which band1, band2, and band3 appear in the antecedent and band4 in the consequent. We derive following association rules:
b12 Ù b26 Ù b32 Þ b42, with support = 40% and confidence = 66.7%.
b11 Ù b25 Ù b37 Þ b44, with support = 40% and confidence = 100%.
In precision agriculture, users domain knowledge help us eliminate some unnecessary steps in derive rules. For example, we need not to compute the confidence for those rules, such as b11 Þ b44 Ù bb25 Ù b37, b25 Þ b44 Ù b11 Ù b37, b37 Þ b44 Ù b11 Ù b25, bb11 Ù b25 Þ b44 Ù b37, b11 Ù b37 Þ b44 Ù b25, and b44 Þ b11 Ù b25 Ù b37. Its obvious that for one large 4itemset, we only need to compute one rule out of eight. The percentage of improvement is (11/8) = 87.5%.
In this paper, we defined a new data mining problem  mining association rules from imagery data and its application in precision agriculture.
Since the efficiency of a mining algorithm is a very important issue of data mining, we proposed two simple and effective pruning techniques for candidate 2itemset generation. Our performance analysis showed that by exploiting the nature of the problem and characteristics of imagery data, we can prune significant number of unnecessary candidate itemsets during the very early phase of mining process. We presented an algorithm that applied new pruning techniques to the base algorithm.
[AIS93] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Database. Proc. ACMSIGMOD International Conference, May 1993.
[ASR94] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules. Proc International Conference on Very Large Databases, Santiageo, Chile, September 1994.
[CHY96] MingSyan Chen, Jiawei Han, and Philip S. Yu, Data Mining: An Overview from a Database Perspective. IEEE Transaction on Knowledge and Data Engineering, Vol.s, No. 6, December 1996.
[CNF96] D.W.Chung, U.T. Ng, A.W.Fu and Y.J.Fu, Efficient Mining of Association Rules in Distributed Databases. IEEE Transaction on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996.
[DSE] Stuart E. Dreyfus, The Art and Theory of Dynamic Programming . ACADEMIC PRESS, INC.
[FUM96] U.M. Fayyad, G. Piaeskyshapiro, P. Smyth and R.Uthurusamy editors, Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[HKK96] E.H. Han, G. Karypis and Vipin Kumar, Clustering Based On Association Rule Hypergraphs. 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, May 1997.
[PCY97] J.S. Park, M.S. Chen and P.S. Yu, Using a HashBased Method with Transaction Trimming for Mining Association Rules. IEEE Transaction on Knowledge and Data Engineering, Vol. 9, No. 5, September/October 1997.
[SR196] R. Srikant and R. Agrawal, Mining Quantitative Association Rules in Large Relational Tables SIGMOD, June 1996.
[WAY91] J. Way and E.A. Smith, The Evolution of Synthetic Aperture Radar System and Their Progression to the EOS SAR. IEEE Transaction on Geoscience and Remote Sensing, 1991
[YSJ96] ShowJane Yen and Arbee L.P. Chen, An Efficient Approach to Discovering Knowledge from Large Database. Fourth International Conference on PDIS 1996.