数据挖掘课件：chap7-extended-association-analysis (1).ppt_163文库

资源描述

1、Data MiningAssociation Rules: Advanced Concepts and AlgorithmsLecture Notes for Chapter 7Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Continuous and Categorical AttributesE

2、xample of Association Rule: Number of Pages 5,10) (Browser=Mozilla) Buy = NoHow to apply association analysis formulation to non-asymmetric binary variables? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Handling Categorical AttributeslTransform categorical attribute into asymmetric b

3、inary variableslIntroduce a new “item” for each distinct attribute-value pair Example: replace Browser Type attribute withu Browser Type = Internet Exploreru Browser Type = Mozillau Browser Type = Mozilla Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Handling Categorical AttributeslPo

4、tential Issues What if attribute has many possible valuesu Example: attribute country has more than 200 possible valuesu Many of the attribute values may have very low support Potential solution: Aggregate the low-support attribute values What if distribution of attribute values is highly skewedu Ex

5、ample: 95% of the visitors have Buy = Nou Most of the items will be associated with (Buy=No) item Potential solution: drop the highly frequent items Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Handling Continuous AttributeslDifferent kinds of rules: Age21,35) Salary70k,120k) Buy Sal

6、ary70k,120k) Buy Age: =28, =4lDifferent methods: Discretization-based Statistics-based Non-discretization basedu minApriori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Handling Continuous AttributeslUse discretizationlUnsupervised: Equal-width binning Equal-depth binning Clusteringl

7、Supervised: Classv1v2v3v4v5v6v7v8v9Anomalous002010200000Normal150100000100100150100bin1bin3bin2Attribute values, v Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Discretization IssueslSize of the discretized intervals affect support & confidence If intervals too smallu may not have eno

8、ugh support If intervals too largeu may not have enough confidencelPotential solution: use all possible intervalsRefund = No, (Income = $51,250) Cheat = NoRefund = No, (60K Income 80K) Cheat = NoRefund = No, (0K Income 1B) Cheat = No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Discr

9、etization IssueslExecution time If intervals contain n values, there are on average O(n2) possible rangeslToo many rulesRefund = No, (Income = $51,250) Cheat = NoRefund = No, (51K Income 52K) Cheat = NoRefund = No, (50K Income 60K) Cheat = No Tan,Steinbach, Kumar Introduction to Data Mining 4/18/200

10、4 9 Approach by Srikant & AgrawallPreprocess the data Discretize attribute using equi-depth partitioningu Use partial completeness measure to determine number of partitionsu Merge adjacent intervals as long as support is less than max-supportlApply existing association rule mining algorithmslDetermi

11、ne interesting rules in the output Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Approach by Srikant & AgrawallDiscretization will lose information Use partial completeness measure to determine how much information is lost C: frequent itemsets obtained by considering all ranges of at

12、tribute valuesP: frequent itemsets obtained by considering all ranges over the partitionsP is K-complete w.r.t C if P C,and X C, X P such that: 1. X is a generalization of X and support (X) K support(X) (K 1) 2. Y X, Y X such that support (Y) K support(Y)Given K (partial completeness level), can det

13、ermine number of intervals (N)XApproximated X Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Interestingness MeasurelGiven an itemset: Z = z1, z2, , zk and its generalization Z = z1, z2, , zk P(Z): support of ZEZ(Z): expected support of Z based on Z Z is R-interesting w.r.t. Z if P(Z)

14、 R EZ(Z)Refund = No, (Income = $51,250) Cheat = NoRefund = No, (51K Income 52K) Cheat = NoRefund = No, (50K Income 60K) Cheat = No) () ()() ()() ()()(2211ZPzPzPzPzPzPzPZEkkZ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Interestingness MeasurelFor S: X Y, and its generalization S: X

16、8/2004 13 Statistics-based MethodslExample: Browser=Mozilla Buy=Yes Age: =23lRule consequent consists of a continuous variable, characterized by their statistics mean, median, standard deviation, etc.lApproach: Withhold the target variable from the rest of the data Apply existing frequent itemset ge

17、neration on the rest of the data For each frequent itemset, compute the descriptive statistics for the corresponding target variableu Frequent itemset becomes a rule by introducing the target variable as rule consequent Apply statistical test to determine interestingness of the rule Tan,Steinbach, K

18、umar Introduction to Data Mining 4/18/2004 14 Statistics-based MethodslHow to determine whether an association rule interesting? Compare the statistics for segment of population covered by the rule vs segment of population not covered by the rule:A B: versus A B: Statistical hypothesis testing:u Nul

19、l hypothesis: H0: = + u Alternative hypothesis: H1: + u Z has zero mean and variance 1 under null hypothesis222121nsnsZ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Statistics-based MethodslExample: r: Browser=Mozilla Buy=Yes Age: =23 Rule is interesting if difference between and is

20、 greater than 5 years (i.e., = 5) For r, suppose n1 = 50, s1 = 3.5 For r (complement): n2 = 250, s2 = 6.5 For 1-sided test at 95% confidence level, critical Z-value for rejecting null hypothesis is 1.64. Since Z is greater than 1.64, r is an interesting rule11. 32505 . 6505 . 35233022222121nsnsZ Tan

21、,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Min-Apriori (Han et al)TID W1 W2 W3 W4 W5D122001D200122D323000D400101D511102Example:W1 and W2 tends to appear together in the same documentDocument-term matrix: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Min-ApriorilData c

22、ontains only continuous attributes of the same “type” e.g., frequency of words in a documentlPotential solution: Convert into 0/1 matrix and then apply existing algorithmsu lose word frequency information Discretization does not apply as users want association among words not ranges of wordsTID W1 W

23、2 W3 W4 W5D122001D200122D323000D400101D511102 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Min-ApriorilHow to determine the support of a word? If we simply sum up its frequency, support count will be greater than total number of documents!u Normalize the word vectors e.g., using L1

24、normu Each word has a support equals to 1.0TID W1 W2 W3 W4 W5D122001D200122D323000D400101D511102TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33Normalize Tan,Steinbach, Kumar Introduction to Data

25、 Mining 4/18/2004 19 Min-ApriorilNew definition of support:TiCjjiDC),()sup(minExample:Sup(W1,W2,W3)= 0 + 0 + 0 + 0 + 0.17= 0.17TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33 Tan,Steinbach, Kuma

26、r Introduction to Data Mining 4/18/2004 20 Anti-monotone property of SupportExample:Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17TID W1 W2 W3 W4 W5D1 0.40 0.33 0.00 0.00 0.17D2 0.00 0.00 0.33 1.00 0.33D3 0.40 0.50 0.00

27、 0.00 0.00D4 0.00 0.00 0.33 0.00 0.17D5 0.20 0.17 0.33 0.00 0.33 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Multi-level Association RulesFoodBreadMilkSkim2%ElectronicsComputersHomeDesktopLaptopWheatWhiteForemostKempsDVDTVPrinterScannerAccessory Tan,Steinbach, Kumar Introduction to

28、 Data Mining 4/18/2004 22 Multi-level Association RuleslWhy should we incorporate concept hierarchy? Rules at lower levels may not have enough support to appear in any frequent itemsets Rules at lower levels of the hierarchy are overly specific u e.g., skim milk white bread, 2% milk wheat bread,skim

29、 milk wheat bread, etc.are indicative of association between milk and bread Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Multi-level Association RuleslHow do support and confidence vary as we traverse the concept hierarchy? If X is the parent item for both X1 and X2, then (X) (X1) +

30、 (X2) If (X1 Y1) minsup, and X is parent of X1, Y is parent of Y1 then(X Y1) minsup, (X1 Y) minsup(X Y) minsup If conf(X1 Y1) minconf,thenconf(X1 Y) minconf Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Multi-level Association RuleslApproach 1: Extend current association rule formula

31、tion by augmenting each transaction with higher level itemsOriginal Transaction: skim milk, wheat bread Augmented Transaction: skim milk, wheat bread, milk, bread, foodlIssues: Items that reside at higher levels have much higher support counts u if support threshold is low, too many frequent pattern

32、s involving items from the higher levels Increased dimensionality of the data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Multi-level Association RuleslApproach 2: Generate frequent patterns at highest level first Then, generate frequent patterns at the next highest level, and so o

33、nlIssues: I/O requirements will increase dramatically because we need to perform more passes over the data May miss some potentially interesting cross-level association patterns Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Sequence DataObjectTimestampEventsA102, 3, 5A206, 1A231B114,

34、 5, 6B172B217, 8, 1, 2B281, 6C141, 8, 7Sequence Database: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Examples of Sequence DataSequence DatabaseSequenceElement (Transaction)Event(Item)CustomerPurchase history of a given customerA set of items bought by a customer at time tBooks, di

35、ary products, CDs, etcWeb DataBrowsing activity of a particular Web visitorA collection of files viewed by a Web visitor after a single mouse clickHome page, index page, contact info, etcEvent dataHistory of events generated by a given sensorEvents triggered by a sensor at time tTypes of alarms gene

36、rated by sensors Genome sequencesDNA sequence of a particular speciesAn element of the DNA sequence Bases A,T,G,CSequenceE1E2E1E3E2E3E4E2Element (Transaction)Event (Item) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Formal Definition of a SequencelA sequence is an ordered list of el

37、ements (transactions)s = Each element contains a collection of events (items)ei = i1, i2, , ik Each element is attributed to a specific time or locationlLength of a sequence, |s|, is given by the number of elements of the sequencelA k-sequence is a sequence that contains k events (items) Tan,Steinba

38、ch, Kumar Introduction to Data Mining 4/18/2004 29 Examples of SequencelWeb sequence: lSequence of initiating events causing the nuclear accident at 3-mile Island:(http:/stellar- of books checked out at a library: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Formal Definition of a S

39、ubsequencelA sequence is contained in another sequence (m n) if there exist integers i1 i2 in such that a1 bi1 , a2 bi1, , an bin lThe support of a subsequence w is defined as the fraction of data sequences that contain wlA sequential pattern is a frequent subsequence (i.e., a subsequence whose supp

40、ort is minsup)Data sequenceSubsequenceContain?Yes NoYes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Sequential Pattern Mining: DefinitionlGiven: a database of sequences a user-specified minimum support threshold, minsuplTask: Find all subsequences with support minsup Tan,Steinbach,

41、 Kumar Introduction to Data Mining 4/18/2004 32 Sequential Pattern Mining: ChallengelGiven a sequence: Examples of subsequences:, , , etc.lHow many k-subsequences can be extracted from a given n-sequence? n = 9 k=4: Y _ _ Y Y _ _ _ Y 12649:Answerkn Tan,Steinbach, Kumar Introduction to Data Mining 4/

42、18/2004 33 Sequential Pattern Mining: ExampleMinsup = 50%Examples of Frequent Subsequences: s=60% s=60%s=80%s=80%s=80%s=60%s=60%s=60%s=60%ObjectTimestampEventsA11,2,4A22,3A35B11,2B22,3,4C11, 2C22,3,4C32,4,5D12D23, 4D34, 5E11, 3E22, 4, 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 E

43、xtracting Sequential PatternslGiven n events: i1, i2, i3, , inlCandidate 1-subsequences: , , , , lCandidate 2-subsequences:, , , , , , lCandidate 3-subsequences:, , , , , , , , , , Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Generalized Sequential Pattern (GSP)lStep 1: Make the fir

44、st pass over the sequence database D to yield all the 1-element frequent sequenceslStep 2: Repeat until no new frequent sequences are found Candidate Generation: uMerge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items Candidate Pruning:uPr

45、une candidate k-sequences that contain infrequent (k-1)-subsequences Support Counting:uMake a new pass over the sequence database D to find the support for these candidate sequences Candidate Elimination:uEliminate candidate k-sequences whose actual support is less than minsup Tan,Steinbach, Kumar I

46、ntroduction to Data Mining 4/18/2004 36 Candidate GenerationlBase case (k=2): Merging two frequent 1-sequences and will produce two candidate 2-sequences: and lGeneral case (k2): A frequent (k-1)-sequence w1 is merged with another frequent (k-1)-sequence w2 to produce a candidate k-sequence if the s

47、ubsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2u The resulting candidate after merging is given by the sequence w1 extended with the last event of w2. If the last two events in w2 belong to the same element, then the last

48、 event in w2 becomes part of the last element in w1 Otherwise, the last event in w2 becomes a separate element appended to the end of w1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Candidate Generation ExampleslMerging the sequences w1= and w2 = will produce the candidate sequence

49、because the last two events in w2 (4 and 5) belong to the same elementlMerging the sequences w1= and w2 = will produce the candidate sequence because the last two events in w2 (4 and 5) do not belong to the same elementlWe do not have to merge the sequences w1 = and w2 = to produce the candidate bec

50、ause if the latter is a viable candidate, then it can be obtained by merging w1 with Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 GSP Example Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Timing Constraints (I)A B C D E= msngxg: max-gapng: min-gapms: maximum spanData

展开阅读全文