数据挖掘课件：chap6-basic-association-analysis.ppt_163文库

资源描述

1、Data Mining Association Analysis: Basic Concepts and AlgorithmsLecture Notes for Chapter 6Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Association Rule MininglGiven a set o

2、f transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactionMarket-Basket transactionsExample of Association RulesDiaper Beer,Milk, Bread Eggs,Coke,Beer, Bread Milk,Implication means co-occurrence, not causality! Tan,Steinbach, Kum

3、ar Introduction to Data Mining 4/18/2004 3 Definition: Frequent ItemsetlItemset A collection of one or more itemsuExample: Milk, Bread, Diaper k-itemsetuAn itemset that contains k itemslSupport count ( ) Frequency of occurrence of an itemset E.g. (Milk, Bread,Diaper) = 2 lSupport Fraction of transac

4、tions that contain an itemset E.g. s(Milk, Bread, Diaper) = 2/5lFrequent Itemset An itemset whose support is greater than or equal to a minsup threshold Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Definition: Association RuleExample:BeerDiaper,Milk4 . 052|T|)BeerDiaper,Milk(s67. 032

5、)Diaper,Milk()BeerDiaper,Milk,(clAssociation RuleAn implication expression of the form X Y, where X and Y are itemsetsExample: Milk, Diaper Beer lRule Evaluation MetricsSupport (s)uFraction of transactions that contain both X and YConfidence (c)uMeasures how often items in Y appear in transactions t

6、hatcontain X Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Association Rule Mining TasklGiven a set of transactions T, the goal of association rule mining is to find all rules having support minsup threshold confidence minconf thresholdlBrute-force approach: List all possible associat

7、ion rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds Computationally prohibitive! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Mining Association RulesExample of Rules:Milk,Diaper Beer (s=0.4, c=0.67)Milk,Beer Diaper (s=0.

8、4, c=1.0)Diaper,Beer Milk (s=0.4, c=0.67)Beer Milk,Diaper (s=0.4, c=0.67) Diaper Milk,Beer (s=0.4, c=0.5) Milk Diaper,Beer (s=0.4, c=0.5)Observations: All the above rules are binary partitions of the same itemset: Milk, Diaper, Beer Rules originating from the same itemset have identical support but

9、can have different confidence Thus, we may decouple the support and confidence requirements Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Mining Association RuleslTwo-step approach: 1. Frequent Itemset GenerationGenerate all itemsets whose support minsup2. Rule GenerationGenerate high

10、 confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemsetlFrequent itemset generation is still computationally expensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Frequent Itemset GenerationnullABACADAEBCBDBECDCEDEABCDEABCABDABEAC

11、DACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEGiven d items, there are 2d possible candidate itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Frequent Itemset GenerationlBrute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each can

12、didate by scanning the database Match each transaction against every candidate Complexity O(NMw) = Expensive since M = 2d ! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Computational ComplexitylGiven d unique items: Total number of itemsets = 2d Total number of possible association

13、rules: 1231111dddkkdjjkdkdRIf d=6, R = 602 rules Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Frequent Itemset Generation StrategieslReduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce MlReduce the number of transactions (N) Reduce size of N as

14、 the size of itemset increases Used by DHP and vertical-based mining algorithmslReduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18

15、/2004 12 Reducing Number of CandidateslApriori principle: If an itemset is frequent, then all of its subsets must also be frequentlApriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-mo

16、notone property of support)()()( :,YsXsYXYX Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Found to be InfrequentIllustrating Apriori PrinciplePruned supersets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Illustrating Apriori PrincipleItemCountBread4Coke2Milk4Beer3Dia

17、per4Eggs1ItemsetCountBread,Milk3Bread,Beer2Bread,Diaper3Milk,Beer2Milk,Diaper3Beer,Diaper3Item set C ount B read,M ilk,D iaper 3 Items (1-itemsets)Pairs (2-itemsets)(No need to generatecandidates involving Cokeor Eggs)Triplets (3-itemsets)Minimum Support = 3If every subset is considered, 6C1 + 6C2 +

18、 6C3 = 41With support-based pruning,6 + 6 + 1 = 13 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Apriori AlgorithmlMethod: Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identifieduGenerate length (k+1) candidate itemsets from length k freque

19、nt itemsetsuPrune candidate itemsets containing subsets of length k that are infrequent uCount the support of each candidate by scanning the DBuEliminate candidates that are infrequent, leaving only those that are frequent Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Reducing Number

20、 of ComparisonslCandidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structureu Instead of matching each transaction against every candidate, match it against candidates contained

21、in the hashed buckets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Generate Hash Tree2 3 45 6 71 4 51 3 61 2 44 5 71 2 54 5 81 5 93 4 53 5 63 5 76 8 93 6 73 6 81,4,72,5,83,6,9Hash functionSuppose you have 15 candidate itemsets of length 3: 1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1

22、 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Association Rule Disco

23、very: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCandidate Hash TreeHash on 1, 4 or 7 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Association Rule Discovery: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92

24、 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCandidate Hash TreeHash on 2, 5 or 8 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Association Rule Discovery: Hash tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash FunctionCand

25、idate Hash TreeHash on 3, 6 or 9 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Subset OperationGiven a transaction t, what are the possible subsets of size 3? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83

26、 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81 2 3 5 61 + 2 3 5 63 5 62 +5 63 +1,4,72,5,83,6,9Hash Functiontransaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6

27、,9Hash Function1 2 3 5 63 5 61 2 +5 61 3 +61 5 +3 5 62 +5 63 +1 + 2 3 5 6transaction Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Subset Operation Using Hash Tree1 5 91 4 51 3 63 4 53 6 73 6 83 5 63 5 76 8 92 3 45 6 71 2 44 5 71 2 54 5 81,4,72,5,83,6,9Hash Function1 2 3 5 63 5 61 2

28、+5 61 3 +61 5 +3 5 62 +5 63 +1 + 2 3 5 6transactionMatch transaction against 11 out of 15 candidates Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Factors Affecting ComplexitylChoice of minimum support threshold lowering support threshold results in more frequent itemsets this may in

29、crease number of candidates and max length of frequent itemsetslDimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increaselSize of database since Apriori makes

30、multiple passes, run time of algorithm may increase with number of transactionslAverage transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width) Tan,

31、Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Compact Representation of Frequent ItemsetslSome itemsets are redundant because they have identical support as their supersetslNumber of frequent itemsetslNeed a compact representation101103kk Tan,Steinbach, Kumar Introduction to Data Mining

32、4/18/2004 27 Maximal Frequent ItemsetBorderInfrequent ItemsetsMaximal ItemsetsAn itemset is maximal frequent if none of its immediate supersets is frequent Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Closed ItemsetlAn itemset is closed if none of its immediate supersets has the sam

33、e support as the itemsetItemsetSupportA,B,C2A,B,D3A,C,D2B,C,D3A,B,C,D2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Maximal vs Closed ItemsetsTIDItems1ABC2ABCD3BCE4ACDE5DEnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDE1241231234245345121242441232

34、3243445122244423424Transaction IdsNot supported by any transactions Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Maximal vs Closed Frequent ItemsetsnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDE12412312342453451212424412323243445122244423424Mini

35、mum support = 2# Closed = 9# Maximal = 4Closed and maximalClosed but not maximal Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Maximal vs Closed Itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Alternative Methods for Frequent Itemset GenerationlTraversal of Ite

36、mset Lattice General-to-specific vs Specific-to-general Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Alternative Methods for Frequent Itemset GenerationlTraversal of Itemset Lattice Equivalent Classes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Alternative Methods

37、for Frequent Itemset GenerationlTraversal of Itemset Lattice Breadth-first vs Depth-first Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Alternative Methods for Frequent Itemset GenerationlRepresentation of Database horizontal vs vertical data layout Tan,Steinbach, Kumar Introduction

38、to Data Mining 4/18/2004 36 FP-growth AlgorithmlUse a compressed representation of the database using an FP-treelOnce an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 FP-tree c

39、onstructionTIDItems1A,B2B,C,D3A,C,D,E4A,D,E5A,B,C6A,B,C,D7B,C8A,B,C9A,B,D10B,C,EnullA:1B:1nullA:1B:1B:1C:1D:1After reading TID=1:After reading TID=2: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 FP-Tree ConstructionnullA:7B:5B:3C:3D:1C:1D:1C:3D:1D:1E:1E:1TIDItems1A,B2B,C,D3A,C,D,E4A

40、,D,E5A,B,C6A,B,C,D7B,C8A,B,C9A,B,D10B,C,EPointers are used to assist frequent itemset generationD:1E:1Transaction DatabaseItemPointerABCDEHeader table Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 FP-growthnullA:7B:5B:1C:1D:1C:1D:1C:3D:1D:1Conditional Pattern base for D: P = (A:1,B:1

41、,C:1),(A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)Recursively apply FP-growth on PFrequent Itemsets found (with sup 1): AD, BD, CD, ACD, BCDD:1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Tree ProjectionSet enumeration tree:nullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDA

42、BCEABDEACDEBCDEABCDEPossible Extension: E(A) = B,C,D,EPossible Extension: E(ABC) = D,E Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Tree ProjectionlItems are listed in lexicographic orderlEach node P stores the following information: Itemset for node P List of possible lexicographic

43、 extensions of P: E(P) Pointer to projected database of its ancestor node Bitvector containing information about which transactions in the projected database contain the itemset Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Projected DatabaseTIDItems1A,B2B,C,D3A,C,D,E4A,D,E5A,B,C6A,B

44、,C,D7B,C8A,B,C9A,B,D10B,C,ETIDItems1B23C,D,E4D,E5B,C6B,C,D78B,C9B,D10Original Database:Projected Database for node A: For each transaction T, projected transaction at node A is T E(A) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 ECLATlFor each item, store a list of transaction ids (

45、tids)TIDItems1A,B,E2B,C,D3C,E4A,C,D5A,B,C,D6A,E7A,B8A,B,C9A,C,D10BHorizontalData LayoutABCDE11221423435545667897898109Vertical Data LayoutTID-list Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 ECLATlDetermine support of any k-itemset by intersecting tid-lists of two of its (k-1) subs

46、ets.l3 traversal approaches: top-down, bottom-up and hybridlAdvantage: very fast support countinglDisadvantage: intermediate tid-lists may become too large for memoryA1456789B1257810 AB1578 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Rule GenerationlGiven a frequent itemset L, find

47、 all non-empty subsets f L such that f L f satisfies the minimum confidence requirement If A,B,C,D is a frequent itemset, candidate rules:ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABCAB CD,AC BD, AD BC, BC AD, BD AC, CD AB,lIf |L| = k, then there are 2k 2 candidate association rules (ignoring

48、 L and L) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Rule GenerationlHow to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone propertyc(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the sa

49、me itemset has an anti-monotone property e.g., L = A,B,C,D: c(ABC D) c(AB CD) c(A BCD) u Confidence is anti-monotone w.r.t. number of items on the RHS of the rule Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Rule Generation for Apriori AlgorithmLattice of rulesPruned RulesLow Confid

50、ence Rule Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Rule Generation for Apriori AlgorithmlCandidate rule is generated by merging two rules that share the same prefixin the rule consequentljoin(CD=AB,BD=AC)would produce the candidaterule D = ABClPrune rule D=ABC if itssubset AD=BC

展开阅读全文