GUIDE : MS. ANAGHA CHAUDHARI
A sequence : < (ef) (ab) (df) c b > A sequence database SID sequence An element may contain a set of items. Items within an element are unordered 10 <a(abc)(ac)d(cf)> and we list them alphabetically. 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of 40 <eg(af)cbc> <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 6
CHALLENGES ON SEQUENTIAL PATTERN MINING A huge number of possible sequential patterns are hidden in databases A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of user-specific constraints 7
The Apriori Algorithm—An Example Supmin = 2 Itemset sup Itemset sup Database TDB {A} 2 Tid Items L1 {A} 2 C1 {B} 3 {B} 3 10 A, C, D {C} 3 1st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 L2 Itemset sup 2nd scan {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {B, C} 2 {A, E} {B, E} 3 {B, E} 3 {B, C} {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset 3rd scan L3 Itemset sup C3 {B, C, E} {B, C, E} 2 10
The Apriori Algorithm [Pseudo-Code] Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 11
APRIORI ADV/DISADV  Advantages:  Uses large itemset property.  Easily parallelized  Easy to implement.  Disadvantages:  Assumes transaction database is memory resident.  Requires up to m database scans.
 J. Han, J. Pei, and Y. Yin 2000  Depth-first search  Avoid explicit candidate generation  Adopt divide-and-conquer strategy  Two step approach Step1:Build a compact data structure called FP tree Step2:Extract frequent itemsets from FP tree.
Step 1: FP-Tree Construction  FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support.
Pass 2: Nodes correspond to items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ). – In this case, counters are incremented 3. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) – The more paths that overlap, the higher the compression. FP-tree may fit in memory. 4. Frequent itemsets extracted from the FP-Tree.
 Start from each frequent length-1 pattern (as an initial suffix pattern)  construct its conditional pattern base (a ―subdatabase,‖which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern)  Construct its (conditional) FP-tree, and perform mining recursively on such a tree.  The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
Table : Table after first scan of database Table : Transactional data
Fig . FP – Tree Construction
EXAMPLE CONT Table:Mining FP Tree by creating conditional (sub)-pattern bases
EXAMPLE CONT Fig.The conditional FP-tree associated with the conditiona node I3
FP-FROWTH ADV/DISADV Advantages of FP-Growth • only 2 passes over data-set • ―compresses‖ data-set • no candidate generation • much faster than Apriori Disadvantages of FP-Growth • FP-Tree may not fit in memory!! • FP-Tree is expensive to build
APPLICATIONS Customer shopping sequences:  First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures 22
THANK YOU
Sequential pattern mining

Sequential pattern mining

  • 1.
    GUIDE : MS.ANAGHA CHAUDHARI
  • 6.
    A sequence :< (ef) (ab) (df) c b > A sequence database SID sequence An element may contain a set of items. Items within an element are unordered 10 <a(abc)(ac)d(cf)> and we list them alphabetically. 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> <a(bc)df> is a subsequence of 40 <eg(af)cbc> <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern 6
  • 7.
    CHALLENGES ON SEQUENTIAL PATTERNMINING A huge number of possible sequential patterns are hidden in databases A mining algorithm should  find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold  be highly efficient, scalable, involving only a small number of database scans  be able to incorporate various kinds of user-specific constraints 7
  • 10.
    The Apriori Algorithm—AnExample Supmin = 2 Itemset sup Itemset sup Database TDB {A} 2 Tid Items L1 {A} 2 C1 {B} 3 {B} 3 10 A, C, D {C} 3 1st scan {C} 3 20 B, C, E {D} 1 {E} 3 30 A, B, C, E {E} 3 40 B, E C2 Itemset sup C2 Itemset {A, B} 1 L2 Itemset sup 2nd scan {A, B} {A, C} 2 {A, C} 2 {A, C} {A, E} 1 {B, C} 2 {B, C} 2 {A, E} {B, E} 3 {B, E} 3 {B, C} {C, E} 2 {C, E} 2 {B, E} {C, E} Itemset 3rd scan L3 Itemset sup C3 {B, C, E} {B, C, E} 2 10
  • 11.
    The Apriori Algorithm[Pseudo-Code] Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk != ; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; 11
  • 12.
    APRIORI ADV/DISADV  Advantages:  Uses large itemset property.  Easily parallelized  Easy to implement.  Disadvantages:  Assumes transaction database is memory resident.  Requires up to m database scans.
  • 13.
    J. Han, J. Pei, and Y. Yin 2000  Depth-first search  Avoid explicit candidate generation  Adopt divide-and-conquer strategy  Two step approach Step1:Build a compact data structure called FP tree Step2:Extract frequent itemsets from FP tree.
  • 14.
    Step 1: FP-TreeConstruction  FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support.
  • 15.
    Pass 2: Nodes correspondto items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ). – In this case, counters are incremented 3. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) – The more paths that overlap, the higher the compression. FP-tree may fit in memory. 4. Frequent itemsets extracted from the FP-Tree.
  • 16.
     Start fromeach frequent length-1 pattern (as an initial suffix pattern)  construct its conditional pattern base (a ―subdatabase,‖which consists of the set of prefix paths in the FP-tree co-occurring with the suffix pattern)  Construct its (conditional) FP-tree, and perform mining recursively on such a tree.  The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-tree.
  • 17.
    Table : Tableafter first scan of database Table : Transactional data
  • 18.
    Fig . FP– Tree Construction
  • 19.
    EXAMPLE CONT Table:Mining FPTree by creating conditional (sub)-pattern bases
  • 20.
    EXAMPLE CONT Fig.The conditionalFP-tree associated with the conditiona node I3
  • 21.
    FP-FROWTH ADV/DISADV Advantages ofFP-Growth • only 2 passes over data-set • ―compresses‖ data-set • no candidate generation • much faster than Apriori Disadvantages of FP-Growth • FP-Tree may not fit in memory!! • FP-Tree is expensive to build
  • 22.
    APPLICATIONS Customer shopping sequences:  First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures 22
  • 26.