S1: Data Mining
Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S2: Data Mining - Application Domains
  • Motivation - Many applications, e.g. Market Analysis
    • Which items sell together? (diaper & beer)
    • Who buys Palm PDAs? (Customer segmentation)
    • Should this credit card application be approved?
    • Section 27.2 (pp. 891) lists more
  • Idea of data mining (Data-driven, inductive, backward)
    • ...Get model from learning data -- > check with test data
    • ...Refine model if mismatch(prediction, reality)
    • Different from forward deductive methods
    • ...Build model -- > deduce conclusions -- > match with data
    • ...Refine model if mismatch(prediction, reality)
  • History
    • Statistics and Machine Learning (AI) started it
    • ...often main memory based software
    • ...few patterns: classification, clustering, ...
    • Database researchers contributed
    • ...scalability to large databases >> main memories
    • ...new patterns: e.g. association rules
    • ...integration with SQL, data warehosue etc.
  • Study tools listed in Table 27.1 (pp. 893).
    • Which came from datbase vendor?


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S3: Data Mining - Conceptual Data Model
  • Given following Tables
    • ...items( instance-id, item-type-id, attribute1, attribute2, ...)
    • ...transactions( xid, subset of item-types, customer-id, timestamp )
    • ...item-types( item-type-id, type-name, ... )
  • Pattern family - Associations
    • Subsets of similar item-types
    • ...Customer buying item-type IT1 also buys item-type IT2
    • Sequential Associations:
    • ...Customer buying item-type IT1 will buy item-type IT2 later
  • Pattern Family - Categorization / Segmentation
    • partition item-types into groups
    • Unsupervised Clustering - number of groups not given
    • ...e.g. market segmentation
    • Supervised Classification
    • ...optional table category( instance-id, class-label )
    • ...Examples of each category is available
    • ...e.g. credit card application : -- > { grant, decline }
  • Q. Map following to pattern families:
    • amazon.com: people buying book b1 also bought book b2
    • < Smoking, Lung cancer > , < El-Nino, mild winter in Minnesota >
    • Minnesota has 2 seasons: road-repair and snow
    • Admit if (GPA > 3) and (GRE score > 90 percentile)


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S4: Association Rules - Conceptual Data Model
  • Association, association rules
    • Let I = a set of items, T = a set of transactions
    • ...Association A is a subset of I occuring together in T
    • ...Support(S1) = fraction of T containing S1
    • Let S1 and S2 be subsets of I
    • ...Association Rule S1 -- > S2
    • ...Support(S1 - > S2) = support(S1 union S2)
    • ...Confidence(S1 - > S2) = support(S1 union S2) / support(S1)
  • Example
    • Study the dataset Table 27.1 (pp. 872)
    • Identify values for I, T
    • Compute support and confidence for rules
    • ...(Milk = > Juice), (Bread = > Juice)
  • Variations of Association rules
    • Hierarchical set of Items (Fig. 27.3, pp. 880)
    • Negative associations
    • Continuous attributes
  • Q. Compare association rules with statistical concepts:
    • support(X - > Y), confidence(X - > Y)
    • conditional probability{X|Y}, correlation(X,Y)


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S5: Association Rules - Logical Data Model
  • Idea: Extend relational model and SQL
    • Specify patterns over a relational schema
    • Specify data mining request to extract
    • ...pattern meeting certain criteria
    • Analogy: data-cube model for DWH
  • Status:
    • Research prototype for association rule
    • Open problems
    • ...Common data model for many familier of patterns
    • ...syntax for specifying patterns
  • BTW: Statistical data mining supported in SQL (Oracle, MS SQL Server)
    • Correlations, covariance
    • Linear regression
    • We may see association rules in SQL if market picks up!


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S6: Association Rules Mining - Physical Data Model
  • Efficient algorithms to mine association rules
    • Given an itemSet I, a transactionsSet T, thresholds,
    • Find Association Rule S1 -- > S2 with
    • ...support above given threshold
  • Naive Idea: Exhaustive enumeration (pp. 872)
    • Steps
    • ...Generate candidate subsets of I
    • ...Scan T to compute their support
    • ...Select subsets with adequate support
    • Downside: complexity exponential(size(I))
  • Better Idea: Apriori (Algorithm 27.1, pp. 873)
    • Idea: Generate candidate in order of size
    • Apriori_gen: combine subsets which differ only in last item
    • ...(A1,A2,A3), (A1,A2,A6) = > (A1,A2,A3,A6)
    • ...(A1,A2,A3), (A1,A3,A6) = > not combined
    • ...efficient enumeration!
    • Prune a candidate families as soon as possible
    • ...Note: Support is monotonic on size of candidates
    • ...Support(S1 union S2) < = min( support(S1), support(S2))
    • ...Pruning can be applied in each iteration
    • Steps - see pp. 860-861
  • Properties - correctness, completeness
  • Data-strucutres, Storage methods
    • Candidate set in Apriori algorithm
    • ...Sorted candidate sets efficient for apriori_gen()
    • Set data structures - bitmap vectors for I
    • Candidate set of different sizes - Hash tree
  • Other Ideas
    • Sampling (27.2.3)
    • Frequent Pattern Tree (Algorithm 27.2, section 27.2.4)
    • ...Example Frequent Pattern Tree (Fig. 27.2, pp. 877)


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S7: Association Rules Mining - Exercises
  • Determine truth / falsehood for an AR : LHS -- > RHS
    • A. confidence(AR) is never less than support(AR).
    • C. support(LHS -- > RHS) = support (RHS -- > LHS)
    • B. confidence(LHS -- > RHS) = confidence (RHS -- > LHS)
  • Consider following set of transactions: (tid, itemset)
    (100, < 1, 3, 4 > ), (101, < 2, 3, 5 > ),
    (102, < 1, 2, 3, 5 > ), (103, < 2, 5 > )
    • Find all associations with support of at least 0.5
    • Find all association rule with support of at least 0.5
    • ...and confidence of at least 0.5
  • Repeat previous question for thresholds of 40 percent
    • Set of Item-types = ( A, B, C, D, E, F, G )
      T1={A, D, E}, T6={A, D, E, G}
      T2={A, D, F}, T7={A, B, D, F}
      T3={A, E, F}, T8={A, B, D, F, G}
      T4={A, B, D, F}, T9={B, D, E, G}
      T5={B, D, F}, T10={A, D, E}
  • Suppose half the transactions include item A and another
    • ...half include item B. How many non-trivial association rules
    • ...can be inferred from this information. Provide limits
    • ...on support and confidence of these rules.


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S8: Classifers - Logical Data Model
  • Problem Definition
    • Given: A relation R(A1, A2, ..., An, C)
    • Find: A function f mapping (A1, ..., An) to C
    • Objectives: Minimize classification error
    • ...differece( C, f(A1, ..., An) )
    • ...over learning as well as testing samples
    • Constraints: categorical C, independent samples, representative
  • Choices for function "f"
    • Decision Rules
    • Decision Tree
    • Others, e.g. Bayesian classifier, regression,
  • Decision Trees
    • Structure
    • ...Interior Nodes = set (A1, ... An) of categorical attribute
    • ...Leafs = class labels from domain (C)
    • ...Edges = a value from domain(Ai), Ai associated with parent node
    • Property: a search tree : tuples in R -- > leafs, i.e. class labels
    • Example Fig. 27.5 pp. 883
  • Q? Find class labels for tuples in Fig. 27.6 using DT in Fig. 27.5
  • Strengths and Weaknesses of Decision Trees
    • Can work with heterogeneous data
    • Decision boundary are parallel to axis


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S9: Decision Tree Mining - Physical Data Model
  • Efficient algorithms to mine decision trees
    • Given: a table R(A1, A2, ..., An, C)
    • Find: a decision tree T
    • Objective: Minimize classification error
    • Constraint: prefer smaller tree, no unique columns (e.g. p.k.),
  • Naive Idea: Exhaustive enumeration
    • 1. Generate candidate trees
    • 2. Scan R to compute their classification errors
    • 3. Select a tree with small classification error and small size
    • Downside: complexity exponential(number of columns of R)
  • Alternative Idea: ID3 (Algorithm 27.3, pp. 883)
    • Idea : greedy algorithm - each node reduces entropy
    • 1. Select an attribute for root node
    • ...Choose attributes with highest information gain
    • ...Partition tuples over domain of chosen attribute
    • 2. Recursion on each child
    • 3. Stopping criteria - don't recurse on nodes with few samples
  • Example: Fig. 27.6
    • See spreadheet for entropy function for table partitions
    • Compare attributes by entropy - salary wins
    • salary in (20K..50K) = > either age or acctBalance
    • Result = Fig. 27.7
  • Strengths and Weaknesses of ID3
    • Faster than exhaustive
    • Heuristic - no guarantee of solution quality


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S10: Clustering - Logical Data Model
  • Problem Definition
    • Given: A relation R(A1, A2, ..., An),
    • ...and a similarity function between rows of R
    • Find: A set of those groups of rows in R
    • Objectives: Groups should be
    • ...cohesive: tuples within a group are similar to each other
    • ...not coupled: tuples across group are dissimilar
    • Constraints: number of clusters may be given, significant clusters
  • Choices for similarity measures
    • Distance functions, e.g. euclidean, manhattan, string edits, graph-distance
    • L2 metrics
  • Choices for group representations
    • Centers, e.g. mean, medoid, (mean and standard deviaion)
    • boolean expression on attributes
    • Named subsets of row
  • Example, Figure 27.8, pp. 887
    • Construct a manhattan distance matrix for rows
    • Q? Which points are close to p1 ?
    • Q? Which points are close to p2 ?
    • Q? For these two clusters
    • ...Compute the centers
    • ...Provide boolean expression on attributes
    • Q? Which point is furthest from its nearest neighbor ?


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S11: Cluster Mining - Physical Data Model
  • Efficient algorithms to mine cluster-centers
    • Given: a table R(A1, A2, ..., An),
    • Find: cluster, i.e. center points
    • Objective: Minimize average distance(rows of R, nearest center points)
    • Constraint: K = number of clusters,
    • ...optional - center points belong to R
  • Naive Idea: Exhaustive enumeration
    • If domains of attributes are small
    • 1. Enumerate cross-product of domains of attributes
    • 2. Enumerate size K subsets of tuple
    • 3. Choose size-K subset with smallest avergage distance to R.
  • Alternative Idea: K-Mean (Algorithm 27.4, pp. 886)
    • Idea : greedy algorithm - each iteration reduces average distance
    • 1. Select K points at random
    • 2. Assign tuples of R to nearest points
    • 3. Compute center of each cluster
    • 4. Repeat steps 2 and 3 to reach local minima
  • Example: Fig. 27.8 with manhattan distance
    • Try the K-mean algorithm with following two random points
    • ...(Age, Years of Service) = (0,0), (1,1)
    • ...(Age, Years of Service) = (10,10), (50,20)
    • Do you converge to the same solution from both starting points ?
    • Result = Fig. 27.7
  • Strengths and Weaknesses of K-mean
    • Faster than exhaustive
    • Heuristic - no guarantee of solution quality
    • Assumes equal size disjoint clusters


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S12: Outliers - Logical Data Model
  • Problem Definition
    • Given: A relation R(A1, A2, ..., An),
    • ...and a similarity function between rows of R
    • Find: rows in R which are dissimilar to most point in R
    • Objectives: maximize dissimilarity function
    • Constraints: number of outliers may be given, significant outliers
  • Choices for similarity measures between rows
    • Distance functions, e.g. euclidean, manhattan, string edits, graph-distance
    • L2 metrics
  • Choices for aggregate dissimilarity measures
    • Distance of K nearest neighbors
    • Density of neighborhood outside the expected range
    • attribute difference with nearby neighbors
  • Example, Figure 27.8, pp. 887 using manhattan distance
    • Q? Which point is furthest from its nearest neighbor ?


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S13: Outlier Mining - Physical Data Model
  • Efficient algorithms to mine cluster-centers
    • Given: a table R(A1, A2, ..., An),
    • Find: rows
    • Objective: maximize dissimilarity_measure(rows of R, outlier row)
    • Constraint: Top K or statistically significant
  • Naive Idea: Exhaustive enumeration
    • 1. Compute dissimilariy measure for each row in R with rest of R
    • 2. Choose top-K rows, or those with scores outside expected range
  • Alternative Idea: Clustering based
    • Idea : Use clusters as summary of data to reduce computational costs
    • 1. Cluster R, e.g via K-means
    • 2. Compute distance of each tuple in R to nearest cluster center
    • 3. Choose top-K rows, or those with scores outside expected range
  • Strengths and Weaknesses of cluster based outliers
    • Faster than exhaustive
    • Heuristic - no guarantee of solution quality
    • May miss outliers, e.g. 100 points on a unit circle and 1 at center


Copyright: S. Shekhar, C. S. Dept., University of Minnesota, Minneapolis, MN 55455. ,,.(home)