S1: Data Mining
- Learning Goals:
- Understand Data Mining Tasks
- Understand Data Mining Patterns
- Learn about Algorithms to mine the patterns
- Outline
- Data Mining - motivation, history,
- Conceptual Data Models - Patterns, Categorizing patterns
- Logical Data Models - 4 Pattern families
- Physical Data Models - Algorithms
- Readings
- Learning strategy:
- Learn new terms, Compare those to what we know!
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S2: Data Mining - Application Domains
- Motivation - Many applications, e.g. Market Analysis
- Which items sell together? (diaper & beer)
- Who buys Palm PDAs? (Customer segmentation)
- Should this credit card application be approved?
- Section 27.2 (pp. 891) lists more
- Idea of data mining (Data-driven, inductive, backward)
- ...Get model from learning data -- > check with test data
- ...Refine model if mismatch(prediction, reality)
- Different from forward deductive methods
- ...Build model -- > deduce conclusions -- > match with data
- ...Refine model if mismatch(prediction, reality)
- History
- Statistics and Machine Learning (AI) started it
- ...often main memory based software
- ...few patterns: classification, clustering, ...
- Database researchers contributed
- ...scalability to large databases >> main memories
- ...new patterns: e.g. association rules
- ...integration with SQL, data warehosue etc.
- Study tools listed in Table 27.1 (pp. 893).
- Which came from datbase vendor?
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S3: Data Mining - Conceptual Data Model
- Given following Tables
- ...items( instance-id, item-type-id, attribute1, attribute2, ...)
- ...transactions( xid, subset of item-types, customer-id, timestamp )
- ...item-types( item-type-id, type-name, ... )
- Pattern family - Associations
- Subsets of similar item-types
- ...Customer buying item-type IT1 also buys item-type IT2
- Sequential Associations:
- ...Customer buying item-type IT1 will buy item-type IT2 later
- Pattern Family - Categorization / Segmentation
- partition item-types into groups
- Unsupervised Clustering - number of groups not given
- ...e.g. market segmentation
- Supervised Classification
- ...optional table category( instance-id, class-label )
- ...Examples of each category is available
- ...e.g. credit card application : -- > { grant, decline }
- Q. Map following to pattern families:
- amazon.com: people buying book b1 also bought book b2
- < Smoking, Lung cancer > , < El-Nino, mild winter in Minnesota >
- Minnesota has 2 seasons: road-repair and snow
- Admit if (GPA > 3) and (GRE score > 90 percentile)
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S4: Association Rules - Conceptual Data Model
- Association, association rules
- Let I = a set of items, T = a set of transactions
- ...Association A is a subset of I occuring together in T
- ...Support(S1) = fraction of T containing S1
- Let S1 and S2 be subsets of I
- ...Association Rule S1 -- > S2
- ...Support(S1 - > S2) = support(S1 union S2)
- ...Confidence(S1 - > S2) = support(S1 union S2) / support(S1)
- Example
- Study the dataset Table 27.1 (pp. 872)
- Identify values for I, T
- Compute support and confidence for rules
- ...(Milk = > Juice), (Bread = > Juice)
- Variations of Association rules
- Hierarchical set of Items (Fig. 27.3, pp. 880)
- Negative associations
- Continuous attributes
- Q. Compare association rules with statistical concepts:
- support(X - > Y), confidence(X - > Y)
- conditional probability{X|Y}, correlation(X,Y)
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S5: Association Rules - Logical Data Model
- Idea: Extend relational model and SQL
- Specify patterns over a relational schema
- Specify data mining request to extract
- ...pattern meeting certain criteria
- Analogy: data-cube model for DWH
- Status:
- Research prototype for association rule
- Open problems
- ...Common data model for many familier of patterns
- ...syntax for specifying patterns
- BTW: Statistical data mining supported in SQL (Oracle, MS SQL Server)
- Correlations, covariance
- Linear regression
- We may see association rules in SQL if market picks up!
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S6: Association Rules Mining - Physical Data Model
- Efficient algorithms to mine association rules
- Given an itemSet I, a transactionsSet T, thresholds,
- Find Association Rule S1 -- > S2 with
- ...support above given threshold
- Naive Idea: Exhaustive enumeration (pp. 872)
- Steps
- ...Generate candidate subsets of I
- ...Scan T to compute their support
- ...Select subsets with adequate support
- Downside: complexity exponential(size(I))
- Better Idea: Apriori (Algorithm 27.1, pp. 873)
- Idea: Generate candidate in order of size
- Apriori_gen: combine subsets which differ only in last item
- ...(A1,A2,A3), (A1,A2,A6) = > (A1,A2,A3,A6)
- ...(A1,A2,A3), (A1,A3,A6) = > not combined
- ...efficient enumeration!
- Prune a candidate families as soon as possible
- ...Note: Support is monotonic on size of candidates
- ...Support(S1 union S2) < = min( support(S1), support(S2))
- ...Pruning can be applied in each iteration
- Steps - see pp. 860-861
- Properties - correctness, completeness
- Data-strucutres, Storage methods
- Candidate set in Apriori algorithm
- ...Sorted candidate sets efficient for apriori_gen()
- Set data structures - bitmap vectors for I
- Candidate set of different sizes - Hash tree
- Other Ideas
- Sampling (27.2.3)
- Frequent Pattern Tree (Algorithm 27.2, section 27.2.4)
- ...Example Frequent Pattern Tree (Fig. 27.2, pp. 877)
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S7: Association Rules Mining - Exercises
- Determine truth / falsehood for an AR : LHS -- > RHS
- A. confidence(AR) is never less than support(AR).
- C. support(LHS -- > RHS) = support (RHS -- > LHS)
- B. confidence(LHS -- > RHS) = confidence (RHS -- > LHS)
- Consider following set of transactions: (tid, itemset)
(100, < 1, 3, 4 > ), (101, < 2, 3, 5 > ),
(102, < 1, 2, 3, 5 > ), (103, < 2, 5 > )
- Find all associations with support of at least 0.5
- Find all association rule with support of at least 0.5
- ...and confidence of at least 0.5
- Repeat previous question for thresholds of 40 percent
- Set of Item-types = ( A, B, C, D, E, F, G )
T1={A, D, E}, T6={A, D, E, G}
T2={A, D, F}, T7={A, B, D, F}
T3={A, E, F}, T8={A, B, D, F, G}
T4={A, B, D, F}, T9={B, D, E, G}
T5={B, D, F}, T10={A, D, E}
- Suppose half the transactions include item A and another
- ...half include item B. How many non-trivial association rules
- ...can be inferred from this information. Provide limits
- ...on support and confidence of these rules.
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S8: Classifers - Logical Data Model
- Problem Definition
- Given: A relation R(A1, A2, ..., An, C)
- Find: A function f mapping (A1, ..., An) to C
- Objectives: Minimize classification error
- ...differece( C, f(A1, ..., An) )
- ...over learning as well as testing samples
- Constraints: categorical C, independent samples, representative
- Choices for function "f"
- Decision Rules
- Decision Tree
- Others, e.g. Bayesian classifier, regression,
- Decision Trees
- Structure
- ...Interior Nodes = set (A1, ... An) of categorical attribute
- ...Leafs = class labels from domain (C)
- ...Edges = a value from domain(Ai), Ai associated with parent node
- Property: a search tree : tuples in R -- > leafs, i.e. class labels
- Example Fig. 27.5 pp. 883
- Q? Find class labels for tuples in Fig. 27.6 using DT in Fig. 27.5
- Strengths and Weaknesses of Decision Trees
- Can work with heterogeneous data
- Decision boundary are parallel to axis
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S9: Decision Tree Mining - Physical Data Model
- Efficient algorithms to mine decision trees
- Given: a table R(A1, A2, ..., An, C)
- Find: a decision tree T
- Objective: Minimize classification error
- Constraint: prefer smaller tree, no unique columns (e.g. p.k.),
- Naive Idea: Exhaustive enumeration
- 1. Generate candidate trees
- 2. Scan R to compute their classification errors
- 3. Select a tree with small classification error and small size
- Downside: complexity exponential(number of columns of R)
- Alternative Idea: ID3 (Algorithm 27.3, pp. 883)
- Idea : greedy algorithm - each node reduces entropy
- 1. Select an attribute for root node
- ...Choose attributes with highest information gain
- ...Partition tuples over domain of chosen attribute
- 2. Recursion on each child
- 3. Stopping criteria - don't recurse on nodes with few samples
- Example: Fig. 27.6
- See spreadheet for entropy function for table partitions
- Compare attributes by entropy - salary wins
- salary in (20K..50K) = > either age or acctBalance
- Result = Fig. 27.7
- Strengths and Weaknesses of ID3
- Faster than exhaustive
- Heuristic - no guarantee of solution quality
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S10: Clustering - Logical Data Model
- Problem Definition
- Given: A relation R(A1, A2, ..., An),
- ...and a similarity function between rows of R
- Find: A set of those groups of rows in R
- Objectives: Groups should be
- ...cohesive: tuples within a group are similar to each other
- ...not coupled: tuples across group are dissimilar
- Constraints: number of clusters may be given, significant clusters
- Choices for similarity measures
- Distance functions, e.g. euclidean, manhattan, string edits, graph-distance
- L2 metrics
- Choices for group representations
- Centers, e.g. mean, medoid, (mean and standard deviaion)
- boolean expression on attributes
- Named subsets of row
- Example, Figure 27.8, pp. 887
- Construct a manhattan distance matrix for rows
- Q? Which points are close to p1 ?
- Q? Which points are close to p2 ?
- Q? For these two clusters
- ...Compute the centers
- ...Provide boolean expression on attributes
- Q? Which point is furthest from its nearest neighbor ?
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S11: Cluster Mining - Physical Data Model
- Efficient algorithms to mine cluster-centers
- Given: a table R(A1, A2, ..., An),
- Find: cluster, i.e. center points
- Objective: Minimize average distance(rows of R, nearest center points)
- Constraint: K = number of clusters,
- ...optional - center points belong to R
- Naive Idea: Exhaustive enumeration
- If domains of attributes are small
- 1. Enumerate cross-product of domains of attributes
- 2. Enumerate size K subsets of tuple
- 3. Choose size-K subset with smallest avergage distance to R.
- Alternative Idea: K-Mean (Algorithm 27.4, pp. 886)
- Idea : greedy algorithm - each iteration reduces average distance
- 1. Select K points at random
- 2. Assign tuples of R to nearest points
- 3. Compute center of each cluster
- 4. Repeat steps 2 and 3 to reach local minima
- Example: Fig. 27.8 with manhattan distance
- Try the K-mean algorithm with following two random points
- ...(Age, Years of Service) = (0,0), (1,1)
- ...(Age, Years of Service) = (10,10), (50,20)
- Do you converge to the same solution from both starting points ?
- Result = Fig. 27.7
- Strengths and Weaknesses of K-mean
- Faster than exhaustive
- Heuristic - no guarantee of solution quality
- Assumes equal size disjoint clusters
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S12: Outliers - Logical Data Model
- Problem Definition
- Given: A relation R(A1, A2, ..., An),
- ...and a similarity function between rows of R
- Find: rows in R which are dissimilar to most point in R
- Objectives: maximize dissimilarity function
- Constraints: number of outliers may be given, significant outliers
- Choices for similarity measures between rows
- Distance functions, e.g. euclidean, manhattan, string edits, graph-distance
- L2 metrics
- Choices for aggregate dissimilarity measures
- Distance of K nearest neighbors
- Density of neighborhood outside the expected range
- attribute difference with nearby neighbors
- Example, Figure 27.8, pp. 887 using manhattan distance
- Q? Which point is furthest from its nearest neighbor ?
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S13: Outlier Mining - Physical Data Model
- Efficient algorithms to mine cluster-centers
- Given: a table R(A1, A2, ..., An),
- Find: rows
- Objective: maximize dissimilarity_measure(rows of R, outlier row)
- Constraint: Top K or statistically significant
- Naive Idea: Exhaustive enumeration
- 1. Compute dissimilariy measure for each row in R with rest of R
- 2. Choose top-K rows, or those with scores outside expected range
- Alternative Idea: Clustering based
- Idea : Use clusters as summary of data to reduce computational costs
- 1. Cluster R, e.g via K-means
- 2. Compute distance of each tuple in R to nearest cluster center
- 3. Choose top-K rows, or those with scores outside expected range
- Strengths and Weaknesses of cluster based outliers
- Faster than exhaustive
- Heuristic - no guarantee of solution quality
- May miss outliers, e.g. 100 points on a unit circle and 1 at center
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)