Slides

S1: Data Mining

Learning Goals:
- Understand Data Mining Tasks
- Understand Data Mining Patterns
- Learn about Algorithms to mine the patterns
Outline
- Data Mining - motivation, history,
- Conceptual Data Models - Patterns, Categorizing patterns
- Logical Data Models - 4 Pattern families
- Physical Data Models - Algorithms
Readings
- Textbook Chapter 27
Learning strategy:
- Learn new terms, Compare those to what we know!

S2: Data Mining - Application Domains

Motivation - Many applications, e.g. Market Analysis
- Which items sell together? (diaper & beer)
- Who buys Palm PDAs? (Customer segmentation)
- Should this credit card application be approved?
- Section 27.2 (pp. 891) lists more
Idea of data mining (Data-driven, inductive, backward)
- ...Get model from learning data -- > check with test data
- ...Refine model if mismatch(prediction, reality)
- Different from forward deductive methods
- ...Build model -- > deduce conclusions -- > match with data
- ...Refine model if mismatch(prediction, reality)
History
- Statistics and Machine Learning (AI) started it
- ...often main memory based software
- ...few patterns: classification, clustering, ...
- Database researchers contributed
- ...scalability to large databases >> main memories
- ...new patterns: e.g. association rules
- ...integration with SQL, data warehosue etc.
Study tools listed in Table 27.1 (pp. 893).
- Which came from datbase vendor?

S3: Data Mining - Conceptual Data Model

Given following Tables
- ...items( instance-id, item-type-id, attribute1, attribute2, ...)
- ...transactions( xid, subset of item-types, customer-id, timestamp )
- ...item-types( item-type-id, type-name, ... )
Pattern family - Associations
- Subsets of similar item-types
- ...Customer buying item-type IT1 also buys item-type IT2
- Sequential Associations:
- ...Customer buying item-type IT1 will buy item-type IT2 later
Pattern Family - Categorization / Segmentation
- partition item-types into groups
- Unsupervised Clustering - number of groups not given
- ...e.g. market segmentation
- Supervised Classification
- ...optional table category( instance-id, class-label )
- ...Examples of each category is available
- ...e.g. credit card application : -- > { grant, decline }
Q. Map following to pattern families:
- amazon.com: people buying book b1 also bought book b2
- < Smoking, Lung cancer > , < El-Nino, mild winter in Minnesota >
- Minnesota has 2 seasons: road-repair and snow
- Admit if (GPA > 3) and (GRE score > 90 percentile)

S4: Association Rules - Conceptual Data Model

Association, association rules
- Let I = a set of items, T = a set of transactions
- ...Association A is a subset of I occuring together in T
- ...Support(S1) = fraction of T containing S1
- Let S1 and S2 be subsets of I
- ...Association Rule S1 -- > S2
- ...Support(S1 - > S2) = support(S1 union S2)
- ...Confidence(S1 - > S2) = support(S1 union S2) / support(S1)
Example
- Study the dataset Table 27.1 (pp. 872)
- Identify values for I, T
- Compute support and confidence for rules
- ...(Milk = > Juice), (Bread = > Juice)
Variations of Association rules
- Hierarchical set of Items (Fig. 27.3, pp. 880)
- Negative associations
- Continuous attributes
Q. Compare association rules with statistical concepts:
- support(X - > Y), confidence(X - > Y)
- conditional probability{X|Y}, correlation(X,Y)

S5: Association Rules - Logical Data Model

Idea: Extend relational model and SQL
- Specify patterns over a relational schema
- Specify data mining request to extract
- ...pattern meeting certain criteria
- Analogy: data-cube model for DWH
Status:
- Research prototype for association rule
- Open problems
- ...Common data model for many familier of patterns
- ...syntax for specifying patterns
BTW: Statistical data mining supported in SQL (Oracle, MS SQL Server)
- Correlations, covariance
- Linear regression
- We may see association rules in SQL if market picks up!

S6: Association Rules Mining - Physical Data Model

Efficient algorithms to mine association rules
- Given an itemSet I, a transactionsSet T, thresholds,
- Find Association Rule S1 -- > S2 with
- ...support above given threshold
Naive Idea: Exhaustive enumeration (pp. 872)
- Steps
- ...Generate candidate subsets of I
- ...Scan T to compute their support
- ...Select subsets with adequate support
- Downside: complexity exponential(size(I))
Better Idea: Apriori (Algorithm 27.1, pp. 873)
- Idea: Generate candidate in order of size
- Apriori_gen: combine subsets which differ only in last item
- ...(A1,A2,A3), (A1,A2,A6) = > (A1,A2,A3,A6)
- ...(A1,A2,A3), (A1,A3,A6) = > not combined
- ...efficient enumeration!
- Prune a candidate families as soon as possible
- ...Note: Support is monotonic on size of candidates
- ...Support(S1 union S2) < = min( support(S1), support(S2))
- ...Pruning can be applied in each iteration
- Steps - see pp. 860-861
Properties - correctness, completeness
Data-strucutres, Storage methods
- Candidate set in Apriori algorithm
- ...Sorted candidate sets efficient for apriori_gen()
- Set data structures - bitmap vectors for I
- Candidate set of different sizes - Hash tree
Other Ideas
- Sampling (27.2.3)
- Frequent Pattern Tree (Algorithm 27.2, section 27.2.4)
- ...Example Frequent Pattern Tree (Fig. 27.2, pp. 877)

S7: Association Rules Mining - Exercises

Determine truth / falsehood for an AR : LHS -- > RHS
- A. confidence(AR) is never less than support(AR).
- C. support(LHS -- > RHS) = support (RHS -- > LHS)
- B. confidence(LHS -- > RHS) = confidence (RHS -- > LHS)
Consider following set of transactions: (tid, itemset)
(100, < 1, 3, 4 > ), (101, < 2, 3, 5 > ),
(102, < 1, 2, 3, 5 > ), (103, < 2, 5 > )
Find all associations with support of at least 0.5
Find all association rule with support of at least 0.5
...and confidence of at least 0.5
Repeat previous question for thresholds of 40 percent
Set of Item-types = ( A, B, C, D, E, F, G )
T1={A, D, E}, T6={A, D, E, G}
T2={A, D, F}, T7={A, B, D, F}
T3={A, E, F}, T8={A, B, D, F, G}
T4={A, B, D, F}, T9={B, D, E, G}
T5={B, D, F}, T10={A, D, E}
Suppose half the transactions include item A and another
...half include item B. How many non-trivial association rules
...can be inferred from this information. Provide limits
...on support and confidence of these rules.

S8: Classifers - Logical Data Model

Problem Definition
- Given: A relation R(A1, A2, ..., An, C)
- Find: A function f mapping (A1, ..., An) to C
- Objectives: Minimize classification error
- ...differece( C, f(A1, ..., An) )
- ...over learning as well as testing samples
- Constraints: categorical C, independent samples, representative
Choices for function "f"
- Decision Rules
- Decision Tree
- Others, e.g. Bayesian classifier, regression,
Decision Trees
- Structure
- ...Interior Nodes = set (A1, ... An) of categorical attribute
- ...Leafs = class labels from domain (C)
- ...Edges = a value from domain(Ai), Ai associated with parent node
- Property: a search tree : tuples in R -- > leafs, i.e. class labels
- Example Fig. 27.5 pp. 883
Q? Find class labels for tuples in Fig. 27.6 using DT in Fig. 27.5
Strengths and Weaknesses of Decision Trees
- Can work with heterogeneous data
- Decision boundary are parallel to axis

S9: Decision Tree Mining - Physical Data Model

Efficient algorithms to mine decision trees
- Given: a table R(A1, A2, ..., An, C)
- Find: a decision tree T
- Objective: Minimize classification error
- Constraint: prefer smaller tree, no unique columns (e.g. p.k.),
Naive Idea: Exhaustive enumeration
- 1. Generate candidate trees
- 2. Scan R to compute their classification errors
- 3. Select a tree with small classification error and small size
- Downside: complexity exponential(number of columns of R)
Alternative Idea: ID3 (Algorithm 27.3, pp. 883)
- Idea : greedy algorithm - each node reduces entropy
- 1. Select an attribute for root node
- ...Choose attributes with highest information gain
- ...Partition tuples over domain of chosen attribute
- 2. Recursion on each child
- 3. Stopping criteria - don't recurse on nodes with few samples
Example: Fig. 27.6
- See spreadheet for entropy function for table partitions
- Compare attributes by entropy - salary wins
- salary in (20K..50K) = > either age or acctBalance
- Result = Fig. 27.7
Strengths and Weaknesses of ID3
- Faster than exhaustive
- Heuristic - no guarantee of solution quality

S10: Clustering - Logical Data Model

Problem Definition
- Given: A relation R(A1, A2, ..., An),
- ...and a similarity function between rows of R
- Find: A set of those groups of rows in R
- Objectives: Groups should be
- ...cohesive: tuples within a group are similar to each other
- ...not coupled: tuples across group are dissimilar
- Constraints: number of clusters may be given, significant clusters
Choices for similarity measures
- Distance functions, e.g. euclidean, manhattan, string edits, graph-distance
- L2 metrics
Choices for group representations
- Centers, e.g. mean, medoid, (mean and standard deviaion)
- boolean expression on attributes
- Named subsets of row
Example, Figure 27.8, pp. 887
- Construct a manhattan distance matrix for rows
- Q? Which points are close to p1 ?
- Q? Which points are close to p2 ?
- Q? For these two clusters
- ...Compute the centers
- ...Provide boolean expression on attributes
- Q? Which point is furthest from its nearest neighbor ?

S11: Cluster Mining - Physical Data Model

Efficient algorithms to mine cluster-centers
- Given: a table R(A1, A2, ..., An),
- Find: cluster, i.e. center points
- Objective: Minimize average distance(rows of R, nearest center points)
- Constraint: K = number of clusters,
- ...optional - center points belong to R
Naive Idea: Exhaustive enumeration
- If domains of attributes are small
- 1. Enumerate cross-product of domains of attributes
- 2. Enumerate size K subsets of tuple
- 3. Choose size-K subset with smallest avergage distance to R.
Alternative Idea: K-Mean (Algorithm 27.4, pp. 886)
- Idea : greedy algorithm - each iteration reduces average distance
- 1. Select K points at random
- 2. Assign tuples of R to nearest points
- 3. Compute center of each cluster
- 4. Repeat steps 2 and 3 to reach local minima
Example: Fig. 27.8 with manhattan distance
- Try the K-mean algorithm with following two random points
- ...(Age, Years of Service) = (0,0), (1,1)
- ...(Age, Years of Service) = (10,10), (50,20)
- Do you converge to the same solution from both starting points ?
- Result = Fig. 27.7
Strengths and Weaknesses of K-mean
- Faster than exhaustive
- Heuristic - no guarantee of solution quality
- Assumes equal size disjoint clusters

S12: Outliers - Logical Data Model

Problem Definition
- Given: A relation R(A1, A2, ..., An),
- ...and a similarity function between rows of R
- Find: rows in R which are dissimilar to most point in R
- Objectives: maximize dissimilarity function
- Constraints: number of outliers may be given, significant outliers
Choices for similarity measures between rows
- Distance functions, e.g. euclidean, manhattan, string edits, graph-distance
- L2 metrics
Choices for aggregate dissimilarity measures
- Distance of K nearest neighbors
- Density of neighborhood outside the expected range
- attribute difference with nearby neighbors
Example, Figure 27.8, pp. 887 using manhattan distance
- Q? Which point is furthest from its nearest neighbor ?

S13: Outlier Mining - Physical Data Model

Efficient algorithms to mine cluster-centers
- Given: a table R(A1, A2, ..., An),
- Find: rows
- Objective: maximize dissimilarity_measure(rows of R, outlier row)
- Constraint: Top K or statistically significant
Naive Idea: Exhaustive enumeration
- 1. Compute dissimilariy measure for each row in R with rest of R
- 2. Choose top-K rows, or those with scores outside expected range
Alternative Idea: Clustering based
- Idea : Use clusters as summary of data to reduce computational costs
- 1. Cluster R, e.g via K-means
- 2. Compute distance of each tuple in R to nearest cluster center
- 3. Choose top-K rows, or those with scores outside expected range
Strengths and Weaknesses of cluster based outliers
- Faster than exhaustive
- Heuristic - no guarantee of solution quality
- May miss outliers, e.g. 100 points on a unit circle and 1 at center