S1: Data Warehouses and Data Mining
- Learning Goals:
- Understand DWH and Data Mining facilities
- New Performance needs
- Solutions: New features in database server
- Outline
- Datawarehouses - motivation, operations,
- Conceptual Data Models
- Logical Data Models
- Physical Data Models
- Readings
- Learning strategy:
- Learn new terms, Compare those to what we know!
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S2: DataWarehouses - history
- OLAP Databases (vs. OLTP Databases)
- Infrequent Periodically refresh
- Frequent read-only aggregate reports
- Central repository of data for analysis
- Process shown in Fig. 28.1, pp. 901
- Marketed by consultants in late 1980s
- Sold methodology and DWH servers
- Methodology to gather data from many sources
- ...data cleaning, refresh etc.
- New vocabulory - OLAP, star schema, pivot, ...
- Custom software products with consulting packages
- ...Many absorbed into relational DBMS product families !
- Academics got involved in early 1990s
- Data Cube - an abstract model of hierarchical reports
- ...specifies all summary reports together
- SQL - cube operator
- Performance enhancement - e.g. star join strategy
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S3: Datawarehouses - Common Features
- Motivation: OLAP, DSS, EIS, ...
- Few metrics of evaluations - e.g. sales, profit
- Analysis of impact of many dimension on the metric
- ...Via hierarchy of multiple aggregation reports
- ...e.g. top ten stores = > drill down to find out why
- Abstract Example
- Multi-dim data - Figure 28.4, pp. 904
- Rollup - Figure 28.5, pp. 905
- Drill down (by dimension & concept hierarchy) - Figure 28.6, pp. 905
- Ex. relation R(SI) schema: (store, item, sales)
- Data: (s1, i1, 3), (s1, i2, 40), (s2, i1, 20), (s2, i2, 6)
- Report R(S): Sales summary by stores: (s1, 43), (s2, 26)
- Report R(I): Sales summary by items: (i1, 23), (i2, 46)
- Report R on total Sales : 69
- Aggregation Report Hierarchies
- ...dimension lattice cube: R(SI) -- > { R(S), R(I) } -- > R
- ...concept hierarchy, e.g. R(S) -- > R(Store groups by region)
- Operations in a datawarehouse (Section 28.5, pp. 910)
- Pivot: get non-relational view
- Dice/Slice: select parts of reports
- Navigation in hierarchy of reports
- ...rollup: R(SI) -- > R(S) -- > R
- ...drill-down: R -- > R(I) -- > R(IS)
- Match items in "Data" menu in Excel to DWH operations
- PivotTable and PivotChart Report...
- SubTotals
- Group and Ungroup
- Consolidate
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S4: DataWarehouses - Conceptual Data Model
- Multi-dimensional data model
- Star Schema (Fig. 28.7, pp. 906)
- 2 types of tables: 1 fact table, dimension tables
- dimension table Ti = < Di key, Di attributes >
- fact table = < D1, D2, ..., Dn, mesaurement y >
- Comparison with Entity Relationship Model
- Star schema is subset of ER model
- ...It is an n-ry relationship
- ...1 relationship table, N entities
- Snowflake Schema (Fig. 28.8)- normalized dim. tables
- Multi-dimensional representation of fact table
- Function y = f(D1, D2, ..., Dn)
- ...natural for statistical analysis
- N dimension cube: domain(D1) X domain(D2) X ... X domain(Dn)
- ...each cell has value of y
- ...pivot spreadsheet view if n = 2
- Smaller cubes to show summary reports
- Q? Show multi-dim. cube representation of
- ...reports R(SI), R(S), R(I) and R.
- Arrange four cubes and sub-cubes to show
- ...relationship of partial totals to total-sales
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S5: DataWarehouses - Logical Data Model
- OLAP Analysis Queries to generate reports
- Fact table: y = f(D1, D2, ..., Dn)
- Reports: aggregate(y) = g(subset of dimensions)
- ...power(2, N) distinct aggregation reports
- Each report needs a "group by" SQL query
- Extending SQL Query Language
- Cube operator - generates data for all reports
- ...Shorthand for power(2, N) group by queries
- Cube(Fact table, subset with N dimensions) =
- ...Union of power(2,N) "group by" queries on the fact table
- ALL keyword - dual of NULL
- ...fills in columns missing from summary reports
- ...to denote that those have been aggregated away.
- cube( R(SI) ) schema = (store, item, sales) = R(SI) schema
- tuples: (s1, i1, 3), (s1, i2, 40), (s2, i1, 20), (s2, i2, 9)
- ...(s1, all, 43), (s2, all, 26) from R(S)
- ...(all, i1, 23), (all, i2, 46) from R(I)
- ...(all, all, 69) from R
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S6: DataWarehouses - Logical Data Model
- Rollup( Fact table, ordered list of N dimensions ) =
- Union of N "group by" queries on the fact table
- ...Aggregate away I right-most dimensions, I = 1, 2, ..., N
- rollup(fact(store, item, sales), (store, item) )
- tuples: (s1, i1, 3), (s1, i2, 40), (s2, i1, 20), (s2, i2, 9)
- ...(s1, all, 43), (s2, all, 26) from R(S)
- ...(all, all, 69) from R
- rollup(fact(store, item, sales), (item, store) )
- tuples: (s1, i1, 3), (s1, i2, 40), (s2, i1, 20), (s2, i2, 9)
- ...(all, i1, 23), (all, i2, 46) from R(I)
- ...(all, all, 69) from R
- ...Note: result lists column in same order as in the list!
- Ex. Write SQL expressions to get reports R(S)
- (A) to get reports R(S) from data table R(SI)
- (B) to get reports R(I) from cube( R(SI) )
- (C) to get cube( R(SI), (SI) ) from data table R(SI)
- (D) to get rollup( R(SI), (SI) ) from data table R(SI)
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S7: DataWarehouses - Logical Data Model
- Oracle examples for part (C) in previous slide
- given table fact(store, item, sales) = R(SI)
select store, item, sum(sales)
from fact
group by cube(store, item)
- Oracle examples for part (D) in previous slide
select store, item, sum(sales)
from fact
group by rollup(store, item)
- Oracle operations: supp. book - pp. 220
- GROUP BY cluase may have ROLLUP, CUBE
- Operations: ranking : RANK,PERCENT_RANK, ROW_NUMBER, ...
- ...windowing/reporting : STDDEV, VARIANCE, FIRST_VALUE, ...
- ...Statistics : covariance, correlation, linear regression
- ...Lag/lead : compare similar time periods e.g. 2001 Q1, 2002 Q1
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)
S8: DataWarehouses - Physical Data Model
- DWH logical data model can be implemented on
- Custom systems using flat files
- Relational database servers (Relational OLAP)
- New performance needs of DWH - > new solutions
- ...bit-map index, star-join strategy,
- ...view maintenace, parallel processing
- Storage Methods: Bit-map index (Fig. S9.3, pp 218)
- index = < attribute-value, bitmap vector >
- Example of bit-map indices
- ...R(SI).store index tuples: (s1, 1100), (s2, 0011)
- ...R(SI).item index tuples: (i1, 1010), (i2, 1010)
- Storage overhead comparison(bit-map index, B+trees)
- ...bitmaps = 1 bit per tuple in the table
- ...index-size = (nR / 8) * size(Domain)
- ...size of B+tree = (nR)*(key-size + pointer size)
- Useful for Attributes with a small domain of values
- ...e.g. state - domain has 50 values for USA
- Join Strategies: Star Join Strategy
- For N-ry join (fact table, all dimension tables)
- Scans fact table once only!
- Better than sequence of N binary joins
- ...when size(fact table) >> size(dimension tables)
- Suppl. book, pp. 217 under Query Optimization
- Q. Review the relational table in Figure 6.1 (pp. 158)
- Which column may be indexed using bit-map index?
- How many entries in index? How many bit per bit-vectors?
- List first few bits for each bit-vectors in index.
- Q. Compare I/O cost for bit-map index and B+tree secondary index for
- assume index : table.A1, table.A2
- select(table, A1 = value1)
- select(table, A1 = value1 and A2 = value2)
- insert a new tuple
- Q. How will Oracle cluster data-structure store a cluster
- ...consisting of a fact table, and its dimension tables?
Copyright: S. Shekhar, C. S. Dept.,
University of Minnesota, Minneapolis, MN 55455. ,,.(home)