Sample Questions for Csci 8701, Fall 2000

Enclosed are a few sample questions for the mid semester examination. More questions will be added by Thursday lecture.

Validation Methodologies

A validation methodologies allows researchers to back up their claims in objective manner. It allows the reader of a paper to reproduce the results and conclusions of the author. This is a fundamental requirement for science and scientific research. List a few papers in your reading list and their validation methodologies. Do the paper describe their methodology in sufficient detail for ensuring reproducibility of their claims?
Which validation methodologies are commonly used in database literature to back up following kinds of claims:
1. correctness, completeness
2. deadlock free, no starvation
3. Relational algebra and tuple calculus are equivalent in expressive power
4. worst case complexity
5. superior throughput (e.g. TPS)
6. superior response time for a given benchmark (e.g. Sequoia 2000)
Simulation is a popular validation methodology in Database Systems Research. What are the strengths and weaknesses of this methodology?
What is the statistical "confidence level" of performance measures (e.g. mean response time) based on simulation? Why is it important to design simulation experiments to provide appropriate confidence levels?
Compare and Contrast the following pairs of concepts within simulation methodology.
- (a) Workload simulation: trace driven vs. Statistical Distribution driven,
- (b) Performance measure: response time vs. throughput,
- (c) Candidate comparison: statistical mean based vs. 95-th percentile based.
Hint: Define each term, identify the strengths and weakness of each.
Examine the performance studies reported in various papers covered in the lectures. Identify the style of workload simulation, choice of performance measures, and candidate comparison. Whenever possible, use the alternatives identified in previous question for these dimensions.
Repeat previous question for the dimension of estimation of performance of alternative candidates. The performance of candidates can be estimated in two different ways: (a) algebraic cost model or (b) measurements on execution of a software prototype. For example, query optimizers use algebraic cost models to compare performance of alternative query processing strategies (recall Csci 5708). In contrast, some research papers (e.g. Sequoia 2000, Buffer Pool Management) implemented candidates and measured time elapsed during execution.
Consider the claims made in research papers comparing data models, e.g. object data model, object relational data model, and relational data model. Which methodologies can be used to compare them? What are evaluation measures for this task?

Weekly Topics

Past, Present and Future of Database Research

Justify the need for DB research even though main memory is getting cheaper and bigger, CPUs and Networks are getting faster and DB technology is maturing and Commercial DBMSs are becoming commodities.
List 2 key research accomplishments of database research in the following subareas: Relational Databases, Transaction Management, Distributed Databases.
List 5 key research challenges for database research according to the readings.
List a few current research challenges for database research by answering the following questions:
1. Name major pull info. technologies of last decade.
2. Name major killer applications of last decade.
3. List a few challenges to presented to traditional databases due to the new technologies and applications?
4. Which were not anticipated by authors?

Benchmarking

Compare and contrast the following
1. response time vs. throughput
2. TPS measure in TPC benchmark vs. throughput
3. Interactive, Minbatch and Bulk transactions
4. I/O hardware, I/O runtime library, locking/recovery methods
Describe the TPC and Sequoia 2000 benchmarks in the following areas:
1. Hardware/Software features being evaluated
2. Measures of performance
3. Datasets
4. Workload (queries, computations)
5. Scaleability parameters
Consider the problem of purchasing a spatial database management systems to support data types and operations in Open Geodata Interchange Standard. Critique Sequoia 2000 benchmark for comparing DBMS in this context.
Consider the problem of purchasing a database management systems to e-commerce transactions on internet. Critique TPC benchmark for comparing DBMS in this context.
In Benchmark Handbook, Gray suggests that the goal of benchmarking is to assist end-users in answering the following question: " Which computer should I buy for my workload". Benchmarking allows one to choose cheapest system that does the job. List a few other questions that benchmarking can help answer for programmers, software sroduct sevelopers, and end-users.
In Benchmark Handbook, Gray defined four criteria for evaluating benchmarks: Relevance, Portability, Scaleability, Simplicity. Compare the TPC and Sequoia 2000 benchmarks on these criteria.

Data Models and Query Languages

Compare and contrast the following pairs:
1. Relational data model and object oriented data model
2. Relational data model and object relational data model
3. Object relational data model and object oriented data model
4. Rules and Integrity Constraints
5. Cold cache and Warm/hot cache (cold start vs. warm start)
Consider the support for spatial datatypes and operator (e.g. those in Open Geodata Interchange Standard) inside a query language such SQL. Which data models facilitate it? Explain.
Type constructors (e.g. arrays, composite collection) allow definition of user defined data types (e.g. polygon) from base types (e.g. float). List two type constructors supported by Postgres. List two type constructors supported by ObjectStore.
What are path expressions and nested dot notations? Does query language of Postgres support it ? Does query language of ObjectStore support it ?
Define 'impedance mismatch' and explain how ObjectStore mitigates this mismatch.
Does POSTGRES support inheritance? Doe it allow a class to be derived?
Explain briefly how 'swizzling' supports the ObjectStore goal of enhancing database performance.
How does POSTGRES justify its use of no-overwrite since sequential I/O is substantially faster than random I/O, the no-overwrite solution is guaranteed to offer worse performance.
List two unique issues related to the data models and query languages needed by Spatial Databases. Explain briefly.

Storage and Indexing

Compare and contrast design choices made by Operating Systems and those made by DBMSs in the following areas:
1. index-structures on collection of pages inside a file/table,
2. allocation of disk blocks (in terms of their physical proximity on disk tracks/cylinders) for a file/table,
3. buffer pool management (prefetching, replacement policy).
Define the following term and explain their effect buffer pool management: working set size, page access pattern.
Define the following page replacement policies: LRU, MRU, FIFO, hot-set, DBMIN.
What is the difference between the page access patterns for programs executing by Operating Systems and those of the query processing strategies executing inside a DBMS?
Answer the following about DBMIN buffer pool management policy:
1. Identify the working set sizes, page access patterns, and appropriate page replacement policies for the following query processing strategies : (a) Point search using an index, (b) Scanning all pages of a table (c) Nested loop join when inner table has no index
2. Define Query Locality Set Model (QLSM).
3. Is DBMIN page replacement strategy useful for Object Oriented Databases? Explain briefly.
4. Is DBMIN page replacement strategy useful for Spatial Databases? Explain briefly.
List two unique issues related to the storage and indexing needs of Spatial Databases. Explain briefly.

Transactions, Recovery

List the four ACID properties for transactions.
Compare and contrast: transaction failure, system failure, and media failure.
Which features of the following affect design of recovery algorithms:
1. main memory buffer pool management
2. file system data-structures on disk
Answer the following about Computing Survey paper on recovery:
1. Define the dimensions (e.g. atomic, steal, force, checkpoint-type) used to classify crash recovery algorithms.
2. List three types of checkpointing. How do they affect recovery algorithms?
3. Does it cover recovery when the computers has a stable main memory with battery backup? Justify.

Concurrency Control

What are the problems with using two phase read/write locking for concurrency control on operations to B+ trees?
Compare and contrast: B+ trees, B-link Trees.
Compare and contrast: 2 phase read/write locking, tree locking protocol (Yao, Lehman).
Answer the following about Yao/Lehman tree locking protocol:
1. Does it follows two phase locking?
2. Is it is free of deadlock ? starvation?
3. What is order of visiting nodes for search operation? insert operation?
Simulation is a popular validation methodology in Database Systems Research. How can a simulation study help in study of tree locking protocols?

Query Processing

Distributed Databases

Compare and contrast the following pairs:
1. vertical and horizontal partitioning
2. fragmentation and replication
3. semi-join and bloom-join
4. query optimization in a DBMS vs. that in a distributed DBMS.
5. commit protocol in a DBMS vs. that in a distributed DBMS.
Consider the following assumptions made in distributed database research in early 1980s. Identify the applications (e.g. mobile wireless, desk-top clients to internet, parallel databases, mirrored databases), where these assumptions may still be true:
1. unreliable communication
2. single schema across all sites
3. vertical and horizontal fragmentation

Parallel Databases

Data Warehouse

Compare and contrast the following pairs:
1. data cube, group by aggregate queries in SQL
2. NULL, ALL
3. B+ tree index and bitmap index
4. bitmap index and bitslice index
5. projection index and value-list index
6. group set index and join index
Define the following classes of aggregate functions: distributive, algebraic, holostic. Classify the following geomtric aggreagate functions into distributive, algebraic, holostic functions: (a) minimum bounding circle, (b) center of mass, (c) minimum bounding orthogonal rectangle, (d) convex hull.
Which index is most efficient for following SQL aggreagate operations: minimum, average, count. Briefly justify your answer. You may use Table 3.5 (pp. 549) from the textbook. However the justification should be based on first principles of database systems.
Which attributes are suitable for bit-map indices? Characterize these attributes in terms of the cardinality of their domain of values, number of bytes needed to store its values, read-to-update ratio, selectivity of range queries on the attribute.

Data Mining

Consider the following sets of items and transactions:

Items Transactions (i.e. Subsets of Items)

I1: CS Courses T1: Courses on a Student transcript

I2: CS Students T2: Students in a Course
1. What is the typical size of transactions in T1. Provide an example of an association rules using I1 and T1. Explain the meaning of association rules using I1 and T1.
2. What is the typical size of transactions in T2. Provide an example of an association rules using I1 and T2. Explain the meaning of association rules using I2 and T2.
Compare and contrast the following pairs:
1. association vs. statistical correlation
2. confidence vs. support
3. apriori algorithm vs. apriori Tid algorithm
Consider a brute algorithm to find association rules. Given N items, it first generates all posible subsets of items. Set of transactions are next scanned to compute support for all subsets. Subsets with support below the given threshold are discarded to produce the final set of association rules.
Compare the computational complexity of this algorithm with that of aprioi algorithm for finding association rules. When will you use brute force algorithm over apriori algorithm?

Items	Transactions (i.e. Subsets of Items)
I1: CS Courses	T1: Courses on a Student transcript
I2: CS Students	T2: Students in a Course