Enclosed are a few sample questions for the mid semester examination.
More questions will be added by Thursday lecture.
Validation Methodologies
A validation methodologies allows researchers to back up their claims
in objective manner. It allows the reader of a paper to reproduce the
results and conclusions of the author. This is a fundamental requirement
for science and scientific research. List a few papers in your reading
list and their validation methodologies. Do the paper describe their
methodology in sufficient detail for ensuring reproducibility of their
claims?
Which validation methodologies are commonly used in database
literature to back up following kinds of claims:
correctness, completeness
deadlock free, no starvation
Relational algebra and tuple calculus are equivalent in expressive power
worst case complexity
superior throughput (e.g. TPS)
superior response time for a given benchmark (e.g. Sequoia 2000)
Simulation is a popular validation methodology in Database Systems
Research. What are the strengths and weaknesses of this methodology?
What is the statistical "confidence level" of performance
measures (e.g. mean response time) based on simulation?
Why is it important to design simulation experiments to provide
appropriate confidence levels?
Compare and Contrast the following pairs of concepts
within simulation methodology.
(a) Workload simulation: trace driven vs. Statistical Distribution driven,
(b) Performance measure: response time vs. throughput,
(c) Candidate comparison: statistical mean based vs. 95-th percentile based.
Hint: Define each term, identify the strengths and weakness of each.
Examine the performance studies reported in various papers covered
in the lectures. Identify the style of workload simulation, choice
of performance measures, and candidate comparison. Whenever possible,
use the alternatives identified in previous question for these dimensions.
Repeat previous question for the dimension of estimation of
performance of alternative candidates.
The performance of candidates can be estimated in two different ways:
(a) algebraic cost model or (b) measurements on execution of a
software prototype.
For example, query optimizers use algebraic cost models to compare
performance of alternative query processing strategies (recall Csci 5708).
In contrast, some research papers (e.g. Sequoia 2000, Buffer Pool
Management) implemented candidates and measured time elapsed during execution.
Consider the claims made in research papers comparing data models,
e.g. object data model,
object relational data model, and relational data model.
Which methodologies can be used to compare them?
What are evaluation measures for this task?
Weekly Topics
Past, Present and Future of Database Research
Justify the need for DB research even though
main memory is getting cheaper and bigger,
CPUs and Networks are getting faster and
DB technology is maturing and
Commercial DBMSs are becoming commodities.
List 2 key research accomplishments of database research
in the following subareas: Relational Databases,
Transaction Management, Distributed Databases.
List 5 key research challenges for database research
according to the readings.
List a few current research challenges for database research
by answering the following questions:
Name major pull info. technologies of last decade.
Name major killer applications of last decade.
List a few challenges to presented to traditional databases
due to the new technologies and applications?
Describe the TPC and Sequoia 2000 benchmarks in the following areas:
Hardware/Software features being evaluated
Measures of performance
Datasets
Workload (queries, computations)
Scaleability parameters
Consider the problem of purchasing a spatial database management
systems to support data types and operations in
Open Geodata Interchange Standard.
Critique Sequoia 2000 benchmark for comparing DBMS in this context.
Consider the problem of purchasing a database management
systems to e-commerce transactions on internet.
Critique TPC benchmark for comparing DBMS in this context.
In Benchmark Handbook, Gray suggests that the goal of
benchmarking is to assist end-users in answering the following
question: " Which computer should I buy for my workload".
Benchmarking allows one to choose cheapest system that does the job.
List a few other questions that benchmarking can help answer
for programmers, software sroduct sevelopers, and end-users.
In Benchmark Handbook,
Gray defined four criteria for evaluating benchmarks: Relevance,
Portability, Scaleability, Simplicity.
Compare the TPC and Sequoia 2000 benchmarks on these criteria.
Data Models and Query Languages
Compare and contrast the following pairs:
Relational data model and object oriented data model
Relational data model and object relational data model
Object relational data model and object oriented data model
Rules and Integrity Constraints
Cold cache and Warm/hot cache (cold start vs. warm start)
Consider the support for spatial datatypes and operator
(e.g. those in Open Geodata Interchange Standard) inside a query language
such SQL. Which data models facilitate it? Explain.
Type constructors (e.g. arrays, composite collection) allow
definition of user defined data types (e.g. polygon) from base types (e.g.
float).
List two type constructors supported by Postgres.
List two type constructors supported by ObjectStore.
What are path expressions and nested dot notations?
Does query language of Postgres support it ?
Does query language of ObjectStore support it ?
Define 'impedance mismatch' and explain how ObjectStore
mitigates this mismatch.
Does POSTGRES support inheritance? Doe it allow a class to be derived?
Explain briefly how 'swizzling' supports the ObjectStore
goal of enhancing database performance.
How does POSTGRES justify its use of no-overwrite
since sequential I/O is substantially faster than random I/O,
the no-overwrite solution is guaranteed to offer worse performance.
List two unique issues related to the data models and query languages
needed by Spatial Databases. Explain briefly.
Storage and Indexing
Compare and contrast design choices made by Operating Systems and
those made by DBMSs in the following areas:
index-structures on collection of pages inside a file/table,
allocation of disk blocks (in terms of their
physical proximity on disk tracks/cylinders) for a file/table,
buffer pool management (prefetching, replacement policy).
Define the following term and explain their effect buffer pool management:
working set size, page access pattern.
Define the following page replacement policies: LRU, MRU, FIFO,
hot-set, DBMIN.
What is the difference between the page access patterns for
programs executing by Operating Systems and those of the
query processing strategies executing inside a DBMS?
Answer the following about DBMIN buffer pool management policy:
Identify the working set sizes, page access patterns, and
appropriate page replacement policies for the
following query processing strategies :
(a) Point search using an index,
(b) Scanning all pages of a table
(c) Nested loop join when inner table has no index
Define Query Locality Set Model (QLSM).
Is DBMIN page replacement strategy useful for Object Oriented
Databases? Explain briefly.
Is DBMIN page replacement strategy useful for Spatial Databases?
Explain briefly.
List two unique issues related to the storage and indexing needs
of Spatial Databases. Explain briefly.
Transactions, Recovery
List the four ACID properties for transactions.
Compare and contrast: transaction failure, system failure, and
media failure.
Which features of the following affect design of recovery algorithms:
main memory buffer pool management
file system data-structures on disk
Answer the following about Computing Survey paper on recovery:
Define the dimensions (e.g. atomic, steal, force, checkpoint-type)
used to classify crash recovery algorithms.
List three types of checkpointing. How do they affect recovery
algorithms?
Does it cover recovery when the computers has a stable main
memory with battery backup? Justify.
Concurrency Control
What are the problems with using two phase read/write locking
for concurrency control on operations to B+ trees?
Compare and contrast: B+ trees, B-link Trees.
Compare and contrast: 2 phase read/write locking, tree locking
protocol (Yao, Lehman).
Answer the following about Yao/Lehman tree locking protocol:
Does it follows two phase locking?
Is it is free of deadlock ? starvation?
What is order of visiting nodes for search operation? insert operation?
Simulation is a popular validation methodology in Database Systems
Research. How can a simulation study help in study of tree locking
protocols?
Query Processing
Distributed Databases
Compare and contrast the following pairs:
vertical and horizontal partitioning
fragmentation and replication
semi-join and bloom-join
query optimization in a DBMS vs. that in a distributed DBMS.
commit protocol in a DBMS vs. that in a distributed DBMS.
Consider the following assumptions made in distributed database
research in early 1980s. Identify the applications (e.g. mobile wireless,
desk-top clients to internet, parallel databases, mirrored databases),
where these assumptions may still be true:
unreliable communication
single schema across all sites
vertical and horizontal fragmentation
Parallel Databases
Data Warehouse
Compare and contrast the following pairs:
data cube, group by aggregate queries in SQL
NULL, ALL
B+ tree index and bitmap index
bitmap index and bitslice index
projection index and value-list index
group set index and join index
Define the following classes of aggregate functions:
distributive, algebraic, holostic.
Classify the following geomtric aggreagate functions into
distributive, algebraic, holostic functions:
(a) minimum bounding circle, (b) center of mass, (c) minimum bounding
orthogonal rectangle, (d) convex hull.
Which index is most efficient for following SQL aggreagate
operations: minimum, average, count. Briefly justify your answer.
You may use Table 3.5 (pp. 549) from the textbook. However the
justification should be based on first principles of database systems.
Which attributes are suitable for bit-map indices? Characterize
these attributes in terms of
the cardinality of their domain of values, number of
bytes needed to store its values, read-to-update ratio, selectivity of
range queries on the attribute.
Data Mining
Consider the following sets of items and transactions:
Items
Transactions (i.e. Subsets of Items)
I1: CS Courses
T1: Courses on a Student transcript
I2: CS Students
T2: Students in a Course
What is the typical size of transactions in T1.
Provide an example of an association rules using I1 and T1.
Explain the meaning of association rules using I1 and T1.
What is the typical size of transactions in T2.
Provide an example of an association rules using I1 and T2.
Explain the meaning of association rules using I2 and T2.
Compare and contrast the following pairs:
association vs. statistical correlation
confidence vs. support
apriori algorithm vs. apriori Tid algorithm
Consider a brute algorithm to find association rules. Given N items,
it first generates all posible subsets of items. Set of transactions are
next scanned to compute support for all subsets. Subsets with support
below the given threshold are discarded to produce the final set of
association rules.
Compare the computational complexity of this algorithm with that of
aprioi algorithm for finding association rules. When will you use brute
force algorithm over apriori algorithm?