lab2

Lab 2 : Query Optimization

The goal of Lab 2 is to learn about query optimizers by examining and analyzing query execution plans of the following SQL query statments on given database tables, movies, actors and casting.

Part A: Preliminaries

Database Schemas

CREATE TABLE actor
(id INTEGER NOT NULL, name VARCHAR(35), PRIMARY KEY (id));

CREATE INDEX actor_name ON actor(name);

CREATE TABLE movie
(id INTEGER NOT NULL, title VARCHAR(70)
,yr DECIMAL(4), score FLOAT, votes INTEGER, director INTEGER
,PRIMARY KEY (id), FOREIGN KEY (director) REFERENCES actor(id)
);
CREATE INDEX movie_title ON movie(title);
CREATE INDEX movie_votes ON movie(votes);

CREATE TABLE casting
(movieid INTEGER NOT NULL, actorid INTEGER NOT NULL
,ord INTEGER, PRIMARY KEY (movieid, actorid)
,OREIGN KEY (movieid) REFERENCES movie(id)
, FOREIGN KEY (actorid) REFERENCES actor(id)
);

CREATE INDEX casting_movie ON casting(movieid);
CREATE INDEX casting_actor ON casting(actorid);
CREATE INDEX casting_ord   ON casting(ord);

NOTE: TA account(S1G30) has these three tables with data. Since these tables are quite large, you should NOT copy those in your own accounts. You can refer to these tables either using prefixing owner account name or using synonyms. For example,

SELECT * FROM S1G30.movie;
or
CREATE SYNONYM movie FOR S1G30.movie;
SELECT * FROM movie;

TA executed the following ANALYZE statements so that Oracle uses cost-based optimization.

ANALYZE TABLE movie COMPUTE STATISTICS;
ANALYZE TABLE actor COMPUTE STATISTICS;
ANALYZE TABLE casting COMPUTE STATISTICS;

Examine indices info related to above tables.
(Use INDEX_NAME, TABLE_NAME, LEAF_BLOCKS,AVG_LEAF_BLOCKS_PER_KEY, AVG_DATA_BLOCKS_PER_KEY, NUM_ROWS, DISTINCT_KEYS) columns in ALL_INDEXES table.)

Setup for using EXPLAIN PLAN.

The EXPLAIN PLAN statement displays execution plans chosen by the Oracle optimizer for SELECT, UPDATE, INSERT, and DELETE statements. A statement's execution plan is the sequence of operations Oracle performs to run the statement.
1. Creating the PLAN_TABLE Output Table
  ORACLE stores the execution plans of a given SQL into a table called PLAN_TABLE. Execute the following SQL script for it.
```
SQL> @$ORACLE_HOME/rdbms/admin/utlxplan
```
2. Running EXPLAIN PLAN
  The following SQL code template is used to populate PLAN_TABLE with execution plan for a SQL statement. You need to edit the code template to put specific value for <some-name> before running it.
```
EXPLAIN PLAN
SET STATEMENT_ID = '<some-name>'
FOR
<select statement to be analyzed>;
```
  NOTE: PLAN_TABLE allows duplicate values in statement_id column. This may result in spurious data for execution plans for queries using an existing value of statement_id. You may consider cleaning up old information using DELETE FROM PLAN_TABLE; before asking for execution plan for a new query.
3. Displaying PLAN_TABLE Output
  Use the following SQL to view created execution plans
```
SELECT LPAD(' ', 2*LEVEL)||OPERATION||' '||OPTIONS||' '||OBJECT_NAME Query_Plan
FROM PLAN_TABLE
CONNECT BY PRIOR ID = PARENT_ID and STATEMENT_ID = '<some-name>'
START WITH ID=1 and STATEMENT_ID = '<some-name>'
ORDER BY ID;

or

SELECT id, parent_id, level, position, operation, options, object_name, cost, bytes
FROM PLAN_TABLE
CONNECT BY PRIOR ID = PARENT_ID and STATEMENT_ID = '<some-name>'
START WITH ID=1 and STATEMENT_ID = '<some-name>'
ORDER BY ID;
```
4. NOTES:
  1) You may save these statement via "SAVE <some-name>" command, recall it via "@<some-name>" and edit it via "EDIT <some-name>".
  2) You may use "spool <some-name>" command to start recording a session and "spool off" to stop recording.
  3) Additional Information on Explain Plan is available from

Part B: Range Query and Index

Recall that there is a secondary index on the VOTES column of MOVIE table.
Write SQL statements for the follwoing VOTES based range queries.
- Q1: List the movies with votes between 0 and 1000.
- Q2: List the movies with votes between 0 and 50000.
Run the explain plan for each Q1 and Q2 using the instructions given above.
Submit the explain plan output of above two queries and compare these plans.
Explain why the index on VOTES is not always used for range queries based on VOTES attribute.
Derive an algebric condition on the selectivity of range query in terms of record size, block size, key-value size, pointer value size and depth of secondary index to quantify your arguement.

Part C: Selectivity

Recall that there is a secondary index on the VOTES column of MOVIE table.
Consider following SQL queries for the VOTES based range queries .
- Q1: SELECT title FROM movie WHERE votes < 1000;
- Q2: SELECT title FROM movie WHERE votes > 40000;
Run the explain plan for each Q1 and Q2 and Submit the output.
Compare the execution plans for the two queries. Which query is cheaper to execute? Why?
Compare the number of tuples selected from each query. Which query do you expect to have lower selectivity (i.e. returns fewer rows)?
Explain why the index on VOTES is not used for Q2.
(Hint: The cost based optimizer assumes that VOTES values follow an uniform distribution.)

Part D: String Matching and Index

Recall that there is a secondary index on the NAME column of ACTOR table.
Consider following SQL queries.
- Q1: SELECT name FROM actor WHERE name LIKE 'W%';
- Q2: SELECT * FROM actor WHERE name LIKE 'W%';
- Q3: SELECT name FROM actor WHERE substr(name, 1, 1) = 'W';
- Q4: SELECT name FROM actor WHERE name LIKE '%w';
Run the explain plan for each Q1, Q2, Q3 and Q4, and submit the output.
Compare the execution plans of query Q1 and Q2. What's the differnence? Explain.
Explain the execution plans of query Q3 and Q4. Explain why the index on NAME is not always used for queries based on NAME attribute.

Part E: Aggregate Operations

Consider following two equivalent SQL queries.
- Q1: SELECT title FROM movie WHERE votes ≥ (SELECT MAX(votes) FROM movie);
- Q2: SELECT title FROM movie WHERE votes ≥ ALL (SELECT votes FROM movie) ;
Run the explain plan for each Q1 and Q2 and submit the output.
Is there any difference in the execution plans for Q1 and Q2? Which query is more efficient to execute? Explain.

Part F: Join and Selection

Recall that there are indices on TITLE and VOTES.
Consider following two equivalent SQL queries for retrieving titles of movies with votes larger than that of Star Wars.
- Q1: SELECT title FROM movie WHERE votes > (SELECT votes FROM movie WHERE title = 'Star Wars');
- Q2: SELECT m1.title FROM movie m1, movie m2 WHERE m1.votes > m2.votes AND m2.title = 'Star Wars';
Run the explain plan for each Q1 and Q2 and submit the output.
Is there any difference in the execution plans for Q1 and Q2? Which query is more efficient to execute? Explain.

Part G: Alternate Join Strategies

Recall that there are indices on CASTING.ACTORID, CASTING.ORD and ACTOR.ID.
Consider following three equivalent SQL queries for retrieving leading actors, i.e. actors which have had a leading roles in at least one movie. The two of them have optimizer hint. Hints provide a mechanism to direct the optimizer to choose a certain query execution plan based on join method.
- Q1: SELECT name FROM casting, actor WHERE id=actorid AND ord = 1;
- Q2: SELECT /*+ USE_MERGE (casting actor) */ name FROM casting, actor WHERE id=actorid AND ord = 1;
- Q3: SELECT /*+ ORDERED USE_NL (actor) */ name FROM casting, actor WHERE id=actorid AND ord = 1;
Run the explain plan for each Q1, Q2 and Q2 and submit the output.
Compare and contrast three exeuction plans for Q1,Q2 and Q3? Is there any difference? Why?