A Baseball Database Server: Part II, Batting Statistics
In this lab we turn-up the heat on our Baseball Database and find some
real statistics. I have written an HTML page that provides a form
which allows a client to design their own query of the major batting
statistics: ball2.html. Your task in this lab
is to write a JSP which queries the baseball database in MySQL and returns
an HTML table of the results of the query. This is a very complicated
process which must take client queries on up to 22
fields with numerous constraints
for aggregate statistics or individual rows, and create a single
SQL query for your ResultTable class.
Your lab will be judged on several features
- Throrough testing of the java classes you write, and the JSP response page for the client.
- The style of your java code, and thoroughness of your comments. With a program this complex, it is hard to get every detail correct, so it is very important that your java class is simple and clear. In particular, you should not try to write one big constructor!!
- Simplicity of your JSP request page. This means minimal java code, nothing more than calling constructors and methods.
MySQL Server
Everyone will access the same database on MySQL. This database is on
a MySQL server running on a local machine danryan, not
dbserver, and you will access it as an anonymous user,
cs102, with access limited only to making SELECT
queries. The commandline is
mysql cs102 -h danryan -u cs102 -P 34463 -p
and the password is cs102 (easy to remember!!).
SQL SELECT Command
You will need a more robust SQL SELECT command for this
weeks lab. This is still not as rich as SQL allows, but it is sufficient
for satisfying every client query by a single SQL SELECT
query.
SELECT columns FROM tables WHERE where_conditions GROUP BY groups HAVING have_conditions ORDER BY column [ASC | DESC] LIMIT offset, countThe where_conditions will be extended beyond
LIKE
operator (for pattern matching) to include comparison
operators
>, >=, <, <=, =, !=which will be used in conditions having the form
column compare valuewhere value can be a string, integer or decimal value. The have_conditions are exactly like the where_conditions, but they may additionally include aggregate operations where column names must occur in where_conditions. The
SELECT block, located after the
SELECT keyword may contain a list of
column names as well as
aggregate operations and aliases. For this
lab, the only aggregate function you need is SUM. You will
need alias specifications, which take the form
column-or-aggregate AS aliasin the column-list. You MUST use aliases for aggregate functions, especially if there are constraints on these values (in which case the constraints must be placed under the
HAVING block) or you are using an
aggregate to order the results, in the ORDER block.
These are the only places aggregates may occur. In particular, they
may not occur in the WHERE block or
GROUP block. (These are constraints placed by
SQL.)
Here are some things to keep in mind
- Aggregates can only appear in the
SELECTblock, theHAVINGblock and theORDERblock. - You must use an alias for an aggregate whenever you need
to refer to the aggregate in the
HAVINGorORDERblock. - You may use an alias which is identical to a column name.
In this case,
whenever the alias appears in the
HAVINGorORDERblock it refers to the alias, and everywhere else, it refers to the column. For exampleSELECT first,last,SUM(HR) AS HR, SUM(AB)/SUM(HR) AS ab-per-hr FROM Master,Batting WHERE Master.playerID = Batting.playerID GROUP BY Master.playerID HAVING HR > 500 ORDER BY HR ASC LIMIT 0,25;TheHRin theSELECTblock aggregateSUM(HR)andSUM(AB)/SUM(HR)refers to the columnHRof the Batting table; all other occurrences refer to the alias created to refer toSUM(HR). - All aggregates must appear in the
column_list of the
SELECTblock. The non-aggregate columns of a table, such as theplayerIDoryear, need not appear in theSELECTblock to constrain the query. - You will not need more than one column or aggregate in the
ORDERfield, although SQL permits multiple values. - The SQL default for ordering is
ASC, although the queries generated for yourResultTablewill always specify this value. - SQL allows one integer in the
LIMITblock, although the queries generated for yourResultTablewill always produce the offset as well as the count.
Java Query Code
You will need your html package code and
ResultTable class code. These will not require
modification, if they are properly functioning.
Download the new
sql package,
and the database package
and unarchive it by
tar xfz sql.tgz
tar xfz database.tgz
These will produce directories called sql
and database.The first contains
the new java classes to implement the more robust SELECT
SQL command for this week's lab. Your ResultTable does not
depend upon the actual details of SelectQuery, so it
will work without modification.
I have implemented the SelectQuery as a single class
with numerous methods for adding various constraints. One very important
modification is that you do not have to add all your columns
at the time of construction of a SelectQuery. It will
allow you to add columns after its construction, and will store all
columns in the order you add them. It does keeps
multiple instances of the same column, so it is up to you to add a
column once to the column list in SelectQuery. Read the
Javadoc documentation for more on the variety of methods provided
for you.
Baseball Database
For this lab, you will only need two tables, Master and
Batting. (The extra credit adds a selected
use of the Fielding table.) The primary key for the
Master table is playerID, and this provides the
key link which identifies a player with all other player data tables.
You will find a description of these tables in MySQL using
SHOW COLUMNS FROM table;
and a description of the
Lahman baseball database (version 5-3).
The Batting table is organized with each row corresponding to
a single player (determined by the playerID), in a single
year and a single team (determined by the teamID,
which accesses the Teams table, which you will not need for
this lab.) The row contains the major statistics for that player.
You will need to compute some statistics yourself. There is a seasonal
and aggregate form. I have provided the seasonal version here, and leave
it to you to determine the correct notation for the aggregate form.
One (apparent) quirk of MySQL is that it computes using DECIMAL,
which provides only two digits. I have not discovered how to produce
a third digit appropriate for baseball statistics. All columns
refer to columns (unless specifically noted)
in the Batting table:
- Singles:
(H-2B-3B-HR), which I will abbreviate as1B. - Batting Average:
H/AB - On Base Percentage:
(H+BB+HBP)/(AB+BB+HBP) - Slugging Percentage:
(HR*4 + 3B*3 + 2B*2 + 1B)/AB - age:
year - Master.birth
Your query must report the first and last
name, in that order, for any row of data. The user optionally requests
among twenty other major statistics, for seasonal or aggregate totals.
The third reported value is the value on which the order is based,
if it is not the default last name. (This is the case,
even if the user does not specifically request this statistic reported.
The order of all other statistics does not matter for this lab.
The SelectQuery class will produce the column list
for the SELECT SQL query in the order you add the columns
(starting with those you pass in the constructor.) It does not
ensure a column is reported only once. It only adds a column through
the addColumn() method, so it is up to you to ensure
columns are not multiply reported.
The database has been optimized to report queries in under 5 seconds. If you do not recieve a response in this time, there is a good chance there is a problem with your query.
Whenever appropriate, I have use correct field names as values in the
form for ball2.html, your baseball database server webpage.
For example, the teamID of the Chicago Cubs
is "CHN" and for the Chicago White Sox is
"CHA". I have also tried to use column names where
appropriate, for parameter names, such as with teamID,
to identify the users input for the team constraint. I have also tried
to ensure that reasonable values would be passed your way in most
fields. For the exceptions, such as the year constraint, you may ignore
any non-integer input values for given for a year; any stupid values--those
not between 1876 and 2005 are due to client stupidity, so they
must suffer the consequences!!
I have use "all" at all places where the client does not wish
to add any constraint. For statistics, this means that the client does
wish your query to report the statistic, but no constraint is placed on
the value. I have used "none" to signify that the user
does not wish the query to report this value. Remember, that if a column
is used to order the data, it must be reported after the first and last
names, even if the user does not specifically request it.
Extra Credit
For extra credit you will implement the Position constraint.
The table Fielding contains a row for each playerID,
each year, each teamID (identification of
team played for--an index to the Teams table) played for that
year,
each pos (fielding position,
one of "P","C","1B","2B","3B","SS" or "OF") played on that team,
and the fielding statistics.
You will need to access the position. For simplicity, you may assume
a player played a position for a team in a given year if they played
that position for the team in at least one game.
For example, Eddie Mathews played 79 games
at first base for Houston in 1967 and 24 games at third base for them, so
he could be counted as a first baseman or third baseman that year. He
also played first base (13 games)
and third base (21 games) with Detroit in 1967. So, for the year
1967 Mathews has 4 rows in the Fielding table. He has two rows in the
Batting table, one for each team he played for in 1967. There
is a field, stint which distinguishes the rows by the
team played for, in the order of appearance. Eddie Mathews played for
Houston first, so his stint value is 1, and then played
for Detroit that same year, so his stint value is 2.
This field is common to both Batting and Fielding.
(The value of stint does not depend upon the multiple
positions played in the Fielding table, so Mathews
stint value is 1 for both rows in which he played for
Houston in 1967, and is 2 for both rows in which he played for
Detroit in 1967.)
Tracking down the players position and correctly computing batting totals introduces special problems, since you must now search on a third table, besides Master and Batting. Here are some things to be aware
- Batting and Fielding have several rows in common,
so you will have new ambiguities. One example, is with
year, but there are several others. You will need to be careful in qualifying these fields. - You need to be careful about simply adding the Fielding
table to a query. The following query is intended to produce the
homerun leaders
SELECT first,last,SUM(HR) AS hr FROM Master,Batting,Fielding WHERE Master.playerID=Batting.playerID GROUP BY Master.playerID ORDER BY hr DESC LIMIT 10;Without Fielding table the query takes a couple of seconds and returns the correct values. Even though the Fielding table is never directly accessed, MySQL uses the table in producing its query results leading to a very long query with gives badly incorrect values. - Finding the most homeruns by a third baseman in a career has some subtlties created by the explanation of Eddie Mathews year in 1967. Mathews played third base every year of his career (and was one of the great third basemen), so his career homeruns should equal is homeruns in years he played thirdbase. But it is rather subtle to figure-out how write a SQL query which verifies this.
Handin
Your lab is due Sunday at 11:59pm. See handin.