A Baseball Database Server: Part II, Batting Statistics
In this lab we turn-up the heat on our Baseball Database and find some
real statistics. I have written an HTML page that provides a form
which allows a client to design their own query of the major batting
statistics: ball2.html. Your task in this lab
is to write a JSP which queries the baseball database in MySQL and returns
an HTML table of the results of the query. This is a very complicated
process which must take client queries on up to 22
fields with numerous constraints
for aggregate statistics or individual rows, and create a single
SQL query for your
Your lab will be judged on several features
- Throrough testing of the java classes you write, and the JSP response page for the client.
- The style of your java code, and thoroughness of your comments. With a program this complex, it is hard to get every detail correct, so it is very important that your java class is simple and clear. In particular, you should not try to write one big constructor!!
- Simplicity of your JSP request page. This means minimal java code, nothing more than calling constructors and methods.
Everyone will access the same database on MySQL. This database is on
a MySQL server running on a local machine danryan, not
dbserver, and you will access it as an anonymous user,
cs102, with access limited only to making
queries. The commandline is
mysql cs102 -h danryan -u cs102 -P 34463 -pand the password is cs102 (easy to remember!!).
You will need a more robust SQL
SELECT command for this
weeks lab. This is still not as rich as SQL allows, but it is sufficient
for satisfying every client query by a single SQL
SELECT columns FROM tables WHERE where_conditions GROUP BY groups HAVING have_conditions ORDER BY column [ASC | DESC] LIMIT offset, countThe where_conditions will be extended beyond
LIKEoperator (for pattern matching) to include
>, >=, <, <=, =, !=which will be used in conditions having the form
column compare valuewhere value can be a string, integer or decimal value. The have_conditions are exactly like the where_conditions, but they may additionally include aggregate operations where column names must occur in where_conditions. The
SELECTblock, located after the
SELECTkeyword may contain a list of column names as well as aggregate operations and aliases. For this lab, the only aggregate function you need is
SUM. You will need alias specifications, which take the form
column-or-aggregate AS aliasin the column-list. You MUST use aliases for aggregate functions, especially if there are constraints on these values (in which case the constraints must be placed under the
HAVINGblock) or you are using an aggregate to order the results, in the
ORDERblock. These are the only places aggregates may occur. In particular, they may not occur in the
GROUPblock. (These are constraints placed by SQL.)
Here are some things to keep in mind
- Aggregates can only appear in the
HAVINGblock and the
- You must use an alias for an aggregate whenever you need
to refer to the aggregate in the
- You may use an alias which is identical to a column name.
In this case,
whenever the alias appears in the
ORDERblock it refers to the alias, and everywhere else, it refers to the column. For example
SELECT first,last,SUM(HR) AS HR, SUM(AB)/SUM(HR) AS ab-per-hr FROM Master,Batting WHERE Master.playerID = Batting.playerID GROUP BY Master.playerID HAVING HR > 500 ORDER BY HR ASC LIMIT 0,25;The
SUM(AB)/SUM(HR)refers to the column
HRof the Batting table; all other occurrences refer to the alias created to refer to
- All aggregates must appear in the
column_list of the
SELECTblock. The non-aggregate columns of a table, such as the
year, need not appear in the
SELECTblock to constrain the query.
- You will not need more than one column or aggregate in the
ORDERfield, although SQL permits multiple values.
- The SQL default for ordering is
ASC, although the queries generated for your
ResultTablewill always specify this value.
- SQL allows one integer in the
LIMITblock, although the queries generated for your
ResultTablewill always produce the offset as well as the count.
Java Query Code
You will need your html package code and
ResultTable class code. These will not require
modification, if they are properly functioning.
tar xfz sql.tgz tar xfz database.tgzThese will produce directories called
database.The first contains the new java classes to implement the more robust
SELECTSQL command for this week's lab. Your
ResultTabledoes not depend upon the actual details of
SelectQuery, so it will work without modification.
I have implemented the
SelectQuery as a single class
with numerous methods for adding various constraints. One very important
modification is that you do not have to add all your columns
at the time of construction of a
SelectQuery. It will
allow you to add columns after its construction, and will store all
columns in the order you add them. It does keeps
multiple instances of the same column, so it is up to you to add a
column once to the column list in
SelectQuery. Read the
Javadoc documentation for more on the variety of methods provided
For this lab, you will only need two tables, Master and
Batting. (The extra credit adds a selected
use of the Fielding table.) The primary key for the
Master table is
playerID, and this provides the
key link which identifies a player with all other player data tables.
You will find a description of these tables in MySQL using
SHOW COLUMNS FROM table;and a description of the Lahman baseball database (version 5-3). The Batting table is organized with each row corresponding to a single player (determined by the
playerID), in a single
yearand a single team (determined by the
teamID, which accesses the Teams table, which you will not need for this lab.) The row contains the major statistics for that player.
You will need to compute some statistics yourself. There is a seasonal
and aggregate form. I have provided the seasonal version here, and leave
it to you to determine the correct notation for the aggregate form.
One (apparent) quirk of MySQL is that it computes using
which provides only two digits. I have not discovered how to produce
a third digit appropriate for baseball statistics. All columns
refer to columns (unless specifically noted)
in the Batting table:
(H-2B-3B-HR), which I will abbreviate as
- Batting Average:
- On Base Percentage:
- Slugging Percentage:
(HR*4 + 3B*3 + 2B*2 + 1B)/AB
year - Master.birth
Your query must report the
name, in that order, for any row of data. The user optionally requests
among twenty other major statistics, for seasonal or aggregate totals.
The third reported value is the value on which the order is based,
if it is not the default
last name. (This is the case,
even if the user does not specifically request this statistic reported.
The order of all other statistics does not matter for this lab.
SelectQuery class will produce the column list
SELECT SQL query in the order you add the columns
(starting with those you pass in the constructor.) It does not
ensure a column is reported only once. It only adds a column through
addColumn() method, so it is up to you to ensure
columns are not multiply reported.
The database has been optimized to report queries in under 5 seconds. If you do not recieve a response in this time, there is a good chance there is a problem with your query.
Whenever appropriate, I have use correct field names as values in the
form for ball2.html, your baseball database server webpage.
For example, the
teamID of the Chicago Cubs
"CHN" and for the Chicago White Sox is
"CHA". I have also tried to use column names where
appropriate, for parameter names, such as with
to identify the users input for the team constraint. I have also tried
to ensure that reasonable values would be passed your way in most
fields. For the exceptions, such as the year constraint, you may ignore
any non-integer input values for given for a year; any stupid values--those
not between 1876 and 2005 are due to client stupidity, so they
must suffer the consequences!!
I have use
"all" at all places where the client does not wish
to add any constraint. For statistics, this means that the client does
wish your query to report the statistic, but no constraint is placed on
the value. I have used
"none" to signify that the user
does not wish the query to report this value. Remember, that if a column
is used to order the data, it must be reported after the first and last
names, even if the user does not specifically request it.
For extra credit you will implement the Position constraint.
The table Fielding contains a row for each
teamID (identification of
team played for--an index to the Teams table) played for that
pos (fielding position,
one of "P","C","1B","2B","3B","SS" or "OF") played on that team,
and the fielding statistics.
You will need to access the position. For simplicity, you may assume
a player played a position for a team in a given year if they played
that position for the team in at least one game.
For example, Eddie Mathews played 79 games
at first base for Houston in 1967 and 24 games at third base for them, so
he could be counted as a first baseman or third baseman that year. He
also played first base (13 games)
and third base (21 games) with Detroit in 1967. So, for the year
1967 Mathews has 4 rows in the Fielding table. He has two rows in the
Batting table, one for each team he played for in 1967. There
is a field,
stint which distinguishes the rows by the
team played for, in the order of appearance. Eddie Mathews played for
Houston first, so his
stint value is 1, and then played
for Detroit that same year, so his
stint value is 2.
This field is common to both Batting and Fielding.
(The value of
stint does not depend upon the multiple
positions played in the Fielding table, so Mathews
stint value is 1 for both rows in which he played for
Houston in 1967, and is 2 for both rows in which he played for
Detroit in 1967.)
Tracking down the players position and correctly computing batting totals introduces special problems, since you must now search on a third table, besides Master and Batting. Here are some things to be aware
- Batting and Fielding have several rows in common,
so you will have new ambiguities. One example, is with
year, but there are several others. You will need to be careful in qualifying these fields.
- You need to be careful about simply adding the Fielding
table to a query. The following query is intended to produce the
SELECT first,last,SUM(HR) AS hr FROM Master,Batting,Fielding WHERE Master.playerID=Batting.playerID GROUP BY Master.playerID ORDER BY hr DESC LIMIT 10;Without Fielding table the query takes a couple of seconds and returns the correct values. Even though the Fielding table is never directly accessed, MySQL uses the table in producing its query results leading to a very long query with gives badly incorrect values.
- Finding the most homeruns by a third baseman in a career has some subtlties created by the explanation of Eddie Mathews year in 1967. Mathews played third base every year of his career (and was one of the great third basemen), so his career homeruns should equal is homeruns in years he played thirdbase. But it is rather subtle to figure-out how write a SQL query which verifies this.
Your lab is due Sunday at 11:59pm. See handin.