2.12 Finding the Top N Values in a Set
2.12.1 Problem
You want to find the first N elements of an ordered set. In most cases, this means
that you want to find the top N records. Assume that the grading
rules of the school require professors to use only the best two
Scores from the term papers each student submitted. You need to write
a query that returns that information. You don't
want all Scores for each student, only the top two.
2.12.2 Solution
The most straightforward solution to this problem is to
use
the TOP keyword. TOP is a MS SQL Server extension to SQL that allows
you to limit a query so that it returns only the first N records. The
following query returns the top two Scores for each student in each
course:
SELECT s1.StudentName, s1.CourseId, s1.TermPaper, MAX(s1.Score) Score
FROM Students s1
GROUP BY s1.CourseId, s1.StudentName, s1.TermPaper
HAVING MAX(s1.Score) IN
(SELECT TOP 2 s2.Score
FROM Students s2
WHERE s1.CourseId=s2.CourseId AND
s1.StudentName=s2.StudentName
ORDER BY s2.Score DESC)
ORDER BY s1.StudentName, s1.CourseId, s1.Score DESC
An alternative solution is a bit less Transact-SQL-specific and a bit
less intuitive. It is, however, more general, and it conforms with
the SQL standard:
SELECT s1.StudentName,s1.CourseId, s1.TermPaper, MAX(s1.Score) Score
FROM Students s1 INNER JOIN Students s2
ON s1.CourseId=s2.CourseId AND
s1.StudentName=s2.StudentName
GROUP BY s1.CourseId, s1.StudentName, s1.TermPaper
HAVING SUM(CASE WHEN s1.Score <= s2.Score THEN 1 END) <= 2
ORDER BY s1.StudentName, s1.CourseId, s1.Score DESC
Both queries list the two highest-scoring term papers for each
student in each course and order the results in descending order by
Score. The results will resemble the following output:
StudentName CourseId TermPaper Score
------------ --------- ---------- ------
Andrew ACCN101 4 15.60
Andrew ACCN101 3 11.00
Andrew MGMT120 3 23.10
Andrew MGMT120 2 21.70
Bert ACCN101 1 13.40
Bert ACCN101 3 13.00
Cindy ACCN101 2 16.20
Cindy ACCN101 1 12.10
Cindy MGMT120 3 16.00
Cindy MGMT120 2 14.40
2.12.3 Discussion
Two solutions to the problem of finding the top N rows are presented
in this recipe. The first solution uses the TOP keyword, which is a
SQL extension implemented by MS SQL Server. The second solution uses
a self-join instead of the TOP keyword and is useful to know if you
ever need to work in a database environment where TOP is not
supported.
2.12.3.1 Using TOP
The first query
uses TOP — an extension to SQL
provided by MS SQL Server. The key thing to notice is that the SELECT
TOP statement is used in the HAVING clause. While the outer query
prepares a list of rows, the HAVING clause makes sure that only rows
that match the two Scores for each student in each course are
returned in the final result.
The way that TOP works is that SELECT TOP N causes the database
server to return only the first N rows returned by the query.
It's very important when using TOP to also specify
an ORDER BY clause. Since we want the top two Scores, we used
ORDER BY s2.Score DESC in the
subquery. If you don't specify an ORDER BY clause,
you'll still get two rows returned, but they will be
a random set of two rows. In this sense, the use of the word TOP is a
bit misleading.
|
If you're not familiar with TOP, you may want to
consult the MS SQL Server documentation for a complete description of
this extension.
|
|
The core of the query is in the HAVING clause. It checks if the Score
of the current group is in the list of the top two Scores for the
group:
HAVING MAX(s1.Score) IN
(SELECT TOP 2 s2.Score
FROM Students s2
WHERE s1.CourseId=s2.CourseId AND
s1.StudentName=s2.StudentName
ORDER BY s2.Score DESC)
Note that the MAX function does not have any real effect since the
GROUP BY clause separates the data into single rows anyway. There is
only one item per group, so MAX returns the Score for that one item.
Even though we aren't summarizing the Score, the MAX
function is necessary to comply with the rules of SQL. We used
MAX(s1.Score) in the select list, so we must use
MAX(s1.Score) here as well. The same results could
be obtained by not aggregating the Score column and instead adding it
to the GROUP BY clause. However, performance would suffer
somewhat — the optimizer works better with fewer columns for
grouping.
The subquery lists all Scores for the current course and the student,
and it ranks them in descending order (using the ORDER BY clause).
This ensures that the first two records returned reflect the top two
Scores for the student in the course. ORDER BY is necessary here,
specifically because TOP is being used.
|
Always remember to use ORDER BY in conjunction with TOP.
|
|
2.12.3.2 Using a self-join
The second solution in this recipe
uses
a self-join to find the two best papers for each student in each
course. In the FROM clause we join the Students table to itself. The
two copies are given aliases of s1 and s2, respectively. The WHERE
clause restricts the join such that we get the Cartesian product (all
combinations) of all term papers for each student/course group.
By applying the GROUP BY clause, we form groups identified by a
course — a student and a term paper from the first table combined
with all possible combinations of those same term papers from the
second table. Since we join the tables by course ID and student name,
the corresponding columns have equal values and can be treated as one
column. In other words, s1.StudentName will always be the same as
s2.StudentName. The same is true for s1.CourseId and s2.CourseId.
This is not the case with the term paper column. By using a
self-join, we produced a Cartesian product of all possible pairs of
term papers for a given student in a given course. This might sound a
bit confusing, so consider the following example that shows the
Cartesian product of term papers for the student named Andrew in the
course named ACCN101:
StudentName CourseId s1.TermPaper s1.Score s2.TermPaper s2.Score
------------ --------- ------------- --------- ------------- ---------
Andrew ACCN101 4 15.60 4 15.60
Andrew ACCN101 4 15.60 2 10.40
Andrew ACCN101 4 15.60 3 11.00
Andrew ACCN101 2 10.40 4 15.60
Andrew ACCN101 2 10.40 2 10.40
Andrew ACCN101 2 10.40 3 11.00
Andrew ACCN101 3 11.00 4 15.60
Andrew ACCN101 3 11.00 2 10.40
Andrew ACCN101 3 11.00 3 11.00
After the joining and grouping has taken place, the HAVING clause
uses the CASE statement to identify all cases where the Score from
the first table (s1) is less than or equal to the Score from the
second table (s2). For each row in the group, the CASE statement
returns 1 if the condition is met and NULL otherwise. The SUM
function then sums the result within a group. The result is that, for
each term paper, we know now how many other papers by the same
student for the same course have a less-than-or-equal-to Score.
Finally, the HAVING clause causes the query to return only those
groups that have just one or two less-than-or-equal-to Scores.
To illustrate this, let's work through an example in
detail. Here are the first three rows from the previous example for
the student named Andrew:
Andrew ACCN101 4 15.60 4 15.60
Andrew ACCN101 4 15.60 2 10.40
Andrew ACCN101 4 15.60 3 11.00
Apply the CASE statement to the first group, and
you'll see that it returns 4 for the first row, NULL
for the second row, and NULL for the third row. The sum of those
three values (1+NULL+NULL) is 1, which is smaller than 2 as required
by the HAVING condition. Therefore, the group is reported in the
result. Since the grouping is by student name, course ID, and term
paper number, an aggregate function must be used to report the term
paper Score. Our query uses the MAX function, but any function could
have been used because there is only one term paper in a group.
Similar reasoning applies to the second group of three rows. In that
group, the CASE statement returns a 1 for all three rows. The SUM of
those values (1+1+1) is 3. Since 3 is not <=2, as the HAVING
clause requires, the group is excluded from the result. The third
group of three rows for Andrew does meet the specified condition,
and, in the final result, you can see that Andrew's
best Scores came from his first and third term papers.
|
Since TOP is not being used, the ORDER BY clause in the second query
only serves to make the output more readable. If order is not
important, you'll get better performance by leaving
off the ORDER BY clause.
|
|
|