2.12 Finding the Top N Values in a Set

2.12.1 Problem

You want to find the first N elements of an ordered set. In most cases, this means that you want to find the top N records. Assume that the grading rules of the school require professors to use only the best two Scores from the term papers each student submitted. You need to write a query that returns that information. You don't want all Scores for each student, only the top two.

2.12.2 Solution

The most straightforward solution to this problem is to use the TOP keyword. TOP is a MS SQL Server extension to SQL that allows you to limit a query so that it returns only the first N records. The following query returns the top two Scores for each student in each course:

SELECT  s1.StudentName, s1.CourseId, s1.TermPaper, MAX(s1.Score) Score
FROM Students s1
GROUP BY s1.CourseId, s1.StudentName, s1.TermPaper
HAVING MAX(s1.Score) IN 
   (SELECT TOP 2 s2.Score 
       FROM Students s2
       WHERE s1.CourseId=s2.CourseId AND
          s1.StudentName=s2.StudentName
    ORDER BY s2.Score DESC)
ORDER BY s1.StudentName, s1.CourseId, s1.Score DESC

An alternative solution is a bit less Transact-SQL-specific and a bit less intuitive. It is, however, more general, and it conforms with the SQL standard:

SELECT  s1.StudentName,s1.CourseId, s1.TermPaper, MAX(s1.Score) Score
FROM Students s1 INNER JOIN Students s2
   ON s1.CourseId=s2.CourseId AND
      s1.StudentName=s2.StudentName
GROUP BY s1.CourseId, s1.StudentName, s1.TermPaper
HAVING SUM(CASE WHEN s1.Score <= s2.Score THEN 1 END) <= 2
ORDER BY s1.StudentName, s1.CourseId, s1.Score DESC

Both queries list the two highest-scoring term papers for each student in each course and order the results in descending order by Score. The results will resemble the following output:

StudentName  CourseId  TermPaper  Score  
------------ --------- ---------- ------ 
Andrew       ACCN101   4          15.60
Andrew       ACCN101   3          11.00
Andrew       MGMT120   3          23.10
Andrew       MGMT120   2          21.70
Bert         ACCN101   1          13.40
Bert         ACCN101   3          13.00
Cindy        ACCN101   2          16.20
Cindy        ACCN101   1          12.10
Cindy        MGMT120   3          16.00
Cindy        MGMT120   2          14.40

2.12.3 Discussion

Two solutions to the problem of finding the top N rows are presented in this recipe. The first solution uses the TOP keyword, which is a SQL extension implemented by MS SQL Server. The second solution uses a self-join instead of the TOP keyword and is useful to know if you ever need to work in a database environment where TOP is not supported.

2.12.3.1 Using TOP

The first query uses TOP — an extension to SQL provided by MS SQL Server. The key thing to notice is that the SELECT TOP statement is used in the HAVING clause. While the outer query prepares a list of rows, the HAVING clause makes sure that only rows that match the two Scores for each student in each course are returned in the final result.

The way that TOP works is that SELECT TOP N causes the database server to return only the first N rows returned by the query. It's very important when using TOP to also specify an ORDER BY clause. Since we want the top two Scores, we used ORDER BY s2.Score DESC in the subquery. If you don't specify an ORDER BY clause, you'll still get two rows returned, but they will be a random set of two rows. In this sense, the use of the word TOP is a bit misleading.

If you're not familiar with TOP, you may want to consult the MS SQL Server documentation for a complete description of this extension.

The core of the query is in the HAVING clause. It checks if the Score of the current group is in the list of the top two Scores for the group:

HAVING MAX(s1.Score) IN 
   (SELECT TOP 2 s2.Score 
       FROM Students s2
       WHERE s1.CourseId=s2.CourseId AND
          s1.StudentName=s2.StudentName
    ORDER BY s2.Score DESC)

Note that the MAX function does not have any real effect since the GROUP BY clause separates the data into single rows anyway. There is only one item per group, so MAX returns the Score for that one item. Even though we aren't summarizing the Score, the MAX function is necessary to comply with the rules of SQL. We used MAX(s1.Score) in the select list, so we must use MAX(s1.Score) here as well. The same results could be obtained by not aggregating the Score column and instead adding it to the GROUP BY clause. However, performance would suffer somewhat — the optimizer works better with fewer columns for grouping.

The subquery lists all Scores for the current course and the student, and it ranks them in descending order (using the ORDER BY clause). This ensures that the first two records returned reflect the top two Scores for the student in the course. ORDER BY is necessary here, specifically because TOP is being used.

Always remember to use ORDER BY in conjunction with TOP.

2.12.3.2 Using a self-join

The second solution in this recipe uses a self-join to find the two best papers for each student in each course. In the FROM clause we join the Students table to itself. The two copies are given aliases of s1 and s2, respectively. The WHERE clause restricts the join such that we get the Cartesian product (all combinations) of all term papers for each student/course group.

By applying the GROUP BY clause, we form groups identified by a course — a student and a term paper from the first table combined with all possible combinations of those same term papers from the second table. Since we join the tables by course ID and student name, the corresponding columns have equal values and can be treated as one column. In other words, s1.StudentName will always be the same as s2.StudentName. The same is true for s1.CourseId and s2.CourseId.

This is not the case with the term paper column. By using a self-join, we produced a Cartesian product of all possible pairs of term papers for a given student in a given course. This might sound a bit confusing, so consider the following example that shows the Cartesian product of term papers for the student named Andrew in the course named ACCN101:

StudentName  CourseId  s1.TermPaper  s1.Score  s2.TermPaper  s2.Score 
------------ --------- ------------- --------- ------------- ---------  
Andrew       ACCN101   4             15.60     4             15.60
Andrew       ACCN101   4             15.60     2             10.40
Andrew       ACCN101   4             15.60     3             11.00
Andrew       ACCN101   2             10.40     4             15.60
Andrew       ACCN101   2             10.40     2             10.40
Andrew       ACCN101   2             10.40     3             11.00
Andrew       ACCN101   3             11.00     4             15.60
Andrew       ACCN101   3             11.00     2             10.40
Andrew       ACCN101   3             11.00     3             11.00

After the joining and grouping has taken place, the HAVING clause uses the CASE statement to identify all cases where the Score from the first table (s1) is less than or equal to the Score from the second table (s2). For each row in the group, the CASE statement returns 1 if the condition is met and NULL otherwise. The SUM function then sums the result within a group. The result is that, for each term paper, we know now how many other papers by the same student for the same course have a less-than-or-equal-to Score. Finally, the HAVING clause causes the query to return only those groups that have just one or two less-than-or-equal-to Scores.

To illustrate this, let's work through an example in detail. Here are the first three rows from the previous example for the student named Andrew:

Andrew       ACCN101   4             15.60     4             15.60
Andrew       ACCN101   4             15.60     2             10.40
Andrew       ACCN101   4             15.60     3             11.00

Apply the CASE statement to the first group, and you'll see that it returns 4 for the first row, NULL for the second row, and NULL for the third row. The sum of those three values (1+NULL+NULL) is 1, which is smaller than 2 as required by the HAVING condition. Therefore, the group is reported in the result. Since the grouping is by student name, course ID, and term paper number, an aggregate function must be used to report the term paper Score. Our query uses the MAX function, but any function could have been used because there is only one term paper in a group.

Similar reasoning applies to the second group of three rows. In that group, the CASE statement returns a 1 for all three rows. The SUM of those values (1+1+1) is 3. Since 3 is not <=2, as the HAVING clause requires, the group is excluded from the result. The third group of three rows for Andrew does meet the specified condition, and, in the final result, you can see that Andrew's best Scores came from his first and third term papers.

Since TOP is not being used, the ORDER BY clause in the second query only serves to make the output more readable. If order is not important, you'll get better performance by leaving off the ORDER BY clause.