SQL Window Functions | Advanced SQL

Starting here? This lesson is part of a full-length tutorial in using SQL for Data Analysis. Check out the beginning.

In this lesson we'll cover:

Intro to window functions
Basic windowing syntax
The usual suspects: SUM, COUNT, and AVG
ROW_NUMBER()
RANK() and DENSE_RANK()
NTILE
LAG and LEAD
Defining a window alias
Advanced windowing techniques

This lesson uses data from Washington DC's Capital Bikeshare Program, which publishes detailed trip-level historical data on their website. The data was downloaded in February, 2014, but is limited to data collected during the first quarter of 2012. Each row represents one ride. Most fields are self-explanatory, except rider_type: "Registered" indicates a monthly membership to the rideshare program, "Casual" incidates that the rider bought a 3-day pass. The start_time and end_time fields were cleaned up from their original forms to suit SQL date formatting—they are stored in this table as timestamps.

Intro to window functions

PostgreSQL's documentation does an excellent job of introducing the concept of Window Functions:

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

The most practical example of this is a running total:

SELECT duration_seconds, SUM(duration_seconds) OVER (ORDER BY start_time) AS running_total FROM tutorial.dc_bikeshare_q1_2012

You can see that the above query creates an aggregation (running_total) without using GROUP BY. Let's break down the syntax and see how it works.

Basic windowing syntax

The first part of the above aggregation, SUM(duration_seconds), looks a lot like any other aggregation. Adding OVER designates it as a window function. You could read the above aggregation as "take the sum of duration_seconds over the entire result set, in order by start_time."

If you'd like to narrow the window from the entire dataset to individual groups within the dataset, you can use PARTITION BY to do so:

SELECT start_terminal, duration_seconds, SUM(duration_seconds) OVER (PARTITION BY start_terminal ORDER BY start_time) AS running_total FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'

The above query groups and orders the query by start_terminal. Within each value of start_terminal, it is ordered by start_time, and the running total sums across the current row and all previous rows of duration_seconds. Scroll down until the start_terminal value changes and you will notice that running_total starts over. That's what happens when you group using PARTITION BY. In case you're still stumped by ORDER BY, it simply orders by the designated column(s) the same way the ORDER BY clause would, except that it treats every partition as separate. It also creates the running total—without ORDER BY, each value will simply be a sum of all the duration_seconds values in its respective start_terminal. Try running the above query without ORDER BY to get an idea:

SELECT start_terminal, duration_seconds, SUM(duration_seconds) OVER (PARTITION BY start_terminal) AS start_terminal_total FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'

The ORDER and PARTITION define what is referred to as the "window"—the ordered subset of data over which calculations are made.

Practice Problem

Write a query modification of the above example query that shows the duration of each ride as a percentage of the total time accrued by riders from each start_terminal

Try it out See the answer

The usual suspects: SUM, COUNT, and AVG

When using window functions, you can apply the same aggregates that you would under normal circ*mstances—SUM, COUNT, and AVG. The easiest way to understand these is to re-run the previous example with some additional functions. Make

SELECT start_terminal, duration_seconds, SUM(duration_seconds) OVER (PARTITION BY start_terminal) AS running_total, COUNT(duration_seconds) OVER (PARTITION BY start_terminal) AS running_count, AVG(duration_seconds) OVER (PARTITION BY start_terminal) AS running_avg FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'

Alternatively, the same functions with ORDER BY:

SELECT start_terminal, duration_seconds, SUM(duration_seconds) OVER (PARTITION BY start_terminal ORDER BY start_time) AS running_total, COUNT(duration_seconds) OVER (PARTITION BY start_terminal ORDER BY start_time) AS running_count, AVG(duration_seconds) OVER (PARTITION BY start_terminal ORDER BY start_time) AS running_avg FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'

Make sure you plug those previous two queries into Mode and run them. This next practice problem is very similar to the examples, so try modifying the above code rather than starting from scratch.

Practice Problem

Write a query that shows a running total of the duration of bike rides (similar to the last example), but grouped by end_terminal, and with ride duration sorted in descending order.

ROW_NUMBER()

ROW_NUMBER() does just what it sounds like—displays the number of a given row. It starts are 1 and numbers the rows according to the ORDER BY part of the window statement. ROW_NUMBER() does not require you to specify a variable within the parentheses:

SELECT start_terminal, start_time, duration_seconds, ROW_NUMBER() OVER (ORDER BY start_time) AS row_number FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'

Using the PARTITION BY clause will allow you to begin counting 1 again in each partition. The following query starts the count over again for each terminal:

SELECT start_terminal, start_time, duration_seconds, ROW_NUMBER() OVER (PARTITION BY start_terminal ORDER BY start_time) AS row_number FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'

RANK() and DENSE_RANK()

RANK() is slightly different from ROW_NUMBER(). If you order by start_time, for example, it might be the case that some terminals have rides with two identical start times. In this case, they are given the same rank, whereas ROW_NUMBER() gives them different numbers. In the following query, you notice the 4th and 5th observations for start_terminal 31000—they are both given a rank of 4, and the following result receives a rank of 6:

SELECT start_terminal, duration_seconds, RANK() OVER (PARTITION BY start_terminal ORDER BY start_time) AS rank FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'

You can also use DENSE_RANK() instead of RANK() depending on your application. Imagine a situation in which three entries have the same value. Using either command, they will all get the same rank. For the sake of this example, let's say it's "2." Here's how the two commands would evaluate the next results differently:

RANK() would give the identical rows a rank of 2, then skip ranks 3 and 4, so the next result would be 5
DENSE_RANK() would still give all the identical rows a rank of 2, but the following row would be 3—no ranks would be skipped.

Practice Problem

Write a query that shows the 5 longest rides from each starting terminal, ordered by terminal, and longest to shortest rides within each terminal. Limit to rides that occurred before Jan. 8, 2012.

Try it out See the answer

NTILE

You can use window functions to identify what percentile (or quartile, or any other subdivision) a given row falls into. The syntax is NTILE(*# of buckets*). In this case, ORDER BY determines which column to use to determine the quartiles (or whatever number of 'tiles you specify). For example:

SELECT start_terminal, duration_seconds, NTILE(4) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS quartile, NTILE(5) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS quintile, NTILE(100) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS percentile FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08' ORDER BY start_terminal, duration_seconds

Looking at the results from the query above, you can see that the percentile column doesn't calculate exactly as you might expect. If you only had two records and you were measuring percentiles, you'd expect one record to define the 1st percentile, and the other record to define the 100th percentile. Using the NTILE function, what you'd actually see is one record in the 1st percentile, and one in the 2nd percentile. You can see this in the results for start_terminal 31000—the percentile column just looks like a numerical ranking. If you scroll down to start_terminal 31007, you can see that it properly calculates percentiles because there are more than 100 records for that start_terminal. If you're working with very small windows, keep this in mind and consider using quartiles or similarly small bands.

Practice Problem

Write a query that shows only the duration of the trip and the percentile into which that duration falls (across the entire dataset—not partitioned by terminal).

Try it out See the answer

LAG and LEAD

It can often be useful to compare rows to preceding or following rows, especially if you've got the data in an order that makes sense. You can use LAG or LEAD to create columns that pull values from other rows—all you need to do is enter which column to pull from and how many rows away you'd like to do the pull. LAG pulls from previous rows and LEAD pulls from following rows:

SELECT start_terminal, duration_seconds, LAG(duration_seconds, 1) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS lag, LEAD(duration_seconds, 1) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS lead FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08' ORDER BY start_terminal, duration_seconds

This is especially useful if you want to calculate differences between rows:

SELECT start_terminal, duration_seconds, duration_seconds -LAG(duration_seconds, 1) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS difference FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08' ORDER BY start_terminal, duration_seconds

The first row of the difference column is null because there is no previous row from which to pull. Similarly, using LEAD will create nulls at the end of the dataset. If you'd like to make the results a bit cleaner, you can wrap it in an outer query to remove nulls:

SELECT * FROM ( SELECT start_terminal, duration_seconds, duration_seconds -LAG(duration_seconds, 1) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS difference FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08' ORDER BY start_terminal, duration_seconds ) sub WHERE sub.difference IS NOT NULL

Defining a window alias

If you're planning to write several window functions in to the same query, using the same window, you can create an alias. Take the NTILE example above:

SELECT start_terminal, duration_seconds, NTILE(4) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS quartile, NTILE(5) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS quintile, NTILE(100) OVER (PARTITION BY start_terminal ORDER BY duration_seconds) AS percentile FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08' ORDER BY start_terminal, duration_seconds

This can be rewritten as:

SELECT start_terminal, duration_seconds, NTILE(4) OVER ntile_window AS quartile, NTILE(5) OVER ntile_window AS quintile, NTILE(100) OVER ntile_window AS percentile FROM tutorial.dc_bikeshare_q1_2012 WHERE start_time < '2012-01-08'WINDOW ntile_window AS (PARTITION BY start_terminal ORDER BY duration_seconds) ORDER BY start_terminal, duration_seconds

The WINDOW clause, if included, should always come after the WHERE clause.

Advanced windowing techniques

You can check out a complete list of window functions in Postgres (the syntax Mode uses) in the Postgres documentation. If you're using window functions on a connected database, you should look at the appropriate syntax guide for your system.

If you're interested, we rounded up the top five most popular window functions and expand on the commonalities of window functions in Python and SQL.

SQL Window Functions | Advanced SQL - Mode (2024)

FAQs

What are the drawbacks of window functions in SQL? ›

One of the main drawbacks of window functions is that they can be more difficult to write and understand than aggregate functions. Window functions require you to specify the window definition, which can include clauses such as PARTITION BY, ORDER BY, and RANGE or ROWS.

Learn More ›

How to speed up window functions? ›

Window functions process smaller frames faster. To reduce the size of the frame, decrease the number of rows between the frame offsets, and avoid including all preceding rows or all following rows. Use a well-distributed partition key to create partitions of similar size.

Tell Me More ›

Are SQL window functions efficient? ›

In summary, subqueries and window functions are used for different purposes in SQL, and each has its strengths and weaknesses. Window functions are often more efficient and maintainable for calculations across related rows, while subqueries are useful for filtering and aggregating data based on conditions.

Show Me More ›

When not to use window function in SQL? ›

Note: You can't use window functions and standard aggregations in the same query. More specifically, you can't include window functions in a GROUP BY clause.

Tell Me More ›

Are window functions more efficient than joins? ›

Window functions can provide faster runtimes. In very large datasets, if the cardinality of the column is large, then window functions are recommended. However, if the cardinality of the column is small, data aggregation is small, and the aggregated result can be broadcasted in the join.

Find Out More ›

What is the most commonly used window function in SQL? ›

RANK() The RANK() window function is more advanced than ROW_NUMBER() and is probably the most commonly used out of all SQL window functions. Its task is rather simple: to assign ranking values to the rows according to the specified ordering.

Find Out More ›

Why is window so slow? ›

Having many apps, programs, web browsers, and so on open at once can slow down your PC. Having a lot of browser tabs open at once can also slow it down quite a bit. If this is happening, close any apps, browser tabs, etc., that you don't need and see if that helps speed up your PC.

Get More Info ›

Why is the window function slow? ›

Window functions can often result in slow queries due to the fact they perform calculations across multiple rows. Here are some tips to optimize your window functions: Reduce the number of rows: If you can, filter your data before applying the window function.

Get More Info Here ›

How can I improve my window performance? ›

Improve my PC performance in Windows

Make sure you have the latest updates for Windows and device drivers.
Restart your PC and open only the apps you need.
Use ReadyBoost to help improve performance.
Make sure the system is managing the page file size.
Check for low disk space and free up space.

More items...

Learn More ›

How to find highest salary in SQL using window function? ›

Using ROW_NUMBER() function

We can also use the ROW_NUMBER() function to get the Nth highest salary from the employee table. The ROW_NUMBER() window function returns the sequence numbers of the dataset. This approach involves ordering the dataset in descending order and labelling each one of them to a row number.

See Details ›

How can I make SQL queries faster and more efficient? ›

12 ways to optimize SQL queries for cloud databases

Use indexes effectively.
Avoid SELECT * and retrieve only necessary columns.
Optimize JOIN operations.
Minimize the use of subqueries.
Avoid redundant or unnecessary data retrieval.
Utilize stored procedures.
Consider partitioning and sharding.
Normalize database tables.

More items...

Jun 30, 2023

What makes SQL better than Excel? ›

Scalability: Unlike SQL databases, Excel exhibits limited scalability and may encounter difficulties in efficiently handling very large datasets. Data Integrity: Excel lacks the robust data integrity features of SQL, heightening the risk of errors and inconsistencies in data analysis.

Which is faster, CTE or subquery? ›

CTEs can improve readability and maintenance of complex queries but might not always result in performance gains. Subqueries can sometimes be more efficient because they allow for better use of indexes and can be optimized by the database engine.

SQL Window Functions | Advanced SQL - Mode (2024)

Intro to window functions

Basic windowing syntax

Practice Problem

The usual suspects: SUM, COUNT, and AVG

Practice Problem

ROW_NUMBER()

RANK() and DENSE_RANK()

Practice Problem

NTILE

Practice Problem

LAG and LEAD

Defining a window alias

Advanced windowing techniques

FAQs

What are the drawbacks of window functions in SQL? ›

How can I make SQL queries faster and more efficient? ›