# Cracking Hard SQL: The Window-Function Patterns That Recur

A junior analyst once handed me a 200-line query — nested subqueries inside nested subqueries, the same table joined to itself four times — and asked why it timed out. The task was simple: the top three earners per department. I rewrote it in about ten lines with one window function, and it ran instantly. That gap, between the brute-force SQL people reach for first and the clean SQL that the problem actually wants, is what the hard and medium LeetCode SQL problems are really testing. And the good news, which took me too long to internalize: **those problems are not infinite — they're five or six recurring patterns wearing different costumes.** Learn the patterns and "hard SQL" stops being hard.

This is a tour of the patterns I see over and over, in interviews and in production. The unifying tool is the **window function** — now finally available everywhere (even MySQL got them in 8.0 last year), so there's no excuse left to brute-force these with self-joins.

## Pattern 1: Top-N per group (the one everyone gets wrong first)

The canonical problem: "the highest-paid employee in each department," or its harder sibling, "the top three." The instinct is a correlated subquery — "where salary equals the max salary for this department" — which works for top-1, breaks awkwardly on ties, and falls apart entirely for top-N. The pattern that actually fits is **partition, order, rank, filter**.

```mermaid
graph LR
    A["All rows(employees)"]
    B["PARTITION BY dept(one window per group)"]
    C["ORDER BY salary DESC(within each group)"]
    D["ROW_NUMBER / RANK / DENSE_RANK(number the rows)"]
    E["WHERE rn <= N(keep the top N)"]
    A --> B --> C --> D --> E
          
```

The top-N-per-group pattern, which solves a whole family of problems: partition the rows into one window per group, order within each window, number them, and filter to the top N. The only real decision is which ranking function — ROW_NUMBER (no ties, exactly N rows), RANK (ties share a rank and leave gaps), or DENSE_RANK (ties share, no gaps). "Top 3 distinct salaries" wants DENSE_RANK; "any 3 people" wants ROW_NUMBER. Picking the wrong one is the most common subtle bug.

```sql
-- top 3 earners per department, ties handled by choosing the right ranker
SELECT department, name, salary
FROM (
  SELECT department, name, salary,
         DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk
  FROM employees
) t
WHERE rnk <= 3;
```

The three ranking functions are not interchangeable, and the difference is the whole game: `ROW_NUMBER` gives every row a distinct number (exactly N rows out, ties broken arbitrarily), `RANK` lets ties share a number and then skips (1,2,2,4), and `DENSE_RANK` lets ties share with no gap (1,2,2,3). "Second highest salary" — a LeetCode classic — is really "where DENSE_RANK = 2," and getting tie semantics right is what separates a correct answer from a plausible-looking wrong one.

## Pattern 2: Comparing a row to its neighbours

"Find numbers that appear three times consecutively." "Find days where revenue rose versus the day before." The brute-force approach joins the table to itself on `id = id+1` and `id = id+2` — and it works, but it's clumsy and slow. The pattern is `LAG` and `LEAD`, which let a row see the values before and after it without a join at all.

```sql
-- numbers appearing in 3+ consecutive rows, via LAG (no self-join)
SELECT DISTINCT num
FROM (
  SELECT num,
         LAG(num, 1) OVER (ORDER BY id) AS prev1,
         LAG(num, 2) OVER (ORDER BY id) AS prev2
  FROM logs
) t
WHERE num = prev1 AND num = prev2;
```

Once `LAG`/`LEAD` are in your toolkit, an entire class of "compare to the previous/next row" problems collapses from multi-join contortions into a single readable pass. Day-over-day deltas, detecting changes, "did the value increase" — all the same shape.

## Pattern 3: Gaps and islands (the one that looks impossible)

This is the pattern that stumps people, and it's beautiful once it clicks. "Find the longest streak of consecutive winning days." "Find ranges of consecutive available seats." You have rows that form runs (islands) separated by gaps, and you need to identify or measure the runs. The trick is almost magical: **subtract a row number from the sequence value, and every row in the same consecutive run shares the same difference.** That difference becomes a group key you can aggregate on.

```sql
-- group consecutive ids into "islands"; the (id - row_number) is constant per run
SELECT MIN(id) AS island_start, MAX(id) AS island_end, COUNT(*) AS length
FROM (
  SELECT id,
         id - ROW_NUMBER() OVER (ORDER BY id) AS grp   -- constant within a run
  FROM seats
  WHERE status = 'available'
) t
GROUP BY grp
ORDER BY length DESC;
```

Why it works: if ids are consecutive (5,6,7), the row numbers within the filtered set are also consecutive (1,2,3), so `id - row_number` is constant (4,4,4). The moment there's a gap, the difference jumps, starting a new group. It feels like a trick the first time; by the third gaps-and-islands problem it's just the tool you reach for.

## Pattern 4: Conditional aggregation (pivoting rows to columns)

"Show, per month, total sales for each of the three product categories as columns." The pattern is `SUM` (or `COUNT`) wrapped around a `CASE` — one aggregate per output column, each counting only the rows that match its condition. It's the portable, every-database way to pivot, and it's clearer than vendor-specific `PIVOT` syntax.

```sql
-- pivot category rows into columns with conditional aggregation
SELECT
  month,
  SUM(CASE WHEN category = 'electronics' THEN amount ELSE 0 END) AS electronics,
  SUM(CASE WHEN category = 'apparel'     THEN amount ELSE 0 END) AS apparel,
  SUM(CASE WHEN category = 'grocery'     THEN amount ELSE 0 END) AS grocery
FROM sales
GROUP BY month;
```

The same shape counts conditionally (replace `amount` with `1`), computes ratios (a conditional sum over a total), and builds the kind of cross-tab summaries that dashboards live on. Conditional aggregation is the SQL workhorse hiding behind half of all reporting queries.

## Pattern 5: The anti-join, and the NULL trap that fails silently

"Customers who never placed an order." This is an **anti-join**, and there are three ways to write it — but one of them is a trap.

| Approach | Verdict |
| --- | --- |
| `LEFT JOIN ... WHERE right.id IS NULL` | Reliable and clear — my default |
| `WHERE NOT EXISTS (SELECT 1 ...)` | Reliable, often the optimizer's favourite |
| `WHERE id NOT IN (SELECT fk ...)` | Dangerous — breaks silently if the subquery returns any NULL |

**`NOT IN` with a nullable subquery is the bug that returns zero rows and no error.** If the subquery behind `NOT IN` returns even a single NULL, the whole predicate evaluates to NULL (not true) for every row, and you silently get an empty result — no error, no warning, just wrong. It's three-valued logic doing exactly what the standard says, and it has burned more analysts than almost any other SQL gotcha. Prefer `NOT EXISTS` or the `LEFT JOIN / IS NULL` form, which both handle NULLs correctly. If you must use `NOT IN`, guarantee the subquery can't return NULL. This one is worth memorizing because it fails in the worst way — quietly.

## One more: dedup, keep the latest

Not flashy, but it's everywhere in real work: "keep only the most recent row per user." It's just the top-N pattern with N=1 and an ordering by timestamp — but worth naming because you'll write it constantly when cleaning data.

```sql
-- one row per user: the most recent, dropping the rest
SELECT user_id, status, updated_at
FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn
  FROM user_events
) t
WHERE rn = 1;
```

**The mental shift that makes all of this click: stop thinking row-by-row, start thinking in windows and groups.** Brute-force SQL comes from imagining a loop — "for each row, go find the related rows" — which leads straight to correlated subqueries and self-joins. Window functions let you instead describe a *frame* of rows relative to the current one and compute over it declaratively. Once you think "what window do I need — partitioned how, ordered how, framed how?" the five patterns above are just answers to that single question, and the 200-line query becomes ten. This set-oriented mindset is also exactly what makes your SQL fast, because the planner can optimize a window function far better than a pile of self-joins.

## What to carry away

Hard SQL problems aren't a bottomless pit of cleverness — they're a small set of recurring patterns. Top-N per group is partition-order-rank-filter (and choosing ROW_NUMBER vs RANK vs DENSE_RANK by tie semantics). Neighbour comparisons are `LAG`/`LEAD`, not self-joins. Gaps-and-islands is the `value - row_number` grouping trick. Pivoting is conditional aggregation with `CASE`. Anti-joins are `NOT EXISTS` or `LEFT JOIN / IS NULL` — never nullable `NOT IN`. And dedup is just top-N with N=1.

Underneath all six is one shift: think in sets and windows, not row-by-row loops. That's what window functions were built for, it's what the planner optimizes best, and it's what turns a timed-out 200-line query into ten readable lines. Master these patterns and you'll recognize the costume the moment a "hard" problem walks in — which is most of solving it. The same set-thinking pays off everywhere SQL runs, from a warehouse [dimensional model](dimensional-modeling-kimball) to the analytical queries a [warehouse design](designing-a-data-warehouse) is built to serve.
