In production environments handling high-throughput data, pagination is often implemented naively using LIMIT and OFFSET. While functionally correct for small, static datasets, this approach inevitably degrades in two critical areas: data consistency and query latency. Engineers frequently report "phantom rows" (duplicates appearing across pages) or skipped records when sorting by non-unique columns like timestamps. Furthermore as the offset increases, database performance drops linearly (O(N)), causing severe bottlenecks in large-scale applications.
1. The Non-Deterministic Sorting Problem
The root cause of duplicate or missing records during pagination lies in non-deterministic sorting. SQL standards do not guarantee a specific order for rows with identical values in the ORDER BY clause. When a query sorts by a column with low cardinality or frequent duplicates (e.g., posted_at, status), the database engine is free to return tied rows in any order based on the current execution plan, disk state, or parallel worker availability.
Consider a scenario where 100 articles were published at the exact same second. If you request LIMIT 10 OFFSET 0, the DB returns an arbitrary set of 10. When you request OFFSET 10, the DB generates the sort again. Without a unique tie-breaker, the "random" order of the tied rows may change, causing rows from Page 1 to reappear on Page 2.
ORDER BY created_at DESC is unsafe for pagination if multiple records can be created within the same timestamp resolution (e.g., milliseconds).
Visualizing the Instability
The following table illustrates how the internal sort order can shift between two identical queries, leading to data inconsistencies.
| Execution | ID (PK) | Posted (Non-Unique) | Internal Index Position |
|---|---|---|---|
| Query A (Page 1) | 105 | 10:30:00 | 1 |
| Query A (Page 1) | 104 | 10:30:00 | 2 (Returned) |
| Query B (Page 2) | 102 | 10:30:00 | 1 (Skipped via Offset) |
| Query B (Page 2) | 104 | 10:30:00 | 2 (Returned again!) |
2. Solution: Deterministic Tie-Breakers
To guarantee stability, the sorting criteria must be unique for every row. This is achieved by appending a Tie-Breaker column to the ORDER BY clause. The Primary Key (`id`) is the ideal candidate because it is indexed, unique, and non-null.
The database engine will first sort by the business requirement (`posted`), and upon encountering ties, it will strictly resolve them using the `id`. This ensures that for any given dataset, the sequence of rows remains immutable across query executions.
-- Unstable Query (Do Not Use)
SELECT id, title, posted
FROM articles
ORDER BY posted DESC
LIMIT 10 OFFSET 10;
-- Stable Query (Best Practice)
SELECT id, title, posted
FROM articles
ORDER BY posted DESC, id DESC -- Deterministic Tie-Breaker
LIMIT 10 OFFSET 10;
ROWID or PostgreSQL's ctid) can serve as tie-breakers, explicit Primary Keys are preferred for portability and to avoid issues after operations like VACUUM or table rewrites.
3. High-Performance Strategy: Keyset Pagination
While adding a tie-breaker solves consistency issues, OFFSET pagination still suffers from the "Deep Pagination" performance problem. To return records 1,000,000 to 1,000,010, the database must fetch and sort 1,000,010 rows, discard the first million, and return the rest. This creates unnecessary I/O and CPU overhead.
Keyset Pagination (also known as the Seek Method or Cursor-based Pagination) eliminates OFFSET entirely. Instead of skipping rows, the query "seeks" directly to the last record of the previous page using the index.
Implementation Logic
The client must track the values of the sort columns from the last received item. For the next request, these values act as a cursor.
-- Optimized Keyset Query
-- Assume the last item of Page 1 had: posted='2023-11-10', id=542
SELECT id, title, posted
FROM articles
WHERE
(posted, id) < ('2023-11-10', 542) -- Tuple comparison
ORDER BY
posted DESC, id DESC
LIMIT 10;
This approach leverages the composite index on (posted, id) to jump directly to the target record in O(log N) time, regardless of how deep the page is.
| Feature | Offset Pagination | Keyset Pagination |
|---|---|---|
| Complexity | O(N) (Linear degradation) | O(log N) (Constant speed) |
| Consistency | Unstable without tie-breaker | Inherently stable |
| Data Change | Insertions shift page boundaries | Immune to insertions/deletions |
| Navigation | Random Jump (Page 1 -> 50) | Sequential Only (Next/Prev) |
Keyset pagination is the standard for infinite scroll interfaces (like Twitter or Facebook feeds) where users sequentially consume data. If your UI requires specific page numbers (e.g., "Go to Page 5"), OFFSET remains necessary, but must be paired with the deterministic tie-breaker optimization discussed in Section 2.
Conclusion
Data consistency in pagination is not optional. The default behavior of SQL databases prioritizes speed over sort stability for ties, requiring explicit architectural decisions from engineers. To build robust systems:
- Mandatory: Always include a unique column (Primary Key) in your
ORDER BYclause. - Performance: Adopt Keyset Pagination for large datasets or infinite scrolling features to maintain sub-millisecond query times at scale.
- UX Trade-off: Use Offset Pagination only when random page access is a strict business requirement, understanding the performance costs involved.
Post a Comment