SQL Pagination: Fixing Duplicates & Lag

In production environments handling high-throughput data, pagination is often implemented naively using LIMIT and OFFSET. While functionally correct for small, static datasets, this approach inevitably degrades in two critical areas: data consistency and query latency. Engineers frequently report "phantom rows" (duplicates appearing across pages) or skipped records when sorting by non-unique columns like timestamps. Furthermore as the offset increases, database performance drops linearly (O(N)), causing severe bottlenecks in large-scale applications.

1. The Non-Deterministic Sorting Problem

The root cause of duplicate or missing records during pagination lies in non-deterministic sorting. SQL standards do not guarantee a specific order for rows with identical values in the ORDER BY clause. When a query sorts by a column with low cardinality or frequent duplicates (e.g., posted_at, status), the database engine is free to return tied rows in any order based on the current execution plan, disk state, or parallel worker availability.

Consider a scenario where 100 articles were published at the exact same second. If you request LIMIT 10 OFFSET 0, the DB returns an arbitrary set of 10. When you request OFFSET 10, the DB generates the sort again. Without a unique tie-breaker, the "random" order of the tied rows may change, causing rows from Page 1 to reappear on Page 2.

Common Pitfall: Relying on ORDER BY created_at DESC is unsafe for pagination if multiple records can be created within the same timestamp resolution (e.g., milliseconds).

Visualizing the Instability

The following table illustrates how the internal sort order can shift between two identical queries, leading to data inconsistencies.

Execution ID (PK) Posted (Non-Unique) Internal Index Position
Query A (Page 1) 105 10:30:00 1
Query A (Page 1) 104 10:30:00 2 (Returned)
Query B (Page 2) 102 10:30:00 1 (Skipped via Offset)
Query B (Page 2) 104 10:30:00 2 (Returned again!)

2. Solution: Deterministic Tie-Breakers

To guarantee stability, the sorting criteria must be unique for every row. This is achieved by appending a Tie-Breaker column to the ORDER BY clause. The Primary Key (`id`) is the ideal candidate because it is indexed, unique, and non-null.

The database engine will first sort by the business requirement (`posted`), and upon encountering ties, it will strictly resolve them using the `id`. This ensures that for any given dataset, the sequence of rows remains immutable across query executions.


-- Unstable Query (Do Not Use)
SELECT id, title, posted 
FROM articles 
ORDER BY posted DESC 
LIMIT 10 OFFSET 10;

-- Stable Query (Best Practice)
SELECT id, title, posted 
FROM articles 
ORDER BY posted DESC, id DESC -- Deterministic Tie-Breaker
LIMIT 10 OFFSET 10;
Architecture Note: While internal row identifiers (like SQLite's ROWID or PostgreSQL's ctid) can serve as tie-breakers, explicit Primary Keys are preferred for portability and to avoid issues after operations like VACUUM or table rewrites.

3. High-Performance Strategy: Keyset Pagination

While adding a tie-breaker solves consistency issues, OFFSET pagination still suffers from the "Deep Pagination" performance problem. To return records 1,000,000 to 1,000,010, the database must fetch and sort 1,000,010 rows, discard the first million, and return the rest. This creates unnecessary I/O and CPU overhead.

Keyset Pagination (also known as the Seek Method or Cursor-based Pagination) eliminates OFFSET entirely. Instead of skipping rows, the query "seeks" directly to the last record of the previous page using the index.

Implementation Logic

The client must track the values of the sort columns from the last received item. For the next request, these values act as a cursor.


-- Optimized Keyset Query
-- Assume the last item of Page 1 had: posted='2023-11-10', id=542
SELECT id, title, posted
FROM articles
WHERE 
  (posted, id) < ('2023-11-10', 542) -- Tuple comparison
ORDER BY 
  posted DESC, id DESC
LIMIT 10;

This approach leverages the composite index on (posted, id) to jump directly to the target record in O(log N) time, regardless of how deep the page is.

Comparison: Offset vs. Keyset
Feature Offset Pagination Keyset Pagination
Complexity O(N) (Linear degradation) O(log N) (Constant speed)
Consistency Unstable without tie-breaker Inherently stable
Data Change Insertions shift page boundaries Immune to insertions/deletions
Navigation Random Jump (Page 1 -> 50) Sequential Only (Next/Prev)

Keyset pagination is the standard for infinite scroll interfaces (like Twitter or Facebook feeds) where users sequentially consume data. If your UI requires specific page numbers (e.g., "Go to Page 5"), OFFSET remains necessary, but must be paired with the deterministic tie-breaker optimization discussed in Section 2.

Conclusion

Data consistency in pagination is not optional. The default behavior of SQL databases prioritizes speed over sort stability for ties, requiring explicit architectural decisions from engineers. To build robust systems:

  1. Mandatory: Always include a unique column (Primary Key) in your ORDER BY clause.
  2. Performance: Adopt Keyset Pagination for large datasets or infinite scrolling features to maintain sub-millisecond query times at scale.
  3. UX Trade-off: Use Offset Pagination only when random page access is a strict business requirement, understanding the performance costs involved.

Post a Comment