When building APIs that return a non-trivial amount of data, we need to make design decisions based on our underlying database. For example, a developer may implement a specific endpoint in APIs that allows clients to return a window-based collection. As implementors of an API, we have the choice between page-based paging and cursor-based paging.
This post will show to implement both approaches and why cursor-based paging is what performance-focused developers should prefer when choosing between the options.
What is Paging?
For developers getting into building data-driven APIs, paging becomes a significant concept to understand. A collection can exist in one of two logical categories: bounded and unbounded.
A bounded collection has a perceived logical limit. For example, a family would likely have an upper bound to the number of children within the family unit. In most circumstances, a family would have 2-4 children. Thus, we’d likely return all the children in a single response when building an API around a family resource.
An unbounded collection has no known limit. These usually are collections associated with user input, time-series data, or other data collection mechanisms. For example, all social media platforms operate on some unbounded resource. For instance, on Twitter, tweets are sent by millions of individuals around the planet simultaneously. It is impossible to return a complete data collection to any client in these cases, but instead, the API produces a fragment based on request criteria.
Regarding APIs, paging is the developer’s attempt to give client’s a partial result set from a more significant data source that may be impossible to interact with otherwise.
Choosing The Right Paging Approach
When developers consider a paging approach, there are typically two approaches:
- Paging by a page number and a page size
- Paging using a cursor identifier and a page size
The term cursor is overloaded, as it shouldn’t be confused with a relational database’s concept of a cursor. While the idea has similarities with the implementation in a database, they are unrelated.
A cursor is an identifier that retrieves the next element in our subsequent paging requests. Thus, users can think of it as asking the question, “what comes after this cursor identifier?”.
Let’s look at the two approaches in a request format.
### Paging By Page and Size
GET https://localhost:5001/pictures/paging?page=100000&size=10
Accept: application/json
### Paging By Cursor and Size
GET https://localhost:5001/pictures/cursor?after=999990&size=10
Accept: application/json
The two HTTP requests look very similar from a user’s perspective but operate fundamentally differently.
For a paging request, we calculate a window in our data store and retrieve those elements. Here, we’ll page a collection of Picture
to retrieve the results on a particular page.
var query = db
.Pictures
.OrderBy(x => x.Id)
.Skip((page - 1) * size)
.Take(size);
While this approach works, developers should be aware of the logical caveats, the biggest being a shifting window as new data arrives. The other issue with this approach depends on the underlying database. When paging with logical windows, the database has to scan the entire result set to return the page to the client. Thus, when we page farther into the collection, performance will begin to degrade.
Well, how does cursor paging work? Unlike creating a logical window to retrieve a result set, we use the previous result to retrieve the next. The cursor can be any value that guarantees the next window. Fields like auto-incrementing identifiers or timestamps are ideal for cursor-based paging. Cursor-based paging also doesn’t suffer from the issue of incoming data throwing our paging results off, as long as our ordering is deterministic.
Let’s see how to implement cursor paging as a LINQ Query.
var query = db
.Pictures
.OrderBy(x => x.Id)
// will use the index
.Where(x => x.Id > after)
.Take(size);
Paging is an expensive operation, no matter how we approach it. Ever wonder why Google stops showing result pages after a max of 25 pages? Well, there’s a point of diminishing return of value and performance.
Comparing Performance
One of the most noticeable drawbacks to paging with page and size, when compared to cursor paging, is the performance impact. Let’s look at two implementations in an ASP.NET Core web application.
Note: We are using ASP.NET Core Minimal APIs for this sample.
using System.Linq;
using CursorPaging;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddDbContext<Database>();
var app = builder.Build();
await using (var scope = app.Services.CreateAsyncScope())
{
var db = scope.ServiceProvider.GetService<Database>();
var logger = scope.ServiceProvider.GetService<ILogger<WebApplication>>();
var result = await Database.SeedPictures(db);
logger.LogInformation($"Seed operation returned with code {result}");
}
if (app.Environment.IsDevelopment())
{
app.UseDeveloperExceptionPage();
}
app
.UseDefaultFiles()
.UseStaticFiles();
app.MapGet("/pictures/paging", async http =>
{
var page = http.Request.Query.TryGetValue("page", out var pages)
? int.Parse(pages.FirstOrDefault() ?? string.Empty)
: 1;
var size = http.Request.Query.TryGetValue("size", out var sizes)
? int.Parse(sizes.FirstOrDefault() ?? string.Empty)
: 10;
await using var db = http.RequestServices.GetRequiredService<Database>();
var total = await db.Pictures.CountAsync();
var query = db
.Pictures
.OrderBy(x => x.Id)
.Skip((page - 1) * size)
.Take(size);
var logger = http.RequestServices.GetRequiredService<ILogger<Database>>();
logger.LogInformation($"Using Paging:\n{query.ToQueryString()}");
var results = await query.ToListAsync();
await http.Response.WriteAsJsonAsync(new PagingResult
{
Page = page,
Size = size,
Pictures = results.ToList(),
TotalCount = total,
Sql = query.ToQueryString()
});
});
app.MapGet("/pictures/cursor", async http =>
{
var after = http.Request.Query.TryGetValue("after", out var afters)
? int.Parse(afters.FirstOrDefault() ?? string.Empty)
: 0;
var size = http.Request.Query.TryGetValue("size", out var sizes)
? int.Parse(sizes.FirstOrDefault() ?? string.Empty)
: 10;
await using var db = http.RequestServices.GetRequiredService<Database>();
var logger = http.RequestServices.GetRequiredService<ILogger<Database>>();
var total = await db.Pictures.CountAsync();
var query = db
.Pictures
.OrderBy(x => x.Id)
// will use the index
.Where(x => x.Id > after)
.Take(size);
logger.LogInformation($"Using Cursor:\n{query.ToQueryString()}");
var results = await query.ToListAsync();
await http.Response.WriteAsJsonAsync(new CursorResult
{
TotalCount = total,
Pictures = results,
Cursor = new CursorResult.CursorItems
{
After = results.Select(x => (int?) x.Id).LastOrDefault(),
Before = results.Select(x => (int?) x.Id).FirstOrDefault()
},
Sql = query.ToQueryString()
});
});
app.Run();
We’ll seed our database with over 1 million rows into an SQLite database and attempt to page as far into the collection as possible.
### Paging By Page and Size
GET https://localhost:5001/pictures/paging?page=100000&size=10
Accept: application/json
### Paging By Cursor and Size
GET https://localhost:5001/pictures/cursor?after=999990&size=10
Accept: application/json
So what are our results from this experiment?
For our page/size request, we see a total response time of 225ms
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Date: Wed, 23 Jun 2021 12:46:27 GMT
Server: Kestrel
Transfer-Encoding: chunked
> 2021-06-23T084627.200.json
Response code: 200 (OK); Time: 225ms; Content length: 1328 bytes
Let’s see how the cursor requests fairs in a similar situation?
HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8
Date: Wed, 23 Jun 2021 12:46:27 GMT
Server: Kestrel
Transfer-Encoding: chunked
> 2021-06-23T084627-1.200.json
Response code: 200 (OK); Time: 52ms; Content length: 1370 bytes
Wow! 52ms
compared to 225ms
! That’s a substantial difference while still producing the same results. Where do we get this performance improvement? Well, let’s compare the SQL generated by EF Core and compare them.
-- Using Page Size
.param set @__p_1 10
.param set @__p_0 999990
SELECT "p"."Id", "p"."Created", "p"."Url"
FROM "Pictures" AS "p"
ORDER BY "p"."Id"
LIMIT @__p_1 OFFSET @__p_0
-- Using Cursor
.param set @__after_0 999990
.param set @__p_1 10
SELECT "p"."Id", "p"."Created", "p"."Url"
FROM "Pictures" AS "p"
WHERE "p"."Id" > @__after_0
ORDER BY "p"."Id"
LIMIT @__p_1
We can see that the first query (page/size) uses the OFFSET
keyword. The keyword tells our database provider, SQLite, to scan to the offset before calling the LIMIT
keyword. As we can imagine, scanning a table of millions of rows can become relatively expensive.
We can see we are using the Id
column in our WHERE
clause for our cursor query. Our use of the index allows SQLite to efficiently navigate the table, allowing the query to run more effectively.
Conclusion
As we saw in the code and examples above, cursor-based paging is more performant as our datasets get more significant. That said, logical paging using page and size have their place in applications. The approach is much simpler to implement, and with smaller datasets that change infrequently, it might be the easier choice. As developers, it’s incumbent on us to choose the best approach for our scenario. Still, with an almost 4x improvement in query performance, it is hard to argue against cursor-based paging.
As always, thanks for reading.