I’ve grown fond of SQLite during my time as a developer advocate. The file-based database with native drivers for every operating system allows me to ship self-contained data-driven sample apps to developers worldwide. Developers can clone a demo and run it immediately without the need for additional dependencies like a database server, Docker container, or remote hosting environment. It’s excellent, and I’m sure my readers appreciate it.
Typically, when writing demos, I need to seed an SQLite instance with enough data to make my point. In my previous post, I needed to show the performance impact of paging deep into a large dataset using EF Core. When writing the previous post, I learned the fastest way to seed an SQLite database using EF Core and work around EF Core with SQL.
The Problem
When working with a database server, we usually have bulk import mechanisms we can leverage. For example, with Microsoft SQL Server, folks may be familiar with SqlBulkCopy
. Well, SQLite doesn’t have any such feature, and it doesn’t need it. We’re writing to a file-based database, so we’re only limited by our development machine’s resources: memory, disk write speed and efficiency of code.
We’ll be writing a Picture
record to our database over 1,000,000 for this demo.
In this example code, we’ll be using NuGet packages like Entity Framework Core, Spectre.Console, and Bogus.
public class Picture
{
public int Id { get; set; }
public string Url { get; set; }
public DateTimeOffset Created { get; set; }
= DateTimeOffset.Now;
}
Let’s see how not to do it!
Using Entity Framework Core (Don’t)
If we’re using EF Core, we may be tempted to use the EF Core database interface to store values in our database. While it is certainly possible, it will be very slow. Let’s take a look at the code. Again, don’t do this unless you want to waste your time.
public static async Task<SeedOperationResult> SeedPictures(Database database, string label = "normal")
{
try
{
await database.Database.MigrateAsync();
// already pre-loaded
if (await database.Pictures.AnyAsync())
return SeedOperationResult.Skip;
// hold onto your butts
var data = new Faker<Picture>()
.RuleFor(p => p.Url, f => f.Image.PicsumUrl())
.GenerateForever();
await AnsiConsole
.Progress()
.AutoClear(false)
.StartAsync(async ctx =>
{
const int max = 1_000_000;
var task = ctx.AddTask($"Loading pictures - {label}", true, max);
task.StartTask();
while (task.Value < task.MaxValue)
{
const int amount = 10_000;
var page = data.Take(amount);
await database.Pictures.AddRangeAsync(page);
await database.SaveChangesAsync();
database.ChangeTracker.Clear();
task.Increment(amount);
}
task.StopTask();
});
return SeedOperationResult.Seeded;
}
catch (Exception e)
{
AnsiConsole.WriteException(e);
return SeedOperationResult.Error;
}
}
This code will take about 180 seconds
to write 1,000,000 records. In my scenario, this was too long to make users wait. However, we can cut that time in half by disabling the auto-detection of changes on our entities.
db.ChangeTracker.AutoDetectChangesEnabled = false;
Using Bulk SQL Scripts
The other approach is to build bulk insert scripts in code and then execute them against our underlying SQLite. So let’s see how to do just that!
public static async Task<SeedOperationResult> SeedPicturesWithSql(Database database, string label = "normal")
{
try
{
await database.Database.MigrateAsync();
// already pre-loaded
if (await database.Pictures.AnyAsync())
return SeedOperationResult.Skip;
// hold onto your butts
var data = new Faker<Picture>()
.RuleFor(p => p.Url, f => f.Image.PicsumUrl())
.GenerateForever();
await AnsiConsole
.Progress()
.AutoClear(false)
.StartAsync(async ctx =>
{
const int max = 1_000_000;
var task = ctx.AddTask($"Loading pictures - {label}", true, max);
task.StartTask();
var id = 0;
var sqlConnection = new SqliteConnection(CursorPaging.Database.ConnectionString);
await sqlConnection.OpenAsync();
while (task.Value < task.MaxValue)
{
const int amount = 250_000;
var values = new StringBuilder();
foreach (var picture in data.Take(amount))
{
values.AppendFormat("\n({0}, '{1}', '{2:yyyy-MM-dd HH:mm:ss}'),",
++id,
picture.Url,
picture.Created
);
}
// remove the comma
values.Remove(values.Length - 1, 1);
var command = sqlConnection.CreateCommand();
command.CommandText = $"insert into Pictures (Id, Url, Created) values {values};";
command.CommandType = CommandType.Text;
await command.ExecuteNonQueryAsync();
task.Increment(amount);
}
task.StopTask();
});
return SeedOperationResult.Seeded;
}
catch (Exception e)
{
AnsiConsole.WriteException(e);
return SeedOperationResult.Error;
}
}
This code executes at about 7
seconds for over 1,000,000 records. Wowza! We can get slightly faster, but there is a point of diminishing returns, but let’s see it anyways.
Using Command and Parameters
This approach I found on the Microsoft Documentation site. It claims that the generates SQL statement generated by the command
is cached and thus gives us increased performance. I found this approach would sometimes be faster than generating SQL statements, but sometimes not. That said, this approach is more memory efficient than generating large strings.
public static async Task<SeedOperationResult> SeedPicturesWithCommand(Database database, string label = "normal")
{
try
{
await database.Database.MigrateAsync();
// already pre-loaded
if (await database.Pictures.AnyAsync())
return SeedOperationResult.Skip;
// hold onto your butts
var data = new Faker<Picture>()
.RuleFor(p => p.Url, f => f.Image.PicsumUrl())
.GenerateForever();
await AnsiConsole
.Progress()
.AutoClear(false)
.StartAsync(async ctx =>
{
const int max = 1_000_000;
var task = ctx.AddTask($"Loading pictures - {label}", true, max);
task.StartTask();
var id = 0;
while (task.Value < task.MaxValue)
{
const int amount = 250_000;
using (var transaction = await database.Database.BeginTransactionAsync())
{
var command = database.Database.GetDbConnection().CreateCommand();
command.CommandText =
$"insert into Pictures (Id, Url, Created) values ($id, $url, $created);";
var parameterId = command.CreateParameter();
parameterId.ParameterName = "$id";
command.Parameters.Add(parameterId);
var parameterUrl = command.CreateParameter();
parameterUrl.ParameterName = "$url";
command.Parameters.Add(parameterUrl);
var parameterCreated = command.CreateParameter();
parameterCreated.ParameterName = "$created";
command.Parameters.Add(parameterCreated);
for (var i = 0; i < amount; i++)
{
var picture = data.First();
parameterId.Value = id++;
parameterUrl.Value = picture.Url;
parameterCreated.Value = picture.Created.ToString("yyyy-MM-dd HH:mm:ss");
await command.ExecuteNonQueryAsync();
}
await transaction.CommitAsync();
}
task.Increment(amount);
}
task.StopTask();
});
return SeedOperationResult.Seeded;
}
catch (Exception e)
{
AnsiConsole.WriteException(e);
return SeedOperationResult.Error;
}
}
This approach takes about 6
seconds to insert 1,000,000 records.
Conclusion
If you want to try out any of these approaches for yourself, well, you’re in luck! I’ve uploaded the sample project to GitHub, where you can play around with all three methods.
If you happen to have an even faster way, please let me know on Twitter. My username is @buhakmeh.