As a web developer, it’s important to remind ourselves that we push our work onto the internet for the world to see. Our development team utilizes Windows Azure for many environments, and all those environments need to be publicly accessible. There are ways to limit accessibility on Windows Azure, but the eyes looking at our work can vary, so having a hard constraint becomes unmanageable for us.
The accessibility of our environments helps stakeholders see iterations of our progress faster and provide feedback. In turn, that feedback goes back into our development, making the deliverable better for everyone.
In practice, we usually have three environments in Windows Azure App Services: Development, Staging, and Production. We want all these to be visible to people, but we all know people don’t just view the web. The majority of web users are automated processes that scan your site, and many other sites to provide search results. I put a generous amount of effort into a previous blog post and proud of it. If you want to learn how to build a great search experience in your ASP.NET Core application, I highly recommend it.
In this post, I’ll explain why a robots.txt
is essential for your public-facing ASP.NET Core applications. Additionally, you’ll learn how to write or generate a few variations of the file. Finally, you’ll learn how to serve a specific file based on your current hosting environment.
This post is by no means limited to Windows Azure users and works anywhere you host your ASP.NET Core applications.
If you want the code, you can download it from my GitHub repository. It targets ASP.NET Core 3.0 but could be modified to support lower SDK versions.
What is a Robots.txt
As mentioned in the previous section, most of the web’s users are automated processes. These processes are known as Web crawlers and operated by data collection and search companies like Google, Facebook, and Microsoft.
Wikipedia sums it perfectly:
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering). –Wikipedia
While algorithms can differ, most crawlers start at your homepage and navigate the links on your site in an attempt to find the most content. However, the exciting fact is, you can talk to these crawlers!
If talking to spiders makes your skin crawl, don’t worry, I promise there won’t be any insects involved. Instead, there is a simple plain text file called a robots.txt
file. This file contains instructions for crawlers, telling them where to look in your site for meaningful content, and even where not to look. As reiterated on Wikipedia:
The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.–Wikipedia
It’s important to note that the instructions in the file are no guarantee that a crawler obeys your wishes. There are bad bots that crawl sites for many other reasons, especially if your site contains valuable information.
Why It is Important
I’m glad you are asking the question:
Why is it essential to have a
robots.txt
file for my ASP.NET Core application?
While you may not have content within your application, you still want individuals to find your primary domain. Pointing crawlers to the specific pages to you and your users can make for a more significant impact.
In my case, we have multiple domains that can have very similar content. Crawlers may mistake the variations in domains having similar content as an attempt to create spam sites. The duplication of sites can hurt the overall ranking of a site in search engine results.
Even worse, your non-primary sites could rank higher than your primary! Imagine users entering critical information into your development environment, only for it to be lost. Yikes!
Generating Your Own Robots.txt
A robots.txt
has a few components, but the essential parts are who and what.
Whom do you want to scan your site?
What parts of the site do you want to be indexed?
Let’s take a simple robots.txt
file:
User-Agent: *
Allow: /
The first line of the file defines what User-Agent
can scan the site. The User-Agent
is a string value passed from the client. For example, Google’s bot passes a value of Googlebot
, but they also have an army of bots.
The second line of the file tells the bot what it can index. The example informs the bot to start at the root path and work its way everywhere it can. You can also Disallow
specific paths, but don’t assume this takes the place of securing your site. As mentioned before, bots do not have to respect the robots.txt
file.
I recommend generating a few variations using this online tool. As the tool mentions, you can damage your site’s search results if done incorrectly, so be careful and think through the file you’ll be serving.
The Code
You’ve learned a lot up to this point:
- You know what a crawler is
- How to talk to them via a
robots.txt
file - Have generated a few variations
In this section, I’ll show you how to write a piece of middleware that takes inspiration for ASP.NET Core’s use of environments. Our goal is to serve a robots.txt
file unique to each environment we have.
We want to tell bots to stay away from our development environments while boosting the importance of our production sites.
Robots.txt Per Environment
The first step is to realize that ASP.NET Core provides an Environment
mechanism. By default, we get Development
, Staging
, and Production
. We can create any environments, but let’s start here.
Our robots.txt files can match these environments:
robots.txt
robots.Production.txt
robots.Development.txt
The robots.txt
is our fallback, but the other files are specific to the environments. ASP.NET Core reads the current environment value from the environmental variable ASPNETCORE_ENVIRONMENT
which can be set at runtime or via launch settings.
Robots.txt Middleware Code
The middleware is straight-forward. It scans our directory for the files based on our environments. In my sample, I place the files in the content path of the project, and not the webroot path. You could place it anywhere, but in general, I don’t want the static file middleware to serve the incorrect file accidentally.
public static class RobotsTxtMiddlewareExtensions
{
public static IApplicationBuilder UseRobotsTxt(
this IApplicationBuilder builder,
IWebHostEnvironment env,
string rootPath = null
)
{
return builder.MapWhen(ctx => ctx.Request.Path.StartsWithSegments("/robots.txt"), b =>
b.UseMiddleware<RobotsTxtMiddleware>(env.EnvironmentName, rootPath ?? env.ContentRootPath ));
}
}
public class RobotsTxtMiddleware
{
const string Default =
@"User-Agent: *\nAllow: /";
private readonly RequestDelegate next;
private readonly string environmentName;
private readonly string rootPath;
public RobotsTxtMiddleware(
RequestDelegate next,
string environmentName,
string rootPath
)
{
this.next = next;
this.environmentName = environmentName;
this.rootPath = rootPath;
}
public async Task InvokeAsync(HttpContext context)
{
if (context.Request.Path.StartsWithSegments("/robots.txt"))
{
var generalRobotsTxt = Path.Combine(rootPath, "robots.txt");
var environmentRobotsTxt = Path.Combine(rootPath, $"robots.{environmentName}.txt");
string output;
// try environment first
if (File.Exists(environmentRobotsTxt))
{
output = await File.ReadAllTextAsync(environmentRobotsTxt);
}
// then robots.txt
else if (File.Exists(generalRobotsTxt))
{
output = await File.ReadAllTextAsync(generalRobotsTxt);
}
// then just a general default
else
{
output = Default;
}
context.Response.ContentType = "text/plain";
await context.Response.WriteAsync(output);
}
else
{
await next(context);
}
}
}
Calling Robots.Txt In Startup.cs
Like any other middleware in an ASP.NET Core application, we need to register it in our middleware pipeline.
public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
if (env.IsDevelopment())
{
app.UseDeveloperExceptionPage();
}
else
{
app.UseExceptionHandler("/Error");
app.UseHsts();
}
app.UseHttpsRedirection();
// Register Our Middleware
app.UseRobotsTxt(env);
app.UseStaticFiles();
app.UseRouting();
app.UseAuthorization();
app.UseEndpoints(endpoints => { endpoints.MapRazorPages(); });
}
This particular middleware takes IWebHostEnvironment
as a constructor parameter. Passing in the IWebHostEnvironment
allows the middleware to read values like the environment name and if a root path is not provided at registration.
Serving robots.txt
By navigating to /robots.txt
we see the file based on our environment.
Hooray! We did it! By setting a breakpoint, we see we hit the middleware for the environment we expect.
To download the project, go to my GitHub repository page. Note, you’ll need the .NET Core 3.0 SDK.
Conclusion
While the code may seem trivial, serving a correct robots.txt
can have an enormous impact on the success of your site. I cannot recommend enough you think about who sees your content, including bots. It’s also important to know that most visitors find your site through a search, so in reality, your most important user is a bot.