DOM Parsing, Query Selectors, and JavaScript with AngleSharp

I recently saw a tweet from Chris Heilmann, Principal Program Manager for Browser Tools at Microsoft. In this tweet, he gave a quick tip about parsing HTML content using the built-in DOMParser JavaScript class rather than resorting to RegEx, and the inevitable heartache edge cases and malformed HTML can bring.

The tweet made me think about the advantages modern browsers have when parsing HTML content over a custom home-rolled solution, especially when working in .NET. The document object model (DOM) is essential to client-side web development, and for all intents and purposes, confined to that domain.

What if we could leverage the power of the DOM in .NET and utilize query selector functionality to parse and retrieve information from HTML content?

We could spin up a browser and proxy messages across, but it’s challenging to do in .NET; trust me, I tried. But we can do something very similar using a .NET Foundation library called AngleSharp.

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. The library also supports XML without validation. An important aspect of AngleSharp is that it can parse CSS. AngleSharp includes a parser built upon the official W3C specification, which produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also, standard DOM features such as querySelector or querySelectorAll work for tree traversal.

AngleSharp also runs on multiple platforms, including .NET Core, .NET Framework, Xamarin, UWP, macOS, and more.

In the use case outlined above, we’ll be using AngleSharp to load some HTML content and determine an HTML tag’s value using query selector syntax and a second approach using JavaScript.

To get started, let’s install the AngleSharp.Js package in a simple Console application, which will include the core AngleSharp packages.

> dotnet new console -o Sample 
> cd Sample
> dotnet add package AngleSharp.Js

Next, we’ll need some HTML content. For this sample, let’s load a simple HTML document.

const string basic = @"
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>";

After this, we’ll need to create a BrowsingContext. This class will allow us to load our HTML into a C# object that we can inspect using query selectors.

var config = Configuration.Default.WithJs();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(req => req.Content(basic));

Our first attempt to read the information from our loaded HTML content will be made using the built-in C# query selector interface. Let’s find all the h1 tags.

var heading = document
        .QuerySelectorAll("h1")
        .Select(x => x.InnerHtml)
        .First();

Console.WriteLine(heading);

Executing our application, we get the output of My First Heading. Now let’s use JavaScript to retrieve the same value.

var script = 
        document.ExecuteScript("document.querySelectorAll('h1')[0].innerHTML");

Console.WriteLine(script);

As expected, we get the same result as we did before. Let’s see the entire solution (which uses top-level statements).

using System;
using System.Linq;
using AngleSharp;
using AngleSharp.Js;

const string basic = @"
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>";

var config = Configuration.Default.WithJs();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(req => req.Content(basic));

var heading = document
        .QuerySelectorAll("h1")
        .Select(x => x.InnerHtml)
        .First();

Console.WriteLine(heading);

var script = 
        document.ExecuteScript("document.querySelectorAll('h1')[0].innerHTML");

Console.WriteLine(script);

Conclusion

AngleSharp is a W3C compliant parser that gives C# developers access to the same tools that client-side JavaScript developers enjoy. In this post, we saw multiple approaches to slice-and-dice HTML content to get the data we need. Regular expressions can solve many problems, but parsing HTML (especially mangled HTML) can be a severe roadblock to achieving our goals. In my opinion, this library is a better approach for folks looking to solve markup-based problems.

I hope you found this post helpful in your journey to parse HTML, and I’d like to thank Maarten Balliauw for sharing this library with me.