I recently saw a tweet from Chris Heilmann, Principal Program Manager for Browser Tools at Microsoft. In this tweet, he gave a quick tip about parsing HTML content using the built-in
The tweet made me think about the advantages modern browsers have when parsing HTML content over a custom home-rolled solution, especially when working in .NET. The document object model (DOM) is essential to client-side web development, and for all intents and purposes, confined to that domain.
What if we could leverage the power of the DOM in .NET and utilize query selector functionality to parse and retrieve information from HTML content?
We could spin up a browser and proxy messages across, but it’s challenging to do in .NET; trust me, I tried. But we can do something very similar using a .NET Foundation library called AngleSharp.
AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. The library also supports XML without validation. An important aspect of AngleSharp is that it can parse CSS. AngleSharp includes a parser built upon the official W3C specification, which produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also, standard DOM features such as querySelector or querySelectorAll work for tree traversal.
AngleSharp also runs on multiple platforms, including .NET Core, .NET Framework, Xamarin, UWP, macOS, and more.
To get started, let’s install the AngleSharp.Js package in a simple Console application, which will include the core AngleSharp packages.
> dotnet new console -o Sample > cd Sample > dotnet add package AngleSharp.Js
Next, we’ll need some HTML content. For this sample, let’s load a simple HTML document.
const string basic = @" <!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>";
After this, we’ll need to create a
BrowsingContext. This class will allow us to load our HTML into a C# object that we can inspect using query selectors.
var config = Configuration.Default.WithJs(); var context = BrowsingContext.New(config); var document = await context.OpenAsync(req => req.Content(basic));
Our first attempt to read the information from our loaded HTML content will be made using the built-in C# query selector interface. Let’s find all the
var heading = document .QuerySelectorAll("h1") .Select(x => x.InnerHtml) .First(); Console.WriteLine(heading);
var script = document.ExecuteScript("document.querySelectorAll('h1').innerHTML"); Console.WriteLine(script);
As expected, we get the same result as we did before. Let’s see the entire solution (which uses top-level statements).
using System; using System.Linq; using AngleSharp; using AngleSharp.Js; const string basic = @" <!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"; var config = Configuration.Default.WithJs(); var context = BrowsingContext.New(config); var document = await context.OpenAsync(req => req.Content(basic)); var heading = document .QuerySelectorAll("h1") .Select(x => x.InnerHtml) .First(); Console.WriteLine(heading); var script = document.ExecuteScript("document.querySelectorAll('h1').innerHTML"); Console.WriteLine(script);
I hope you found this post helpful in your journey to parse HTML, and I’d like to thank Maarten Balliauw for sharing this library with me.