As human beings, we have an innate ability to pick out valuable information from less-than-ideal situations. Our brain can process the things we need while ignoring the nonsense. Unfortunately, web mainstays like HTML and HTTP don’t have those same capabilities, and in fact, don’t appreciate surprises.

As programmers, we need to take additional steps when passing data to an HTML page to make sure it doesn’t break the HTML conventions. Additionally, when sending a request to our server, we may need to adhere to HTTP rules to ensure the server does not reject our web request.

This post will look at the WebUtility class and break down how its methods operate on our input to encode/decode potentially breaking data.

What is Encoding/Decoding?

Encoding and decoding are the opposite sides of the same process: taking input and sanitizing it for its intended recipient.

The method of encoding takes a human-readable value and converts characters that may interfere with the recipient. In the case of HTML, those characters might be angle brackets (< and >). Forgetting to encode user input is a typical problem for cross-site scripting attacks. A user can input data into the application, and the site renders it without first validating its safety. We’ll get into more details in the HTML encoding part of this post.

The process of decoding is reversing the encoding process. The process takes escaped characters and reverts them to their original value. Decoding can be useful in scenarios where the initial value needs to be verified by a human set of eyes but ultimately sent to a non-human recipient. We may have seen encoded URLs with values like %20, which indicates a space in our URL. A human may find it difficult to read Hello%2C+World, but its initial value of Hello, World natural.

Let’s get into HTML and URL encoding specifics and how the two differ from each other.

The Utility Classes - WebUtility and HttpUtility

The .NET Runtime has a WebUtility class that is the safest method in .NET to perform core encoding tasks targeted at the web. The type is part of the System.Net namespace and can be found in the System.Runtime assembly. There is also the HttpUtility class found under the System.Web.Util namespace that can perform additional encoding/decoding on HTML attributes, query strings, JavaScript, and additional operations utilizing the Encoding type, which includes variants of UTF8, UTF16, and Unicode.

For the sake of this post, we’ll be sticking to the WebUtility implementations.

Url Encoding

Let’s take a look at the UrlEncode found in the WebUtility type. In this method, we’ll see what characters are considered safe and which ones need to be changed to respect our URL recipient.

public static string? UrlEncode(string? value)
{
    if (string.IsNullOrEmpty(value))
        return value;

    int safeCount = 0;
    int spaceCount = 0;
    for (int i = 0; i < value.Length; i++)
    {
        char ch = value[i];
        if (IsUrlSafeChar(ch))
        {
            safeCount++;
        }
        else if (ch == ' ')
        {
            spaceCount++;
        }
    }

    int unexpandedCount = safeCount + spaceCount;
    if (unexpandedCount == value.Length)
    {
        if (spaceCount != 0)
        {
            // Only spaces to encode
            return value.Replace(' ', '+');
        }

        // Nothing to expand
        return value;
    }

    int byteCount = Encoding.UTF8.GetByteCount(value);
    int unsafeByteCount = byteCount - unexpandedCount;
    int byteIndex = unsafeByteCount * 2;

    // Instead of allocating one array of length `byteCount` to store
    // the UTF-8 encoded bytes, and then a second array of length
    // `3 * byteCount - 2 * unexpandedCount`
    // to store the URL-encoded UTF-8 bytes, we allocate a single array of
    // the latter and encode the data in place, saving the first allocation.
    // We store the UTF-8 bytes to the end of this array, and then URL encode to the
    // beginning of the array.
    byte[] newBytes = new byte[byteCount + byteIndex];
    Encoding.UTF8.GetBytes(value, 0, value.Length, newBytes, byteIndex);

    GetEncodedBytes(newBytes, byteIndex, byteCount, newBytes);
    return Encoding.UTF8.GetString(newBytes);
}
C#

The most notable part of the method is the call to IsUrlSafeChar. What are the values that we can safely add to a URL? Looking at the method, we can see an unoptimized implementation.

if (ch >= 'a' && ch <= 'z' || ch >= 'A' && ch <= 'Z' || ch >= '0' && ch <= '9')
    return true;

switch (ch)
{
    case '-':
    case '_':
    case '.':
    case '!':
    case '*':
    case '(':
    case ')':
        return true;
}

return false;
C#

It turns out everything but alphanumeric characters, and the characters -, _, ., !, *, (, and ) are unsafe. Looking further down the original method of UrlEncode, we can see what happens to all those unwanted values.

  1. The method first converts Space ( ) values into + symbols.
  2. Finally, the method converts the remaining values into their byte equivalent and then gets the string value. The encoding is achieved using the Encoding.UTF8.GetBytes and Encoding.UTF8.GetString methods.

Let’s take a look at HTML encoding now and see how it differs from URL encoding.

HTML Encoding

In the same WebUitlity type, we’ll find the HtmlEncode method. We use this method to take values we want to display within existing HTML but don’t want our input to damage the HTML structure. Let’s see how .NET implements this method.

public static void HtmlEncode(string? value, TextWriter output)
{
    if (output == null)
    {
        throw new ArgumentNullException(nameof(output));
    }
    if (string.IsNullOrEmpty(value))
    {
        output.Write(value);
        return;
    }

    ReadOnlySpan<char> valueSpan = value.AsSpan();

    // Don't create ValueStringBuilder if we don't have anything to encode
    int index = IndexOfHtmlEncodingChars(valueSpan);
    if (index == -1)
    {
        output.Write(value);
        return;
    }

    // For small inputs we allocate on the stack. In most cases a buffer three
    // times larger the original string should be sufficient as usually not all
    // characters need to be encoded.
    // For larger string we rent the input string's length plus a fixed
    // conservative amount of chars from the ArrayPool.
    ValueStringBuilder sb = value.Length < 80 ?
        new ValueStringBuilder(stackalloc char[256]) :
        new ValueStringBuilder(value.Length + 200);

    sb.Append(valueSpan.Slice(0, index));
    HtmlEncode(valueSpan.Slice(index), ref sb);

    output.Write(sb.AsSpan());
    sb.Dispose();
}
C#

We can find the exciting part of the HtmlEncode method in its call to IndexOfHtmlEncodingChars.

        private static int IndexOfHtmlEncodingChars(ReadOnlySpan<char> input)
        {
            for (int i = 0; i < input.Length; i++)
            {
                char ch = input[i];
                if (ch <= '>')
                {
                    switch (ch)
                    {
                        case '<':
                        case '>':
                        case '"':
                        case '\'':
                        case '&':
                            return i;
                    }
                }
#if ENTITY_ENCODE_HIGH_ASCII_CHARS
                else if (ch >= 160 && ch < 256)
                {
                    return i;
                }
#endif // ENTITY_ENCODE_HIGH_ASCII_CHARS
                else if (char.IsSurrogate(ch))
                {
                    return i;
                }
            }

            return -1;
        }
C#

The HTML encoding process makes sure that the characters of <, >, ", \, and & are flagged to be replaced by their HTML-friendly counterparts. What values replace these characters? We can find that in another HtmlEncode method.

switch (ch)
{
    case '<':
        output.Append("&lt;");
        break;
    case '>':
        output.Append("&gt;");
        break;
    case '"':
        output.Append("&quot;");
        break;
    case '\'':
        output.Append("&#39;");
        break;
    case '&':
        output.Append("&amp;");
        break;
    default:
        output.Append(ch);
        break;
}
C#

HTML encoded values will start with an ampersand (&) and end with a semi-colon (;). Cool!

Conclusion

As we saw in the encoding implementation for URL and HTML, they accomplish the ultimate goal of changing the value to be safe for the recipient. In HTML encoding, we change the characters that may potentially break an existing HTML page’s structure so that they can be rendered by the client safely. In URL encoding, we change the values that may violate the URL’s continuity, making the recipient misinterpret the full URL value.

If you’re finding strange behaviors in your rendered ASP.NET pages, you might want to check if you are encoding values properly. Luckily, ASP.NET users get automatic encoding when using Razor, so it is not a common problem but something to keep in mind.

If a server rejects your requests, it might be that there are values in your URL that are breaking its completeness. Check for the characters mentioned above, and be sure to encode your URL before running your request.

I hope you enjoyed this post, and let me know in the comments about a time encoding helped you solve a difficult problem. As always, thanks for reading.