This post is quite a few days overdue; I’ve been a little caught up with planning and implementing a few more features for the blog. Most of them are under the covers, and some not yet finished, but one feature that is visible, at least to a certain extent, is the sanitizing that is now being performed on comments. Before the feature was implemented, to protect users against XSS attacks, I HTML encoded the comments. This works well, but completely prevents users from using tags such as <strong>, <em> and <blockquote>.
To solve this “problem”, I set out to find a simple and safe way of sanitizing comments before storing them in the database, and thus being relatively certain that they can be rendered safely without the need to encode them.
At first I looked for code snippets and I did find a few, of which the most notable ones were found within a “thread” on RefactorMyCode.com started by Jeff Atwood. Many of the code snippets looked promising at first, but nearly all of them shared one common flaw, and that is their usage of regular expressions to ”parse” the targeted HTML. Several of the snippets had flawed expressions that allowed a potential attacker to quite easily circumvent the sanitizer.
Continuing my search, I remembered having heard about an OSS DOM parser project a year ago, and searching through some old emails (gotta love Gmail), I found the email that mentioned it — HtmlAgilityPack. At first I was a bit discouraged by the lack of activity within the project for the past two-and-a-half years or so, but after being unable to find a decent alternative, I decided to give it a go, and fix any potential bugs as I discovered them. So far, I’ve been unable to find any bugs that has affected the ability to sanitize HTML.
The biggest advantage of using a DOM parser over mere regular expressions, is that the DOM parser can’t be fooled as easily, and it also has knowledge of concepts such as self-closing tags, and can also handle unclosed tags properly. Not handling unclosed tags properly is the most comment flaw in most regular expressions that attempt to “parse” HTML.
Using HtmlAgilityPack also gave me another advantage, and that is a fairly rich object model for describing blocks of HTML.
As I started developing the sanitizer, I had to choose whether to blacklist or whitelist tags and attributes — after about 10 seconds of pondering I decided to go with whitelisting. By opting for whitelisting, the risk of being the victim of an attack vector you don’t know about, is significantly reduced…
using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;
namespace Wayloop.Blog.Core.Markup
{
public static class HtmlSanitizer
{
private static readonly IDictionary<string, string[]> Whitelist;
static HtmlSanitizer()
{
Whitelist = new Dictionary<string, string[]> {
{ "a", new[] { "href" } },
{ "strong", null },
{ "em", null },
{ "blockquote", null },
};
}
public static string Sanitize(string input)
{
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(input);
SanitizeNode(htmlDocument.DocumentNode);
return htmlDocument.DocumentNode.WriteTo().Trim();
}
private static void SanitizeChildren(HtmlNode parentNode)
{
for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--) {
SanitizeNode(parentNode.ChildNodes[i]);
}
}
private static void SanitizeNode(HtmlNode node)
{
if (node.NodeType == HtmlNodeType.Element) {
if (!Whitelist.ContainsKey(node.Name)) {
node.ParentNode.RemoveChild(node);
return;
}
if (node.HasAttributes) {
for (int i = node.Attributes.Count - 1; i >= 0; i--) {
HtmlAttribute currentAttribute = node.Attributes[i];
string[] allowedAttributes = Whitelist[node.Name];
if (!allowedAttributes.Contains(currentAttribute.Name)) {
node.Attributes.Remove(currentAttribute);
}
}
}
}
if (node.HasChildNodes) {
SanitizeChildren(node);
}
}
}
}
Hello,
That is a very, very bery good class. Saved my behind.
Thanks a lot!