XSS Weakness in strip_tags and some notes on parsing HTML/XML

There is another Cross-Site Scripting (XSS) Weakness in the Rails method strip_tag(). The problem was found in the HTML::Tokenizer which has bugs when parsing non-printable ASCII characters.

According to the original post, this has been fixed in Rails 2.3.5 and there is a patch for the 2.2. branch. Earlier versions are unsupported. Upgrade to a newer version if you make use of this method.

The workaround is this:

Users using strip_tags can pass the resulting output to the regular escaping functionality:

  <%= h(strip_tag(…)) %>

However, this is not how it should be. The strip_tags() method should work correctly. The workaround does work, but strip_tags() is based on HTML::Tokenizer which uses a very naive approach to parsing HTML code. It is based on regular expressions to analyze the code. For serious/enterprise implementations, you should not use an error-prone parser library.

  • The REXML is a little better, but not very fast for large amounts of data. It has some bugs and it’s not 100% standard compliant. For larger amounts of data, it may even be used to use a pull parser: REXML::Parsers::PullParser. Some people have successfully parsed HTML with it.
  • And there is libxml, which is a real parser, now with ruby bindings. We haven’t used it with (X)HTML, though. It has a pull parser too, and its quite like the REXML pull parser. LibXML is an extensive C-library which might not available on exotic Linux-derivates or Windows. Nokogiri is also based on LibXML.
  • Update: If you’re using JRuby, you can use tried and tested Java XHTML/XML parsers. For example Apache Xerces or the pull parser Woodstox which supports “almost well-formed” documents (like legacy (X)HTML content).