Enemies of Valid Strict XHTML: Part 1

Microsoft of late seems to be catching the XHTML validity bug. The marketing buzz surrounding its latest online publishing title, Expression Web, spouts XHTML validity nearly every other sentence. Visual Studio 2005 as well touts XHTML validity as a goal and feature. That's a great start, but neither product is a silver bullet to creating valid Strict XHTML sites. What pitfalls are there that we need to watch out for? What are the enemies of XHTML valid sites?

Setting The Stage

"Getting Standards Right" is a multifaceted objective. It encompasses much more than creating sites with valid XHTML. It encompasses using appropriate markup (semantics) and accessibility and best practices for user experiences.

This three-part series, however, will focus soley on valid Strict XHTML markup. This series will not address DOCTYPE selection issues or defending whether Strict XHTML markup is "right" "dead" "appropriate" or "better." I'm going to set aside those debates for the moment and begin with Valid Strict XHTML as a decided goal.

The concepts apply to anyone creating a dynamic website, regardless of their framework, scripting language, content repository, or target browsers. It applies equally to the PHP developer as it does the .NET developer or the Ruby developer; equally to someone building with the assistance of a content management system or building a custom e-commerce website for a Fortune 500 company.

This concept is universal because even products or frameworks that can produce valid markup can also produce non-valid markup. And the number one reason is obvious:

Garbage into CMS leads to garbage in a browser

"Garbage In. Garbage Out." It's one of the first axioms any computer science student learns. In our case, if your input doesn't validate, then your output is most likely not going to validate either.

With web apps and Strict XHTML, the axiom is especially true. If your content repository contains garbage HTML, then you've already lost the battle.

Where Does Garbage Come From?

Garbage can originate from a variety of sources, but ultimately there is a single responsible entity for allowing garbage in.

Most Garbage starts with people.

  • You
  • Me
  • The client's admin assistant
  • Aunt Ethel

These folks are provided tools to enter data into the web application. The primary difference between sites that produce valid XHTML and invalid XHTML is the handling of data between the user and the content repository.

Enemy #1: Direct HTML Entry

By and large the fastest way for your application to start producing invalid XHTML is by allowing end users to input raw HTML.

People get lazy or forget or don't know that they need to properly nest their tags. Who can expect the client's admin assistant to remember not to nest a block element inside an inline element? Should Aunt Ethel even be expected to close her image elements? Raw HTML entry is the autobahn to invalid XHTML.

By allowing unaltered HTML to infiltrate your data repository, you've already lost the battle. You're resigned to try and sweep up after the elephants in the parade: trying to take invalid markup and make it valid after the fact. This problem is further exacerbated when you are dealing with an existing respository of information.

Imagine if 90% of the knowlegebase articles at Microsoft contained invalid markup. If this were the case, you would have to either clean it up at the source or fix it as it's served up.

Solution

Choose and deploy non-HTML markup as the content respository format. Markdown, Textile, and similar content entry syntaxes provide the ability to enter content in a meaningful way, then is transformed to XHTML upon request. If the output is invalid XHTML, the the transformation failed but the repository remains valid. Once the transformation is fixed, all content is served correctly.

For instance, Vine Type stores content in Markdown syntax format. This vastly simplifies valid markup, and is easy for folks to learn.

People who only feel fairly comfortable with email, in about twelve seconds, are able to learn that _this_ produces "italics" and **this** produces "bold." Of course they actually produce <em></em> and <strong></strong> but this example shows the power of non-HTML markup. _this_ could produce anything upon transformation. The same cannot be said for input that resides in the data repository as user-entered HTML.

1 05 Sep 2006

I agree with your comments and also think that developers of free and commercial "rich HTML editors" need to take more responsibility. 99% of dirty markup gets into our product through a "rich HTML" edit box. A simple copy-and-paste from Microsoft Word creates something awful.

I guess we'll struggle with ignorant users till the end of days, but validating, parsing and cleaning of markup needs to get better.

2 05 Sep 2006

Milan: Shhh! You're giving away my Enemy #2! :-)