Validation | June 28, 2004

Recently there has been quite a bit of discussion about validation. Some people feel that validation is extremely important, and that all sites should validate without exception. Others feel that validation is just another tool in a developers armoury, to be used if and when they please. My take on validation lays somewhere between these two extremes.

When I build a site, I start by building a number of generic templates. I first mark the content up and then slowly add layers of style. When I’ve finished a template I’ll run it through the validator as a final check to make sure everything is OK. I’ve been coding standards complaint HTML for a while now, so have a pretty good idea what is and what isn’t valid. Most of the time my templates are fine. If they don’t validate it’s usually because of some minor grammatical error that’s easily fixed.

Occasionally bugs do crop up. If it’s a particularly strange bug I’ll run the template through the validator just as a sanity check. It’s always good to tick off obvious problems before starting more in-depth bug busting. For less seasoned developers however, the validator really should be your first port of call. I’m amazed by the number of people who post “browser problems” to mailing lists when in fact the problem is down to invalid code.

Once the templates have been accepted and signed off by the client, it’s time to start building the site proper. This is the time when small validation errors are most likely to creep in. One area that errors occur is in dealing with copy. Clients will supply you with the site content, often in word format, and you’ll simply copy and paste this content into the page/database. However when you do this you can end up entering incorrectly encoded characters like a ” instead of a &rdqou;.

On most jobs you’ll have more than one developer working on the site and not everybody will be as up to speed with web standards as you. Because the templates are already done, it’s usually only minor errors that creep in at this point. Things like image tags or breaks not being closed properly. Authoring tools can also be a source of validation errors, although they are getting more standards friendly with each new release.

Once a site is built I usually give it a quick once over with the validator. On small sites I tend to validate every page, however on large sites this isn’t always practical. It would be handy if the W3C validator had a “batch option”, however until that happens the WDG HTML Validator” is quite useful. Some authoring tools have validators built in but in my experience they tend not to be very accurate.

When the site launches, It’s quite easy for validation errors to creep back in. A very common error being unencoded ampersands in links to external sites. If you’ve built a custom CMS, you could check and correct the errors on input (If you use MT, there are plug-ins that will do this). However this will add to the cost of the build and to be honest, most clients don’t care about validation. They are much more interested in the business objectives of the site and their ROI. The closest I’ve ever come to clients caring about validation is clients who want their site to adhere to priority 2 accessibility checkpoints, and this in it’s self is fairly rare.

The reason I validate sites is mostly out of professional pride. It’s nice to know that you’ve done a good job and created a valid site. However until people start serving up web pages as xhtml instead of text/html there really isn’t the need to encode every ampersand and close every break. I’m not saying that you shouldn’t strive for validation as it really is best practice. Just that sites are living, breathing things that are there to be used, and it’s not always possible to catch every last validation error.

Posted at June 28, 2004 9:41 PM

Comments

Samuel Sidler said on June 28, 2004 9:55 PM

Well said, Andy.

Personally, I’ve given templates to person that validated perfectly prior to their consumption. What’s more, they don’t even change the template. Simple things such as the XHTML-defunct bold and italic tags are entered by them. Does it hurt my pride? Not really.

Sites should validate when possible, but even I have been guilty of writing ones that don’t for the “greater good” (a nice looking site).

Geoffrey said on June 28, 2004 11:08 PM

I’ve often wondered about why the italic tag gets shunned by XHTML purists. I understand the need to give tags semantic meaning, but isn’t there a difference between giving something emphasis and showing that something is a book title for instance? Since bold and strong have a similar meanings, it doesn’t matter as much there.
I use the em tag whenever I want a word or phrase to have emphasis, but I still use the italic for titles. It validates XHTML Transitional 1.0. Any thoughts on this? Is there a better way?

jim said on June 29, 2004 4:27 AM

Yep, the validator is a developers best friend. I had a layout recently that all of a sudden was completely messed up - wasn’t until I tried the validator again that I noticed a bracket in my css where there should have been an open brace. Sometimes a simple typo can wreck havoc, making validation something of a necessary quality control.

Anne said on June 29, 2004 6:39 AM

If you don’t encode ampersands. Links will break in older browsers like Netscape 4. And XHTML must be served as ‘application/xhtml+xml’. You should check RFC3023 for that. The W3C has nothing useful to say about HTTP and shouldn’t say anything about it.

If you don’t serve XHTML as ‘application/xhtml+xml’ what is the use? You can’t use it in combination with XSLT anymore to extract meta data, since you are not sure your page is well-formed, which is a parser requirement. You’d better of using HTML. Also, sending XHTML as ‘text/html’ means you are sending invalid HTML to the client, which doesn’t make sense at all. You think you are coding in XHTML, but actually you are sending invalid pages to the client.

There is not benefit from that in my opinion.

Oh, and if you are building you own CMS it is no effort to make sure the output validates.

Isofarro said on June 29, 2004 10:06 AM

Geoffery writes: “I’ve often wondered about why the italic tag gets shunned by XHTML purists. I understand the need to give tags semantic meaning, but isn’t there a difference between giving something emphasis and showing that something is a book title for instance? “

Its often a good idea to explain why something has to be in italics. If there’s a piece of text in italics - why? If the reason why its in italics is because its a title of a book, then <i class=”bookTitle”> is a good idea.

There are reasons for emphasising text - document those reasons. This adds value to your content.

Tim said on June 29, 2004 10:46 AM

Anne, you are a purist. Not everyone else is. Get over it ;)

Tim said on June 29, 2004 10:52 AM

Anne,

one other thing: I use application/xhtml+xml on my own site, but if I used it on sites at work and some invalid code got in (as it sometimes does) I would rather that my users saw a functioning web site rather than a beige XML parsing error screen!

We’re working towards full validation. It’s just that we’re not there yet. Our frameworks may validate, but our CMSs don’t yet have the capability to ensure that invalid code doesn’t get through. It’s, y’know, money and all that - the clients wouldn’t see any ROI (in this case) from spending not-insignificant amounts of time tightening up our CM tools. They don’t have money to burn…

Randy Charles Morin said on June 29, 2004 12:07 PM

Although I usually side w/ the good enough crowd, in this case, I’ll side w/ Anne. From looking at Andy’s site, he’s obviously a good HTML, oops, I mean xHTML coder. He takes ‘professional pride’ in his work. Anne is simply showing Andy where he can improve.

Jim,
The valid XHTML thing is not difficult. The problem is that most sites use the string building construct (PHP, ASP, JSP). Until people get away from the string building construct, then XML will always be difficult.

pid said on June 29, 2004 4:18 PM

George: it’s not a purist thing, it’s just that the i tag has no meaning for non-visual browsers.
The em tag is intended to replace it, by representing the meaning, (of italic in the case of print); this is what people mean when they talk about semantics in relation to markup.

Anne: people come to understand that the mime type is important, but while they’re learning it’s not helpful to be pedantic, and insist that they’re wrong.

The introduction of the XHTML standard is part of the transition to interoperable documents and future markup languages. If people are not able to practice these skills then the jump to modular XHTML, (when browsers are capable and widespread), will be too big and the standard won’t be adopted.

There must be compromise, and it’s better to compromise on something that can be fixed by changing one line in the webserver config - than to insist on something that will leave substantial portions of the web in the dark ages.
(e.g. using italic tags)

pid said on June 29, 2004 4:19 PM

randy: eh?

what string building construct?

Geoffrey said on June 29, 2004 4:50 PM

pid: I understand the reasoning behind adding meaning to the tag, but I’m not so sure emphasis is what I want to give to a book title. I want to give it a “title”. So to me it seems like we are just swapping one tag for another, neither doing the proper job. But in the context of a visual medium I see the problem and that’s why I use em 95% of the time. But I sometimes still use italic for book titles. (I can’t help myself!)

isofarro: «i class=”bookTitle”» This seems like overkill to me. I think I’d rather stick with em if the solution is adding classes to italic tags.

Anyway, thanks for the comments. And yes, I get called George all the time. ;)

Tim said on June 29, 2004 5:38 PM

Geoffrey,

for book titles, use the <cite> tag. It is rendered as italic by default, but of course you can change that.

Michael Schmidle said on June 29, 2004 5:43 PM

I think some of you do not really understand the meaning of <i> and the reason why it is still part of XHTML. Not every semantic meaning can be expressed by <em>. The best example would be the following:
She has a je ne sais pas quoi about her.

She has a <i lang=”fr”>je ne sais pas quoi</i> about her.

A speech browser should not emphasize but put a french accent to the “je ne sais pas quoi”. A grafical browser is not able to do so, but can print it in italic—that is a purpose among others of the <i> tag .

Matthew Thomas wrote an excellent article about when semantic markup goes bad.

DH said on June 29, 2004 6:24 PM

Andy - thanks for another informative and insightful post. Always good to hear about method in site development. Thank you.

Geoffrey said on June 29, 2004 7:16 PM

Tim: Thanks for the tip. I hadn’t considered using cite, and maybe that’s a good choice. In the past I’ve used it to denote the author of a pull-quote, but I’m thinking it makes sense here too. Cool.

Michael: Great resource, Thanks!

Back to work…

Anne said on June 29, 2004 7:50 PM

If you can’t guarantee sites that validate or at least well-formed (that means your characters have to be encoded properly as well) why would you ever use XML?

At least HTML let’s you get away with it. If you still think XHTML is better, than forward compatibility is irrelevant and not an argument, since your documents aren’t well-formed anyway.

And I have to be a purist, otherwise I couldn’t code XHTML ;-)

Randy Charles Morin said on June 30, 2004 1:42 AM

String building is the way PHP, ASP and most of the Web scripting languages create their output. Basically, you write bytes to an output stream and they appear on the client as written and unvalidated. This is very error prone and that’s why XHTML or for that matter XML is often not well-formed.

If you are familiar w/ ASP.NET, then writing the XML directly in an ASP page is string building. Whereas using the XmlTextWriter class is not and much less prone to validation issues.

steve said on June 30, 2004 9:54 AM

1) The good thing about XHTML, even if we cannot always have servers - or user agents - that handle it in all its glory, is that being XML, any validating parser fed the DTD will do a validation job. Last year I wrote myself a couple of such tools (one in .NET using the built-in parser, one in Java using Apache Xerces) so I can more easily keep my own personal sites valid, and I’ve published them as GPLd freeware (www.ravnaandtines.com).

2) In the ideal world, the italicisation (or other formatting) for languages would be handled by a style declaration that picked up on the language definition directly, like

span[lang] {font-style:italic; voice-family: attr(lang);}

and the code would be like

Of course one can do better with a little <span lang=”fr”>savoir faire</span>

Alas, the majority browser also needs

span.lang {font-style:italic;}

and <span class=”lang” lang=”fr”>

Of course all of these take a few more bytes than the un-adorned <i>…

Isofarro said on June 30, 2004 12:13 PM

pid asks: “what string building construct?”

PHP and ASP build their pages dynamically by essentially concatenating strings together. Either the page is an HTML page with dynamic parts generated by code, or a templating system is used that creates a string which effectively represents the page to be returned. Either way, the environment treats the generated HTML page as nothing more than a string of text.

The problem with this approach is that it is hard to ensure validation of the markup. It takes more effort to ensure that generated pages are valid.

I guess what Randy is hinting at, and something I’ve been thinking about for a while, is to use a Document Object Model (DOM) as a templating system. By treating a dynamically generated page as a document object rather than a string of text, its easier to produce valid documents.

Comparing dynamically generated HTML pages with dynamically generated XML documents - generating XML documents by concatenating strings is frowned upon. Yet for generating HTML documents this seems the only mainstream way of doing the job.

steve said on June 30, 2004 12:17 PM

PS forgot to say — those tools do perform batch validation (this folder and any sub-folder) — which was Andy’s feature request.

ptgamer said on June 30, 2004 2:28 PM

Hi Andy,
I’m a huge fan of your writings, love almost every thing you write.

There is an issue that is consuming my mind. XHTML purists will what semantic meaning in all tags. Ok, not a really hard job to do.
But what if you are Portuguese (like my self)… should we tag in English, not useful for “my users” or should we tag in Portuguese? And if so, what to do about the punctuation, what to do with a word like: “Navegação” (Navigation)?

Thanks for your attention.

Charl van Niekerk said on June 30, 2004 3:21 PM

I know that a lot of people will argue with me about this, but is it really so difficult to code valid XHTML?

I know that if you already have a site, it is very difficult to get everything converted (especially when the site is large). But if you are a well-informed developer, striving towards propper markup really isn’t that bad. It can even often save you time, and of course it helps cross-browser compatibility and even accessibility.

XHTML is much stricter, but also much less prone to human error once it validates. In other words, if you forget some ending tag, it will warn you. I saved myself a lot of time with debugging by simply validating.

Ok, I’m not going to go off into all of the advantages of valid XHTML against tag soup again. The only point I am trying to make is that if your markup doesn’t validate, there is something wrong with it. Period. And it is probably better to fix it.

Otherwise, you might just run into problems one day, and by taking shortcuts now you will just be wasting your own time later by trying to fix them. Prevention is better than cure, I always say.

Pauly said on July 1, 2004 4:57 PM

Somehow it’s amusing that someone further up the thread had to use underscores to emphasis something… No offense, hello, first time here.