The Truth About HTML5 Micro-semantics and Schema.org
A common claim made about the new HTML5 structural elements is that they are “more semantic.”
In my view, the new elements are “more semantic” in the same way fruit-flavored candy bars are “more nutritious”—not at all.
Nevertheless, the question of semantics in HTML5 gives us an excellent excuse to take a quick trip through the big picture of “semantic” markup. We’ll look at where semantic markup came from and what semantic markup promised to deliver but never quite did, and we’ll finish with a quick look at something you can use right now—new schemas put forward by the major search engine companies (Google, Microsoft, and Yahoo) that will ideally improve the display of your search results.
By the end of this chapter, your markup nerd-dar will be so finely tuned you’ll be able to separate the markup poseurs using semantic as a mere buzzword from the hard-core markup wonks who are still waiting for the Semantic Web to arrive, any day now...
When it comes to the Web, there are actually two kinds of “semantics:” the nitty-gritty markup of a given web page and the so-called Semantic Web. Let’s start with the semantic markup we practice every day as web designers.
“Semantic markup” was one of the cornerstones of the web standards movement. In 2003 Jeffrey Zeldman, perhaps the best-known advocate for semantic markup and web standards, wrote this on his blog (www.zeldman.com/daily/0303a.shtml):
CSS combined with lean semantic markup makes sites faster, more portable, and more accessible. The combination helps sites work in more existing environments and is the best hope of preparing them for environments that have not yet been developed.
This was a major change in both theory and practice for web designers. We’d keep all the styling information about a page in a separate CSS file and describe the content with “lean, semantic markup,” as Zeldman put it.
Here’s a (slightly reworked) example of semantic markup Zeldman used in a 2002 Digital Web article (www.digital-web.com/articles/999_of_websites_are_obsolete/). First, Zeldman borrowed some “unsemantic” markup from an e-commerce site to show what we were moving away from (try not to shudder when you read it):
<td width="100%"><font face="verdana,helvetica,arial" size="+1" color="#CCCC66"><span class="header"><b>Join now! </b></span></font></td>
And then, with CSS handling the styling, the markup simply became this:
<h2>Join now!</h2>
And lo and behold, there it was: lean, semantic markup that we pretty much take for granted now. It was a big, and extremely worthwhile, shift in practice.
But what makes this example “semantic” and not the first one? Semantic is just a fancy way of saying “meaningful,” and by using heading tags (<h2>), it now means something to browsers (and screen readers): “This is a heading.” Screen readers can (and do) use these headings to navigate around a document, and browsers can give these elements default styling (for example, making it a block-level element).
It also makes it easy for us humans to read. When we scan the markup, there’s no doubt about what this text is—it’s a heading. Simple, right?
This highlights the two key groups that matter in “semantic markup”: humans and machines (browsers, screen readers, search engines, and so on). It should be both “human readable” and “machine readable”—Semantic Markup 101.
“Machine readable” semantic markup has other benefits. Search engines can scan, index, and search our content in a way that’s much harder (if not impossible) with Flash sites or web sites consisting purely of images (as print designers are occasionally wont to churn out).
That said, Google doesn’t care much about what markup you use.
These Problems Have Been Solved
Here’s the thing: the solution to these problems has been around for more than a decade, no matter what flavor of HTML you are using. Search engines can index our content, screen readers can understand it, and our lean, semantic markup makes it easy to read and maintain.
Then the pedants took over.
The people in web design circles began to think “Well, if semantic is good, then more semantic must be better, right?”
Not really.
Beyond the point of human readability and basic machine readability, “more semantic” doesn’t mean anything (irony ahoy!). But this hasn’t stopped people debating which elements are more semantic or more appropriate, which nine times out of ten is about as useful as debating whether it’s “splade” (or, more correctly, “splayd” for you sticklers out there) or “spork.” (Splade, obviously).
There’s No Such Thing As “More” Semantic
I humbly propose that the unqualified use of “more semantic” be banned from web design discussions about HTML elements posthaste.
Whenever you hear someone going on about something being “more semantic,” ask them this simple question:
“For who?”
If all they can come back with is “But it’s ... MORE SEMANTIC!” they’re just making a vague claim about nothing. But if they say something like “More semantic for screen readers,” that’s a valid claim we can evaluate.
Do screen readersreally do anything different for these “more semantic” elements? Are they supported at all? Or do they cause bugs like the HTML5 elements did when they were first used? (See www.accessibleculture.org/blog/2010/11/html5-plus-aria-sanity-check/.)
(Remember: because of the no-JavaScript IE6–8 issues, using HTML5 elements for accessibility is about as useful as dieting on doghnuts.)
Likewise, if they say “But it’s more semantic for search engines,” we can evaluate that specific claim. What does Google’s developer guidelines say? What does the SEO community think? And so on.
But please, no more unqualified claims of “But it’s more semantic” when discussing HTML5. These dubious assumptions have been attaching themselves like barnacles to the good ship Web Standards for years, and it’s time we revved up the high-pressure hose and cleaned them off (assuming that’s how barnacles are, in fact, removed).
OK, mini-rant over. The human readability and basic machine readability problems have been solved, this is where we’re at, and we may hope that HTML5 will take us forward. But before we get to HTML5’s approach, let’s talk about the Big Idea™ behind semantic markup.
Big Ideas in Semantic Markup: The Semantic Web
What if we could take the “machine-readable” part of semantic markup further? What if the machines (and browsers in particular) could read our markup and know not just what content appeared but what given blocks of content actually meant?
That’s the big idea behind semantic markup. If we can describe the content of our pages accurately and specifically, then machines can do cool stuff with the data.
This is (or perhaps was) partly the idea behind the Semantic Web—a big, broad concept that would be driven by the XML-ified Web. (Read more about it here: http://en.wikipedia.org/wiki/Semantic_Web.) The Web would be a perfectly described library of documents, marked up in excruciating detail with XML. An XML-based future was something many influential people believed in. In fact, in the earlier markup example from 2002 and the use of <h2>, Zeldman described web standards as a way we can “transition from HTML, the language of the Web’s past, to XML, the language of its future.”
However, as we saw in Chapter 1, the move to XML died, and with it the dream of a true Semantic Web. Instead, the Web became a wonderful platform for applications, went social, and kept on being the Web we know and love. But it wasn’t the capital-S Semantic Web people had hoped for.
We need to keep this history in mind when people talk about “semantic” elements in any situation, whether it’s HTML5 or whatever future HTML evolves. What kind of “semantics” are they referring to—basic human- and machine-readable semantics we all use every day or the dead-end dream of the XML-powered Semantic Web?
Semantics: Not Dead Yet (Or: Google & Co Drop a Micro-Semantic Bombshell)
There’s actually a third option that sits between the lean, semantic markup we use now and the pie-in-the-sky Semantic Web, called microdata (and microformats), which adds a layer of metadata to our markup.
(A variety of approaches compete here, particularly microformats, microdata, and RDFa. But I’ll just be referring to the overall concept as micro-semantics, which is also known as “structured data.”)
With micro-semantics, we simply embed semantic data into our existing HTML document. Let’s look at how micro-semantics could help daily life on the Web.
E-commerce with Real (Micro) Semantics
Let’s use online shopping as an example. Here, truly semantic markup could theoretically help desktop browsers (in other words, all of us), the visually impaired using screen readers, and search engines.
Those are just a few examples of what’s possible when we have truly semantic markup. Machines—browsers, screen readers, and search engines—can easily pick out useful information and do cool things with it (such as create a comparison shopping list).
The problem is, to use different tags to describe this data, the HTML spec would need a squillion different tags. Every kind of content—from poems to products to policy documents—would need its own tags so the machines knew what the content was. The list of HTML tags would literally be a small dictionary or, rather, a very large dictionary as more and more tags were added to the spec. Authors writing about HTML would quite likely lose their minds.
The good news is we can mark up our content and make this comparison shopping possible (especially the search engine example) without needing any more HTML tags. We simply annotate our existing HTML with attributes and values that machines can read. (I’ll talk more about this soon).
Adding a handful of new elements HTML5-style, however, is not a path to “more semantic” documents. They don’t help machines do much with the data, and our markup becomes more cluttered—hardly a way to make it more readable.
Instead, we need a new mechanism to describe this data. Ideally that’s where HTML5 will lead us.
Can the Real Semantics Please Stand Up?
I know what you’re thinking. “If only we had a way of adding tags that didn’t pollute the entire spec. Some sort of eXtensible Markup Language.” But as we saw in Chapter 1, we tried that, and it failed.
Clearly we need a way to extend HTML that doesn’t involve adding a dictionary’s worth of elements to the spec or trying to XML-ify the Web.
There is a third option, and a bunch of people have been working on various solutions for quite a few years.
Here’s the idea in a nutshell: just attach attributes with values from an agreed bunch of terms to our existing HTML. Here’s an example (I’ve made up the attribute and value):
<div class="myclass" semanticdata="mysemanticvalue"> ... content ...</div>
As you can see, it’s pretty simple. But it’s worth teasing out the terminology because the different terms and implementations can make a simple idea seem far more complex than it actually is.
We need to distinguish between several pieces of the micro-semantics pie:
One group that has been doing cool things with micro-semantic data is the microformats community. They have an active community (http://microformats.org/), a microformats way to use HTML as infrastructure (the class attribute), and specific microformat vocabularies. These are the various parts of the micro-semantics pie and demonstrate how communities have been able to come together to do semantics in a meaningful way on the Web.
You may have heard of and perhaps implemented microformats in the past. Unfortunately, as I write, its future has been more or less killed off by the search giants that have proposed a new way forward for micro-semantics or “structured data.”
Why Should We Care About Micro-semantics?
In 2011 Google, Microsoft, and Yahoo launched what may be the biggest effort to get real semantics into HTML documents in the history of the Web.
And how did they launch it? With a blog post and a web site that had all the pizzazz of a “My First HTML Page” template knocked out during a hurried lunch break (see Figure 5-1). And they also managed to single-handedly annoy everyone already invested in the process who’ve been evangelizing micro-semantics for years. Not a good start.
Figure 5-1. Schema.org. Who said semantics weren’t sexy? Oh ... everyone. Right
Schema.org: The Future of Semantics?
In mid-2011 a handful of engineers from Google, Microsoft, and Yahoo decided they didn’t like the current, community-driven approaches and announced they were picking HTML5’s microdata as the winning infrastructure (that is, the HTML attributes we should use to add micro-semantic data). And so they released Schema.org (http://schema.org/)—a list of vocabularies, or “schemas,” that the major search engines would use to display richer search results.
In this way, all three parts of the micro-semantic pie were changed. The infrastructure (HTML5’s microdata), the vocabularies (Schema.org), and the drivers (corporations, not communities) were all new.
(You can read Google’s announcement at http://googlewebmastercentral.blogspot.com/2011/06/introducing-schemaorg-search-engines.html , Microsoft’s announcement at www.bing.com/community/site_blogs/b/search/archive/2011/06/02/bing-google-and-yahoo-unite-to-build-the-web-of-objects.aspx, and Yahoo’s at http://developer.yahoo.com/blogs/ydn/introducing-schema-org-collaboration-structured-data-44741.html).
Figure 5-2 shows an example of a richer search result.
Figure 5-2. Google iPhone review, and you’ll get results similar to this one. Note how much metadata is included—rating, reviewer, date, and breadcrumbs are all present here
Couldn’t We Do This Before?
This is similar to the Rich Snippets micro-semantics initiative Google launched in 2009, which you may have heard about (or even implemented). But Rich Snippets supported only a handful of existing vocabularies and let authors choose between microdata, microformats, and RDFa. (Plus, it was supported only by Google.)
Now we have one “approved” infrastructure for implementation (microdata), one set of vocabularies at a central location, and a big reason for implementing them: support in Google, Bing, and Yahoo.
That’s a big deal.
(Keep in mind this is purely for search result display, not search ranking. It’s important our clients know the difference).
What’s remarkable isn’t the search giants choosing one infrastructure but rather the 300-odd vocabularies that will potentially define semantics on the Web for years to come. And it was all done behind closed doors with no standards process (or community involvement) whatsoever.
The Semantic Web We’ve Been Waiting For?
Make no mistake, this is the biggest, actually-supported thing to happen for semantics on the Web since, well, pretty much forever.
Way back in Chapter 1 we looked at how XML was supposed to transform semantics on the Web but didn’t. (It was just Architecture Astronauts at work.) We’ve also looked at how HTML5 adds a few semantic elements that either are harmful or add up to very little. (Adding more elements to HTML proper isn’t a solution for semantics.)
This approach of micro-semantics promises a middle way interested communities have been exploring for some time. Let’s run through the existing approaches before we look at the Schema.org launch (and everything that was so horribly wrong with it).
The microformats community has been developing and advocating micro-semantics with reasonable success for years, after kicking off in 2004 (see http://microformats.org/wiki/history-of-microformats). This is from http://microformats.org/about:
Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. Instead of throwing away what works today, microformats intend to solve simpler problems first by adapting to current behaviors and usage patterns (e.g. XHTML, blogging).
For example, in February 2011, all of Facebook’s events were published using microformats (see http://microformats.org/2011/02/17/facebook-adds-hcalendar-hcard). And with the appropriate browser extension (such as the Google Calendar extension for Chrome), a button would appear next to an event, which you could click to add the details to your calendar. Pretty neat, eh?
How did Tantek Çelik, one of the founders of Microformats.org, react to the Google and Microsoft Schema.org announcement (http://twitter.com/#!/t/status/77083481494142976)?
#schemaorg spits in the eyes of every person and company that worked on open vocabularies like vCard, iCalendar, etc.
Ouch.
Microformats was a simple, straightforward, limited-by-design approach to micro-semantics.
RDFa (or Resource Description Framework—in—attributes) was the W3C’s much more complex (but more flexible) approach to machine-readable data that’s been kicking around since 1997 as just “RDF.” (RDFa was started in 2004.) It never really captured developer interest in any significant way, but it’s still hanging around.
As debate raged about the Schema.org announcement mid-June, Mark Pilgrim quipped the following (http://twitter.com/#!/diveintomark/status/80980932957450240—link now 404s; this was before Pilgrim’s Internet disappearing act):
The W3C: failing to make RDF palatable since 1997
Zing.
But there have been some interesting real-world uses, such as the GoodRelations vocabulary for e-commerce (www.heppnetz.de/projects/goodrelations/) that could drive the e-commerce example we looked at earlier.
Web designers generally prefer the simplicity of microformats to the flexibility and complexity of RDFa. Nevertheless, a community interested in micro-semantics had grown around RDFa.
How did Manu Sporny, the current chair of the W3C’s RDF Working Group, react to Google and Microsoft’s Schema.org announcement? In “The False Choice of Schema.org” (http://manu.sporny.org/2011/false-choice/), he said this:
Schema.org is the work of only a handful of people under the guise of three very large companies. It is not the community of thousands of Web Developers that RDFa and Microformats relied upon to build truly open standards. This is not how we do things on the Web.
Yikes.
Finally we have microdata, the new format used in Schema.org.
Nothing compels web authors to add esoteric metadata to their pages like several competing, slightly different metadata formats. So, Ian Hickson, the HTML5 editor, decided microformats was too cold and RDFa was too hot, so he invented a third approach—microdata—that he felt was just right (so to speak). (Here’s Hickson’s lengthy WHATWG post introducing the feature: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-May/019681.html.)
Note that microdata, as far as the HTML5 spec is concerned, is about providing the infrastructure (with new, valid attributes) for adding micro-semantics. It doesn’t specify what those vocabularies should be or who should invent or maintain them. It is completely separate from the actual vocabularies on Schema.org (for example).
And this is the format that won, in a blessed-by-the-tech-giants sense.
(For a lengthier discussion of the various formats and the implications of Schema.org, see Henri Sivonen’s excellent “Schema.org and Pre-Existing Communities” at http://hsivonen.iki.fi/schema-org-and-communities/.)
Microdata and Schema.org
Now Google, Microsoft, and Yahoo are pushing not only a single format (microdata) but also a single set of vocabularies for real semantics on the Web.
Everything has a specific vocabulary (or “schema”): books, movies, events, organizations, places, people, restaurants, products, reviews ... you name it. (See the full list here: http://schema.org/docs/full.html.)
There are even schemas for identifying parts of web pages themselves, including the header, footer, sidebar, and navigation. I guess ARIA, HTML5, and so on, weren’t enough.
If this takes off, and that’s a big “if,” it will be a huge revolution in how we mark up our pages—bigger than XHTML, HTML5, and whatever flavor of HTML comes next.
Has the Semantic Web finally arrived?
How Not to Launch an Initiative
“schema.org ... there’s just nothing quite like throwing away years of vocabulary/ ontology work”
—Jay Myers, June 3, 2011;http://twitter.com/#!/jaymyers/status/76344419867037696
Well, not if the tepid launch of this new initiative is anything to go by. It was pretty much a textbook case of what not to do.
Here are a few things they could have handled slightly better:
<br> | | <a name=Movie><a href=../Movie>Movie</a>: <span class="slot">duration</span>, <span class="slot">director</span>, <span class="slot">actors</span>, <span class="slot">producer</span>, <span class="slot">trailer</span> <br> | | <span class="slot">productionCompany</span>, <span class="slot">musicBy</span><br>
(That describes movies, by the way.) How are we supposed to take these micro-semantics seriously when they can’t even use basic HTML semantics on their own web site? (Update: In 2012 this markup was improved by changing the list to ... a giant table, complete with nested tables and spacer cells. Go figure.)
And let’s not mention the huge number of schemas listed (more than 300!), the fact microdata hasn’t been implemented correctly (see http://jenitennison.com/blog/node/156), or the issues about patents (see www.seobythesea.com/?p=5608 halfway down). What a mess.
All this for potentially the biggest change to web semantics since the Web kicked off.
What Do the People Behind Schema.org Think?
Kavi Goel, a product manager at Google, participated in a session at SemTech 2011 (the “Semantic Technology Conference”) that discussed Schema.org. And some of the responses don’t exactly inspire confidence. (See the W3C’s official transcript here: www.w3.org/2011/06/semtech-bof-notes-smaller.html.)
Here’s an example (slightly abridged):
Ivan Herman: Schema.org is out there, ... how do you envisage the process for the future whereby schema.org might be a place where new vocabs are developed. I [sic] place to make it a more open social process?
Kavi Goel: I don’t have a great answer right now. I don’t think any one company wants to own this in its entirety. By going with 3, we showed we [Google] weren't just doing it. [...]
Then it leaves the question of where is the completely open discussion ... We don’t have an answer yet, but this is important. We’ll need to sort out the stuff that's out there.
Kevin Marks: Ours [microformats] has an edit button, yours has a feedback button. The CORE of microformats is we reach agreement. YOU said “we did it in a closed room”. You haven’t shown your work, your evidence, how others can get involved. This is the most worrying thing.
Kavi Goel:That’s a totally valid point. Microformats did a great job creating an open community.
There’s no good answer for why we didn’t do that.
Coming to microformats with a whole bunch of new things could have been an option. We did want to get something out there.
Earlier in the discussion, Goel said this:
The achievement was to get something out there. We know it’s not perfect. We can make it better. We hope this can be a step toward great adoption.
Here’s hoping. The rush to “get something out there” seems to have done more harm than good at this stage, but they can redeem themselves. We now have one format and one set of vocabularies to use for micro-semantics on the Web. If Google (and/or Microsoft) actually throws some resources at it and someone at either company actually takes ownership of the project, it could be a very big deal indeed.
To the credit of those involved in Schema.org, consultation is finally taking place, and interested parties are discussing a way forward. See, for example, “Schema.org Workshop—A Path Forward” at http://semanticweb.com/schema-org-workshop-a-path-forward/.
Also see the sporadically updated Schema.org blog for further outreach efforts: http://blog.schema.org/.
Wrapping Up: Semantics and HTML
The waves from the Schema.org announcement are still rippling out across the Web as I write. But even so, we can still say a few things about semantics, HTML, and what we should do:
Ultimately, Schema.org is a case of glass half-full/glass half-empty. We now have a well-supported, standard set of semantic schemas we can easily add to any HTML structure. And if we search with Google, Bing, or Yahoo, we can get tangible results. The chicken-and-egg problem of adding semantic data has been solved, the format has been chosen, and the schemas have been released.
But rushing the launch (which was underwhelming, to say the least), abandoning any standards process whatsoever for the vocabularies, and trampling years of existing work are heavy prices to pay.
3.144.8.212