More Searchable Content and Content Types

The emphasis throughout this book has been on providing the crawlers with textual content semantically marked up using HTML. However, the less accessible document types—such as multimedia, content behind forms, and scanned historical documents—are being integrated into the search engine results pages (SERPs) more and more, as search algorithms evolve in the ways that the data is collected, parsed, and interpreted. Greater demand, availability, and usage also fuel the trend.

Engines Will Make Crawling Improvements

The search engines are breaking down some of the traditional limitations on crawling. Content types that search engines could not previously crawl or interpret are being addressed. For example, in mid-2008 reports began to surface that Google was finding links within JavaScript (http://www.seomoz.org/ugc/new-reality-google-follows-links-in-javascript-4930). Certainly, there is the possibility that the search engines could begin to execute JavaScript to find the content which may be embedded within it.

In June 2008, Google announced that it was crawling and indexing Flash content (http://googlewebmastercentral.blogspot.com/2008/06/improved-flash-indexing.html). In particular, this announcement indicated that Google was finding text and links within the content. However, there were still major limitations in Google’s ability to deal with Flash-based content. For example, it applied only to Flash implementations that do not rely on external JavaScript calls, which is something that many Flash-based systems use.

Perhaps the bigger problem is the fact that Flash is not inherently textual. It is essentially like any other video where there is little incentive within the medium to use lots of text, and that limits what the search engine can interpret. So, although this is a step forward, the real returns for people who want to build all-Flash sites will probably need to wait until social signals become a stronger factor in search rankings.

Another major historical limitation of search engines is dealing with forms. The classic example is a search query box on a publisher’s website. There is little point in the search engine punching in random search queries to see what results the search engines return. However, there are other cases in which a much simpler form is in use, such as a form that a user may fill out to get access to a downloadable article.

Search engines could potentially try to fill out such forms, perhaps according to a protocol where the rules are predefined to gain access to such content in a form where they can index it and include it in their search results. A lot of valuable content is currently isolated behind such simple forms, and defining such a protocol is certainly within the realm of possibility (though it is no easy task, to be sure). Google has stated that it has this capability, but will use it only on very important but inaccessible sites (http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html).

This is but one specific example, and there may be other scenarios where the search engines might perform form submissions and gain access to currently inaccessible content.

Engines Are Getting New Content Sources

As we noted earlier, Google’s stated mission is “to organize the world’s information and make it universally accessible and useful.” This is a powerful statement, particularly in light of the fact that so much information has not yet made its way online.

As part of its efforts to move more data to the Web, in 2004 Google launched an initiative to scan in books so that they could be incorporated into a Book Search (http://books.google.com/) search engine. This became the subject of a lawsuit by authors and libraries, but evidently a settlement was reached in late 2008 (http://books.google.com/googlebooks/agreement/). The agreement is still subject to full ratification by the parties, but that is expected to be resolved before the end of 2009. In addition to books, other historical documents are worth scanning. Google is not the only organization pursuing such missions (e.g., see http://www.recaptcha.net).

Similarly, content owners retain lots of other proprietary information that is not generally available to the public. Some of this information is locked up behind logins for subscription-based content. To provide such content owners an incentive to make that content searchable, Google came up with its First Click Free concept (discussed in Chapter 6), which is a program to allows Google to crawl subscription-based content.

However, a lot of other content out there is not on the Web at all, and this is information that the search engines want to index. To access it, they can approach the content owners and work on proprietary content deals, and this is also an activity that the search engines all pursue.

Multimedia Is Becoming Indexable

Content in images, audio, and video is currently not indexable by the search engines, but all the major engines are working on solutions to this problem. In the case of images, optical character recognition (OCR) technology has been around for decades. The main challenge in applying it in the area of search has been that it is a relatively compute-intensive process. As computing technology continues to get cheaper and cheaper, this becomes a less difficult problem.

In the meantime, creative solutions are being found. Google is already getting users to annotate images under the guise of a game, with Google Image Labeler (http://images.google.com/imagelabeler/). In this game, users agree to record labels for what is in an image. Participants work in pairs, and every time they get matching labels they score points, with more points being awarded for more detailed labels.

Or consider http://recaptcha.net. This site is helping to complete the digitization of books from the Internet Archive and old editions of the New York Times. These have been partially digitized using scanning and OCR software. OCR is not a perfect technology and there are many cases where the software cannot determine a word with 100% confidence. However, Recaptcha.net is assisting by using humans to figure out what these words are and feeding them back into the database of digitized documents.

First, Recaptcha.net takes the unresolved words and puts them into a database. These words are then fed to blogs that use the site’s CAPTCHA solution for security purposes. These are the boxes you see on blogs and account sign-up screens where you need to enter the characters you see, such as the one shown in Figure 13-3.

Recaptcha.net CAPTCHA screen

Figure 13-3. Recaptcha.net CAPTCHA screen

In this example, the user is expected to type in morning. However, in this case, Recaptcha.net is using the human input in these CAPTCHA screens to help it figure out what the word was in the book that was not resolved using OCR. It makes use of this CAPTCHA information to improve the quality of its digitized book.

Similarly, speech-to-text solutions can be applied to audio and video files to extract more data from them. This is also a relatively compute-intensive technology, so it has not yet been applied in search. But it is a solvable problem as well, and we should see search engines using it within the next decade.

The business problem the search engines face is that the demand for information and content in these challenging-to-index formats is increasing exponentially. Search results that do not include this type of data, and accurately so, will begin to be deemed irrelevant or wrong.

The emergence of YouTube in late 2008 as the #2 search engine (ahead of Yahoo! and Microsoft) is a powerful warning signal. Users want this alternative type of content, and they want a lot of it. User demand for alternative forms of content will ultimately rule the day, and they will get what they want. For this reason, the work on improved techniques for indexing such alternative content types is an urgent priority for the search engines.

Interactive content is also growing on the Web, with technologies such as Flash and AJAX leading the way. In spite of the indexing challenges these technologies bring to search engines, the use of these technologies is continuing because of the experience they offer for users who have broadband connectivity. The search engines are hard at work on solutions to better understand the content wrapped up in these technologies as well.

Over time, our view of what is “interactive” will change drastically. Two- or three-dimensional first-person shooter games and movies will continue to morph and become increasingly interactive. Further in the future, these may become full immersion experiences, similar to the Holodeck on “Star Trek.” You can also expect to see interactive movies where the audience influences the plot with both virtual and human actors performing live. These types of advances are not the immediate concern of today’s SEO practitioner, but staying in tune with where things are headed over time can provide a valuable perspective.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.1.225