5. Building Queries

When I first started writing this book, I wanted to immediately jump in and tell you all about all the nifty stuff you can do with information trapping. I wanted to get right to the examples. I wanted to do the search engine equivalent of a quadruple axel and explain how easy it would be for you to do it, too.

However, it wouldn’t have been particularly useful for me to show you how to do the equivalent of a fancy jump if I hadn’t yet shown you how to put your skates on. Now that we’ve gone through the basic elements that are part of every information trapper’s toolbox, you know how to use your browser as an information-trapping tool, how to find and read RSS feeds, and how to monitor pages using a page monitor.

Now it’s time to learn what to trap, where to trap it, and why. This chapter presents some of the theory behind, and the practice of, setting up good queries. So go get a cool drink, sit down in front of your computer, and crack your knuckles—we’re into the fun part!

In later chapters, we’ll tackle what to do with the information you find, how to keep up with your feeds (even when you’re on the go!), and how to republish what you gather.

For now, let’s begin by thinking about the obvious question: what are you hunting?

What Do You Want to Monitor?

What are you trying to trap in the first place, and how are you going to do it? What you’re trying to trap can be boiled down to two types of information: internal and external.

Internal information is information out on the Web that mentions, references, or discusses a Web site you are interested in, whether it’s your own, your company’s, or someone else’s. Whenever I do presentations about search engines and information trapping, I often find that many people are wanting to track interest in their own Web sites. Monitoring to determine what other Web sites are linking to your own, what they are saying, or how many people are reading the RSS feed you’re publishing is infinitely valuable. We’ll spend Chapter 6 looking at finding and trapping internal information.

External information is everything else. As you might imagine, the concept of external information is much larger and much more inclusive than internal information. The challenge with external information lies in building the perfect query to trap information of interest because there are so many ways to describe and express a topic, idea, or even a name. For example, say you want to monitor the news for mentions of George W. Bush (which is way too general a thing to want to monitor, but more about that in a moment). Do you monitor for "George Bush", "George W. Bush", "George Walker Bush", "GW Bush", or something else entirely?

Getting the most out of the search engine

The fact that any topic can be expressed in a huge number of different ways (and therefore via a huge number of different queries) means that you’ve got to know a little bit about how to get the most out of a search engine. Yes, there are a huge number of queries you could use to track your topic. But only a few of them will work well. In addition, you’ve got to consider the size of the data pools where you’ll be trapping. Google News has over 4,500 sources. Google Web had over eight billion pages at last count. Both of these data pools are constantly being added to and updated. You want to make sure you get a manageable number of results that completely covers your topic. That’s the goal. That’s your mantra: manageable but complete, manageable but complete.

Practicing the theory of onions

Getting too many results for general searches will always be a problem when using search engines with large pools of data (which amounts to pretty much any Web search engine, news search engine, and so on).

When developing queries, I advocate practicing “the theory of onions” to narrow a search as much as possible. This involves first developing a query for your topic that is as tight and specific as you can—condensed, like the middle of an onion. Then you run that search and see what kind of results you get. Are they manageable? Probably, if your query was very specific. Are they complete? Probably not, if your query was as specific as it should have been. Next, make your query a little more general—unwrap it like you’re moving out through the layers of an onion. Run the query again. Ask yourself the same two questions: is it manageable? Is it complete? Repeat this process until you’re getting a good number of search results without feeling like you’ve been hit by the equivalent of a data firehose.

Experiment on your own a little using the theory of onions and see what happens. Most people are so used to plugging a couple words into a search engine and going to town that they don’t really “get” the idea of building as specific a query as possible. If you’re having problems understanding how to start with a very specific search query, here are some possible solutions that might get you unstuck.

Your language isn’t unique enough. Every topic has words of its own—its own vocabulary. A great example of this is medical terms. If you want to monitor causes of abdominal pain, a search for "stomach ache" will get you one level of results, while searching for "peptic ulcer" or "gastrointestinal reflux disease" will get you an entirely different, more technical level of results. If you are familiar with it, try to include the language of your topic in your search queries. It will help you immensely in narrowing down your results.

You’re not using enough words. The way some people put queries into search engines, you’d think they were being charged by the letter. Why not use more words? Last time I checked, Google had a query limit of 32 words. Use the limit! Throw in all the words about your topic that occur to you.

You’re not narrowing in enough. Get specific! Don’t use tree, use dogwood. Don’t use bird, use cardinal. If you have to rework your search and get more general, that’s fine. That’s the point. You’re experimenting here. Experimenting now will pay off later in saved time.

Of course, how much you adjust your query—how much more or less specific you have to get—is going to depend a lot on the kind of resource you’re using. Let’s get a bird’s eye view of some of the different types of resources you’ll search so that you’ll know what to expect.

Types of Search Engines

In general, two types of search engines index text-based data (multimedia is different; we’ll look at multimedia trapping in Chapter 8). This applies to every kind of data that might be searched, from scientific journals to news stories to Web search engines.

The first kind is the full-text search engine. Google is a good example. Full-text search engines try to index every word on every page that they come across. As you might expect, this amounts to a lot of words.

The second kind is the searchable subject index. Yahoo’s Directory and Open Directory Project are good examples. A searchable subject doesn’t index every word of every page, or even index every available page on the Web. Instead, it tries to list sites. And it doesn’t list all the words of a site, but rather just the name of the site, the URL, and some kind of brief description.

As you might imagine, there’s a big difference between a narrow effective query in a full-text search engine, and a narrow effective query in a searchable subject index. The difference revolves around the “pool” of data you’re searching.

Let’s look, for example, at a Web site about trees that has 100 pages with 100 words on each page. Google will attempt to index that entire site. When Google finishes, that site will be represented in Google’s index by 10,000 words of its content. Compare that to Yahoo’s Directory or the Open Directory Project, which would index the same site with only the Web site’s title and a description. In this case, the Web site about trees would be lucky to have 100 words representing its content.

Because of how much information about a site is indexed in each kind of search engine, you have to be careful about what you search for. A search for something as specific as "southeastern knotty limbed birch tree" would be perfect for Google but almost useless in Yahoo’s Directory. Meanwhile, a simple search for "birch trees" could bring you useful results at the Open Directory, but a deluge of results at Google.

For the most part, you’ll deal mostly with full-text search engines, so focus on generating a concentrated, narrow query.

Is there more you can do to get focused and specific in your queries besides putting in as many words as possible and using topic-specific language? Absolutely. You can take advantage of the different syntaxes the search engines offer. You can also omit keywords. We won’t go over all the different syntax that search engines offer; instead I’ll give you some common syntax—and then show you where to look for the more esoteric, quadruple-axel stuff.

Basic Searching Syntax

How do you search for things in search engines? You type in words. But how do you tell the search engine which words you want to search for, which ones you want to avoid, and so on? You use Boolean logic. Boolean logic simply tells a search engine that you want to search for something and something else, or for something and not something else, or for one of several different words, and so on.

For example, the search query beans –rice tells the search engine, “I want to search for beans and not rice,” and that’s Boolean. The minus sign in front of rice is a Boolean operator; it’s telling the search engine that you want to make sure that word doesn’t appear in any query results. If you’ve been using search engines for any length of time you’re familiar with using minus signs to exclude words, and plus signs to include them. In fact, you may have been using Boolean logic all along and not even have known it!

The first thing you need to know about a search engine is its Boolean default. The Boolean default is the Boolean operator that the search engine will use if you enter your query with no Boolean operators (for example: beans rice cheese lettuce tomato). Will it search for all the words? If so, then it’s using the Boolean “and.” Will it search for any of the words? (That is, will it return as a result any Web page that has at least one of your query words in it?) If so, it’s using the Boolean “or.”

If you’re trying to get your results as focused as possible, “and” is far better than “or,” for the obvious reason that it will give you fewer and more relevant results. Unsurprisingly, most search engines nowadays default to the Boolean “and.”

Experiment a little by doing the following: Begin a query by entering into a search engine a bunch of relevant words you think might narrow your search results. Don’t forget that since most search engines default to “and,” you don’t need to add anything to the query words to make sure they’re included in the query. If you want to make sure they’re not included in the query, however, put a minus (-) sign in front of them. And to make sure that words appear only as a phrase, group them with quotes.

Take for example the following query: "cardiac arrest" treatment prevention -"heart attack". This phrase would return results for "cardiac arrest" and include the words treatment and prevention but would exclude any pages that also had the phrase "heart attack".


Tip

Be sure to use great caution when excluding words! The example above was designed to find medical professional-level results—in other words, Web pages whose audience of medical professionals would refer to “cardiac arrest” instead of a “heart attack.” If you run this same search, you’ll find that the results are from sources like professional medical journals, the American Heart Association, hospitals, and so forth. It’s high-level information—just removing one phrase wiped out a whole level of results. Very powerful, but also very damaging if you use it inappropriately.


Special Searching Syntax

When you enter a plain query word in a search box, the search engine looks for that word anywhere on a page. It will look in the title of the page or in the body of the page. One way to reduce your number of results further is to use special syntax that limits where the search engine will look for a word. Search engines vary in the kinds of special syntax they offer, but in this section you’ll find some almost-universal syntax you can use in your searches. In addition, look in the search engine’s Help documents for information on the right way to express the syntax.

The title syntax

To use the special title syntax on Google and Yahoo, you enter intitle:keyword. This syntax restricts your searching to the title of a Web page. It’s useful when you don’t have many words by which to limit your topic. Instead, you’re limiting where your word might be found. When your word is found in the title of a Web page, the page usually contains a lot of information about that keyword.

Be sure to try to get as specific as possible even when you’re searching titles—a page that has your very specific query word in the title will probably get you a jackpot of information. On the other hand, you may find you need to get more general before you start getting many results. Don’t use too many query words with this syntax.

The URL syntax

To conduct a special URL syntax search on Yahoo and Google, you must enter inurl:keyword. This syntax searches only the URLs of pages for the keyword you specify.

But be careful. We are definitely in quadruple-axel territory here. Since there’s no guarantee that people will use a full word in a Web page name, it’s tough to use this syntax. I tend to use it mostly when I’m searching blog entries because many blogs put the full title of their entry in their URL.

The site syntax

To conduct a site syntax search on Yahoo and Google, you must enter site: in the query box. This syntax restricts your search to either a top-level-domain (.org, .edu, .com, .uk, etc.) or to a single domain (like CNN.com). Why would you want to do this? Many domains tend to have their own “flavor” of information (with the exception of the .com domain).

For example, material on .edu domains, especially if you’re searching for scientific/professional information, tends to be more academic (though occasionally student pages can also be found there). Information on .org domains can slant toward the nonprofit (though this is less true than it used to be). If you need to search within a single domain, check and see if that domain offers its own search or alert service before you go to the services offered by a search engine. Remember, there’s no guarantee that a search engine will index every single page on a site.


Warning

Be aware that any time you exclude the .com domain from a search, you’re excluding a lot of pages. Try your search without using site syntax and then try it again using site syntax to see what information you’d be missing.


Other useful syntax

Title, URL, and site syntax are common to a lot of search engines, whether they’re for news, Web searches, or RSS feeds. That’s because these elements are common to all Web pages. But search engines go beyond those syntax to offer others that can help your searching. Keep an eye out for the following syntax, which you can use to narrow down your searches.

Location

Some search engines, especially news search engines, offer you a way to narrow your searches by the location of the source. Google News does this, for example. To use the location syntax, enter location: and the two-letter postal code of the state you want to search. For example, location:ca will find news from sources in California. This syntax also works with the name of a country (try location:ireland). General search engines sometimes allow you to narrow your search by country or region of the world, but less often by state or city.

Page size

Sometimes you can narrow down your search results to how large the page is. Using this you can try to skip the pages that contain a minimal amount of information, but this may mean you can miss useful pages.

Using the Advanced Search Form

Every time you use a search engine, without fail, look for the advanced search form. Most search engines have a basic form with a query box and maybe a couple of options. Until you visit the advanced search form (there’s usually a link next to the simple search) you’ll have little idea of what that search engine is capable of. For instance, compare Google’s front page (Figure 5.1) to its advanced search page (Figure 5.2). The front page has a simple query box, while the advanced page has several query boxes for providing all kinds of search information, and even a few pull-down menus!

Figure 5.1. Google’s front page is very simple.

Image

Figure 5.2. Google’s advanced search page can be a bit overwhelming. See what I mean?

Image

There are thousands and thousands of search engines and interfaces available, but they all have some basic guidelines in common. Look at the advanced search. Look at the help files. For each syntax you use, ask yourself: Is this going to narrow my results? Is it going to get manageable results while keeping them comprehensive? Experiment with them.

And my biggest recommendation for using syntax is this: combine syntax together when you can. Searching just in a title or just in one set of domains is powerful enough, but when you do both of those things together you can really zoom in on what you want to find.

The suggestions and the techniques I’ve gone over thus far are very useful for all kinds of data collections, be they Web pages, news search engines, article collections, sets of RSS feeds, or what have you. However, they are not useful for two other kinds of information collections—tags and conversations. For those you’ll need to use a different approach.

Tags and Conversations

What are tags and conversations? Conversations, of course, are discussions that might happen on mailing lists or on public forums. You can find them via general search engines, but there are also many specialty search engines that index only conversations.

Tags you might not know about. A tag is a keyword that someone can use to describe a resource in a directory. Usually a search engine or directory that indexes tags has many people “tagging” resources at the same time. The index of words used to describe the contents of a directory built that way is called a folksonomy, a taxonomy developed by a group of people. Tags are not full descriptions or site titles—usually they’re just a word or two.


Tip

You can learn a lot more about tagging and folksonomies at Wikipedia, including an overview on folksonomy at en.wikipedia.org/wiki/Folksonomy.


You will need to change your strategy when developing monitors for tags and conversations. Why? The big answer is language.

In the case of tags, you’re not searching for summaries or even fully articulated concepts. Instead you’re searching for a word or two words. In the case of conversations, you’re looking for much more informal language, such as sentence fragments and scraps of conversations.

These kinds of data pools are very different from a structured, organized Web page. And for that reason you’ll have to use different strategies for monitoring them. Later in this book, we look at where you can go to search and monitor tags and conversations, but here we’re going to stick with the idea of queries and how to create them.

For now, let’s take a look at searching with tags.

Searching within tags

The theory of onions is extremely important when searching huge data sets. Getting very narrow and specific is paramount. With tag searching, you have less to search. Tags are usually only a word or two. You don’t have to abandon the onion completely, but try starting a little more general than you would normally. Say you had four levels for describing a bird:

1. Bird

2. Raptor

3. Hawk

4. Red-shouldered hawk

Level 1 is going to be too general no matter what you search. Level 4 would be great for Web search or a news search, but might be too specific for a tag search. When searching tags, try to stay at Level 2 or 3. Try one-word, or at most two-word, searches. Don’t get too extensive or complicated.

Searching within tags can be boiled down to basically two ideas: simpler, more general. Giving instruction about searching within conversations is a little more complex.

Searching within conversations

When you think about searching within a conversation, whether it is a mailing list or an online form, try to think about how you’d talk about your topic. Create queries that reflect how you’d verbally discuss the topics.

If you’re focusing on conversations that professionals are having, try to use their vocabulary. (If you’re monitoring mailing lists of medical professionals, try to use medical terminology.) When you’re monitoring for technical information, you can get a little geekier and use model numbers or version numbers. Read through some of the types of conversations you want to track so that you can get a sense of the kind of language that’s being used. Remember the example earlier in the chapter where simply searching for the phrase "cardiac arrest" and removing the phrase "heart attack" changed the results so dramatically?

And take advantage of the advanced search forms. Tag searches don’t have a lot of advanced options, but many conversations do. Use the available special syntax to narrow your results.


Note

We’re just scratching the surface of tags and conversations with a few words to get you thinking about how to generate queries when approaching these resources. We’ll get into some serious digging later on.


The next few chapters examine the “where” of information trapping.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.70.248