Searching the Network

Leaving aside the communicative p2p implementations, probably the next most common reason for running a p2p network is distributing and sharing resources. A core functionality in the network is the ability to search for specific resources, distributed services, or for that matter, people.

For the purposes of this discussion of principles, it’s convenient to restrict the focus to finding content, as in the common file-sharing applications. Although the related issues of storing and retrieving content is mentioned in this context, the Content Management section looks specifically at this aspect.

The seemingly trivial issue of finding content is actually quite complex in practical implementation. Different networks—and for that matter, different clients— have their own approaches to this issue. It depends on the architecture, and on the surprisingly seldom discussed philosophy behind how to search for something.

Bit 3.2 Atomistic networks can be characterized as “search and discovery”.

How search is (or can be) implemented depends to a great extent on how the p2p client designers view content storage and its acquisition through simple queries.


Atomistic Search

We look first at implementations that use atomistic content storage. The simple image that most users have for an atomistic network (such as Gnutella, the implementation details of which are discussed in Chapter 7, although it is used as an example in the following) is that the search function simply looks through the file names in its database, reporting back anything that matches the text string in the query.

Because the atomistic model does not have a centralized content database like a Web search engine or say Napster, all nodes within query range must act individually. The query horizon means that any search is by necessity incomplete. Each responding node must search its own locally stored content (for example, in a list of files). And, because the search is not coordinated, the collected unfiltered result is likely to contain many duplicates when popular files are stored by many nodes.

Furthermore, the search is at the mercy of contributing users who can apply arbitrary name variants (and even misspellings) to the same file, or the same name to different files. It’s also possible that a responding node disconnects after providing a hit result, making its files inaccessible for the duration of the session.

However, a number of hidden assumptions in the filename-search image aren’t necessarily true, which is due to our preconceptions about how centralized or atomistic search is implemented and how content is stored and managed.

Question and Answer

Many p2p technologies only protocol-specify the query and result formats. Exactly how the search is implemented in a particular client is completely up to the developer of the software—so too is the content of the response.

The situation is much like asking a person a question. Spoken or written languages have public syntax rules to enable us to understand the communication, but the process of deriving an answer is completely internal and hidden from public inspection. It should therefore not be a surprise that the same question can return different responses depending on who you ask and in what context.

Search on atomistic content functions much the same. Different clients can process the query in different ways. Some look only at the superficial names of the local shared files, perhaps even requiring strict matches. Others might look at a larger context when determining a match—for instance, parent folder name, fuzzy-logic matches, associated descriptive texts, even a detailed inspection of file content.

Bit 3.3 Atomistic search returns arbitrary, source-determined relevancy.

Unfortunately, as a search instigator, you have little way of determining the quality or exhaustive nature of the different results that eventually return. The return can even be bogus, intentionally misleading to entice you to download malicious code.


Compare this inexactness with centralized search engines, where we can be fairly confident that our query will be processed the same way every time—no surprises and reproducible. But is this always desirable? Are we not then dependent on one search algorithm and its possible failings or unsuitability for our purposes? To be sure, the big search engines provide “advanced” syntax options to fine-tune our query, and we can move on to another engine (and another database) to try a different approach. Meta-search engines can even query a host of different machines, yet often the results are frustratingly dependent on our ability to frame the right match criteria—we have to guess sufficiently close to the format of the result we’re looking for. And for the most part, we can deal effectively mainly only with text, with little intelligence in the way results are sorted, filtered and ranked.

Not Just Files

Interestingly, there is no requirement in p2p protocols such as Gnutella that queries must have anything to do with file names or any static content at all. Broadening our perspectives for a moment, it’s sufficient that a query can be intelligently parsed in some client-relevant context to return a result.

Why not, for example, have a mix of many different kinds of clients on the same network? Some could implement Gnutella’s original intent of collecting recipes. A query for “nutella spread” could then return actual recipes (or pointers to recipe files, or URIs to web resources with such recipes). On the other hand, a query formulated as a mathematical equation might find a client somewhere that can make sense of it and return a calculated answer. Another client might parse it differently and return a pointer to a list of applicable theorems. Some clients might be entirely image oriented. Yet others might provide dictionary entries, translation services, or encyclopedia-style articles based on the query words. Another valid response could be a list of people who have expressed an interest in subjects suggested by the query.

The client-based process works because of the way the query protocol is defined. Clients who can’t parse a particular request do nothing, except pass it on.

Bit 3.4 Atomistic search entities only return responses to queries for which they can process and return a relevant “hit”.

Again, the responding software decides what constitutes a hit. Client software can be designed any way at all as long as it understands and responds to protocol messages.


So, what might initially be seen as a grave fault in the p2p network, an imperfection in search, can in some situations turn out to be an unexpectedly powerful feature. New functionality can be deployed to respond to specific queries, with no change to the existing network protocol, even when these extensions are not anticipated in the original protocols and clients. Like HTML-rendering browsers, the clients that understand a new feature will respond, those that don’t should ignore.

More Innovation Needed

So far, however, we see little of such innovations in the field. To a large extent, the majority of users are interested in swapping files, mostly music and video.

Most developers, and consequently clients, myopically focus on simple file searches and expect to go to download mode when the user requests this from a hit.

It’s important to realize that file download is not the only response and that the end result could just as well be from a network-located service.

Bit 3.5 Peer networks can provide innovative services to simple queries.

Clients and users should be open to new kinds of queries and new types of responses to network “search”, which may require sophisticated filtering of received results.


A prototype for nontraditional search responses was demonstrated in June 2000 by InfraSearch (www.infrasearch.com) by programmers Gene Kan, Yaroslav Faybishenko, Spencer Kimball, and Tracy Scott. The prototype was a Web-accessed search engine—in reality just a front-end server to a small local network of Gnutella-aware clients, each specialized for a particular context. The proof-of-concept experiment included an image database, stock quotes through a proxy, an archive of news headlines, and a simple calculator. Each responded to whatever query made sense according to its parsing rules. The front-end process received the node results and presented a compilation to the querying user.

One vision for the InfraSearch project was to redefine the paradigm of Web search, so that results would be based on actual site content at search time, instead of on an out-of-date database laboriously compiled by various Web-indexing robots. Some saw this innovation as heralding yet another shake-out (YASO) in the dot-com field, this time for the likes of Yahoo! and traditional search engine sites in general.

InfraSearch (the company) was subsequently acquired by Sun Microsystems in March 2001, clearly with an aim to strengthening relevant aspects of its Project JXTA p2p effort. JXTA is described fully in Chapter 10. Further information about its design and software (known as JXTA Search and presented as Distributed Search for Distributed Networks) is found at the search.jxta.org site. The design centers around an open XML protocol called Query Routing Protocol ( QRP).

The need for new search methods has been highlighted by studies of the amount of data stored in distributed systems, of which the Web is but one example. Consider that “good” search engines index (and at that, incompletely) perhaps at most half of the openly accessible billion or so static Web pages. This indexed content forms a partial and out-of-date database for the search algorithms applied. Additionally, a large and unknown number of corresponding dynamic pages can’t be indexed at all. Such pages are the so-called “deep content” from site databases, which is estimated to be another order of magnitude larger, ten billion or more pages.

Furthermore, it was estimated in a study in early 2001 by The Industry Standard (issue archive at www.thestandard.com) that distributed networks held over 550 billion content documents (not “pages”). This incredible mass of mostly unindexed data can safely be assumed to be growing rapidly with increased use of p2p networks in general, and distributed content management within business in particular.

It’s also becoming generally recognized that a wider array of client devices need to access distributed content, not just the traditional PC-with-Web-browser. This range of devices includes roving laptops, handheld PDAs, mobile phones, public kiosk terminals, and likely also devices that we can’t at the moment foresee. Such clients will not only be search consumers, but in all probability also increasingly function as content providers and local search agents. This capability can be realized only with some model of distributed network and a whole new range of so-called “deep search” methods with highly directed queries. It’s likely, too, that we might need more search dimensions or structuring protocols to keep from being flooded by returns that are irrelevant to a particular purpose. For example, a small PDA should be able to frame the request or filter returns so only small-screen text results are shown.

File Storage Model Affects Search

Other p2p networks, atomistic but with distributed storage, provide a counterpoint to the fully atomistic model. Mojo Nation and Freenet, for instance, force a unified search approach. This is not because they represent a less atomistic p2p model as such, but because of the way their distributed storage is designed. The focus is squarely on files as content, published to the network, and inaccessible except through the unique identity code assigned by the publishing agents.

Searches are still spread among many nodes, but these nodes act as collective agents for a unified search algorithm that must produce a valid file identity. Client software must conform to more detailed specifications to function on the network. Although likely to be more efficient than the simplistic-but-open, protocol-based model discussed earlier, such networks are also more rigidly bound to some initial vision of purpose. They are also bound to a single method of searching and presenting hits, which might not be equally valid for all users or types of content.

Bit 3.6 Storage model decisions can restrict innovative functionality.

For example, an explicit reliance on externalized file metadata (for instance, user-supplied descriptions) can make context or functionality-based responses impossible, or at least hard to implement without redesigning the network.


A chosen search model invariably represents a trade-off between the open-ended approach on one hand and constraint issues on the other—requirements on efficient (or just fast) search, good content availability, secure (especially encrypted) and redundant storage, publishing as opposed to sharing, and possibly anonymity of source. The decision thus boils down to which philosophy influences the design and working model of the implementation. If the Gnutella focus can most easily be stated as sharing content, as files in a search-and-discovery kind of network, then other solutions are primarily about sharing bandwidth, storage, and services.

But content might be stored in a distributed manner. And more importantly, search might be implemented as a distributed service, shared among many nodes for content not necessarily stored on just the local system.

Distributed Search

While simple content-sharing systems are based on content stored locally, and thus also searched locally, more sophisticated content-publishing systems decouple content from the individual nodes, at least to some extent.

Publishing systems make sharing an explicit act of “uploading” a file to the network as a whole. The upload may be either a physical submission of the entire file and its description, or less commonly, a submission of only the descriptor metadata. The point is that the shared content list is kept separate from any particular node and is often managed by a distributed tracking system.

The content itself is usually split up and distributed/replicated across many nodes in some hash-determined way. Its availability is therefore vastly better and independent of the online status of individual nodes. Searching in these situations is quite different than in the fully atomistic model, because queries in distributed searches are directed to specific search services that can respond with pointers to current storage locations.

Bit 3.7 Distributed search and storage promote content availability.

Performance is usually better as well, although other design considerations might make a distributed storage network initially slower to respond.


The previous discussions refer to how the chosen storage model affects the search strategies and their performance, but in a general way. In the following, relevant storage models are examined and explained in more detail.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.138.202