The Internet has been called a cesspool, sometimes in reference to the number of virus-infected and hacker-controlled machines, but more often in reference to the amount of objectionable content available at a click of the mouse. This chapter deals with efforts to monitor and control access to some of this content. Applications that perform this kind of activity are controversial: Privacy and free-speech advocates regularly refer to “censorware,” while the writers of such software tend to use the term “content filtering.” This chapter uses “content filtering,” without meaning to take a side in the argument by so doing. For more on the policy and legal issues surrounding Web monitoring and content filtering, see Chapters 48 and 72 in this Handbook.

This chapter briefly discusses the possible motivations leading to the decision to filter content, without debating the legitimacy of these motives. Given the variety of good and bad reasons to monitor and filter Web content, this chapter reviews the various techniques used in filtering, as well as some ways in which monitoring and filtering can be defeated.

31.2 SOME TERMINOLOGY

Proxy—a computer that intercedes in a communication on behalf of a client. The proxy receives the client request and then regenerates the request with the proxy's own address as the source address. Thus, the server only sees identifying information from the proxy, and the client's identity remains hidden (at least from a network addressing point of view). Proxies are widely used by organizations to control the outward flow of traffic (mostly Web traffic) and to protect users from direct connection with potentially damaging Web sites.

Anonymizing proxy—a proxy that allows users to hide their Web activity. Typically such a proxy is located outside organizational boundaries, and often used to get around filtering rules.

Privacy-enhancing technologies (PET)—a class of technologies that helps users keep their network usage private. These include encryption, anonymizing proxies, mixing networks, and onion routing.

Encryption—reversible garbling of text (plaintext) into random-looking data (ciphertext) that cannot be reversed without using secret information (keys). See Chapter 7 in this Handbook for details of cryptography.

Mixing networks—users encrypt their outbound messages with their recipient's (or recipients') public key(s) and then also with the public key of a “Mix server.” The encrypted message is sent to the Mix server, which acts as a message broker and decrypts the cryptogram to determine the true recipient's address and to forward the message.¹

Onion routing—an anonymous and secure routing protocol developed by the Naval Research Laboratory, Center for High Assurance Computer Systems in the 1990s. Messages are sent through a network (cloud) of onion routers that communicate with each other using public key cryptography. Once the temporary circuit has been established, the sender encrypts the outbound message with each of the public keys of all the onion routers in the circuit.

Specifically, the architecture provides for bi-directional communication even though no one but the initiator's proxy server knows anything but previous and next hops in the communication chain. This implies that neither the respondent nor his proxy server nor any external observer need know the identity of the initiator or his proxy server.²

A third-generation implementation of onion routing is Tor (the onion router).³

31.3 MOTIVATION.

As a general concept, an individual or group chooses to monitor network activity and filter content through an authority relationship. The most common relationships leading to monitoring and filtering are:

Governments controlling their citizenry
Parents and schools protecting their children
Organizations enforcing their policies

31.3.1 Prevention of Dissent.

Probably the most common reason for the opposition to content filtering is that many repressive governments filter content to prevent dissent. If citizens (or subjects) cannot access information that either reflects badly on the government or questions the country's philosophical or religious doctrines, then the citizens probably will not realize the degree to which they are being repressed.

In countries such as the United States, where the Constitution guarantees freedoms of speech and the press, many people feel that only those with something to hide would advocate censorship. This notion, combined with objectionable content so readily available on the Internet, may be why efforts to question the use of filtering software in libraries and other public places have met with resistance or indifference. Chapter 72 gives many examples of countries using filtering products to limit the rights of their citizens. These examples should be a cautionary tale to readers in countries that are currently more liberal in information policy.

31.3.2 Protection of Children.

For the same reasons that convenience stores, at least partially, hide adult magazines from view, the government has required that some classes of information on the Internet be off-limits to children in public schools. This stems from the perception of the schools' role as a surrogate parent and government liability involved in possible failure in that role. A variety of regulations have required the use of content monitoring and filtering in schools and libraries, on the theory that such public or publicly supported terminals should not be using taxpayers' money to provide objectionable content to children. The U.S. Supreme Court has ruled several of these efforts unconstitutional after extreme protest from libraries. Most recently, the Child Internet Protection Act (CIPA)⁴ required school districts that receive certain kinds of government funding to use filtering technologies. Most schools have implemented Web filtering, aggressively limiting the Web content available to students. Estimates of the amount of blocked content vary widely, but one school district claims to filter about 10 percent of its total Web traffic due to questionable content.⁵

The obvious target of filtering technology in schools is to keep students from viewing material considered harmful to minors. Another less-publicized reason to filter Web content is to prevent students from doing things on the Web that they might not do if they knew that someone was watching. The hope is to keep students from getting a criminal record before they even graduate from high school, protecting them from their bad judgment while they are learning to develop good judgment and learning the rules of society. Arguably, this is the job of a parent and not the job of a school, but schools providing Internet access nonetheless must provide this kind of protection.

High school students and kindergarteners generally have differing levels of maturity, and a reasonable filtering policy would include a flexible, gradual degree of filtering depending on age. Schools could use less-intrusive methods of controlling access to Web content as well, such as direct supervision of students using computers. Parents also have the opportunity to filter the content that their children view at home.

31.3.3 Supporting Organizational Human Resources Policy.

Organizations have a variety of reasons for monitoring and filtering the Web content accessed by their employees. The simplest is a desire to keep employees doing work-related activity while at work. Despite studies indicating that the flexibility to conduct limited personal business, such as banking or e-mail, from work produces happier, more productive workers, managers sometimes view personal use of business computers as stealing time from the company.

A more pragmatic reason to filter some Web content is to prevent “hostile workplace” liability under Equal Employment Opportunity laws. The problem with this approach is that many kinds of content would potentially be offensive to coworkers, so it is difficult to use filtering technology to guarantee that no one will be offended while still allowing reasonable work-related use of the Internet.

Some organizations choose to monitor Web traffic with automated systems, rather than blocking anything, and to notify users of the monitoring. The notification could be in general policy documents or in the form of a pop-up window announcing that the user is about to view content that might violate policy. Either way, the notion is that the organization might avoid liability by having warned the user, but might also avoid privacy and freedom-of-speech complaints. The monitor-and-notify approach also sends the message that the organization trusts its employees but wants to maintain a positive and productive work environment.

31.3.4 Enforcement of Laws.

Law enforcement agencies rarely engage in content filtering, but traffic monitoring is an often-utilized tool for investigating computer crime. Investigation involves catching the traffic as it actually arrives at its destination, so filtering would be counterproductive. In this case, proving identity is the key to getting usable evidence, so privacy-enhancing technologies are real problems. In some cases, the logs of Web proxies—which would identify the real source address of the client machine—are available with a subpoena. Investigations of this sort often include child pornography, drug production, or—increasingly—the theft of computer hardware protected by asset-recovery software.

31.4 GENERAL TECHNIQUES.

Filtering methods for Web browsing can focus on targets, sources, the general class of address, and traffic content.

31.4.1 Matching the Request.

The simplest technique used by filtering technologies is the matching of strings against lists of keywords. Every Web request uses a uniform resource locator (URL) in the general form:

protocol://server.organization.top-level-domain/path-to-file.file-format

Filtering a request for a URL can examine any portion of the string for a match against prohibited strings:

Filters can match the protocol field to enforce policies about the use of encrypted Web traffic (HTTP versus HTTPS). This is more of a general security concern than it is a Web filtering issue, although an organization worried about the need to analyze all traffic could prevent the use of encrypted Web traffic by blocking all HTTPS requests.
The server and organization fields describe who is hosting the content. Filtering based in these strings is a broad approach, as it leads to blocking either an entire Web server or an entire organization's Web traffic. See the discussion of server blocking in Section 32.4.2.
The top-level-domain field can be used to filter content; see Section 31.4.3 for attempts to set up an .xxx domain.
The path-to-file field includes the actual title of the requested Web page, so it varies most between individual requests. Whether it is more likely than other fields to contain information useful in filtering depends on the naming convention of the server. This field (and the file-format field) is optional, as shown in the request for www.wiley.com, which directs the server to display its default page.
The file-format field tells the Web browser how to handle the text displayed in the page, whether in straight html format or encoded in some other file format (e.g., doc, pdf), or whether to allow dynamic generation of content (e.g., asp). Few filtering products use this field to filter traditional kinds of objectionable content, although enforcing other policies regarding dynamic code in high-security environments can require matching this field.

31.4.2 Matching the Host.

Some filtering systems attempt to distinguish between acceptable and unacceptable sources on the Web by inspecting the particular servers or general information portals such as search engines.

31.4.2.1 Block Lists of Servers.

Objectionable content tends to be concentrated on individual servers. This naturally leads some organizations to block access to those servers. The two methods of blocking servers are by Internet Protocol (IP) address and by name.

Blocking by IP address simply denies all traffic (or all HTTP traffic) to and from certain addresses. This tactic involves several difficulties based on how addresses are used. First, IP addresses are not permanent. While the numeric addresses of most large commercial servers tend to remain the same over time, many smaller servers have dynamically assigned addresses that may change periodically. Thus, blocking an IP address may prevent access to content hosted on a completely different server than intended. Second, commercial servers often host content for many different customers, and blocking the server as a whole will block all of the contents rather than just the objectionable ones. This is a particular problem for very large service providers like AOL, which grants every user the ability to host a Web site. Blocking the AOL servers because of objectionable content would potentially overblock large numbers of personal Web pages. Third, address-based blocking creates the possibility of malicious listing, a practice in which an attacker (or a competitor) spoofs the address of a Web server and provides objectionable content likely to land the server on blocking lists. Some filtering products allow users to submit “bad” sites, which provides another opportunity for malicious listing.

Blocking by name involves the Domain Name System (DNS), in which a human-readable name (e.g., www.wiley.com) maps to a computer-readable IP addresses (in this case, 208.215.179.146). DNS allows an organization to change the physical address of its Web server by updating the DNS listing to point to the new IP address, although this does rely on the system as a whole propagating the change in a reasonable amount of time. A device that monitors or filters traffic based on a domain name needs to be able to periodically refresh its list of name-to-address mappings in order to avoid blocking sites whose addresses periodically change. As with address-based blocking, name-based blocking also risks malicious listing as well as both under- and overblocking. Many organizations register multiple names for their servers, for a variety of reasons. For instance, an organization might register its name in the .net, .com, .org, and .biz top-level domains to prevent the kind of misdirection described in Section 31.4.1 (whitehouse.com). Other reasons for registering a name in multiple domains include preventing competitors from using name registration to steal customers and preventing speculative buying of similar names by individuals hoping to sell them for significant profit. Web hosting companies also provide service to many different organizations, so a variety of URLs would point to the IP address of the same hosting server. Thus, blocking by server name could underblock by not accounting for all the possible registered names pointing to the same server and overblock by matching the server name of a service provider that hosts sites for many different customers who use the provider's server name in the URL of their Web pages.

31.4.2.2 Block/Modify Intermediaries.

The Web has become an enormous repository of information, requiring the development of powerful search tools to find information. Early search engines gave way to more sophisticated information portals like Yahoo!, Google, AOL, and MSN. By allowing advanced searches and customized results, these portals give users easy access to information that would be difficult or impossible to find using manual search techniques. Portals have become wildly popular tools for accessing the Internet as a whole; Google claimed 100 million search queries per day at the end of 2000 and announced in 2004 that its site index contained over 6 billion items.⁶ Information access at this scale makes portals natural targets for monitoring and filtering. Few commercial organizations prevent their employees from using popular portals, since they have become so much a part of the way people use the Internet. Some countries, however, have blocked access to certain portals for their citizens, hoping to control access to information that might tend to violate national laws (e.g., access to Nazi memorabilia in France) or inspire citizen resistance to the government (e.g., access to information about the Tiananmen Square massacre in China). For further discussion of these issues, see Chapter 72 in this Handbook.

31.4.3 Matching the Domain.

Since 2000, the Internet Corporation for Assigned Names and Numbers (ICANN) has been reviewing requests for a new top-level domain, .xxx, which would allow providers of sexually related content to voluntarily reregister their sites. Such a domain would be easy to filter in the URL of the request, which presumably would appeal to supporters of Web filtering. It would also allow content providers to show that they were complying with laws preventing children from accessing inappropriate material, by enabling more effective parental filtering. Nevertheless, the move has met resistance on both fronts. Conservative religious groups fear that establishing a .xxx domain would legitimize pornography, while not all sexual content providers agree that the perceived benefit would outweigh either the increased filtering of their sites or the easier monitoring of their clients' traffic.

In 2007, ICANN rejected the most recent revision of the .xxx domain proposal, citing the lack of unanimity in the sex content provider community as well as the fear that ICANN might be placed in the position of regulating content, which is outside the organization's charter.⁷

31.4.4 Matching the Content.

String matching is simple to do, but difficult to do without both over- and underblocking. For instance, one of the most common categories for content filtering (particularly in the United States) is sex. Blocking all content exactly matching the word “sex” would fail to match the words “sexy” and “sexual.” To avoid this kind of underblocking, word lists need to be very long to account for all permutations. A slightly more effective tactic is to block all works containing the string “sex,” but this would overblock the words “Essex,” “Sussex,” and “asexual.” Looking for all strings beginning with the combination “sex” would overblock “sexton,” “sextet,” and “sextant.” Simple string matching also ignores context, so blocking “sex” would match in cases where a survey page asked the respondent to identify gender using the word “sex” or in pages describing inherited sex traits or gender roles or sexual discrimination lawsuits.

Other difficulties in string matching involve the vagaries of language. URLs can be displayed in any language whose character set a computer recognizes, so a filter will underblock requests in a language for which it lacks word lists. More generally, in any language it is possible to obfuscate the contents of a site with a seemingly benign URL to avoid filtering. The classic example of this is the pornography site www.whitehouse.com, presumably set up to catch visitors who mistakenly typed “com” when trying to reach the U.S. White House Web site (www.whitehouse.gov). More recently, spam marketing campaigns have been setting up Web sites linked in e-mail, with meaningless strings of numbers and characters in the URL (e.g., http://2sfh.com/7hioh), making the sites difficult to filter.

In a 2006 study, Veritest compared three of the industry-leading Web filter products (WebSense, SmartFilter, and SurfControl). The winning product underblocked 7 sites, overblocked 8 sites, and miscategorized 10 sites, from a preselected list of 600 URLs. The two competing products fared worse, underblocking 23 and 14 sites, and overblocking 9 and 12 sites out of 600.⁸ If this is the performance of the industry's leading edge, then clearly the technology is still developing.

Given the difficulties of accurate matching based on text or address related to the Web page request, a natural alternative is to examine the page content itself. Of course, content matching needs to have access to the unencrypted data in transit, so encrypted Web sessions cause a real problem for this tactic. Some organizations allow (or require) that HTTPS sessions terminate on the organization's own proxy server, potentially allowing the proxy to decrypt the data and perform content analysis.

31.4.4.1 Text.

It is possible, although resource intensive, to watch the network traffic stream and look for text that matches a list of undesired content. This sort of matching typically does little analysis of context and so is prone to the same kind of false positives (overblocking) and false negatives (underblocking) described in Section 31.4.2. Moreover, as an increasing amount of Web content involves pictures and sounds, text matching becomes less effective.

31.4.4.2 Graphics.

A promising new technique, with applications in visual searching as well as visual content blocking, breaks a graphic image into smaller objects by color or pattern. The technique then evaluates each object against a database of reference images for matching with desired criteria. In the case of blocking objectionable sex content, objects can be evaluated for skin tone and either blocked outright or referred to administrators for manual review if the match is inconclusive. Although content-based filtering has not yet developed into a commercial product, the tools exist and the technology seems applicable not only to still images, but also to video and even audio content.⁹ In 2006, the NASA Inspector General's office used an image-search program called Web ContExt to snare an employee who had been trafficking in child pornography.¹⁰

31.5 IMPLEMENTATION.

With the exception of content-based matching, which has not yet reached the market in any significant way, most filtering—whether of address, domain, or keyword—involves matching text lists.

31.5.1 Manual “Bad URL” Lists.

Many firewalls provide the capability for administrators to block individual URLs in the firewall configuration. Entered manually, these rules are good for one-time blocking when a security alert or investigation identifies sites hosting viruses or other malware. This approach is also useful for demonstrating the general filtering abilities of the firewall and for testing other Web-blocking technologies. For instance, in an organization using a commercial blocking solution on a Web proxy, a simple URL-blocking rule on the organization's border firewall would provide some easy spot testing of the effectiveness of the commercial solution. Given the extreme size and constant growth of the Web, however, the manual approach does not scale well to protect against all the possible sources of objectionable material.

31.5.2 Third-Party Block Lists.

With the enormous size of the Web, the more typical approach is to use a third-party block list. Most of these are commercial products with proprietary databases, developed through a combination of automated “Web crawlers” and human technicians evaluating Web sites. Some companies have attempted to prevent researchers from trying to learn about blocking lists and strategies, but the U.S. Copyright Office granted a Digital Millennium Copyright Act (DMCA) exemption in 2003 for fair use by researchers studying these lists.¹¹ Web-filter companies continue to oppose such exemptions. Two open-source filtering alternatives also exist, with publicly viewable (and customizable) block lists that run on caching proxies: SquidGuard¹² and DansGuardian.¹³

31.6 ENFORCEMENT.

Filtering of Web traffic typically occurs either at a network choke point, such as a firewall or Web proxy, or on the individual client machine. Economies of scale lead organizations to filter on a network device, while products designed for parental control of children's Internet use usually reside on individual home computers.

31.6.1 Proxies.

A proxy server is a device that accepts a request from a client computer and then redirects that request to its ultimate destination. Proxies serve a variety of purposes for an organization, including reduction of traffic over expensive wide-area network links and Internet connections, increased performance through caching frequently accessed Web pages, and protection of internal users through hiding their actual IP addresses from the destination Web servers. Proxies also represent a natural locus of control for the organization, enabling authentication and tracking of Web requests that go through this single device. Most browsers support manual configuration of a proxy for all Web traffic as well as automatic discovery of proxies running on the organization's network. Organizations that use Web proxies typically allow outbound Web traffic only from the IP address of the proxy, thus forcing all HTTP traffic to use the proxy. Use of an encrypted Web session (HTTPS) is possible through a proxy, although either at the expense of the ability to monitor content (if the proxy merely passes the traffic through) or at the expense of the end-to-end privacy of the encrypted link (if the proxy decrypts and reencrypts the session).

Individuals also use proxies to maintain the privacy of their activities on the network, as described in Section 31.7.4. Thus, in addition to serving as a natural vehicle for content-filtering applications, proxies also represent a serious threat to those same applications.

31.6.2 Firewalls.

A firewall's job is to analyze information about the traffic passing through it and apply policy rules based on that information. Maintaining acceptable response time and throughput requires that the firewall do its job quickly and efficiently. In order to do so, most firewalls merely look at network-layer information, such as source and destination addresses and ports. More recently, firewall vendors have been adding more features to increase security and product appeal. Many companies now call their more advanced firewalls “service gateways” or “security gateways” as the notion of Unified Threat Management (UTM) becomes more popular. These UTM devices combine many features that formerly required individual devices, such as antivirus, intrusion detection, and filtering of both junk e-mail and Web content.

Sophisticated traffic examination increases the demand on firewall hardware. In order to reduce the performance hit caused by increased packet inspection, many firewalls allow the administrator to define particular rules or protocols for advanced checking. For instance, since viruses are most prevalent in email, Web, and peer-to-peer connections, the firewall administrator might need to configure only antivirus checking on rules applying to these protocols. Similarly, if the firewall needs only to monitor outbound HTTP requests from a single IP address, the Web proxy, then the extra processing load of the monitoring function can be constrained to that traffic profile.

The decision between filtering Web traffic at the proxy server (letting the firewall just pass the traffic from that address) and filtering Web traffic at the firewall (having the firewall do the URL inspection) depends on the amount of Web traffic and the budget (one device or two). The decision also affects the strength of the assertion that the organization is successfully filtering objectionable content. If the organization's border firewall performs the filtering, then this assertion depends on the firewall being the only way for traffic to leave the organization's network. Other traffic vectors, including wireless networking, protocol tunneling, and anonymizing proxies, may come into play. If the organization relies on client computers to use a Web proxy by policy, then the degree to which users can circumvent this policy should also be a consideration.

31.6.3 Parental Tools.

Although client-based Web filtering is not common in large organizations because of the expense and management of such services on a large scale, products enabling parents to block content for their children at home have become abig business. Many large ISPs, such as AOL and MSN, offer parental content-blocking tools as a free feature of their services. Other companies sell stand-alone products that install on a home computer, with password-protected parental administrative access to content-blocking functions. Net Nanny, Cybersitter, and Cyber Patrol are some of the more popular offerings. These products typically reside on individual computers rather than on a network device, although if a home computer is set up as a hub for network connectivity (such as with Microsoft's Internet Connection Sharing), then the controls can filter traffic in the same way as an organizational proxy server. Many of these products also filter other traffic, including e-mail, peer-to-peer file sharing, and instant messaging, as well as offering foreign language filtering, destination-address blocking, and time-of-day access rules.¹⁴

31.7 VULNERABILITIES.

No security scheme, whether physical or logical, is completely free of vulnerabilities, and Web filtering is certainly no exception to this rule. Users who want to access blocked content have a variety of tactics available to them, although solutions vary in ease of use. IP spoofing, protocol tunneling, and some forms of encryption are not trivial practices, and therefore are the tools of technically adept users in reasonably small number. Other technologies, however, such as anonymizing proxies, translation sites, and caching services, are easy ways for the average user to defeat filtering. Web-filtering vendors are constantly striving to make their products more effective, while privacy and free-speech advocates support ongoing efforts to defeat what they call censorware.

31.7.1 Spoofing.

In an organization that performs Web filtering on a proxy server, the organization's border firewall must allow outbound HTTP requests from the IP address of that proxy. A user who can configure traffic to look as though it is coming from the proxy's IP address might be able to get traffic through the firewall without actually going through the proxy. This tactic, known as address spoofing, takes advantage of lax routing policies on routers that forward all unknown traffic to default gateways without checking to see if that traffic came from a direction consistent with its reported source address. The drawback of spoofing, from the attacker's point of view, is that a large organization with significant amounts of Web traffic traversing the network will notice the temporary unavailability of the proxy caused by the spoofing.

An organization can defeat spoofing by configuring internal routers to check their routing tables, to see if the address of a packet coming into an interface is consistent with the networks available via that interface. The router drops packets with source addresses inconsistent with their actual source. This tactic, called reverse-path filtering, requires a more capable (and expensive) router, so it is generally not available for the home user trying to set up network-based protection for parental control.

31.7.2 Tunneling.

A more problematic tactic, because it relies on the behavior of applications rather than on subverting network-layer protections, is protocol or application tunneling. An application can encapsulate any other application's information as a packet of generic data and send it across the network. Virtual private network (VPN) clients use this approach to send traffic through an encrypted tunnel.

Protocol tunneling (sometimes called dynamic application tunneling)¹⁵ relies on applications that send data on commonly allowed ports. For example, a user might tunnel a Web session through the secure shell (SSH) application, which uses TCP port 22. SSH often is allowed through firewalls, and because it uses both authentication and encryption, it can be hard to monitor the difference between a legitimate SSH session and a covert tunnel. In this example, the Web browser issues a request for an HTTP page on TCP port 80, and another application running on the client system captures and redirects the port 80 request into an SSH-encrypted tunnel. The other end of the SSH tunnel could be the destination server or a proxy server somewhere between the client and server. The client application (in this case, the Web browser) is unaware of the traffic diversion and needs no altered configuration. This is similar to the approach used by traditional VPNs.

Application tunneling (also called static application tunneling)¹⁶ requires reconfiguration of the client application to redirect requests through a different port on the client machine. Typically, the user redefines the destination address for the application to a localhost address (in the 127.0.0.x range, referring to the local device), and an application then sets up a connection from that local port to the destination on an allowed network port. This approach requires alteration of the client application's configuration. Some Secure Sockets Layer (SSL) VPN products use this approach to tunnel one or more protocols, or all traffic, across an encrypted HTTP tunnel.

Tunneling via protocol or application generally requires access to either configure existing application settings or install extra software. Thus, tunneling is unavailable to users in organizations that give end users limited control over their computers. Tunneling is more of a problem for parents, whose technically adroit children can install tunneling applications to get around parental controls.

31.7.3 Encryption.

Security professionals generally encourage the use of encryption, since it protects sensitive information in transit across a network. However, when users encrypt data to hide transactions that violate organizational policy, encryption can become a liability instead of an asset from the organization's perspective.

In an encrypted Web session using HTTPS (which is HTTP over SSL), the contents of the session are encrypted and thus unavailable for monitoring or filtering based on URL or data content. However, the source and destination IP addresses of the session, which are visible to TCP as the session is set up, remain visible on the network. This is necessary for routers to be able to get the encrypted data packets from one end of the transaction to the other. Thus, while an HTTPS session is immune to filtering by URL text matching or content-based filtering, blocking the destination server is still effective.

As mentioned in the previous section, VPN technologies represent another use of encryption to protect the content of a transaction. Again, although the VPN encrypts the contents of the transaction, the IP addresses of the endpoints must remain visible in order to transport the data to their destination.

Another more complicated version of encryption, called steganography, actually embeds data inside other information to avoid detection. For instance, a user could embed a text message within the information used to encode a picture. For more details on steganography and other types of encryption, see Chapter 7 in this Handbook.

31.7.4 Anonymity.

Most Web monitoring and content filtering relies on the identity of the user. Content filtering can happen without identity information, but unless an organization or country chooses to impose draconian broad-based filtering of all Web requests (and some do choose this approach), the organization might like to enforce filtering requirements for only some users. For instance, a school district might wish to impose stricter filtering on elementary school students than on high school students and use a more permissive policy for teachers and administrative employees.

The bane of this approach is anonymity. When privacy concerns by individuals lead to the use of anonymizing technologies on a scale that makes identity of users difficult to determine, then the only way to comply with policies and laws requiring filtering of content for some groups is to filter for all groups at the level of the strictest requirement. In the case of a school district, if the network cannot distinguish between student traffic and teacher traffic, then the district must impose the requirements of student content-filtering on teachers as well. In fact, many school districts have made this choice, due to both the extreme numbers of hard-to-find anonymizing proxies and the technical difficulty of separating student and teacher use of a common school network infrastructure. Even so, external anonymizing proxies bedevil network administrators' attempts to force compliance with filtering regulations.

Other uses of privacy-enhancing technologies (PET) include network-based anonymity schemes, such as mixing networks and onion routing. The project known as TOR (originally an acronym for the onion router) has been gaining popularity in recent years. For more information about anonymity and identity, see Chapter 70 in this Handbook. For more about PET, see Chapter 42 in this Handbook.

31.7.5 Translation Sites.

Language-translation sites, such as Babel Fish (http://babelfish.yahoo.com),¹⁷ also offer the possibility of avoiding content filters. The user enters a URL and clicks a button to request a translation of the text of the site. The user's session is between the client computer and Yahoo!, with the requested URL merely passed as keystroke data within the HTTP session, so the potentially blocked site is made available so long as the user can get to Yahoo! This is a special case of a proxy, in that the request typed into Babel Fish generates a request to the server at the other end but the client does not receive the exact results of that reply; the display, instead, includes all the graphics of the original, but the text is translated into the requested language. So, from the point of view of a monitoring tool, the client is always connected with Yahoo!, and even a content filter looking at text within HTTP packets could be stymied by the foreign language. Recognizing the potential for abuse, Yahoo! published a Terms of Use document for the Babel Fish site prohibiting using the service for “items subject to US embargo, hate materials (e.g. Nazi memorabilia),… pornography, prostitution,… [or] gambling items,”¹⁸ among many other classes of activities or products.

Translation sites are also annoying to system administrators because translation of a site in a particular language into that language usually results in unchanged content. For example, a translation of an English-language Web site from French into English simply passes the content through without alteration.

31.7.6 Caching Services.

One of the primary uses for proxy servers has been to reduce network traffic by saving local copies of frequently requested pages. This behavior can circumvent Web filtering, so long as the user can get to the caching server. Google, for example, caches many pages in an effort to provide very fast response to search requests. Large graphic and video files take up the most bandwidth, so Google's Image Search feature often caches them. Often graphic files are available even for sites that no longer exist. Users wanting to bypass blocks on sexually explicit content often use Google Image Search. Google does provide a feature called Safe Search, with three levels of voluntary filtering. The default (middle) setting filters “explicit images only,” while the strict setting filters “both explicit text and explicit images.”¹⁹ Google notes, “no filter is 100% accurate, but Safe Search should eliminate most inappropriate material.”²⁰ The Safe Search feature is configurable per user, and offers no password protection, so it is not a Web filtering technology so much as it is a voluntary sex-based search filter.

31.8 THE FUTURE.

As more ways of distributing content emerge, the content-filtering industry will doubtless evolve to cover the new technologies. At present, vendors sell filtering products for e-mail, Web chat, newsgroups, instant messaging, peer-to-peer file sharing, and FTP, as well as filtering of Web requests. New features will appear in “traditional” Web filtering as well. The latest versions of the home filtering products NetNanny and McAfee Parental Controls now offer the ability to force safe search options in the major search engines (like Google Safe Search, described in Section 31.7.6), and provide “object recognition,” which recognizes certain versions of Web objects (like visit counters) that are commonly used in pornography sites.²¹

Supporters of content filtering, and those who are required to use it and need a reliable product, will be encouraged by the growth of the industry but perhaps disappointed that the problem never seems to be completely solved. Advances in image recognition could provide much better filtering, but may well spawn new ways to alter content to circumvent these tools. Those who decry these products as censorware will point out that, historically, most attempts to censor speech have failed in the end. Eventually, the two sides will probably work out an uneasy compromise, as has happened regarding sales of “adult” content in print and video. As long as some people insist on their right to distribute information that other people find offensive, this conflict is likely to continue.

31.9 SUMMARY.

For a variety of reasons, some better than others, groups of people with power over, or responsibility for, other groups of people want to control the kind of information to which the other groups has access. Web content-filtering products (or censorware) provide this kind of control using computer technology to screen content across the Internet. Free speech and privacy advocates argue that content filtering prevents legitimate, legal access to information. Even should one grant the legitimacy of filtering in some cases, current technologies are prone to error, both failing to block some objectionable content and blocking some sites that contain no such content.

Most filtering techniques involve matching a text string against a database of undesirable content. A Web request filter looks at some portion of the URL, while a content filter looks at the text portion of a returned page. Lists of blocked servers can also be used to block traffic, either by address or by server name. Attempts continue to add top-level domain as a filtering criterion, by the addition of a .xxx domain. Other methods of blocking content examine nontextual parts of a Web page, including graphics. Research into image recognition and flesh-tone matching is progressing, and government agencies have used some image-recognition tools in prosecuting cases, but image recognition has not yet entered the commercial market.

With every protective, or overprotective, strategy comes a group of people dedicated to its defeat. Content filtering has a number of vulnerabilities, chief among which is the use of anonymity via privacy-enhancing technologies such as anonymizing proxies and onion routing. Other ways to defeat Web filtering include the use of protocol and application tunneling, encryption, Web translation sites, and caching services. Filtering technologies have been improving over the years, as has the inventiveness of those dedicated to thwarting them. As information outlets continue to proliferate and new communications media appear, this kind of conflict between protective technologies and privacy-enhancing circumvention is likely to continue.

31.10 FURTHER READING

Barracuda Networks. “CIPA Compliance and the Barracuda Web Filter,” www.barracudanetworks.com/ns/downloads/Barracuda_WP_CIPA.pdf (retrieved April 7, 2007).

The Censorware Project, http://censorware.net/ (retrieved April 7, 2007).

Electronic Privacy Information Center. “EPIC Censorware Page,” www.epic.org/free_speech/censorware/ (retrieved April 7, 2007).

Kongshem, Lars. “Censorware: How Well Does Internet Filtering Software Protect Students?” Online School, www.electronic-school.com/0198f1.html (retrieved April 7, 2007).

Secure Computing. “Best Practices for Monitoring and Filtering Internet Access in the Workplace,” www.securecomputing.com/webform.cfm? id=92& ref=pdtwp1295 (retrieved April 7, 2007; note: registration required).

“Seth Finkelstein's Anticensorware Investigations—Censorware Exposed,” http://sethf.com/anticensorware/ (retrieved April 7, 2007).

31.11 NOTES

1. D. Chaum, “Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms,” Communications of the ACM 24, No. 2 (February 1981); Available at: http://world.std.com/~franl/crypto/chaum-acm-1981.html.

2. D. M. Goldschlag, M. G. Reed, and P. F. Syverson, “Hiding Routing Information,” Workshop on Information Hiding—Proceedings. May 1996, Cambridge, UK.

3. www.onion-router.net/.

4. Codified at 47 U.S.C. § 254(h) and 20 U.S.C. § 9134.

5. Poudre School District (Fort Collins, Colorado), Information Technology Services, www.psdschools.org/services/infotech/index.aspx (retrieved April 7, 2007).

6. “Google Milestones,” www.google.com/intl/en/corporate/history.html (retrieved April 7, 2007).

7. Internet Corporation for Assigned Names and Numbers, “Board Rejects .XXX Domain Application,” March 30, 2007; www.icann.org/announcements/announcement-30mar07.htm (retrieved April 4, 2007).

8. “Websense: Web Filtering Effectiveness Study,” January 2006; www.lionbridge.com/NR/rdonlyres/websensecontentfilte7fmspvtsryjhojtsecqomzmiriqoefctif.pdf (retrieved April 7, 2007).

9. “Using eVe for Content Filtering,” eVision Visual Search Technology, www.evisionglobal.com/business/cf.html (retrieved April 4, 2007).

10. “NASA HQ Raided in Kiddie Porn Probe,” The Smoking Gun, March 31, 2006; www.thesmokinggun.com/archive/0331061nasa1.html (retrieved April 4, 2007).

11. S. Finkelstein, “DMCA 1201 Exemption Transcript,” April 11, 2003, http://sethf.com/anticensorware/hearing_dc.php (retrieved April 4, 2007).

12. “Another SquidGuard Website,” www.squidguard.org/ (retrieved April 4, 2007).

13. “DansGuardian, True Web Content Filtering for All,” http://dansguardian.org/ (retrieved April 4, 2007).

14. Top Ten Reviews, “Internet Filter Review 2007,” http://internet-filter-review.toptenreviews.com/ (retrieved April 4, 2007).

15. SSH Communications Security, “Secure Application Connectivity,” www.ssh.com/solutions/applications/secure-app-connectivity.html (retrieved April 7, 2007).

16. SSH Communications Security, “Secure Application Connectivity.”

17. Named after the tiny Babel fish in Douglas Adams's science-fiction classic Hitchhiker's Guide to the Galaxy (New York: Del Rey, 1995). The Babel fish, when inserted in a person's ear, would instantly enable the person to understand any spoken language. The fish name is used for a translation technology developed by AltaVista, now owned by Yahoo!, and is offered as a free service at http://babelfish.yahoo.com (retrieved April 7, 2007). It is also the name of a commercial translation firm that has registered the trademark, BabelFish.com, www.babelfish.com/ (retrieved April 7, 2007).

18. “Yahoo! Search Builder Terms of Use,” http://help.yahoo.com/help/us/ysearch/ysearch-01.html?fr=bf-home (retrieved April 7, 2007).

19. “Google Preferences,” http://images.google.com/preferences?q=we+live+together&um=1&hl=en (retrieved April 7, 2007).

20. “Google Help Center,” http://images.google.com/intl/en/help/customize.html#safe (retrieved April 7, 2007).

21. Internet Filter Review, “Internet Filter Terms,” http://internet-filter-review.toptenreviews.com/short-definitions.html (retrieved April 7, 2007).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 31: WEB MONITORING AND CONTENT FILTERING

Create new playlist

Sign In

Sign Up

CHAPTER 31

WEB MONITORING AND CONTENT FILTERING

31.1 INTRODUCTION.