Chapter 1. Introduction to Cisco Wide Area Application Services (WAAS)

IT organizations are struggling with two opposing challenges: to provide high levels of application performance for an increasingly distributed workforce, and to consolidate costly infrastructure to streamline management, improve data protection, and contain costs. Separating the growing remote workforce from the location that IT desires to deploy infrastructure is the wide-area network (WAN), which introduces tremendous delay, packet loss, congestion, and bandwidth limitations, all of which can impede a user’s ability to interact with applications in a high-performance manner.

Cisco Wide Area Application Services (WAAS) is a solution designed to bridge the divide between application performance and infrastructure consolidation in WAN environments. By employing robust optimizations at multiple layers, Cisco WAAS is able to ensure high-performance access to distant application infrastructure, including file services, e-mail, intranet, portal applications, and data protection. By mitigating the performance-limiting factors of the WAN, Cisco WAAS not only improves performance, but also positions IT organizations to better consolidate distributed infrastructure to better control costs and ensure a stronger position toward data protection and compliance.

The purpose of this book is to discuss the Cisco WAAS solution in depth, including a thorough examination of how to design and deploy Cisco WAAS solutions. This chapter provides an introduction to the performance barriers that are created by the WAN, and a technical introduction to Cisco WAAS. This chapter also examines the software architecture of Cisco WAAS, and outlines how each of the fundamental optimization components overcomes those application performance barriers. The chapter ends with a discussion of how Cisco WAAS fits into a network-based architecture of optimization technologies, and how these technologies can be deployed in conjunction with Cisco WAAS to provide a holistic solution for improving application performance over the WAN.

Understanding Application Performance Barriers

Before examining how Cisco WAAS overcomes performance challenges created by network conditions in the WAN, it is important to have an understanding of how those conditions in the WAN impact application performance. Applications today are becoming increasingly robust and complex compared to applications ten years ago, and it is expected that this trend will continue. Many enterprise applications are multitiered, having a presentation layer (commonly comprised of web services), which in turn accesses an application tier of servers, which interacts with a database tier (commonly referred to as an n-tier architecture). Each of these distinct layers commonly interacts with one another using middleware, which is a subsystem that connects disparate software components or architectures. As of this writing, the majority of applications in use today are client/server, involving only a single tier on the server side (for instance, a simple file server). However, n-tier application infrastructures are becoming increasingly popular.

Layer 4 Through Layer 7

Server application instances, whether single-tier or n-tier, primarily interact with user application instances at the application layer of the Open Systems Interconnection (OSI) model. At this layer, application layer control and data messages are exchanged to perform functions based on the business process or transaction being performed. For instance, a user may ‘GET’ an object stored on a web server using HTTP. Interaction at this layer is complex, as the number of operations that can be performed over a proprietary protocol or even a standards-based protocol can be literally in the hundreds or thousands. Between the application layers on a given pair of nodes exists a hierarchical structure of layers between the server application instance and user application instance, which also adds complexity—and performance constraints.

For instance, data that is to be transmitted between application instances might pass through a shared (and prenegotiated) presentation layer. This layer may or may not be present depending on the application, as many applications have built-in semantics around data representation. This layer is responsible for ensuring that the data conforms to a specific structure, such as ASCII or Extensible Markup Language (XML).

From the presentation layer, the data might be delivered to a session layer, which is responsible for establishing an overlay session between two endpoints. Session layer protocols provide applications with the capability to manage checkpoints and recovery of atomic upper-layer protocol (ULP) exchanges, which occur at a transactional or procedural layer as compared to the transport of raw segments (provided by the Transmission Control Protocol, discussed later). Similar to the presentation layer, many applications may have built-in semantics around session management and may not use a discrete session layer. However, some applications, commonly those that use remote procedure calls (RPC), do require a discrete session layer.

Whether the data to be exchanged between a user application instance and server application instance requires the use of a presentation layer or session layer, data to be transmitted across an internetwork will be handled by a transport protocol. The transport protocol is primarily responsible for data multiplexing—that is, ensuring that data transmitted by a node is able to be processed by the appropriate application process on the recipient node. Commonly used transport layer protocols include the Transmission Control Protocol (TCP), User Datagram Protocol (UDP), and Stream Control Transmission Protocol (SCTP). The transport protocol is commonly responsible for providing guaranteed delivery and adaptation to changing network conditions, such as bandwidth changes or congestion. Some transport protocols, such as UDP, do not provide such capabilities. Applications that leverage UDP either implement their own means of guaranteed delivery or congestion control, or these capabilities simply are not required for the application.

The components mentioned previously, including transport, session, presentation, and application layers, represent a grouping of services that dictate how application data is exchanged between disparate nodes. These components are commonly called Layer 4 through Layer 7 services, or L4–7 services, or application networking services (ANS). L4–7 services rely on the packet routing and forwarding services provided by lower layers, including the network, data link, and physical layers, to move segments of application data in network packets between nodes that are communicating. With the exception of network latency caused by distance and the speed of light, L4–7 services generally add the largest amount of operational latency to the performance of an application. This is due to the tremendous amount of processing that must take place to move data into and out of buffers (transport layer), maintain long-lived sessions between nodes (session layer), ensure data conforms to representation requirements (presentation layer), and exchange application control and data messages based on the task being performed (application layer).

Figure 1-1 shows an example of how L4–7 presents application performance challenges.

L4–7 Performance Challenges

Figure 1-1. L4–7 Performance Challenges

The performance challenges caused by L4–7 can generally be classified into the following categories: latency, bandwidth inefficiencies, and throughput. These are examined in the following three sections.


L4–7 latency is a culmination of the latency components added by each of the four layers involved: application, presentation, session, and transport. Given that presentation layer, session layer, and transport layer latency are typically low and have minimal impact on overall performance, this section focuses on latency that is incurred at the application layer. It should be noted that, although significant, the latency added by L4–7 processing in the node itself is typically minimal compared to latency found in the network itself, and far less than the performance impact of application layer latency caused by protocol chatter over a high-latency network.

Application layer latency is defined as the operational latency of an application protocol and is generally exhibited when applications or protocols have a “send-and-wait” type of behavior. An example of application layer latency can be observed when accessing a file on a file server using the Common Internet File System (CIFS) protocol, which is predominant in environments using Windows clients and Windows servers, or network-attached storage (NAS) devices that are being accessed by Windows clients. In such a case, the client and server must exchange a series of “administrative” messages prior to any data being sent to a user.

For instance, the client must first establish the session to the server, and establishment of this session involves validation of user authenticity against an authority such as a domain controller. Then, the client must establish a connection to the specific share (or named pipe), which requires that client authorization be examined. Once the user is authenticated and authorized, a series of messages is exchanged to traverse the directory structure and gather metadata. After the file is identified, a series of lock requests must be sent in series (based on file type), and then file I/O requests (such as read, write, or seek) can be exchanged between the user and the server. Each of these messages requires that a small amount of data be exchanged over the network, causing operational latency that may be unnoticed in a local-area network (LAN) environment but is significant when operating over a WAN.

Figure 1-2 shows an example of how application layer latency alone in a WAN environment can significantly impede the response time and overall performance perceived by a user. In this example, the one-way latency is 100 ms, leading to a situation where only 3 KB of data is exchanged in 600 ms of time.

Latency-Sensitive Application Example

Figure 1-2. Latency-Sensitive Application Example

It should be noted that although the presentation, session, and transport layers do indeed add latency, it is commonly negligible in comparison to application layer latency. It should also be noted that the transport layer performance itself is commonly subject to the amount of perceived latency in the network due to the slowness associated with relieving transmission windows and other factors. The impact of network latency on application performance is examined in the next section, “Network Infrastructure.”

Bandwidth Inefficiencies

The lack of available network bandwidth (discussed in the section, “Network Infrastructure”) coupled with application layer inefficiencies in the realm of data transfer creates an application performance barrier. This performance barrier is manifest when an application is inefficient in the way information is exchanged between two communicating nodes. For instance, assume that ten users are in a remote office that is connected to the corporate campus network by way of a T1 (1.544 Mbps). If these users use an e-mail server (such as Microsoft Exchange) in the corporate campus network, and an e-mail message with a 1-MB attachment is sent to each of these users, the e-mail message needs to be transferred once for each user, or ten times. Such scenarios can massively congest enterprise WANs, and similarities can be found across many different applications:

  • Redundant e-mail attachments being downloaded over the WAN multiple times by multiple users

  • Multiple copies of the same file stored on distant file servers being accessed over the WAN by multiple users

  • Multiple copies of the same web object stored on distant intranet portals or application servers being accessed over the WAN by multiple users

In many cases, the data contained in objects being accessed across the gamut of applications used by remote office users will likely contain a significant amount of redundancy. For instance, one user might send an e-mail attachment to another user over the corporate WAN, while another user accesses that same file (or a different version of that file) using a file server protocol over the WAN. The packet network itself has historically been independent of the application network, meaning that characteristics of data were generally not considered, examined, or leveraged when routing information throughout the corporate network.

Some applications and protocols have since added semantics that help to minimize the bandwidth inefficiencies of applications operating in WAN environments. For instance, the web browsers of today have built-in client-side caching capabilities. Objects from Internet sites and intranet applications that are transferred over the WAN have metadata included in the protocol header that provides information to the client browser, thus allowing the browser to make a determination of whether or not caching should be used for the object. By employing a client-side cache in such applications, the repeated transmission of objects can be mitigated when the same user requests the object using the same application. Although this improves performance for that particular user, this information goes completely unused when a different user attempts to access that same object, as the application cache is wholly contained on each individual client and not shared across multiple users. Application-level caching is isolated not only to the user that cached the object, but also to the application within that user’s workstation. This means that while the user’s browser has a particular file cached, a different application has no means of leveraging that cached object. Some applications require that software upgrades be added to provide caching functionality.

Although the previous two sections focused primarily on latency and bandwidth utilization as application layer performance challenges, the items discussed in the next section, “Network Infrastructure,” also impact application layer performance. The next section focuses primarily on the network infrastructure aspects that impact end-to-end performance, and also discusses how these challenges have a direct impact on L4–7 and end-to-end performance.

Network Infrastructure

The network itself also creates a tremendous number of application performance barriers. In many cases, the challenges found in L4–7 are exacerbated by the challenges that are manifest in the network infrastructure itself. For instance, the impact of application layer latency is amplified when network infrastructure latency is high. The impact of application layer bandwidth inefficiencies are amplified when the amount of available bandwidth in the network is not sufficient. Packet loss has an adverse effect on application performance, generally indirectly, as transport protocols react to loss events to normalize connection throughput around the available network capacity. This section focuses specifically on the issues that are present in the network infrastructure that negatively impact application performance, and also examines how these issues impact the L4–7 challenges discussed previously. These issues include bandwidth constraints, network latency, and loss and congestion.

Bandwidth Constraints

Network bandwidth can create performance constraints related to application performance. Bandwidth found in the LAN has evolved over the years from Fast Ethernet (100 Mbps), to Gigabit Ethernet (1 Gbps), to 10-Gigabit Ethernet (10 Gbps), and eventually 100-Gigabit Ethernet (100 Gbps) will begin to be deployed. Generally speaking, the bandwidth capacity on the LAN is not a limitation from an application performance perspective. WAN bandwidth, on the other hand, is not increasing as rapidly as LAN bandwidth, and the price per megabit of bandwidth is significantly higher than it is on the LAN. This is largely due to the fact that WAN bandwidth is commonly provided as a service from a carrier or service provider, and the connections must traverse a “cloud” of network locations to connect two geographically distant networks. Most carriers have done a substantial amount of research into what levels of oversubscription in the core network are tolerable to their customers, with the exception being dedicated circuits where the bandwidth is guaranteed.

Nevertheless, WAN bandwidth is far more costly than LAN bandwidth, and the most common WAN circuits found today are an order of magnitude smaller in bandwidth than what can be deployed in a LAN. The most common WAN link found in today’s remote office and branch office environment is the T1 (1.544 Mbps), which is roughly 1/64 the capacity of a Fast Ethernet connection, which is in today’s network environments being phased out in favor of Gigabit Ethernet.

When examining application performance in WAN environments, it is important to note the bandwidth disparity that exists between LAN and WAN environments, as the WAN is what connects the many geographically distributed locations. Such a bandwidth disparity makes environments where nodes are on disparate LANs and separated by a WAN susceptible to a tremendous amount of oversubscription. In these cases, the amount of bandwidth that is able to be used for service is tremendously smaller than the amount of bandwidth capacity found on either of the LAN segments connecting the devices that are attempting to communicate. This problem is exacerbated by the fact that there are commonly tens, hundreds, or even in some cases thousands of nodes that are trying to compete for this precious WAN bandwidth.

Figure 1-3 provides an example of the oversubscription found in a simple WAN environment with two locations, each with multiple nodes attached to the LAN via Fast Ethernet (100 Mbps), contending for available bandwidth on a T1. In this example, the location with the server is also connected to the WAN via a T1, and the potential for exceeding 500:1 oversubscription is realized.

Network Oversubscription in a WAN Environment

Figure 1-3. Network Oversubscription in a WAN Environment

When oversubscription is encountered, traffic that is competing for available WAN bandwidth must be queued to the extent allowed by the intermediary network devices. The queuing and scheduling disciplines applied can be dictated by a configured policy for control and bandwidth allocation (such as quality of service, or QoS) on the intermediary network elements. In any case, if queues become exhausted, packets must be dropped, as there is no memory available in the oversubscribed network device to store the data for service. Loss of packets will likely impact the application’s ability to achieve higher levels of throughput and, in the case of a connection-oriented transport protocol, likely cause the communicating nodes to adjust their rate of transmission to a level that allows them to use only their fair share of the available bandwidth.

As an example, consider a user transmitting a file by way of the File Transfer Protocol (FTP). The user is attached to a Fast Ethernet LAN, as is the server, but a T1 WAN separates the two locations. The maximum achievable throughput would be limited by the T1, as it is the slowest link in the path of communication. Thus, the application throughput (assuming 100 percent efficiency and no packet loss) would be limited to roughly 1.544 Mbps (megabits per second), or 193 kBps (kilobytes per second). Given that packet loss is imminent, and no transport protocol is 100 percent efficient, it is likely that the user would see approximately 90 percent of line-rate in terms of application throughput, or roughly 1.39 Mbps (174 kBps).

Taking the example one step further, if two users were performing the same test (FTP transfer over a T1), the router queues (assuming no QoS policy favoring one user over the other) would quickly become exhausted as the connections began discovering available bandwidth. As packets begin to get dropped by the router, the transport protocol would react to the loss and adjust throughput accordingly. The net result is that both nodes would rapidly converge to a point where they were sharing the bandwidth fairly, and connection throughput would oscillate around this point of convergence (roughly 50 percent of 1.39 Mbps, or 695 kbps, which equals 86.8 kBps). This example is simplistic in that it assumes there is no packet loss or latency found in the WAN. The impact of transport protocols will be examined as part of the discussions on network latency, loss, and congestion.

Network Latency

The example at the end of the previous section did not take into account network latency. Network latency is the amount of time taken for data to traverse a network in between two communicating devices. Network latency is considered the “silent killer” of application performance, as most network administrators have simply tried (and failed) to circumvent application performance problems by adding bandwidth to the network. Put simply, network latency can have a significant effect on the amount of network capacity that can be consumed by two communicating nodes.

In a campus LAN, latency is generally under 1 ms, meaning the amount of time for data transmitted by a node to be received by the recipient is less than 1 ms. This number may of course increase based on how geographically dispersed the campus LAN is, and also on what levels of utilization and oversubscription are encountered. In a WAN, latency is generally measured in tens or hundreds of milliseconds, much higher than what is found in the LAN. Latency is caused by the propagation delay of light or electrons, which is generally 66 percent of the speed of light (or 2 × 108 meters per second). Although this seems extremely fast on the surface, when stretched over a great distance, the latency can be quite noticeable. For instance, in a network that spans 3000 miles (4.8 million meters), the distance between New York and San Francisco, it would take roughly 24.1 ms in one direction for a packet to traverse the network from one end to the other. This of course assumes no serialization delays, loss, or congestion in the network, and that the most direct route is chosen through the network with little to no deviation in distance. It would therefore take at least 52.8 ms for a transmitting node to receive an acknowledgment for a segment that was sent, assuming no time was required for the recipient to process that the data was received.

Figure 1-4 shows how latency in its simplest form can impact the performance of a telephone conversation, which is analogous to two nodes communicating over an internetwork with 1 second of one-way latency.

Challenges of Network Latency

Figure 1-4. Challenges of Network Latency

The reason network latency has an impact on application performance is two-fold. First, network latency introduces delays that impact mechanisms that control rate of transmission. For instance, connection-oriented, guaranteed-delivery transport protocols such as TCP use a sliding-window mechanism to track what transmitted data has been successfully received by a peer and how much additional data can be sent. As data is received, acknowledgments are generated, which not only notify the sender that the data is received, but also relieves window capacity so more data can be transmitted if available. Transport protocol control messages are exchanged between nodes on the network, so any latency found in the network will also impact the rate at which these control messages can be exchanged. Overall, this impacts the rate at which data can be drained from a sender’s transmission buffer into the network. This has a cascading effect, which causes the second impact on application performance for those applications that rely on transport protocols that are susceptible to performance barriers caused by latency. This second impact is discussed later in this section.

Latency not only delays the receipt of data and the subsequent receipt of the acknowledgment for that data, but also can be so large that it actually renders a node unable to leverage all of the available bandwidth capacity. This problem is encountered when the capacity of the network, which is the amount of data that can be in flight at any one given time, is greater than the sliding-window capacity of the sender. For instance, a DS3 (45 Mbps, or roughly 5.63 MBps) with 100 ms of latency can have up to 563 KB (5.63 MBps × .1) of data in flight and traversing the link at any point in time (assuming the link is 100 percent utilized). This “network capacity” is called the bandwidth delay product (BDP), and is calculated by multiplying the network bandwidth (after conversion to bytes) by the amount of latency. Given that many computers today have only a small amount of memory allocated for each TCP connection (64 KB, unless window scaling is used), if the network BDP exceeds 64 KB, the transmitting node will not be able to successfully “fill the pipe”. This is primarily due to the fact that the window is not relieved quickly enough because of the latency, and the buffer is not big enough to keep the link full. This also assumes that the recipient has large enough buffers on the distant end to allow the sender to continue transmission without delay.

Figure 1-5 shows an example of how latency and small buffers render the transmitter unable to fully capitalize on the available bandwidth capacity.

Latency and Small Transmission Buffers

Figure 1-5. Latency and Small Transmission Buffers

The second impact on application performance is related to application-specific messages that must be exchanged using latency-sensitive transport protocols. Most applications today are very robust and require that a series of messages be exchanged between nodes before any real “work” is done. In many cases, these control messages are exchanged in a serial fashion, where each builds upon the last until ultimately small pieces of usable data are exchanged. This type of behavior, where applications exhibit send-and-wait behavior, is also known as “application ping-pong,” because many messages must be exchanged in sequence and in order before any actual usable data is exchanged. In many cases, these same applications exchange only a small amount of data, and each small piece of data is followed by yet another series of control messages leading up to the next small piece of data.

As this section has shown, latency has an impact on the transmitting node’s transport protocol and its ability to effectively utilize available WAN capacity. Furthermore, applications that exhibit “ping-pong” behavior are impacted even further due to the latency encountered when exchanging application layer messages over the impacted transport protocol. The next section examines the impact of packet loss and congestion on throughput and application performance.

Loss and Congestion

Packet loss and congestion also have a negative impact on application throughput. Although packet loss can be caused by anything from signal degradation to faulty hardware, it is most commonly the result of either of the following two scenarios:

  • Internal oversubscription of allocated connection memory within a transmitting node

  • Oversubscribed intermediary network device queues

Packet loss is not generally a scenario that can be proactively reported to a transmitter; that is, a router that drops a particular packet cannot notify a transmitting node that a specific packet has been dropped due to a congested queue. Packet loss is generally handled reactively by a transmitting node based on the acknowledgments that are received from the recipient or the lack thereof. For instance, in the case of a connection-oriented transport protocol, if 5 KB of data is sent in five unique 1-KB sequences, an acknowledgment of only four of the five segments would cause the transmitter to retransmit the missing segment. This behavior varies among transport protocols and is also dependent upon the extensions to the transport protocol that are being used, but the general behavior remains consistent: an unacknowledged segment is likely a segment that was contained in a packet that was lost, not received correctly (due to signal degradation or errors), or oversubscription of the recipient buffer. Double and triple acknowledgments may also be used to indicate the window position of a segment that was not successfully received, to specify what the transmitter should resend.

In the case of TCP, the lack of an acknowledgment causes the transmitter not only to resend, but also to re-evaluate the rate at which it was sending data. A loss of a segment causes TCP to adjust its window capacity to a lower value to cover scenarios where too much data was being sent—either too much data for the network to deliver (due to oversubscription of the network) or too much data for the recipient to receive (due to congested receive buffers). The net effect is that, upon encountering packet loss and subsequently having to retransmit data, the transmitter will decrease the overall throughput of the connection to try and find a rate that will not oversubscribe the network or the recipient. This behavior is called congestion avoidance, as TCP adjusts its rate to match the available capacity in the network and the recipient.

The most common TCP implementation found today, TCP Reno, reduces the congestion window by 50 percent upon encountering packet loss. Although reducing the congestion window by 50 percent does not necessarily correlate to a 50 percent decrease in connection throughput, this reduction can certainly constrain a connection’s ability to saturate the link. During the congestion avoidance phase with TCP Reno, each successful transmission (signaled by receipt of an acknowledgment) causes the congestion window to increase by one segment size. The purpose of the congestion window is to allow TCP to first react to packet loss, which ensures throughput is adjusted to available capacity, and secondly to continue to try and find additional available capacity as a result of continually increasing the congestion window for each successful transmission.

Figure 1-6 shows an example of how packet loss impacts the TCP congestion window, which impacts overall application throughput.

Impact of Packet Loss on Throughput

Figure 1-6. Impact of Packet Loss on Throughput

This “backoff” behavior not only helps TCP normalize around the available network capacity and available capacity in the recipient buffer, but also helps to ensure fairness among nodes that are competing for the available WAN bandwidth.

Introduction to Cisco WAAS

The previous sections examined the most common causes of application performance challenges found in WAN environments. Although the previous sections certainly did not cover every possible performance barrier, they summarized and briefly examined the largest of these problems. With this fundamental understanding of what contributes to application performance challenges, one might ask, “How are they solved?” Each application performance challenge has an appropriate solution, and these solutions must be implemented in a hierarchical manner with the appropriate solution in the appropriate point within the network, as shown in Table 1-1.

Table 1-1. Solutions to Application Performance Barriers Found in the WAN

Performance Barrier

Technology Solution

Application layer latency

Application layer optimization, including parallelization of serial tasks, prefetching, message prediction, local response handling, and object prepositioning

Application layer bandwidth consumption

Application layer object caching with local delivery at the edge of the network near the requesting user

Network bandwidth consumption and congestion

Compression, data suppression, QoS, application layer object caching

Packet loss sensitivity

Optimized transport protocol implementation with advanced congestion avoidance algorithms, TCP proxy architectures, rate-based transmission protocols, or forward error correction (FEC)

Network throughput

Optimized transport protocol implementation with advanced congestion avoidance algorithms, large transmit and receive buffers, window scaling

Prioritization and resource allocation

End-to-end QoS, including basic classification, deep packet inspection, prequeuing operations, hierarchical queuing and scheduling, post-queuing optimization

Cisco WAAS provides a solution to the performance barriers presented by the WAN by employing a series of application-agnostic optimizations, also known as WAN optimization, in conjunction with a series of application-specific optimizations, also known as application acceleration. WAN optimization refers to employing techniques at the network or transport protocol that apply across any application protocol using that network or transport protocol. Application acceleration refers to employing optimizations directly against an application or an application protocol that it uses. WAN optimization has broad applicability, whereas application acceleration has focused applicability.

Cisco WAAS is a solution that is transparent in three domains:

  • Client nodesNo changes are needed on a client node to benefit from the optimization provided by Cisco WAAS.

  • ServersNo changes are needed on a server node to benefit from Cisco WAAS.

  • NetworkCisco WAAS provides the strongest levels of interoperability with technologies deployed in the network, including QoS, NetFlow, IP service-level agreements (IP SLA), access control lists (ACL), firewall policies, and more. Transparency in the network is unique to Cisco WAAS.

This unique combination of three domains of transparency allows Cisco WAAS the least disruptive introduction into the enterprise IT infrastructure of any WAN optimization or application acceleration solution.

The following sections examine the WAN optimization and application acceleration components of Cisco WAAS in detail.

WAN Optimization

Cisco WAAS implements a number of WAN optimization capabilities to help overcome challenges encountered in the WAN. These optimizations include a foundational set of three key elements:

  • Data Redundancy Elimination (DRE)DRE is an advanced compression mechanism that uses disk and memory. DRE minimizes the amount of redundant data found on the WAN by utilizing a loosely synchronized compression history on Wide Area Application Engine (WAE) peers. When redundant data is identified, the WAE will send a signature referencing that data to the peer as opposed to sending the original data, thereby providing potentially very high levels of compression. Data that is non-redundant is added to the compression history on both peers and is sent across the WAN with newly generated signatures.

  • Persistent LZ Compression (PLZ)PLZ is a variant of the Lempel-Ziv (LZ) compression algorithm. The WAE uses a persistent session history to extend the compression capabilities of basic LZ, which helps minimize bandwidth consumption for data traversing the WAN. PLZ is helpful for data that is identified as nonredundant by DRE, and can also compress signatures that are sent by DRE on behalf of redundant chunks of data.

  • Transport Flow Optimization (TFO)TFO is a series of TCP optimizations that helps mitigate performance barriers associated with TCP. TFO includes large initial windows, selective acknowledgment and extensions, window scaling, and an advanced congestion avoidance algorithm that helps “fill the pipe” while preserving fairness among optimized and unoptimized connections.

Determining which optimization to apply is a function of the Application Traffic Policy (ATP), which can be managed discretely per WAAS device or within the Cisco WAAS Central Manager console, and is also dependent upon the optimization negotiation that occurs between WAAS devices during automatic discovery (discussed later in this chapter in “Other Features”).

The data path for optimization within the Cisco WAAS device is the TCP proxy, which is used for each connection that is being optimized by Cisco WAAS. The TCP proxy allows Cisco WAAS to transparently insert itself as a TCP-compliant intermediary. In this way, Cisco WAAS devices can receive and temporarily buffer data sent from a host and locally acknowledge data segments when appropriate. By employing a TCP proxy, Cisco WAAS can also send larger blocks of data to the optimization software components, which permits higher levels of compression to be realized when compared to per-packet architectures in which the compression domain may be limited by the size of the packets being received.

Data in the TCP proxy is then passed through the associated optimization components based on the configured policy, and the optimized traffic is transmitted across the WAN using the optimized TCP implementation. By implementing a TCP proxy, Cisco WAAS can shield communicating nodes from unruly WAN conditions such as packet loss or congestion. Should the loss of a segment be encountered, Cisco WAAS devices can extract the segment from the TCP proxy retransmission queue and retransmit the optimized segment, thereby removing the need for the original transmitting node to retransmit the data that was lost in transit. Transmitting nodes enjoy the benefits of having LAN-like TCP performance, exhibiting the characteristics of minimal packet loss and rapid acknowledgment. By using a TCP proxy, Cisco WAAS allows data to be drained from the transmitting nodes more quickly and nearly eliminates the propagation of performance-limiting challenges encountered in the WAN.

Figure 1-7 shows the Cisco WAAS TCP proxy architecture and how it provides a buffer that prevents WAN performance from impacting transmitting nodes.

Cisco WAAS TCP Proxy Architecture

Figure 1-7. Cisco WAAS TCP Proxy Architecture

The following sections examine each of these optimizations in more detail.

Data Redundancy Elimination

DRE is an advanced, lossless compression algorithm that leverages both memory (high throughput and high I/O rates) and disk (persistent and large compression history). DRE examines data in-flight for redundant patterns (patterns that have been previously identified). As redundant patterns are identified, they are replaced with a signature that references the redundant pattern within the peer WAAS device compression history. As these signatures are only 5 or 6 bytes in size (depending on the breakpoints identified within the data), and the redundant pattern identified could potentially be tens or hundreds of kilobytes, DRE can provide significant levels of compression for flows containing data that has been previously identified, which helps minimize bandwidth consumption on the WAN.

DRE is bidirectional, meaning patterns identified during one direction of traffic flow can be leveraged for traffic flowing in the opposite direction. DRE is also application agnostic in that patterns identified within a flow for one application can be leveraged to optimize flows for a different application. An example of the bidirectional and application-agnostic characteristics of DRE is as follows. Assume two users are located in the same remote office, which is connected to the corporate campus by way of a T1 WAN. Both the remote office and the corporate campus have Cisco WAAS devices installed. Should the first user download an e-mail containing an attachment, the compression history on each of the WAAS devices in the connection path would be updated with the relevant data patterns contained within the flow. Should the second user have a copy of that file, or a file containing similarities, and upload that file by way of another application such as FTP, the compression history that was previously built from the e-mail transfer could be leveraged to provide tremendous levels of compression for the FTP upload.

Hierarchical Chunking and Pattern Matching

As data from a connection configured for DRE optimization enters the TCP proxy, it is buffered for a short period of time. After data builds up in the buffer, the large block of buffered data is passed to DRE to enter a process known as encoding. Encoding is the process of taking transmitted data in from a transmitting node, eliminating redundancy, updating the compression library with any new data, and transmitting compressed messages.

DRE encoding calculates a message validity signature over the original block of data. This message is used by the decoding process on the peer WAE to ensure correctness when rebuilding the message based on the signatures contained in the encoded message. A sliding window is used over the block of data to be compressed, which employs a CPU-efficient calculation to identify breakpoints within the data based on the actual data being transferred, which is also known as content-based chunking. Content-based chunking relies on the actual data itself to identify breakpoints within the data and, as such, is less sensitive to slight changes (additions, removals, changes) upon subsequent transfers of the same or similar data. With content-based chunking, if a small amount of data is inserted into a chunk during the next transmission, the chunk boundaries shift with the insertion of data, allowing DRE better isolation of new data, which helps retain high levels of compression as the other chunks remain valid.

Chunks are identified at multiple layers, and aggregate chunks referencing smaller, lower-layer chunks can be identified. Due to this multi-layer approach to chunking, DRE is hierarchical in that one chunk may reference a number of smaller, lower-layer chunks. If higher-layer chunks are identified as redundant, a single signature can be used to reference a larger number of lower-layer chunks in aggregate form. In essence, DRE aggregation provides a multiresolution view of the same data using chunks of different sizes and levels.

Each chunk that is identified is assigned a 5-byte signature. This signature is used as the point of reference on each Cisco WAAS device for that particular chunk of data. As DRE is encoding data, if any chunk of data is found within the DRE compression history, it is considered redundant, and the signature is transmitted instead of the chunk. For instance, if a 32-KB chunk was found to be redundant and was replaced with the associated signature, an effective compression ratio of over 500:1 would be realized for that particular chunk of data. If any chunk of data is not found in the DRE compression history, it is added to the local compression history for later use. In this case, both the chunk and the signature are transmitted to allow the peer to update its DRE compression history.

Figure 1-8 illustrates the encoding process.

Data Redundancy Elimination Encoding

Figure 1-8. Data Redundancy Elimination Encoding

After the encoding process is complete, the encoding WAE transmits the encoded message with the message validity signature that was calculated for the original block of data. Aside from the message validity signature, the encoded message contains signatures for data patterns that are recognized as redundant, and signatures and data for data patterns that are identified as nonredundant.

Message Validation

DRE uses two means of verifying that encoded messages can be properly rebuilt and match the original data being transmitted. As the decoding WAAS device (closest to the recipient) receives an encoded message, it begins to parse the encoded messages to separate signatures that were sent without an associated chunk of data (redundant data that should exist in the compression history) and signatures that were sent with an accompanying chunk of data (nonredundant data that should be added to the compression history).

As the decoding WAE receives an encoded message, each signature identifying redundant data is used to search the DRE compression history and is replaced with the appropriate chunk of data if found. If the signature and associated chunk of data are not found, a synchronous nonacknowledgment is sent to the encoding WAE to request that the signature and chunk of data both be re-sent. This allows the WAE to rebuild the message with the missing chunk while also updating its local compression history. For chunks of data that are sent with an accompanying signature, the local compression history is updated, and the signature is removed from the message so that only the data remains.

Once the decoding WAAS device has rebuilt the original message based on the encoded data and chunks from the compression history, it then generates a new message validity signature. This message validity signature, which is calculated over the rebuilt message, is compared against the original message validity signature generated by the encoding WAAS device. If the two signatures match, the decoding WAAS device knows that the message has been rebuilt correctly, and the message is returned to the TCP proxy for transmission to the recipient. If the two signatures do not match, the decoding WAAS device sends a synchronous nonacknowledgment over the entire message, requesting that the encoding WAAS device send all of the signatures and data chunks associated with the message that failed decoding. This allows the decoding WAAS device to update its compression history and transmit the message as intended.

Persistent LZ Compression

Cisco WAAS can also employ Persistent LZ Compression, or PLZ, as an optimization based on configured policy. PLZ is a lossless compression algorithm that uses an extended compression history to achieve higher levels of compression than standard LZ variants can achieve. PLZ is helpful for data that has not been identified as redundant by DRE, and can even provide additional compression for DRE-encoded messages, as the DRE signatures are compressible. PLZ is similar in operation to DRE in that it uses a sliding window to analyze data patterns for redundancy, but the compression history is based in memory only and is far smaller than that found in DRE.

Transport Flow Optimization

Cisco WAAS TFO is a series of optimizations that is leveraged for connections that are configured for optimization. By employing TFO, communicating nodes are shielded from performance-limiting WAN conditions such as packet loss and latency. Furthermore, TFO allows nodes to more efficiently use available network capacity and minimize the impact of retransmission. TFO provides the following suite of optimizations:

  • Large initial windowsLarge initial windows, found in RFC 3390, allows TFO to mitigate the latency associated with connection setup, as the initial congestion window is increased. This allows the connection to more quickly identify the bandwidth ceiling during slow-start and enter congestion avoidance at a more rapid pace.

  • Selective acknowledgment (SACK) and extensionsSACK, found in RFCs 2018 and 2883, allows a recipient node to explicitly notify the transmitting node what ranges of data have been received within the current window. With SACK, if a block of data goes unacknowledged, the transmitting node need only retransmit the block of data that was not acknowledged. SACK helps minimize the bandwidth consumed upon retransmission of a lost segment.

  • Window scalingWindow scaling, found in RFC 1323, allows communicating nodes to have an enlarged window. This allows for larger amounts of data to be outstanding and unacknowledged in the network at any given time, which allows end nodes to better utilize available WAN bandwidth.

  • Large buffersLarge TCP buffers on the WAAS device provide the memory capacity necessary to keep high-BDP WAN connections full of data. This helps mitigate the negative impact of high-bandwidth networks that also have high latency.

  • Advanced congestion avoidanceCisco WAAS employs an advanced congestion avoidance algorithm that provides bandwidth scalability (fill the pipe, used in conjunction with window scaling and large buffers) without compromising on cross-connection fairness. Unlike standard TCP implementations that use linear congestion avoidance, TFO leverages the history of packet loss for each connection to dynamically adjust the rate of congestion window increase when loss is not being encountered. TFO also uses a less-conservative backoff algorithm should packet loss be encountered (decreasing the congestion window by 12.5 percent as opposed to 50 percent), which allows the connection to retain higher levels of throughput in the presence of packet loss. Cisco WAAS TFO is based on Binary Increase Congestion (BIC) TCP.

Figure 1-9 shows a comparison between typical TCP implementations and TFO. Notice how TFO is more quickly able to realize available network capacity and begin leveraging it. When congestion is encountered, TFO is able to more intelligently adjust its throughput to accommodate other connections while preserving bandwidth scalability.

Comparison of TCP Reno and Cisco WAAS TFO

Figure 1-9. Comparison of TCP Reno and Cisco WAAS TFO

Whereas this section focused on the WAN optimization components of Cisco WAAS, the next section focuses on the application acceleration components of Cisco WAAS.

Application Acceleration

Application acceleration refers to employing optimizations directly against applications or the application protocols that they use. Whereas WAN optimization refers to techniques employed generally against a network layer or transport layer protocol (Cisco WAAS employs them against the transport layer), application acceleration is employed at higher layers. The optimizations found in application acceleration are in many ways common across applications and application protocols, but because they must be specific to each application or application protocol, these optimizations may be implemented differently.

Ensuring application correctness (don’t break the application), data integrity (don’t corrupt the data), and data coherency (don’t serve stale data) is of paramount importance in any application acceleration solution. With WAN optimization components, ensuring these items is generally easy, as the optimizations employed are done against a lower layer with well-defined semantics for operation. With application acceleration, however, ensuring these items is more difficult, as applications and application protocols are more diverse, complex, and finicky with respect to how they must be handled.

Table 1-2 lists the high-level application acceleration techniques that can be found within Cisco WAAS. Note that this list is not all-inclusive, and focuses on the techniques that are commonly applied to accelerated applications, but others certainly exist.

Table 1-2. Cisco WAAS Application Acceleration Techniques

Acceleration Technique

Functional Description and Value

Object caching

Object caching allows Cisco WAAS to, when safe, store copies of previously accessed objects (files, other content) to be reused by subsequent users. This only occurs when the application state permits caching, and cached objects are served to users only if application state requirements are met and the object has been validated against the origin server as having not changed. Caching mitigates latency (objects served locally), saves WAN bandwidth (does not have to be transferred over the WAN), minimizes server workload (does not have to be transferred from the server), and improves application performance.

Local response handling

By employing stateful optimization, Cisco WAAS can locally respond to certain message types on behalf of the server. This only occurs when the application state permits such behavior, and can help minimize the perceived latency as fewer messages are required to traverse the WAN. As with object caching, this helps reduce the workload encountered on the server while also improving application performance.


Prepositioning is used to allow an administrator to specify what content should be proactively copied to a remote Cisco WAAS object cache. This helps improve first-user performance by better ensuring a “cache hit,” and can also be used to populate the DRE compression history. Population of the DRE compression history is helpful in environments where the object being prepositioned may be written back from the remote location with some changes applied, which is common in software development and CAD/CAM environments.


Read-ahead allows Cisco WAAS to, when safe, increase read request sizes on behalf of users, or initiate subsequent read requests on behalf of users, to have the origin server transmit data ahead of the user request. This allows the data to reach the edge device in a more timely fashion, which in turn means the requesting user is served more quickly. Read-ahead is helpful in cache-miss scenarios, or in cases where the object is not fully cached. Read-ahead minimizes the WAN latency penalty by prefetching information.


Write-behind allows Cisco WAAS to, when safe, locally acknowledge write requests from a user application. This allows Cisco WAAS to streamline the transfer of data over the WAN, minimizing the impact of WAN latency.


Multiplexing refers to a group of optimizations that can be applied independently of one another or in tandem. These include fast connection setup, TCP connection reuse, and message parallelization. Multiplexing helps overcome WAN latency associated with TCP connections or application layer messages, thereby improving performance.

The application of each of these optimizations is determined dynamically for each connection or user session. Because Cisco WAAS is strategically placed in between two communicating nodes, it is in a unique position not only to examine application messages being exchanged to determine what the state of the connection or session is, but also to leverage state messages being exchanged between communicating nodes to determine what level of optimization can safely be applied.

As of Cisco WAAS v4.0.13, Cisco WAAS employs these optimizations against the CIFS protocol and certain MS-RPC operations. WAAS also provides a local print services infrastructure for the remote office, which helps keep print traffic off of the WAN if the local file and print server have been consolidated. Releases beyond v4.0.13 will add additional application protocols to this list.

The following sections provide an example of each of the application acceleration techniques provided by Cisco WAAS. It is important to note that Cisco WAAS employs application layer acceleration capabilities only when safe to do so. The determination on “safety” is made based on state information and metadata exchanged between the two communicating nodes. In any circumstance where it is not safe to perform an optimization, Cisco WAAS dynamically adjusts its level of acceleration to ensure compliance with protocol semantics, data integrity, and data coherency.

Object and Metadata Caching

Object and metadata caching are techniques employed by Cisco WAAS to allow an edge device to retain a history of previously accessed objects and their metadata. Unlike DRE, which maintains a history of previously seen data on the network (with no correlation to the upper-layer application), object and metadata caching are specific to the application being used, and the cache is built with pieces of an object or the entire object, along with its associated metadata. With caching, if a user attempts to access an object, directory listing, or file attributes that are stored in the cache, such as a file previously accessed from a particular file server, the file can be safely served from the edge device, assuming the user has successfully completed authorization and authentication and the object has been validated (verified that it has not changed). Caching requires that the origin server notify the client that caching is permitted through the use of opportunistic locks or other state propagation mechanisms.

Object caching provides numerous benefits, including:

  • LAN-like access to cached objectsObjects that can be safely served out of cache are served at LAN speeds by the WAE adjacent to the requester.

  • WAN bandwidth savingsObject caching minimizes the transfer of redundant objects over the network, thereby minimizing overall WAN bandwidth consumption.

  • Server offloadObject caching minimizes the amount of workload that must be managed by the server being accessed. By safely offloading work from the server, IT organizations may be in a position to minimize the number of servers necessary to support an application.

Figure 1-10 shows an example of object caching and a cache hit as compared to a cache miss.

Examining Cache Hit and Cache Miss Scenarios

Figure 1-10. Examining Cache Hit and Cache Miss Scenarios

As shown in Figure 1-10, when a cache hit occurs, object transfers are done on the LAN adjacent to the requesting node, which minimizes WAN bandwidth consumption and improves performance. When a cache miss occurs, the object is fetched from the origin server in an optimized fashion and, if applicable, the data read from the origin server is used to build the cache to improve performance for subsequent users. This is often referred to as the “first-user penalty” for caching.


Prepositioning is a function by which an administrator can specify which objects should be proactively placed in the cache of a specific edge device or group of edge devices. By using prepositioning, an administrator can ensure high-performance access to an object for the first requesting user (assuming caching is safe to be used for the user’s session), eliminating the first-user penalty. Prepositioning is helpful in environments where large object transfers are necessary. For instance, CAD/CAM, medical imaging, software distribution, software development all require the movement of large files, and prepositioning can help improve performance for remote users while also offloading the WAN and servers in the data center. Prepositioning can also be used as a means of prepopulating the DRE compression history.


Read-ahead is a technique that is useful both in application scenarios where caching can be applied and in scenarios where caching cannot be applied. With read-ahead, a Cisco WAAS device may, when applicable, increment the size of the application layer read request on behalf of the user, or generate additional read requests on behalf of the user. The goal of read-ahead is two-fold:

  • When used in a cache-miss scenario, provide near-LAN response times to overcome the first-user penalty. Read-ahead, in this scenario, allows the WAE to begin immediate and aggressive population of the edge cache.

  • When used in a scenario where caching is not permitted, aggressively fetch data on behalf of the user to mitigate network latency. Read-ahead, in this scenario, is not used to populate a cache with the object, but rather to proactively fetch data that a user may request. Data prefetched in this manner is only briefly cached to satisfy immediate read requests that are for blocks of data that have been read ahead.

Figure 1-11 shows an example of how read-ahead can allow data to begin transmission more quickly over the WAN, thereby minimizing the performance impact of WAN latency.

Read-Ahead in Caching and Noncaching Scenarios

Figure 1-11. Read-Ahead in Caching and Noncaching Scenarios


Write-behind is an optimization that is complementary to read-ahead optimization. Whereas read-ahead focuses on getting the information to the edge more quickly, write-behind focuses on getting the information to the core more quickly—at least from the perspective of the transmitting node. In reality, write-behind is a technique by which a Cisco WAAS device can positively acknowledge receipt of an application layer write request, when safe, to allow the transmitting node to continue to write data. This optimization is commonly employed against application protocols that exhibit high degrees of ping-pong, especially as data is written back to the origin server.

As an optimization that positively acknowledges write requests that have not yet been received by the server being written to, write-behind is only employed against protocols that support information recovery in the event of disconnection (for instance, through temporary files) and is only employed when safe to do so. For applications that do not support information recovery in the event of loss, this optimization cannot be safely applied.


Multiplexing is a term that refers to any process where multiple message signals are combined into a single message signal. Multiplexing, as it relates to Cisco WAAS, refers to the following optimizations:

  • TCP connection reuseBy reusing existing established connections rather than creating new connections, TCP setup latency can be mitigated, thereby improving performance. TCP connection reuse is applied only on subsequent connections between the same client and server pair over the same destination port.

  • Message parallelizationFor protocols that support batch requests, Cisco WAAS can parallelize otherwise serial tasks into batch requests. This helps minimize the latency penalty, as it is amortized across a series of batched messages as opposed to being experienced on a per-message basis. For protocols that do not support batch requests, Cisco WAAS may “predict” subsequent messages and presubmit those messages on behalf of the user in an attempt to mitigate latency.

This section focused on the application-specific acceleration components of Cisco WAAS, including caching, prepositioning, read-ahead, write-behind, and multiplexing. The next section focuses on the integration aspects of Cisco WAAS as it relates to the ecosystem that is the enterprise IT infrastructure, as well as additional value-added features that are part of the Cisco WAAS solution.

Other Features

Cisco WAAS is a unique application acceleration and WAN optimization solution in that it is the only solution that not only provides the most seamless interoperability with existing network features, but also integrates physically into the Cisco Integrated Services Router (ISR). With the Cisco ISR, customers can deploy enterprise edge connectivity to the WAN, switching, wireless, voice, data, WAN optimization, and security in a single platform for the branch office. (The router modules and the appliance platforms are examined in the next chapter.) The following are some of the additional features that are provided with the Cisco WAAS solution:

  • Network transparencyCisco WAAS is fundamentally transparent in three domains—client transparency, server transparency (no software installation or configuration changes required on clients or servers), and network transparency. Network transparency allows Cisco WAAS to interoperate with existing networking and security functions such as firewall policies, optimized routing, QoS, and end-to-end performance monitoring.

  • Enterprise-class scalabilityCisco WAAS can scale to tens of gigabits of optimized throughput and tens of millions of optimized TCP connections using the Cisco Application Control Engine (ACE), which is an external load-balancer and is discussed in detail in Chapter 6, “Data Center Network Integration”. Without external load balancing, Cisco WAAS can scale to tens of gigabits of optimized throughput and over one million TCP connections using the Web Cache Coordination Protocol version 2 (WCCPv2), which is discussed in both Chapter 4, “Network Integration and Interception,” and Chapter 6.

  • Trusted WAN optimizationCisco WAAS is a trusted WAN optimization and application acceleration solution in that it integrates seamlessly with many existing security infrastructure components such as firewalls, intrusion detection systems (IDS), intrusion prevention systems (IPS), and virtual private network (VPN) solutions. Integration work has been done on not only Cisco WAAS but adjacent Cisco security products to ensure that security posture is not compromised when Cisco WAAS is deployed. Cisco WAAS also supports disk encryption (using AES-256 encryption) with centrally managed keys. This mitigates the risk of data loss or data leakage if a WAE is compromised or stolen.

  • Automatic discoveryCisco WAAS devices can automatically discover one another during the establishment of a TCP connection and negotiate a policy to employ. This eliminates the need to configure complex and tedious overlay networks. By mitigating the need for overlay topologies, Cisco WAAS permits optimization without requiring that administrators manage the optimization domain and topology separate from the routing domain.

  • Scalable, secure central managementCisco WAAS devices are managed and monitored by the Cisco WAAS Central Manager. The Central Manager can be deployed in a highly available fashion using two Cisco WAAS devices. The Central Manager is secure in that any exchange of data between the Central Manager and a managed Cisco WAAS device is done using SSL, and management access to the Central Manager is encrypted using HTTPS for web browser access or SSH for console access (Telnet is also available). The Central Manager provides a simplified means of configuring a system of devices through device groups, and provides role-based access control (RBAC) to enable segregation of management and monitoring. The Central Manager is discussed in more detail in Chapter 7, “System and Device Management.”


IT organizations are challenged with the need to provide high levels of application performance for an increasingly distributed workforce. Additionally, they are faced with an opposing challenge to consolidate costly infrastructure to contain capital and operational expenditures. Organizations find themselves caught between two conflicting realities: to distribute costly infrastructure to remote offices in order to solve performance requirements of a growingly distributed workforce, and to consolidate costly infrastructure from those same remote offices to control capital and operational costs and complexity. Cisco WAAS is a solution that employs a series of WAN optimization and application acceleration techniques to overcome the fundamental performance limitations of WAN environments to allow remote users to enjoy near-LAN performance when working with centralized application infrastructure and content.

