Chapter 10. IP Fragmentation and Reassembly

Introduction

In this chapter we describe the IP fragmentation and reassembly processing that we postponed in Chapter 8.

IP has an important capability of being able to fragment a packet when it is too large to be transmitted by the selected hardware interface. The oversized packet is split into two or more IP fragments, each of which is small enough to be transmitted on the selected network. Fragments may be further split by routers farther along the path to the final destination. Thus, at the destination host, an IP datagram can be contained in a single IP packet or, if it was fragmented in transit, it can arrive in multiple IP packets. Because individual fragments may take different paths to the destination host, only the destination host has a chance to see all the fragments. Thus only the destination host can reassemble the fragments into a complete datagram to be delivered to the appropriate transport protocol.

Figure 8.5 shows that 0.3% (72, 786/27, 881, 978) of the packets received were fragments and 0.12% (260, 484/(29, 447, 726—796, 084)) of the datagrams sent were fragmented. On world.std.com, 9.5% of the packets received were fragments. world has more NFS activity, which is a common source of IP fragmentation.

Three fields in the IP header implement fragmentation and reassembly: the identification field (ip_id), the flags field (the 3 high-order bits of ip_off), and the offset field (the 13 low-order bits of ip_off). The flags field is composed of three 1-bit flags. Bit 0 is reserved and must be 0, bit 1 is the “don’t fragment” (DF) flag, and bit 2 is the “more fragments” (MF) flag. In Net/3, the flag and offset fields are combined and accessed by ip_off, as shown in Figure 10.1.

ip_off controls fragmentation of an IP packet.

Figure 10.1. ip_off controls fragmentation of an IP packet.

Net/3 accesses the DF and MF bits by masking ip_off with IP_DF and IP_MF respectively. An IP implementation must allow an application to request that the DF bit be set in an outgoing datagram.

Net/3 does not provide application-level control over the DF bit when using UDP or TCP.

A process may construct and send its own IP headers with the raw IP interface (Chapter 32). The DF bit may be set by the transport layers directly such as when TCP performs path MTU discovery.

The remaining 13 bits of ip_off specify the fragment’s position within the original datagram, measured in 8-byte units. Accordingly, every fragment except the last must contain a multiple of 8 bytes of data so that the following fragment starts on an 8-byte boundary. Figure 10.2 illustrates the relationship between the byte offset within the original datagram and the fragment offset (low-order 13 bits of ip_off) in the fragment’s IP header.

Fragmentation of a 65535-byte datagram.

Figure 10.2. Fragmentation of a 65535-byte datagram.

Figure 10.2 shows a maximally sized IP datagram divided into 8190 fragments. Each fragment contains 8 bytes except the last, which contains only 3 bytes. We also show the MF bit set in all the fragments except the last. This is an unrealistic example, but it illustrates several implementation issues.

The numbers above the original datagram are the byte offsets for the data portion of the datagram. The fragment offset (ip_off) is computed from the start of the data portion of the datagram. It is impossible for a fragment to include a byte beyond offset 65514 since the reassembled datagram would be larger than 65535 bytes t he maximum value of the ip_len field. This restricts the maximum value of ip_off to 8189 (8189 × 8 = 65512), which leaves room for 3 bytes in the last fragment. If IP options are present, the offset must be smaller still.

Because an IP internet is connectionless, fragments from one datagram may be interleaved with those from another at the destination. ip_id uniquely identifies the fragments of a particular datagram. The source system sets ip_id in each datagram to a unique value for all datagrams using the same source (ip_src), destination (ip_dst), and protocol (ip_p) values for the lifetime of the datagram on the internet.

To summarize, ip_id identifies the fragments of a particular datagram, ip_off positions the fragment within the original datagram, and the MF bit marks every fragment except the last.

Code Introduction

The reassembly data structures appear in a single header. Reassembly and fragmentation processing is found in two C files. The three files are listed in Figure 10.3.

Table 10.3. Files discussed in this chapter.

File

Description

netinet/ip_var.h

reassembly data structures

netinet/ip_output.c

fragmentation code

netinet/ip_input.c

reassembly code

Global Variables

Only one global variable, ipq, is described in this chapter.???

Table 10.4. Global variable introduced in this chapter.

Variable

Type

Description

ipq

struct ipq *

reassembly list

Statistics

The statistics modified by the fragmentation and reassembly code are shown in Figure 10.5. They are a subset of the statistics included in the ipstat structure described by Figure 8.4.

Table 10.5. Statistics collected in this chapter.

ipstat member

Description

ips_cantfrag

#datagrams not sent because fragmentation was required but was prohibited by the DF bit

ips_odropped

#output packets dropped because of a memory shortage

ips_ofragments

#fragments transmitted

ips_fragmented

#packets fragmented for output

Fragmentation

We now return to ip_output and describe the fragmentation code. Recall from Figure 8.25 that if a packet fits within the MTU of the selected outgoing interface, it is transmitted in a single link-level frame. Otherwise the packet must be fragmented and transmitted in multiple frames. A packet may be a complete datagram or it may itself be a fragment that was created by a previous system. We describe the fragmentation code in three parts:

  • determine fragment size (Figure 10.6),

    Table 10.6. ip_output function: determine fragment size.

    ----------------------------------------------------------------------- ip_output.c
    253     /*
    254      * Too large for interface; fragment if possible.
    255      * Must be able to put at least 8 bytes per fragment.
    256      */
    257     if (ip->ip_off & IP_DF) {
    258         error = EMSGSIZE;
    259         ipstat.ips_cantfrag++;
    260         goto bad;
    261     }
    262     len = (ifp->if_mtu - hlen) & ~7;
    263     if (len < 8) {
    264         error = EMSGSIZE;
    265         goto bad;
    266     }
    ----------------------------------------------------------------------- ip_output.c
  • construct fragment list (Figure 10.7), and

    Table 10.7. ip_output function: construct fragment list.

    ----------------------------------------------------------------------- ip_output.c
    267     {
    
    268         int     mhlen, firstlen = len;
    269         struct mbuf **mnext = &m->m_nextpkt;
    
    270         /*
    271          * Loop through length of segment after first fragment,
    272          * make new header and copy data of each part and link onto chain.
    273          */
    274         m0 = m;
    275         mhlen = sizeof(struct ip);
    276         for (off = hlen + len; off < (u_short) ip->ip_len; off += len) {
    277             MGETHDR(m, M_DONTWAIT, MT_HEADER);
    278             if (m == 0) {
    279                 error = ENOBUFS;
    280                 ipstat.ips_odropped++;
    281                 goto sendorfree;
    282             }
    283             m->m_data += max_linkhdr;
    284             mhip = mtod(m, struct ip *);
    285             *mhip = *ip;
    286             if (hlen > sizeof(struct ip)) {
    287                 mhlen = ip_optcopy(ip, mhip) + sizeof(struct ip);
    288                 mhip->ip_hl = mhlen >> 2;
    289             }
    290             m->m_len = mhlen;
    291             mhip->ip_off = ((off - hlen) >> 3) + (ip->ip_off & ~IP_MF);
    292             if (ip->ip_off & IP_MF)
    293                 mhip->ip_off |= IP_MF;
    294             if (off + len >= (u_short) ip->ip_len)
    295                 len = (u_short) ip->ip_len - off;
    296             else
    297                 mhip->ip_off |= IP_MF;
    298             mhip->ip_len = htons((u_short) (len + mhlen));
    299             m->m_next = m_copy(m0, off, len);
    300             if (m->m_next == 0) {
    301                 (void) m_free(m);
    302                 error = ENOBUFS;    /* ??? */
    303                 ipstat.ips_odropped++;
    304                 goto sendorfree;
    305             }
    306             m->m_pkthdr.len = mhlen + len;
    307             m->m_pkthdr.rcvif = (struct ifnet *) 0;
    308             mhip->ip_off = htons((u_short) mhip->ip_off);
    309             mhip->ip_sum = 0;
    310             mhip->ip_sum = in_cksum(m, mhlen);
    311             *mnext = m;
    312             mnext = &m->m_nextpkt;
    313             ipstat.ips_ofragments++;
    314         }
    ----------------------------------------------------------------------- ip_output.c
  • construct initial fragment and send fragments (Figure 10.8).

Table 10.8. ip_output function: send fragments.

----------------------------------------------------------------------- ip_output.c
315         /*
316          * Update first fragment by trimming what's been copied out
317          * and updating header, then send each fragment (in order).
318          */
319         m = m0;
320         m_adj(m, hlen + firstlen - (u_short) ip->ip_len);
321         m->m_pkthdr.len = hlen + firstlen;
322         ip->ip_len = htons((u_short) m->m_pkthdr.len);
323         ip->ip_off = htons((u_short) (ip->ip_off | IP_MF));
324         ip->ip_sum = 0;
325         ip->ip_sum = in_cksum(m, hlen);
326       sendorfree:
327         for (m = m0; m; m = m0) {
328             m0 = m->m_nextpkt;
329             m->m_nextpkt = 0;
330             if (error == 0)
331                 error = (*ifp->if_output) (ifp, m,
332                                       (struct sockaddr *) dst, ro->ro_rt);
333             else
334                 m_freem(m);
335         }

336         if (error == 0)
337             ipstat.ips_fragmented++;
338     }
----------------------------------------------------------------------- ip_output.c

253-261

The fragmentation algorithm is straightforward, but the implementation is complicated by the manipulation of the mbuf structures and chains. If fragmentation is prohibited by the DF bit, ip_output discards the packet and returns EMSGSIZE. If the datagram was generated on this host, a transport protocol passes the error back to the process, but if the datagram is being forwarded, ip_forward generates an ICMP destination unreachable error with an indication that the packet could not be forwarded without fragmentation (Figure 8.21).

Net/3 does not implement the path MTU discovery algorithms used to probe the path to a destination and discover the largest transmission unit supported by all the intervening networks. Sections 11.8 and 24.2 of Volume 1 describe path MTU discovery for UDP and TCP.

262-266

len, the number of data bytes in each fragment, is computed as the MTU of the interface less the size of the packet’s header and then rounded down to an 8-byte boundary by clearing the low-order 3 bits (& ~7). If the MTU is so small that each fragment contains less than 8 bytes, ip_output returns EMSGSIZE.

Each new fragment contains an IP header, some of the options from the original packet, and at most len data bytes.

The code in Figure 10.7, which is the start of a C compound statement, constructs the list of fragments starting with the second fragment. The original packet is converted into the initial fragment after the list is created (Figure 10.8).

267-269

The extra block allows mhlen, firstlen, and mnext to be declared closer to their use in the function. These variables are in scope until the end of the block and hide any similarly named variables outside the block.

270-276

Since the original mbuf chain becomes the first fragment, the for loop starts with the offset of the second fragment: hlen + len. For each fragment ip_output takes the following actions:

  • 277-284

    Allocate a new packet mbuf and adjust its m_data pointer to leave room for a 16-byte link-layer header (max_linkhdr). If ip_output didn’t do this, the network interface driver would have to allocate an additional mbuf to hold the link header or move the data. Both are time-consuming tasks that are easily avoided here.

  • 285-290

    Copy the IP header and IP options from the original packet into the new packet. The former is copied with a structure assignment. ip_optcopy copies only those options that get copied into each fragment (Section 10.4).

  • 291-297

    Set the offset field (ip_off) for the fragment including the MF bit. If MF is set in the original packet, then MF is set in all the fragments. If MF is not set in the original packet, then MF is set for every fragment except the last.

  • 298

    Set the length of this fragment accounting for a shorter header (ip_optcopy may not have copied all the options) and a shorter data area for the last fragment. The length is stored in network byte order.

  • 299-305

    Copy the data from the original packet into this fragment. m_copy allocates additional mbufs if necessary. If m_copy fails, ENOBUFS is posted. Any mbufs already allocated are discarded at sendorfree.

  • 306-314

    Adjust the mbuf packet header of the newly created fragment to have the correct total length, clear the new fragment’s interface pointer, convert ip_off to network byte order, compute the checksum for the new fragment, and link the fragment to the previous fragment through m_nextpkt.

In Figure 10.8, ip_output constructs the initial fragment and then passes each fragment to the interface layer.

315-325

The original packet is converted into the first fragment by trimming the extra data from its end, setting the MF bit, converting ip_len and ip_off to network byte order, and computing the new checksum. All the IP options are retained in this fragment. At the destination host, only the IP options from the first fragment of a datagram are retained when the datagram is reassembled (Figure 10.28). Some options, such as source routing, must be copied into each fragment even though the option is discarded during reassembly.

326-338

At this point, ip_output has either a complete list of fragments or an error has occurred and the partial list of fragments must be discarded. The for loop traverses the list either sending or discarding fragments according to error. Any error encountered while sending fragments causes the remaining fragments to be discarded.

ip_optcopy Function

During fragmentation, ip_optcopy (Figure 10.9) copies the options from the incoming packet (if the packet is being forwarded) or from the original datagram (if the datagram is locally generated) into the outgoing fragments.

Table 10.9. ip_optcopy function.

----------------------------------------------------------------------- ip_output.c
395 int
396 ip_optcopy(ip, jp)
397 struct ip *ip, *jp;
398 {
399     u_char *cp, *dp;
400     int     opt, optlen, cnt;

401     cp = (u_char *) (ip + 1);
402     dp = (u_char *) (jp + 1);
403     cnt = (ip->ip_hl << 2) - sizeof(struct ip);
404     for (; cnt > 0; cnt -= optlen, cp += optlen) {
405         opt = cp[0];
406         if (opt == IPOPT_EOL)
407             break;
408         if (opt == IPOPT_NOP) {
409             /* Preserve for IP mcast tunnel's LSRR alignment. */
410             *dp++ = IPOPT_NOP;
411             optlen = 1;
412             continue;
413         } else
414             optlen = cp[IPOPT_OLEN];
415         /* bogus lengths should have been caught by ip_dooptions */
416         if (optlen > cnt)
417             optlen = cnt;
418         if (IPOPT_COPIED(opt)) {
419             bcopy((caddr_t) cp, (caddr_t) dp, (unsigned) optlen);
420             dp += optlen;
421         }
422     }
423     for (optlen = dp - (u_char *) (jp + 1); optlen & 0x3; optlen++)
424         *dp++ = IPOPT_EOL;
425     return (optlen);
426 }
----------------------------------------------------------------------- ip_output.c

395-422

The arguments to ip_optcopy are: ip, a pointer to the IP header of the outgoing packet; and jp, a pointer to the IP header of the newly created fragment. ip_optcopy initializes cp and dp to point to the first option byte in each packet and advances cp and dp as it processes each option. The first for loop copies a single option during each iteration stopping when it encounters an EOL option or when it has examined all the options. NOP options are copied to preserve any alignment constraints in the subsequent options.

  • The Net/2 release discarded NOP options.

If IPOPT_COPIED indicates that the copied bit is on, ip_optcopy copies the option to the new fragment. Figure 9.5 shows which options have the copied bit set. If an option length is too large, it is truncated; ip_dooptions should have already discovered this type of error.

423-426

The second for loop pads the option list out to a 4-byte boundary. This is required, since the packet’s header length (ip_hlen) is measured in 4-byte units. It also ensures that the transport header that follows is aligned on a 4-byte boundary. This improves performance since many transport protocols are designed so that 32-bit header fields are aligned on 32-bit boundaries if the transport header starts on a 32-bit boundary. This arrangement increases performance on CPUs that have difficulty accessing unaligned 32-bit words.

Figure 10.10 illustrates the operation of ip_optcopy.

Not all options are copied during fragmentation.

Figure 10.10. Not all options are copied during fragmentation.

In Figure 10.10 we see that ip_optcopy does not copy the timestamp option (its copied bit is 0) but does copy the LSRR option (its copied bit is 1). ip_optcopy has also added a single EOL option to pad the new options to a 4-byte boundary.

Reassembly

Now that we have described the fragmentation of a datagram (or of a fragment), we return to ipintr and the reassembly process. In Figure 8.15 we omitted the reassembly code from ipintr and postponed its discussion. ipintr can pass only entire datagrams up to the transport layer for processing. Fragments that are received by ipintr are passed to ip_reass, which attempts to reassemble fragments into complete datagrams. The code from ipintr is shown in Figure 10.11.

Table 10.11. ipintr function: fragment processing.

------------------------------------------------------------------------ ip_input.c
271   ours:
272     /*
273      * If offset or IP_MF are set, must reassemble.
274      * Otherwise, nothing need be done.
275      * (We could look in the reassembly queue to see
276      * if the packet was previously fragmented,
277      * but it's not worth the time; just let them time out.)
278      */
279     if (ip->ip_off & ~IP_DF) {
280         if (m->m_flags & M_EXT) {   /* XXX */
281             if ((m = m_pullup(m, sizeof(struct ip))) == 0) {
282                 ipstat.ips_toosmall++;
283                 goto next;
284             }
285             ip = mtod(m, struct ip *);
286         }
287         /*
288          * Look for queue of fragments
289          * of this datagram.
290          */
291         for (fp = ipq.next; fp != &ipq; fp = fp->next)
292             if (ip->ip_id == fp->ipq_id &&
293                 ip->ip_src.s_addr == fp->ipq_src.s_addr &&
294                 ip->ip_dst.s_addr == fp->ipq_dst.s_addr &&
295                 ip->ip_p == fp->ipq_p)
296                 goto found;
297         fp = 0;
298       found:

299         /*
300          * Adjust ip_len to not reflect header,
301          * set ip_mff if more fragments are expected,
302          * convert offset of this to bytes.
303          */
304         ip->ip_len -= hlen;
305         ((struct ipasfrag *) ip)->ipf_mff &= ~1;
306         if (ip->ip_off & IP_MF)
307             ((struct ipasfrag *) ip)->ipf_mff |= 1;
308         ip->ip_off <<= 3;

309         /*
310          * If datagram marked as having more fragments
311          * or if this is not the first fragment,
312          * attempt reassembly; if it succeeds, proceed.
313          */
314         if (((struct ipasfrag *) ip)->ipf_mff & 1 || ip->ip_off) {
315             ipstat.ips_fragments++;
316             ip = ip_reass((struct ipasfrag *) ip, fp);
317             if (ip == 0)
318                 goto next;
319             ipstat.ips_reassembled++;
320             m = dtom(ip);
321         } else if (fp)
322             ip_freef(fp);

323     } else
324         ip->ip_len -= hlen;
------------------------------------------------------------------------ ip_input.c

271-279

Recall that ip_off contains the DF bit, the MF bit, and the fragment offset. The DF bit is masked out and if either the MF bit or fragment offset is nonzero, the packet is a fragment that must be reassembled. If both are zero, the packet is a complete datagram, the reassembly code is skipped and the else clause at the end of Figure 10.11 is executed, which excludes the header length from the total datagram length.

280-286

m_pullup moves data in an external cluster into the data area of the mbuf. Recall that the SLIP interface (Section 5.3) may return an entire IP packet in an external cluster if it does not fit in a single mbuf. Also m_devget can return the entire packet in a cluster (Section 2.6). Before the mtod macros will work (Section 2.6), m_pullup must move the IP header from the cluster into the data area of an mbuf.

287-297

Net/3 keeps incomplete datagrams on the global doubly linked list, ipq. The name is somewhat confusing since the data structure isn’t a queue. That is, insertions and deletions can occur anywhere in the list, not just at the ends. We’ll use the term list to emphasize this fact.

ipintr performs a linear search of the list to locate the appropriate datagram for the current fragment. Remember that fragments are uniquely identified by the 4-tuple: {ip_id, ip_src, ip_dst, ip_p}. Each entry in ipq is a list of fragments and fp points to the appropriate list if ipintr finds a match.

Net/3 uses linear searches to access many of its data structures. While simple, this method can become a bottleneck in hosts supporting large numbers of network connections.

298-303

At found, the packet is modified by ipintr to facilitate reassembly:

  • 304

    ipintr changes ip_len to exclude the standard IP header and any options. We must keep this in mind to avoid confusion with the standard interpretation of ip_len, which includes the standard header, options, and data. ip_len is also changed if the reassembly code is skipped because this is not a fragment.

  • 305-307

    ipintr copies the MF flag into the low-order bit of ipf_mff, which overlays ip_tos (&= ~1 clears the low-order bit only). Notice that ip must be cast to a pointer to an ipasfrag structure before ipf_mff is a valid member. Section 10.6 and Figure 10.14 describe the ipasfrag structure.

    Although RFC 1122 requires the IP layer to provide a mechanism that enables the transport layer to set ip_tos for every outgoing datagram, it only recommends that the IP layer pass ip_tos values to the transport layer at the destination host. Since the low-order bit of the TOS field must always be 0, it is available to hold the MF bit while ip_off (where the MF bit is normally found) is used by the reassembly algorithm.

    ip_off can now be accessed as a 16-bit offset instead of 3 flag bits and a 13-bit offset.

  • 308

    ip_off is multiplied by 8 to convert from 8-byte to 1-byte units.

ipf_mff and ip_off determine if ipintr should attempt reassembly. Figure 10.12 describes the different cases and the corresponding actions. Remember that fp points to the list of fragments the system has previously received for the datagram. Most of the work is done by ip_reass.

Table 10.12. IP fragment processing in ipintr and ip_reass.

ip_off

ipf_mff

fp

Description

Action

0

false

null

complete datagram

no assembly required

0

false

nonnull

complete datagram

discard the previous fragments

any

true

null

fragment of new datagram

initialize new fragment list with this fragment

any

true

nonnull

fragment of incomplete datagram

insert into existing fragment list, attempt reassembly

nonzero

false

null

tail fragment of new datagram

initialize new fragment list

nonzero

false

nonnull

tail fragment of incomplete datagram

insert into existing fragment list, attempt reassembly

309-322

If ip_reass is able to assemble a complete datagram by combining the current fragment with previously received fragments, it returns a pointer to the reassembled datagram. If reassembly is not possible, ip_reass saves the fragment and ipintr jumps to next to process the next packet (Figure 8.12).

323-324

This else branch is taken when a complete datagram arrives and ip_hlen is modified as described earlier. This is the normal flow, since most received datagrams are not fragments.

If a complete datagram is available after reassembly processing, it is passed up to the appropriate transport protocol by ipintr (Figure 8.15):

    (*inetsw[ip_protox[ip->ip_p]].pr_input) (m, hlen);

ip_reass Function

ipintr passes ip_reass a fragment to be processed, and a pointer to the matching reassembly header from ipq. ip_reass attempts to assemble and return a complete datagram or links the fragment into the datagram’s reassembly list for reassembly when the remaining fragments arrive. The head of each reassembly list is an ipq structure, show in Figure 10.13.

Table 10.13. ipq structure.

-------------------------------------------------------------------------- ip_var.h
 52 struct ipq {
 53     struct ipq *next, *prev;    /* to other reass headers */
 54     u_char  ipq_ttl;            /* time for reass q to live */
 55     u_char  ipq_p;              /* protocol of this fragment */
 56     u_short ipq_id;             /* sequence id for reassembly */
 57     struct ipasfrag *ipq_next, *ipq_prev;
 58     /* to ip headers of fragments */
 59     struct in_addr ipq_src, ipq_dst;
 60 };
-------------------------------------------------------------------------- ip_var.h

52-60

The four fields required to identify a datagram’s fragments, ip_id, ip_p, ip_src, and ip_dst, are kept in the ipq structure at the head of each reassembly list. Net/3 constructs the list of datagrams with next and prev and the list of fragments with ipq_next and ipq_prev.

The IP header of incoming IP packets is converted to an ipasfrag structure (Figure 10.14) before it is placed on a reassembly list.

Table 10.14. ipasfrag structure.

------------------------------------------------------------------------- ip_var.h
 66 struct  ipasfrag {
 67 #if BYTE_ORDER == LITTLE_ENDIAN
 68     u_char  ip_hl:4,
 69         ip_v:4;
 70 #endif
 71 #if BYTE_ORDER == BIG_ENDIAN
 72     u_char  ip_v:4,
 73         ip_hl:4;
 74 #endif
 75     u_char  ipf_mff;        /* XXX overlays ip_tos: use low bit
 76                              * to avoid destroying tos;
 77                              * copied from (ip_off&IP_MF) */
 78     short   ip_len;
 79     u_short ip_id;
 80     short   ip_off;
 81     u_char  ip_ttl;
 82     u_char  ip_p;
 83     u_short ip_sum;
 84     struct  ipasfrag *ipf_next; /* next fragment */
 85     struct  ipasfrag *ipf_prev; /* previous fragment */
 86 };
------------------------------------------------------------------------- ip_var.h

66-86

ip_reass collects fragments for a particular datagram on a circular doubly linked list joined by the ipf_next and ipf_prev members. These pointers overlay the source and destination addresses in the IP header. The ipf_mff member overlays ip_tos from the ip structure. The other members are the same.

Figure 10.15 illustrates the relationship between the fragment header list (ipq) and the fragments (ipasfrag).

The fragment header list, ipq, and fragments.

Figure 10.15. The fragment header list, ipq, and fragments.

Down the left side of Figure 10.15 is the list of reassembly headers. The first node in the list is the global ipq structure, ipq. It never has a fragment list associated with it. The ipq list is a doubly linked list used to support fast insertions and deletions. The next and prev pointers reference the next or previous ipq structure, which we have shown by terminating the arrows at the corners of the structures.

Each ipq structure is the head node of a circular doubly linked list of ipasfrag structures. Incoming fragments are placed on these fragment lists ordered by their fragment offset. We’ve highlighted the pointers for these lists in Figure 10.15.

Figure 10.15 still does not show all the complexity of the reassembly structures. The reassembly code is difficult to follow because it relies so heavily on casting pointers to three different structures on the underlying mbuf. We’ve seen this technique already, for example, when an ip structure overlays the data portion of an mbuf.

Figure 10.16 illustrates the relationship between an mbuf, an ipq structure, an ipasfrag structure, and an ip structure.

An area of memory can be accessed through multiple structures.

Figure 10.16. An area of memory can be accessed through multiple structures.

A lot of information is contained within Figure 10.16:

  • All the structures are located within the data area of an mbuf.

  • The ipq list consists of ipq structures joined by next and prev. Within the structure, the four fields that uniquely identify an IP datagram are saved (shaded in Figure 10.16).

  • Each ipq structure is treated as an ipasfrag structure when accessed as the head of a linked list of fragments. The fragments are joined by ipf_next and ipf_prev, which overlay the ipq structures’ ipq_next and ipq_prev members.

  • Each ipasfrag structure overlays the ip structure from the incoming fragment. The data that arrived with the fragment follows the structure in the mbuf. The members that have a different meaning in the ipasfrag structure than they do in the ip structure are shaded.

Figure 10.15 showed the physical connections between the reassembly structures and Figure 10.16 illustrated the overlay technique used by ip_reass. In Figure 10.17 we show the reassembly structures from a logical point of view: this figure shows the reassembly of three datagrams and the relationship between the ipq list and the ipasfrag structures.

Reassembly of three IP datagrams.

Figure 10.17. Reassembly of three IP datagrams.

The head of each reassembly list contains the id, protocol, source, and destination address of the original datagram. Only the ip_id field is shown in the figure. Each fragment list is ordered by the offset field, the fragment is labeled with MF if the MF bit is set, and missing fragments appear as shaded boxes. The numbers within each fragment show the starting and ending byte offset for the fragment relative to the data portion of the original datagram, not to the IP header of the original datagram.

The example is constructed to show three UDP datagrams with no IP options and 1024 bytes of data each. The total length of each datagram is 1052 (20 + 8 + 1024) bytes, which is well within the 1500-byte MTU of an Ethernet. The datagrams encounter a SLIP link on the way to the destination, and the router at that link fragments the datagrams to fit within a typical 296-byte SLIP MTU. Each datagram arrives as four fragments. The first fragment contain a standard 20-byte IP header, the 8-byte UDP header, and 264 bytes of data. The second and third fragments contain a 20-byte IP header and 272 bytes of data. The last fragment has a 20-byte header and 216 bytes of data (1032 = 272 × 3 + 216).

In Figure 10.17, datagram 5 is missing a single fragment containing bytes 272 through 543. Datagram 6 is missing the first fragment, bytes 0 through 271, and the end of the datagram starting at offset 816. Datagram 7 is missing the first three fragments, bytes 0 through 815.

Figure 10.18 lists ip_reass. Remember that ipintr calls ip_reass when an IP fragment has arrived for this host, and after any options have been processed.

Table 10.18. ip_reass function: datagram reassembly.

------------------------------------------------------------------------ ip_input.c
337 /*
338  * Take incoming datagram fragment and try to
339  * reassemble it into whole datagram.  If a chain for
340  * reassembly of this datagram already exists, then it
341  * is given as fp; otherwise have to make a chain.
342  */
343 struct ip *
344 ip_reass(ip, fp)
345 struct ipasfrag *ip;
346 struct ipq *fp;
347 {
348     struct mbuf *m = dtom(ip);
349     struct ipasfrag *q;
350     struct mbuf *t;
351     int     hlen = ip->ip_hl << 2;
352     int     i, next;

353     /*
354      * Presence of header sizes in mbufs
355      * would confuse code below.
356      */
357     m->m_data += hlen;
358     m->m_len -= hlen;
                                                                                   
                                 /* reassembly code */                             
                                                                                   
465   dropfrag:
466     ipstat.ips_fragdropped++;
467     m_freem(m);
468     return (0);
469 }
------------------------------------------------------------------------ ip_input.c

343-358

When ip_reass is called, ip points to the fragment and fp either points to the matching ipq structure or is null.

Since reassembly involves only the data portion of each fragment, ip_reass adjusts m_data and m_len from the mbuf containing the fragment to exclude the IP header in each fragment.

465-469

When an error occurs during reassembly, the function jumps to dropfrag, which increments ips_fragdropped, discards the fragment, and returns a null pointer.

Dropping fragments usually incurs a serious performance penalty at the transport layer since the entire datagram must be retransmitted. TCP is careful to avoid fragmentation, but a UDP application must take steps to avoid fragmentation on its own. [Kent and Mogul 1987] explain why fragmentation should be avoided.

All IP implementations must to be able to reassemble a datagram of up to 576 bytes. There is no general way to determine the size of the largest datagram that can be reassembled by a remote host. We’ll see in Section 27.5 that TCP has a mechanism to determine the size of the maximum datagram that can be processed by the remote host. UDP has no such mechanism, so many UDP-based protocols (e.g., RIP, TFTP, BOOTP, SNMP, and DNS) are designed around the 576-byte limit.

We’ll show the reassembly code in seven parts, starting with Figure 10.19.

Table 10.19. ip_reass function: create reassembly list.

----------------------------------------------------------------------- ip_input.c
359     /*
360      * If first fragment to arrive, create a reassembly queue.
361      */
362     if (fp == 0) {
363         if ((t = m_get(M_DONTWAIT, MT_FTABLE)) == NULL)
364             goto dropfrag;
365         fp = mtod(t, struct ipq *);
366         insque(fp, &ipq);
367         fp->ipq_ttl = IPFRAGTTL;
368         fp->ipq_p = ip->ip_p;
369         fp->ipq_id = ip->ip_id;
370         fp->ipq_next = fp->ipq_prev = (struct ipasfrag *) fp;
371         fp->ipq_src = ((struct ip *) ip)->ip_src;
372         fp->ipq_dst = ((struct ip *) ip)->ip_dst;
373         q = (struct ipasfrag *) fp;
374         goto insert;
375     }
----------------------------------------------------------------------- ip_input.c

Create reassembly list

359-366

When fp is null, ip_reass creates a reassembly list with the first fragment of the new datagram. It allocates an mbuf to hold the head of the new list (an ipq structure), and calls insque to insert the structure in the list of reassembly lists.

Figure 10.20 lists the functions that manipulate the datagram and fragment lists.

Table 10.20. Queueing functions used by ip_reass.

Function

Description

insque

Insert node just after prev.

void insque(void *node, void *prev);

remque

Remove node from list.

void remque(void *node);

ip_enq

Insert fragment p just after fragment prev.

void ip_enq(struct ipasfrag *p, struct ipasfrag *prev);

ip_deq

Remove fragment p.

void ip_deq(struct ipasfrag *p);
  • The functions insque and remque are defined in machdep.c for the 386 version of Net/3. Each machine has its own machdep.c file in which customized versions of kernel functions are defined, typically to improve performance. This file also contains architecture-dependent functions such as the interrupt handler support, cpu and device configuration, and memory management functions.

  • insque and remque exist primarily to maintain the kernel’s run queue. Net/3 can use them for the datagram reassembly list because both lists have next and previous pointers as the first two members of their respective node structures. These functions work for any similarly structured list, although the compiler may issue some warnings. This is yet another example of accessing memory through two different structures.

  • In all the kernel structures the next pointer always precedes the previous pointer (Figure 10.14, for example). This is because the insque and remque functions were first implemented on the VAX using the insque and remque hardware instructions, which require this ordering of the forward and backward pointers.

  • The fragment lists are not joined with the first two members of the ipasfrag structures (Figure 10.14) so Net/3 calls ip_enq and ip_deq instead of insque and remque.

Reassembly timeout

367

The time-to-live field (ipq_ttl) is required by RFC 1122 and limits the time Net/3 waits for fragments to complete a datagram. It is different from the TTL field in the IP header, which limits the amount of time a packet circulates in the internet. The IP header TTL field is reused as the reassembly timeout since the header TTL is not needed once the fragment arrives at its final destination.

In Net/3, the initial value of the reassembly timeout is 60 (IPFRAGTTL). Since ipq_ttl is decremented every time the kernel calls ip_slowtimo and the kernel calls ip_slowtimo every 500 ms, the system discards an IP reassembly list if it hasn’t assembled a complete IP datagram within 30 seconds of receiving any one of the datagram’s fragments. The reassembly timer starts ticking on the first call to ip_slowtimo after the list is created.

RFC 1122 recommends that the reassembly time be between 60 and 120 seconds and that an ICMP time exceeded error be sent to the source host if the timer expires and the first fragment of the datagram has been received. The header and options of the other fragments are always discarded during reassembly and an ICMP error must contain the first 64 bits of the erroneous datagram (or less if the datagram was shorter than 8 bytes). So, if the kernel hasn’t received fragment 0, it can’t send an ICMP message.

Net/3’s timer is a bit too short and Net/3 neglects to send the ICMP message when a fragment is discarded. The requirement to return the first 64 bits of the datagram ensures that the first portion of the transport header is included, which allows the error message to be returned to the application that generated it. Note that TCP and UDP purposely put their port numbers in the first 8 bytes of their headers for this reason.

Datagram identifiers

368-375

ip_reass saves ip_p, ip_id, ip_src, and ip_dst in the ipq structure allocated for this datagram, points the ipq_next and ipq_prev pointers to the ipq structure (i.e., it constructs a circular list with one node), points q at this structure, and jumps to insert (Figure 10.25) where it inserts the first fragment, ip, into the new reassembly list.

The next part of ip_reass, shown in Figure 10.21, is executed when fp is not null and locates the correct position in the existing list for the new fragment.

Table 10.21. ip_reass function: find position in reassembly list.

----------------------------------------------------------------------- ip_input.c
376     /*
377      * Find a fragment which begins after this one does.
378      */
379     for (q = fp->ipq_next; q != (struct ipasfrag *) fp; q = q->ipf_next)
380         if (q->ip_off > ip->ip_off)
381             break;
----------------------------------------------------------------------- ip_input.c

376-381

Since fp is not null, the for loop searches the datagram’s fragment list to locate a fragment with an offset greater than ip_off.

The byte ranges contained within fragments may overlap at the destination. This can happen when a transport-layer protocol retransmits a datagram that gets sent along a route different from the one followed by the original datagram. The fragmentation pattern may also be different resulting in overlaps at the destination. The transport protocol must be able to force IP to use the original ID field in order for the datagram to be recognized as a retransmission at the destination.

Net/3 does not provide a mechanism for a transport protocol to ensure that IP ID fields are reused on a retransmitted datagram. ip_output always assigns a new value by incrementing the global integer ip_id when preparing a new datagram (Figure 8.22). Nevertheless, a Net/3 system could receive overlapping fragments from a system that lets the transport layer retransmit IP datagrams with the same ID field.

Figure 10.22 illustrates the different ways in which the fragment may overlap with existing fragments. The fragments are numbered according to the order in which they arrive at the destination host. The reassembled fragment is shown at the bottom of Figure 10.22 The shaded areas of the fragments are the duplicate bytes that are discarded.

The byte range of fragments may overlap at the destination.

Figure 10.22. The byte range of fragments may overlap at the destination.

In the following discussion, an earlier fragment is a fragment that previously arrived at the host.

The code in Figure 10.23 trims or discards incoming fragments.

Table 10.23. ip_reass function: trim incoming packet.

------------------------------------------------------------------------ ip_input.c
382     /*
383      * If there is a preceding fragment, it may provide some of
384      * our data already.  If so, drop the data from the incoming
385      * fragment.  If it provides all of our data, drop us.
386      */
387     if (q->ipf_prev != (struct ipasfrag *) fp) {
388         i = q->ipf_prev->ip_off + q->ipf_prev->ip_len - ip->ip_off;
389         if (i > 0) {
390             if (i >= ip->ip_len)
391                 goto dropfrag;
392             m_adj(dtom(ip), i);
393             ip->ip_off += i;
394             ip->ip_len -= i;
395         }
396     }
------------------------------------------------------------------------ ip_input.c

382-396

ip_reass discards bytes that overlap the end of an earlier fragment by trimming the new fragment (the front of fragment 5 in Figure 10.22) or discarding the new fragment (fragment 6) if all its bytes arrived in an earlier fragment (fragment 4).

The code in Figure 10.24 trims or discards existing fragments.

Table 10.24. ip_reass function: trim existing packets.

----------------------------------------------------------------------- ip_input.c
397     /*
398      * While we overlap succeeding fragments trim them or,
399      * if they are completely covered, dequeue them.
400      */
401     while (q != (struct ipasfrag *) fp && ip->ip_off + ip->ip_len > q->ip_off) {
402         i = (ip->ip_off + ip->ip_len) - q->ip_off;
403         if (i < q->ip_len) {
404             q->ip_len -= i;
405             q->ip_off += i;
406             m_adj(dtom(q), i);
407             break;
408         }
409         q = q->ipf_next;
410         m_freem(dtom(q->ipf_prev));
411         ip_deq(q->ipf_prev);
412     }
----------------------------------------------------------------------- ip_input.c

397-412

If the current fragment partially overlaps the front of an earlier fragment, the duplicate data is trimmed from the earlier fragment (the front of fragment 2 in Figure 10.22). Any earlier fragments that are completely overlapped by the arriving fragment are discarded (fragment 3).

In Figure 10.25, the incoming fragment is inserted into the reassembly list.

Table 10.25. ip_reass function: insert packet.

----------------------------------------------------------------------- ip_input.c
413   insert:
414     /*
415      * Stick new fragment in its place;
416      * check for complete reassembly.
417      */
418     ip_enq(ip, q->ipf_prev);
419     next = 0;
420     for (q = fp->ipq_next; q != (struct ipasfrag *) fp; q = q->ipf_next) {
421         if (q->ip_off != next)
422             return (0);
423         next += q->ip_len;
424     }
425     if (q->ipf_prev->ipf_mff & 1)
426         return (0);
----------------------------------------------------------------------- ip_input.c

413-426

After trimming, ip_enq inserts the fragment into the list and the list is scanned to determine if all the fragments have arrived. If any fragment is missing, or the last fragment in the list has ipf_mff set, ip_reass returns 0 and waits for more fragments.

When the current fragment completes a datagram, the entire list is converted to an mbuf chain by the code shown in Figure 10.26.

Table 10.26. ip_reass function: reassemble datagram.

------------------------------------------------------------------------ ip_input.c
427     /*
428      * Reassembly is complete; concatenate fragments.
429      */
430     q = fp->ipq_next;
431     m = dtom(q);
432     t = m->m_next;
433     m->m_next = 0;
434     m_cat(m, t);
435     q = q->ipf_next;
436     while (q != (struct ipasfrag *) fp) {
437         t = dtom(q);
438         q = q->ipf_next;
439         m_cat(m, t);
440     }
------------------------------------------------------------------------ ip_input.c

427-440

If all the fragments for the datagram have been received, the while loop reconstructs the datagram from the fragments with m_cat.

Figure 10.27 shows the relationships between mbufs and the ipq structure for a datagram composed of three fragments.

m_cat reassembles the fragments within mbufs.

Figure 10.27. m_cat reassembles the fragments within mbufs.

The darkest areas in the figure mark the data portions of a packet and the lighter shaded areas mark the unused portions of the mbufs. We show three fragments each contained in a chain of two mbufs; a packet header, and a cluster. The m_data pointer in the first mbuf of each fragment points to the packet data, not the packet header. Therefore, the mbuf chain constructed by m_cat includes only the data portion of the fragments.

This is the typical scenario when a fragment contains more than 208 bytes of data (Section 2.6). The “frag” portion of the mbufs is the IP header from the fragment. The m_data pointer of the first mbuf in each chain points beyond “opts” because of the code in Figure 10.18.

Figure 10.28 shows the reassembled datagram using the mbufs from all the fragments. Notice that the IP header and options from fragments 2 and 3 are not included in the reassembled datagram.

The reassembled datagram.

Figure 10.28. The reassembled datagram.

The header of the first fragment is still being used as an ipasfrag structure. It is restored to a valid IP datagram header by the code shown in Figure 10.29.

Table 10.29. ip_reass function: datagram reassembly.

----------------------------------------------------------------------- ip_input.c
441     /*
442      * Create header for new ip packet by
443      * modifying header of first packet;
444      * dequeue and discard fragment reassembly header.
445      * Make header visible.
446      */
447     ip = fp->ipq_next;
448     ip->ip_len = next;
449     ip->ipf_mff &= ~1;
450     ((struct ip *) ip)->ip_src = fp->ipq_src;
451     ((struct ip *) ip)->ip_dst = fp->ipq_dst;
452     remque(fp);
453     (void) m_free(dtom(fp));
454     m = dtom(ip);
455     m->m_len += (ip->ip_hl << 2);
456     m->m_data -= (ip->ip_hl << 2);
457     /* some debugging cruft by sklower, below, will go away soon */
458     if (m->m_flags & M_PKTHDR) {    /* XXX this should be done elsewhere */
459         int     plen = 0;
460         for (t = m; m; m = m->m_next)
461             plen += m->m_len;
462         t->m_pkthdr.len = plen;
463     }
464     return ((struct ip *) ip);
----------------------------------------------------------------------- ip_input.c

Reconstruct datagram header

441-456

ip_reass points ip to the first fragment in the list and changes the ipasfrag structure back to an ip structure by restoring the length of the datagram to ip_len, the source address to ip_src, the destination address to ip_dst; and by clearing the low-order bit in ipf_mff. (Recall from Figure 10.14 that ipf_mff in the ipasfrag structure overlays ipf_tos in the ip structure.)

ip_reass removes the entire packet from the reassembly list with remque, discards the ipq structure that was the head of the list, and adjusts m_len and m_data in the first mbuf to include the previously hidden IP header and options from the first fragment.

Compute packet length

457-464

The code here is always executed, since the first mbuf for the datagram is always a packet header. The for loop computes the number of data bytes in the mbuf chain and saves the value in m_pkthdr.len.

The purpose of the copied bit in the option type field should be clear now. Since the only options retained at the destination are those that appear in the first fragment, only options that control processing of the packet as it travels toward its destination are copied. Options that collect information while in transit are not copied, since the information collected is discarded at the destination when the packet is reassembled.

ip_slowtimo Function

As shown in Section 7.4, each protocol in Net/3 may specify a function to be called every 500 ms. For IP, that function is ip_slowtimo, shown in Figure 10.30, which times out the fragments on the reassembly list.

Table 10.30. ip_slowtimo function.

------------------------------------------------------------------------------ ip_input.c
515 void
516 ip_slowtimo(void)
517 {
518     struct ipq *fp;
519     int     s = splnet();

520     fp = ipq.next;
521     if (fp == 0) {
522         splx(s);
523         return;
524     }
525     while (fp != &ipq) {
526         --fp->ipq_ttl;
527         fp = fp->next;
528         if (fp->prev->ipq_ttl == 0) {
529             ipstat.ips_fragtimeout++;
530             ip_freef(fp->prev);
531         }
532     }
533     splx(s);
534 }
------------------------------------------------------------------------------ ip_input.c

515-534

ip_slowtimo traverses the list of partial datagrams and decrements the reassembly TTL field. ip_freef is called if the field drops to 0 to discard the fragments associated with the datagram. ip_slowtimo runs at splnet to prevent the lists from being modified by incoming packets.

ip_freef is shown in Figure 10.31.

Table 10.31. ip_freef function.

----------------------------------------------------------------------- ip_input.c
474 void
475 ip_freef(fp)
476 struct ipq *fp;
477 {
478     struct ipasfrag *q, *p;

479     for (q = fp->ipq_next; q != (struct ipasfrag *) fp; q = p) {
480         p = q->ipf_next;
481         ip_deq(q);
482         m_freem(dtom(q));
483     }
484     remque(fp);
485     (void) m_free(dtom(fp));
486 }
----------------------------------------------------------------------- ip_input.c

470-486

ip_freef removes and releases every fragment on the list pointed to by fp and then releases the list itself.

ip_drain Function

In Figure 7.14 we showed that IP defines ip_drain as the function to be called when the kernel needs additional memory. This usually occurs during mbuf allocation, which we described with Figure 2.13. ip_drain is shown in Figure 10.32.

Table 10.32. ip_drain function.

-------------------------------------------------------------- ip_input.c
538 void
539 ip_drain()
540 {

541     while (ipq.next != &ipq) {
542         ipstat.ips_fragdropped++;
543         ip_freef(ipq.next);
544     }
545 }
-------------------------------------------------------------- ip_input.c

538-545

The simplest way for IP to release memory is to discard all the IP fragments on the reassembly list. For IP fragments that belong to a TCP segment, TCP eventually retransmits the data. IP fragments that belong to a UDP datagram are lost and UDP-based protocols must handle this at the application layer.

Summary

In this chapter we showed how ip_output splits an outgoing datagram into fragments if it is too large to be transmitted on the selected network. Since fragments may themselves be fragmented as they travel toward their final destination and may take multiple paths, only the destination host can reassemble the original datagram.

ip_reass accepts incoming fragments and attempts to reassemble datagrams. If it is successful, the datagram is passed back to ipintr and then to the appropriate transport protocol. Every IP implementation must reassemble datagrams of up to 576 bytes. The only limit for Net/3 is the number of mbufs that are available. ip_slowtimo discards incomplete datagrams when all their fragments haven’t been received within a reasonable amount of time.

Exercises

10.1

Modify ip_slowtimo to send an ICMP time exceeded message when it discards an incomplete datagram (Figure 11.1).

10.2

The recorded route in a fragmented datagram may be different in each fragment. When a datagram is reassembled at the destination host, which return route is available to the transport protocols?

10.2

After reassembly, only the options from the initial fragment are available to the transport protocols.

10.3

Draw a picture showing the mbufs involved in the ipq structure and its associated fragment list for the fragment with an ID of 7 in Figure 10.17.

10.3

The fragment is read into a cluster since the data length (204 + 20) is greater than 208 (Figure 2.16).

m_pullup in Figure 10.11 moves the first 40 bytes into a separate mbuf as in Figure 2.18.

Exercises

10.4

[Auerbach 1994] suggests that after fragmenting a datagram, the last fragment should be sent first. If the receiving system gets that last fragment first, it can use the offset to allocate an appropriately sized reassembly buffer for the datagram. Modify ip_output to send the last fragment first.

[Auerbach 1994] notes that some commercial TCP/IP implementations have been known to crash if they receive the last fragment first.

10.5

Use the statistics in Figure 8.5 to answer the following questions. What is the average number of fragments per reassembled datagram? What is the average number of fragments created when an outgoing datagram is fragmented?

10.5

The average number of received fragments per datagram is

Exercises

The average number of fragments created for an outgoing datagram is

Exercises

10.6

What happens to a packet when the reserved bit in ip_off is set?

10.6

In Figure 10.11 the packet is initially processed as a fragment. The reserved bit is discarded when ip_off is left shifted. The resulting packet is processed as a fragment or as a complete datagram, depending on the values of the MF and offset bits.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.129.100