Chapter 2. Mbufs: Memory Buffers

Introduction

Networking protocols place many demands on the memory management facilities of the kernel. These demands include easily manipulating buffers of varying sizes, prepending and appending data to the buffers as the lower layers encapsulate data from higher layers, removing data from buffers (as headers are removed as data packets are passed up the protocol stack), and minimizing the amount of data copied for all these operations. The performance of the networking protocols is directly related to the memory management scheme used within the kernel.

In Chapter 1 we introduced the memory buffer used throughout the Net/3 kernel: the mbuf, which is an abbreviation for “memory buffer.” In this chapter we look in more detail at mbufs and at the functions within the kernel that are used to manipulate them, as we will encounter mbufs on almost every page of the text. Understanding mbufs is essential for understanding the rest of the text.

The main use of mbufs is to hold the user data that travels from the process to the network interface, and vice versa. But mbufs are also used to contain a variety of other miscellaneous data: source and destination addresses, socket options, and so on.

Figure 2.1 shows the four different kinds of mbufs that we’ll encounter, depending on the M_PKTHDR and M_EXT flags in the m_flags member. The differences between the four mbufs in Figure 2.1, from left to right, are as follows:

Four different types of mbufs, depending on the m_flags value.

Figure 2.1. Four different types of mbufs, depending on the m_flags value.

  1. If m_flags equals 0, the mbuf contains only data. There is room in the mbuf for up to 108 bytes of data (the m_dat array). The m_data pointer points somewhere in this 108-byte buffer. We show it pointing to the start of the buffer, but it can point anywhere in the buffer. The m_len member specifies the number of bytes of data, starting at m_data. Figure 1.6 was an example of this type of mbuf.

    In Figure 2.1 there are six members in the m_hdr structure, and its total size is 20 bytes. When we look at the C definition of this structure (Figure 2.8) we’ll see that the first four members occupy 4 bytes each and the last two occupy 2 bytes each. We don’t try to differentiate between the 4-byte members and the 2-byte members in Figure 2.1.

  2. The second type of mbuf has an m_flags value of M_PKTHDR, specifying a packet header, that is, the first mbuf describing a packet of data. The data is still contained within the mbuf itself, but because of the 8 bytes taken by the packet header, only 100 bytes of data fit within this mbuf (in the m_pktdat array). Figure 1.10 was an example of this type of mbuf.

    The m_pkthdr.len value is the total length of all the data in the chain mbuf for this packet: the sum of the m_len values for all the mbufs linked through the m_next pointer, as shown in Figure 1.8. The m_pkthdr.rcvif member is not used for output packets, but for received packets contains a pointer to the received interface’s ifnet structure (Figure 3.6).

  3. The next type of mbuf does not contain a packet header (M_PKTHDR is not set) but contains more than 208 bytes of data, so an external buffer called a cluster is used (M_EXT is set). Room is still allocated in the mbuf itself for the packet header structure, but it is unused—we show it shaded in Figure 2.1. Instead of using multiple mbufs to contain the data (the first with 100 bytes of data, and all the rest with 108 bytes of data each), Net/3 allocates a cluster of size 1024 or 2048 bytes. The m_data pointer in the mbuf points somewhere inside this cluster.

    The Net/3 release supports seven different architectures. Four define the size of a cluster as 1024 bytes (the traditional value) and three define it as 2048. The reason 1024 has been used historically is to save memory: if the cluster size is 2048, about one-quarter of each cluster is unused for Ethernet packets (1500 bytes maximum). We’ll see in Section 27.5 that the Net/3 TCP never sends more than the cluster size per TCP segment, so with a cluster size of 1024, almost one-third of each 1500-byte Ethernet frame is unused. But [Mogul 1993, Figure 15.15] shows that a sizable performance improvement occurs on an Ethernet when maximum-sized frames are sent instead of 1024-byte frames. This is a performance-versus-memory tradeoff. Older systems used 1024-byte clusters to save memory while newer systems with cheaper memory use 2048 to increase performance. Throughout this text we assume a cluster size of 2048.

    Unfortunately different names have been used for what we call clusters. The constant MCLBYTES is the size of these buffers (1024 or 2048) and the names of the macros to manipulate these buffers are MCLGET, MCLALLOC, and MCLFREE. This is why we call them clusters. But we also see that the mbuf flag is M_EXT, which stands for “external” buffer. Finally, [Leffler et al. 1989] calls them mapped pages. This latter name refers to their implementation, and we’ll see in Section 2.9 that clusters can be shared when a copy is required.

    We would expect the minimum value of m_len to be 209 for this type of mbuf, not 208 as we indicate in the figure. That is, a record with 208 bytes of data can be stored in two mbufs, with 100 bytes in the first and 108 in the second. The source code, however, has a bug and allocates a cluster if the size is greater than or equal to 208.

  4. The final type of mbuf contains a packet header and contains more than 208 bytes of data. Both M_PKTHDR and M_EXT are set.

There are numerous additional points we need to make about Figure 2.1:

  • The size of the mbuf structure is always 128 bytes. This means the amount of unused space following the m_ext structure in the two mbufs on the right in Figure 2.1 is 88 bytes (128 – 20 – 8 –12).

  • A data buffer with an m_len of 0 bytes is OK since some protocols (e.g., UDP) allow 0-length records.

  • In each of the mbufs we show the m_data member pointing to the beginning of the corresponding buffer (either the mbuf buffer itself or a cluster). This pointer can point anywhere in the corresponding buffer, not necessarily the front.

  • Mbufs with a cluster always contain the starting address of the buffer (m_ext.ext_buf) and its size (m_ext.ext_size). We assume a size of 2048 throughout this text. The m_data and m_ext. ext_buf members are not the same (as we show) unless m_data also points to the first byte of the buffer. The third member of the m_ext structure, ext_free, is not currently used by Net/3.

  • The m_next pointer links together the mbufs forming a single packet (record) into an mbuf chain, as in Figure 1.8.

  • The m_nextpkt pointer links multiple packets (records) together to form a queue of mbufs. Each packet on the queue can be a single mbuf or an mbuf chain. The first mbuf of each packet contains a packet header. If multiple mbufs define a packet, the m_nextpkt member of the first mbuf is the only one used—the m_nextpkt member of the remaining mbufs on the chain are all null pointers.

Figure 2.2 shows an example of two packets on a queue. It is a modification of Figure 1.8. We have placed the UDP datagram onto the interface output queue (showing that the 14-byte Ethernet header has been prepended to the IP header in the first mbuf on the chain) and have added a second packet to the queue: a TCP segment containing 1460 bytes of user data. The TCP data is contained in a cluster and an mbuf has been prepended to contain its Ethernet, IP, and TCP headers. With the cluster we show that the data pointer into the cluster (m_data) need not point to the front of the cluster. We show that the queue has a head pointer and a tail pointer. This is how the interface output queues are handled in Net/3. We have also added the m_ext structure to the mbuf with the M_EXT flag set and have shaded in the unused pkthdr structure of this mbuf.

Two packets on a queue: first with 192 bytes of data and second with 1514 bytes of data.

Figure 2.2. Two packets on a queue: first with 192 bytes of data and second with 1514 bytes of data.

The first mbuf with the packet header for the UDP datagram has a type of MT_DATA, but the first mbuf with the packet header for the TCP segment has a type of MT_HEADER. This is a side effect of the different way UDP and TCP prepend the headers to their data, and makes no difference. Mbufs of these two types are essentially the same. It is the m_flags value of M_PKTHDR in the first mbuf on the chain that indicates a packet header.

Careful readers may note a difference between our picture of an mbuf (the Net/3 mbuf, Figure 2.1) and the picture in [Leffler et al. 1989, p. 290], a Net/1 mbuf. The changes were made in Net/2: adding the m_flags member, renaming the m_act pointer to be m_nextpkt, and moving this pointer to the front of the mbuf.

The difference in the placement of the protocol headers in the first mbuf for the UDP and TCP examples is caused by UDP calling M_PREPEND (Figure 23.15 and Exercise 23.1) while TCP calls MGETHDR (Figure 26.25).

Code Introduction

The mbuf functions are in a single C file and the mbuf macros and various mbuf definitions are in a single header, as shown in Figure 2.3.

Table 2.3. Files discussed in this chapter.

File

Description

sys/mbuf.h

mbuf structure, mbuf macros and definitions

kern/uipc_mbuf.c

mbuf functions

Global Variables

One global variable is introduced in this chapter, shown in Figure 2.4.

Table 2.4. Global variables introduced in this chapter.

Variable

Datatype

Description

mbstat

struct mbstat

mbuf statistics (Figure 2.5)

Statistics

Various statistics are maintained in the global structure mbstat, described in Figure 2.5.

Table 2.5. Mbuf statistics maintained in the mbstat structure.

mbstat member

Description

m_clfree

#free clusters

m_clusters

#clusters obtained from page pool

m_drain

#times protocol’s drain functions called to reclaim space

m_drops

#times failed to find space (not used)

m_mbufs

#mbufs obtained from page pool (not used)

m_mtypes[256]

counter of current mbuf allocations: MT_xxx index

m_spare

spare field (not used)

m_wait

#times waited for space (not used)

This structure can be examined with the netstat -m command; Figure 2.6 shows some sample output. The two values printed for the number of mapped pages in use are m_clusters (34) minus m_clfree (32), giving the number of clusters currently in use (2), and m_clusters (34).

Table 2.6. Sample mbuf statistics.

netstat -m output

mbstat member

99 mbufs in use:
        1 mbufs allocated to data
        43 mbufs allocated to packet headers
        17 mbufs allocated to protocol control blocks
        20 mbufs allocated to socket names and addresses
        18 mbufs allocated to socket options
2/34 mapped pages in use
80 Kbytes allocated to network (20% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines
m_mtypes[MT_DATA]
m_mtypes[MT_HEADER]
m_mtypes[MT_PCB]
m_mtypes[MT_SONAME]
m_mtypes[MT_SOOPTS]
(see text)
(see text)
m_drops
m_wait
m_drain

The number of Kbytes of memory allocated to the network is the mbuf memory (99 x 128 bytes) plus the cluster memory (34 x 2048 bytes) divided by 1024. The percentage in use is the mbuf memory (99 x 128 bytes) plus the cluster memory in use (2 x 2048 bytes) divided by the total network memory (80 Kbytes), times 100.

Kernel Statistics

The mbuf statistics show a common technique that we see throughout the Net/3 sources. The kernel keeps track of certain statistics in a global variable (the mbstat structure in this example). A process (in this case the netstat program) examines the statistics while the kernel is running.

Rather than provide system calls to fetch the statistics maintained by the kernel, the process obtains the address within the kernel of the data structure in which it is interested by reading the information saved by the link editor when the kernel was built. The process then calls the kvm(3) functions to read the corresponding location in the kernel’s memory by using the special file /dev/mem. If the kernel’s data structure changes from one release to the next, any program that reads that structure must also change.

Mbuf Definitions

There are a few constants that we encounter repeatedly when dealing with mbufs. Their values are shown in Figure 2.7. All are defined in mbuf.h except MCLBYTES, which is defined in /usr/include/machine/param.h.

Table 2.7. Mbuf constants from mbuf.h.

Constant

Value (#bytes)

Description

MCLBYTES

2048

size of an mbuf cluster (external buffer)

MHLEN

100

max amount of data in mbuf with packet header

MINCLSIZE

208

smallest amount of data to put into cluster

MLEN

108

max amount of data in normal mbuf

MSIZE

128

size of each mbuf

mbuf Structure

Figure 2.8 shows the definition of the mbuf structure.

Table 2.8. Mbuf structures.

-------------------------------------------------------------------------- mbuf.h
 60 /* header at beginning of each mbuf: */
 61 struct m_hdr {
 62     struct mbuf *mh_next;       /* next buffer in chain */
 63     struct mbuf *mh_nextpkt;    /* next chain in queue/record */
 64     int     mh_len;             /* amount of data in this mbuf */
 65     caddr_t mh_data;            /* pointer to data */
 66     short   mh_type;            /* type of data (Figure 2.10) */
 67     short   mh_flags;           /* flags (Figure 2.9) */
 68 };

 69 /* record/packet header in first mbuf of chain; valid if M_PKTHDR set */
 70 struct pkthdr {
 71     int     len;                /* total packet length */
 72     struct ifnet *rcvif;        /* receive interface */
 73 };

 74 /* description of external storage mapped into mbuf, valid if M_EXT set */
 75 struct m_ext {
 76     caddr_t ext_buf;            /* start of buffer */
 77     void    (*ext_free) ();     /* free routine if not the usual */
 78     u_int   ext_size;           /* size of buffer, for ext_free */
 79 };

 80 struct mbuf {
 81     struct m_hdr m_hdr;
 82     union {
 83         struct {
 84             struct pkthdr MH_pkthdr;    /* M_PKTHDR set */
 85             union {
 86                 struct m_ext MH_ext;    /* M_EXT set */
 87                 char    MH_databuf[MHLEN];
 88             } MH_dat;
 89         } MH;
 90         char    M_databuf[MLEN];    /* !M_PKTHDR, !M_EXT */
 91     } M_dat;
 92 };

 93 #define m_next      m_hdr.mh_next
 94 #define m_len       m_hdr.mh_len
 95 #define m_data      m_hdr.mh_data
 96 #define m_type      m_hdr.mh_type
 97 #define m_flags     m_hdr.mh_flags
 98 #define m_nextpkt   m_hdr.mh_nextpkt
 99 #define m_act       m_nextpkt
100 #define m_pkthdr    M_dat.MH.MH_pkthdr
101 #define m_ext       M_dat.MH.MH_dat.MH_ext
102 #define m_pktdat    M_dat.MH.MH_dat.MH_databuf
103 #define m_dat       M_dat.M_databuf
-------------------------------------------------------------------------- mbuf.h

The mbuf structure is defined as an m_hdr structure, followed by a union. As the comments indicate, the contents of the union depend on the flags M_PKTHDR and M_EXT.

93-103

These 11 #define statements simplify access to the members of the structures and unions within the mbuf structure. We will see this technique used throughout the Net/3 sources whenever we encounter a structure containing other structures or unions.

We previously described the purpose of the first two members in the mbuf structure: the m_next pointer links mbufs together into an mbuf chain and the m_nextpkt pointer links mbuf chains together into a queue of mbufs.

Figure 1.8 differentiated between the m_len member of each mbuf and the m_pkthdr.len member in the packet header. The latter is the sum of all the m_len members of all the mbufs on the chain.

There are five independent values for the m_flags member, shown in Figure 2.9.

Table 2.9. m_flags values.

m_flags

Description

M_BCAST

sent/received as link-level broadcast

M_EOR

end of record

M_EXT

cluster (external buffer) associated with this mbuf

M_MCAST

sent/received as link-level multicast

M_PKTHDR

first mbuf that forms a packet (record)

M_COPYFLAGS

M_PKTHDR | M_EOR | M_BCAST | M_MCAST

We have already described the M_EXT and M_PKTHDR flags. M_EOR is set in an mbuf containing the end of a record. The Internet protocols (e.g., TCP) never set this flag, since TCP provides a byte-stream service without any record boundaries. The OSI and XNS transport layers, however, do use this flag. We will encounter this flag in the socket layer, since this layer is protocol independent and handles data to and from all the transport layers.

The next two flags, M_BCAST and M_MCAST, are set in an mbuf when the packet will be sent to or was received from a link-layer broadcast address or multicast address. These two constants are flags between the protocol layer and the interface layer (Figure 1.3).

The final value, M_COPYFLAGS, specifies the flags that are copied when an mbuf containing a packet header is copied.

Figure 2.10 shows the MT_xxx constants used in the m_type member to identify the type of data stored in the mbuf. Although we tend to think of an mbuf as containing user data that is sent or received, mbufs can contain a variety of different data structures. Recall in Figure 1.6 that an mbuf was used to hold a socket address structure with the destination address for the sendto system call. Its m_type member was set to MT_SONAME.

Table 2.10. Values for m_type member.

Mbuf m_type

Used in Net/3 TCP/IP code

Description

Memory type

MT_CONTROL

extra-data protocol message

M_MBUF

MT_DATA

dynamic data allocation

M_MBUF

MT_FREE

 

should be on free list

M_FREE

MT_FTABLE

fragment reassembly header

M_FTABLE

MT_HEADER

packet header

M_MBUF

MT_HTABLE

 

IMP host tables

M_HTABLE

MT_IFADDR

 

interface address

M_IFADDR

MT_OOBDATA

 

expedited (out-of-band) data

M_MBUF

MT_PCB

 

protocol control block

M_PCB

MT_RIGHTS

 

access rights

M_MBUF

MT_RTABLE

 

routing tables

M_RTABLE

MT_SONAME

socket name

M_MBUF

MT_SOOPTS

socket options

M_SOOPTS

MT_SOCKET

 

socket structure

M_SOCKET

Not all of the mbuf type values in Figure 2.10 are used in Net/3. Some are historical (MT_HTABLE), and others are not used in the TCP/IP code but are used elsewhere in the kernel. For example, MT_OOBDATA is used by the OSI and XNS protocols, but TCP handles out-of-band data differently (as we describe in Section 29.7). We describe the use of other mbuf types when we encounter them later in the text.

The final column of this figure shows the M_xxx values associated with the piece of memory allocated by the kernel for the different types of mbufs. There are about 60 possible M_xxx values assigned to the different types of memory allocated by the kernel’s malloc function and MALLOC macro. Figure 2.6 showed the mbuf allocation statistics from the netstat -m command including the counters for each MT_xxx type. The vmstat -m command shows the kernel’s memory allocation statistics including the counters for each M_xxx type.

Since mbufs have a fixed size (128 bytes) there is a limit for what an mbuf can be used for—the data contents cannot exceed 108 bytes. Net/2 used an mbuf to hold a TCP protocol control block (which we cover in Chapter 24), using the mbuf type of MT_PCB. But 4.4BSD increased the size of this structure from 108 bytes to 140 bytes, forcing the use of a different type of kernel memory allocation for the structure.

Observant readers may have noticed that in Figure 2.10 we say that mbufs of type MT_PCB are not used, yet Figure 2.6 shows a nonzero counter for this type. The Unix domain protocols use this type of mbuf, and it is important to remember that the statistics are for mbuf usage across all protocol suites, not just the Internet protocols.

Simple Mbuf Macros and Functions

There are more than two dozen macros and functions that deal with mbufs (allocate an mbuf, free an mbuf, etc.). We look at the source code for only a few of the macros and functions, to show how they’re implemented.

Some operations are provided as both a macro and function. The macro version has an uppercase name that begins with M, and the function has a lowercase name that begins with m_. The difference in the two is the standard time-versus-space tradeoff. The macro version is expanded inline by the C preprocessor each time it is used (requiring more code space), but it executes faster since it doesn’t require a function call (which can be expensive on some architectures). The function version, on the other hand, becomes a few instructions each time it is invoked (push the arguments onto the stack, call the function, etc.), taking less code space but more execution time.

m_get Function

We’ll look first at the function that allocates an mbuf: m_get, shown in Figure 2.11. This function merely expands the macro MGET.

Table 2.11. m_get function: allocate an mbuf.

---------------------------------------------------------------------- uipc_mbuf.c
134 struct mbuf *
135 m_get(nowait, type)
136 int     nowait, type;
137 {
138     struct mbuf *m;

139     MGET(m, nowait, type);
140     return (m);
141 }
---------------------------------------------------------------------- uipc_mbuf.c

Notice that the Net/3 code does not use ANSI C argument declarations. All the Net/3 system headers, however, do provide ANSI C function prototypes for all kernel functions, if an ANSI C compiler is being used. For example, the

<sys/mbuf.h> header includes the line

struct mbuf *m_get (int, int);

These function prototypes provide compile-time checking of the arguments and return values whenever a kernel function is called.

The caller specifies the nowait argument as either M_WAIT or M_DONTWAIT, depending whether it wants to wait if the memory is not available. As an example of the difference, when the socket layer asks for an mbuf to store the destination address of the sendto system call (Figure 1.6) it specifies M_WAIT, since blocking at this point is OK. But when the Ethernet device driver asks for an mbuf to store a received frame (Figure 1.10) it specifies M_DONTWAIT, since it is executing as a device interrupt handler and cannot be put to sleep waiting for an mbuf. In this case it is better for the device driver to discard the Ethernet frame if the memory is not available.

MGET Macro

Figure 2.12 shows the MGET macro. A call to MGET to allocate the mbuf to hold the destination address for the sendto system call (Figure 1.6) might look like

Table 2.12. MGET macro.

---------------------------------------------------------------------- mbuf.h
154 #define MGET(m, how, type) { 
155     MALLOC((m), struct mbuf *, MSIZE, mbtypes[type], (how)); 
156     if (m) { 
157         (m)->m_type = (type); 
158         MBUFLOCK(mbstat.m_mtypes[type]++;) 
159         (m)->m_next = (struct mbuf *)NULL; 
160         (m)->m_nextpkt = (struct mbuf *)NULL; 
161         (m)->m_data = (m)->m_dat; 
162         (m)->m_flags = 0; 
163     } else 
164         (m) = m_retry((how), (type)); 
165 }
---------------------------------------------------------------------- mbuf.h
MGET(m, M__WAIT, MT_SONAME);
if (m == NULL)
    return(ENOBUFS);

Even though the caller specifies M_WAIT, the return value must still be checked, since, as we’ll see in Figure 2.13, waiting for an mbuf does not guarantee that one will be available.

Table 2.13. m_retry function.

---------------------------------------------------------------------- uipc_mbuf.c
 92 struct mbuf *
 93 m_retry(i, t)
 94 int     i, t;
 95 {
 96     struct mbuf *m;

 97     m_reclaim();
 98 #define m_retry(i, t)   (struct mbuf *)0
 99     MGET(m, i, t);
100 #undef m_retry
101     return (m);
102 }
---------------------------------------------------------------------- uipc_mbuf.c

154-157

MGET first calls the kernel’s MALLOC macro, which is the general-purpose kernel memory allocator. The array mbtypes converts the mbuf MT_xxx value into the corresponding M_xxx value (Figure 2.10). If the memory can be allocated, the m_type member is set to the argument’s value.

158

The kernel structure that keeps mbuf statistics for each type of mbuf is incremented (mbstat). The macro MBUFLOCK changes the processor priority (Figure 1.13) while executing the statement specified as its argument, and then resets the priority to its previous value. This prevents network device interrupts from occurring while the statement mbstat.m_mtypes [type]++; is executing, because mbufs can be allocated at various layers within the kernel. Consider a system that implements the ++ operator in C using three steps: (1) load the current value into a register, (2) increment the register, and (3) store the register into memory. Assume the counter’s value is 77 and MGET is executing at the socket layer. Assume steps 1 and 2 are executed (the register’s value is 78) and a device interrupt occurs. If the device driver also executes MGET for the same type of mbuf, the value in memory is fetched (77), incremented (78), and stored back into memory. When step 3 of the interrupted execution of MGET resumes, it stores its register (78) into memory. But the counter should be 79, not 78, so the counter has been corrupted.

159-160

The two mbuf pointers, m_next and m_nextpkt, are set to null pointers. It is the caller’s responsibility to add the mbuf to a chain or queue, if necessary.

161-162

Finally the data pointer is set to point to the beginning of the 108-byte mbuf buffer and the flags are set to 0.

163-164

If the call to the kernel’s memory allocator fails, m_retry is called (Figure 2.13). The first argument is either M_WAIT or M_DONTWAIT.

m_retry Function

Figure 2.13 shows the m_retry function.

92-97

The first function called by m_retry is m_reclaim. We’ll see in Section 7.4 that each protocol can define a “drain” function to be called by m_reclaim when the system gets low on available memory. We’ll also see in Figure 10.32 that when IP’s drain function is called, all IP fragments waiting to be reassembled into IP datagrams are discarded. TCP’s drain function does nothing and UDP doesn’t even define a drain function.

98-102

Since there’s a chance that more memory might be available after the call to m_reclaim, the MGET macro is called again, to try to obtain the mbuf. Before expanding the MGET macro (Figure 2.12), m_retry is defined to be a null pointer. This prevents an infinite loop if the memory still isn’t available: the expansion of MGET will set m to this null pointer instead of calling the m_retry function. After the expansion of MGET, this temporary definition of m_retry is undefined, in case there is another reference to MGET later in the source file.

Mbuf Locking

In the functions and macros that we’ve looked at in this section, other than the call to MBUFLOCK in Figure 2.12, there are no calls to the spl functions to protect these functions and macros from being interrupted. What we haven’t shown, however, is that the macro MALLOC contains an splimp at the beginning and an splx at the end. The macro MFREE contains the same protection. Mbufs are allocated and released at all layers within the kernel, so the kernel must protect the data structures that it uses for memory allocation.

Additionally, the macros MCLALLOC and MCLFREE, which allocate and release an mbuf cluster, are surrounded by an splimp and an splx, since they modify a linked list of available clusters.

Since the memory allocation and release macros along with the cluster allocation and release macros are protected from interrupts, we normally do not encounter calls to the spl functions around macros and functions such as MGET and m_get.

m_devget and m_pullup Functions

We encounter the m_pullup function when we show the code for IP, ICMP, IGMP, UDP, and TCP. It is called to guarantee that the specified number of bytes (the size of the corresponding protocol header) are contiguous in the first mbuf of a chain; otherwise the specified number of bytes are copied to a new mbuf and made contiguous. To understand the usage of m_pullup we must describe its implementation and its interaction with both the m_devget function and the mtod and dtom macros. This description also provides additional insight into the usage of mbufs in Net/3.

m_devget Function

When an Ethernet frame is received, the device driver calls the function m_devget to create an mbuf chain and copy the frame from the device into the chain. Depending on the length of the received frame (excluding the Ethernet header), there are four different possibilities for the resulting mbuf chain. The first two possibilities are shown in Figure 2.14.

First two types of mbufs created by m_devget.

Figure 2.14. First two types of mbufs created by m_devget.

  1. The left mbuf in Figure 2.14 is used when the amount of data is between 0 and 84 bytes. In this figure we assume there are 52 bytes of data: a 20-byte IP header and a 32-byte TCP header (the standard 20-byte TCP header plus 12 bytes of TCP options) but no TCP data. Since the data in the mbuf returned by m_devget starts with the IP header, the realistic minimum value for m_len is 28: 20 bytes for an IP header, 8 bytes for a UDP header, and a 0-length UDP datagram.

    m_devget leaves 16 bytes unused at the beginning of the mbuf. Although the 14-byte Ethernet header is not stored here, room is allocated for a 14-byte Ethernet header on output, should the same mbuf be used for output. We’ll encounter two functions that generate a response by using the received mbuf as the outgoing mbuf: icmp_reflect and tcp_respond. In both cases the size of the received datagram is normally less than 84 bytes, so it costs nothing to leave room for 16 bytes at the front, which saves time when building the outgoing datagram. The reason 16 bytes are allocated, and not 14, is to have the IP header longword aligned in the mbuf.

  2. If the amount of data is between 85 and 100 bytes, the data still fits in a packet header mbuf, but there is no room for the 16 bytes at the beginning. The data starts at the beginning of the m_pktdat array and any unused space is at the end of this array. The mbuf on the right in Figure 2.14 shows this example, assuming 85 bytes of data.

  3. Figure 2.15 shows the third type of mbuf created by m_devget. Two mbufs are required when the amount of data is between 101 and 207 bytes. The first 100 bytes are stored in the first mbuf (the one with the packet header), and the remainder are stored in the second mbuf. In this example we show a 104-byte datagram. No attempt is made to leave 16 bytes at the beginning of the first mbuf.

    Third type of mbuf created by m_devget.

    Figure 2.15. Third type of mbuf created by m_devget.

  4. Figure 2.16 shows the fourth type of mbuf created by m_devget. If the amount of data is greater than or equal to 208 (MINCLBYTES), one or more clusters are used. The example in the figure assumes a 1500-byte Ethernet frame with 2048-byte clusters. If 1024-byte clusters are in use, this example would require two mbufs, each with the M_EXT flag set, and each pointing to a cluster.

    Fourth type of mbuf created by m_devget.

    Figure 2.16. Fourth type of mbuf created by m_devget.

mtod and dtom Macros

The two macros mtod and dtom are also defined in mbuf.h. They simplify complex mbuf structure expressions.

   #define mtod(m,t)   ((t)((m)->m_data))
   #define dtom(x)     ((struct mbuf *)((int)(x) & ~(MSIZE-1)))

mtod (“mbuf-to-data”) returns a pointer to the data associated with an mbuf, and casts the pointer to a specified type. For example, the code

struct mbuf *m;
struct ip *ip;

ip = mtod(m, struct ip *);
ip->ip_v = IPVERSION;

stores in ip the data pointer of the mbuf (m_data). The type cast is required by the C compiler and the code then references the IP header using the pointer ip. We see this macro used when a C structure (often a protocol header) is stored in an mbuf. This macro works if the data is stored in the mbuf itself (Figures 2.14 and 2.15) or if the data is stored in a cluster (Figure 2.16).

The macro dtom (“data-to-mbuf”) takes a pointer to data anywhere within the data portion of the mbuf and returns a pointer to the mbuf structure itself. For example, if we know that ip points within the data area of an mbuf, the sequence

struct mbuf *m;
struct ip *ip;

m = dtom(ip);

stores the pointer to the beginning of the mbuf in m. By knowing that MSIZE (128) is a power of 2, and that mbufs are always aligned by the kernel’s memory allocator on MSIZE byte blocks of memory, dtom just clears the appropriate low-order bits in its argument pointer to find the beginning of the mbuf.

There is a problem with dtom: it doesn’t work if its argument points to a cluster, or within a cluster, as in Figure 2.16. Since there is no pointer from the cluster back to the mbuf structure, dtom cannot be used. This leads to the next function, m_pullup.

m_pullup Function and Contiguous Protocol Headers

The m_pullup function has two purposes. The first is when one of the protocols (IP, ICMP, IGMP, UDP, or TCP) finds that the amount of data in the first mbuf (m_len) is less than the size of the minimum protocol header (e.g., 20 for IP, 8 for UDP, 20 for TCP). m_pullup is called on the assumption that the remaining part of the header is in the next mbuf on the chain. m_pullup rearranges the mbuf chain so that the first N bytes of data are contiguous in the first mbuf on the chain. N is an argument to the function that must be less than or equal to 100 (MHLEN). If the first N bytes are contiguous in the first mbuf, then both of the macros mtod and dtom will work.

For example, we’ll encounter the following code in the IP input routine:

     if (m->m_len < sizeof(struct ip) &&
         (m = m_pullup(m, sizeof(struct ip))) == 0)  {
             ipstat.ips_toosmall++;
             goto next;
     }
     ip = mtod(m, struct ip *);

If the amount of data in the first mbuf is less than 20 (the size of the standard IP header), m__pullup is called. m_pullup can fail for two reasons: (1) if it needs another mbuf and its call to MGET fails, or (2) if the total amount of data in the mbuf chain is less than the requested number of contiguous bytes (what we called N, which in this case is 20). The second reason is the most common cause of failure. In this example, if m__pullup fails, an IP counter is incremented and the IP datagram is discarded. Notice that this code assumes the reason for failure is that the amount of data in the mbuf chain is less than 20 bytes.

In actuality, m_pullup is rarely called in this scenario (notice that C’s && operator only calls it when the mbuf length is smaller than expected) and when it is called, it normally fails. The reason can be seen by looking at Figure 2.14 through Figure 2.16: there is room in the first mbuf, or in the cluster, for at least 100 contiguous bytes, starting with the IP header. This allows for the maximum IP header of 60 bytes followed by 40 bytes of TCP header. (The other protocols—ICMP, IGMP, and UDP—have headers smaller than 40 bytes.) If the data bytes are available in the mbuf chain (the packet is not smaller than the minimum required by the protocol), then the required number of bytes should always be contiguous in the first mbuf. But if the received packet is too short (m_len is less than the expected minimum), then m_pullup is called and it returns an error, since the required amount of data is not available in the mbuf chain.

Berkeley-derived kernels maintain a variable named MPFail that is incremented each time m_pullup fails. On a Net/3 system that had received over 27 million IP datagrams, MPFail was 9. The counter ipstat.ips_toosmall was also 9 and all the other protocol counters (i.e., ICMP, IGMP, UDP, and TCP) following a failure of m_pullup were 0. This confirms our statement that most failures of m_pullup are because the received IP datagram was too small.

m_pullup and IP Fragmentation and Reassembly

The second use of m_pullup concerns IP reassembly and TCP reassembly. Assume IP receives a packet of length 296, which is a fragment of a larger IP datagram. The mbuf passed from the device driver to IP input looks like the one we showed in Figure 2.16: the 296 bytes of data are stored in a cluster. We show this in Figure 2.17.

An IP fragment of length 296.

Figure 2.17. An IP fragment of length 296.

The problem is that the IP fragmentation algorithm keeps the individual fragments on a doubly linked list, using the source and destination IP address fields in the IP header to hold the forward and backward list pointers. (These two IP addresses are saved, of course, in the head of the list, since they must be put back into the reassembled datagram. We describe this in Chapter 10.) But if the IP header is in a cluster, as shown in Figure 2.17, these linked list pointers would be in the cluster, and when the list is traversed at some later time, the pointer to the IP header (i.e., the pointer to the beginning of the cluster) could not be converted into the pointer to the mbuf. This is the problem we mentioned earlier in this section: the dtom macro cannot be used if m_data points into a cluster, because there is no back pointer from the cluster to the mbuf. IP fragmentation cannot store the links in the cluster as shown in Figure 2.17.

To solve this problem the IP fragmentation routine always calls m__pullup when a fragment is received, if the fragment is contained in a cluster. This forces the 20-byte IP header into its own mbuf. The code looks like

if (m->m_flags & M_EXT) {
    if ((m = m_pullup (m, sizeof (struct ip))) == 0) {
        ipstat.ips_toosmall++;
        goto next;
    }
    ip = mtod(m, struct ip *);
}

Figure 2.18 shows the resulting mbuf chain, after m__pullup is called. m_pullup allocates a new mbuf, prepends it to the chain, and moves the first 40 bytes of data from the cluster into the new mbuf. The reason it moves 40 bytes, and not just the requested 20, is to try to save an additional call at a later time when IP passes the datagram to a higher-layer protocol (e.g., ICMP, IGMP, UDP, or TCP). The magic number 40 (max_protohdr in Figure 7.17) is because the largest protocol header normally encountered is the combination of a 20-byte IP header and a 20-byte TCP header. (This assumes that other protocol suites, such as the OSI protocols, are not compiled into the kernel.)

An IP fragment of length 296, after calling m_pullup.

Figure 2.18. An IP fragment of length 296, after calling m_pullup.

In Figure 2.18 the IP fragmentation algorithm can save a pointer to the IP header contained in the mbuf on the left, and this pointer can be converted into a pointer to the mbuf itself using dtom at a later time.

Avoidance of m_pullup by TCP Reassembly

The reassembly of TCP segments uses a different technique to avoid calling m_pullup. This is because m_pullup is expensive: memory is allocated and data is copied from a cluster to an mbuf.TCP tries to avoid data copying whenever possible.

Chapter 19 of Volume 1 mentions that about one-half of TCP data is bulk data (often 512 or more bytes of data per segment) and the other half is interactive data (of which about 90% of the segments contain less than 10 bytes of data). Hence, when TCP receives segments from IP they are usually in the format shown on the left of Figure 2.14 (a small amount of interactive data, stored in the mbuf itself) or in the format shown in Figure 2.16 (bulk data, stored in a cluster). When TCP segments arrive out of order, they are stored on a doubly linked list by TCP. As with IP fragmentation, fields in the IP header are used to hold the list pointers, which is OK since these fields are no longer needed once the IP datagram is accepted by TCP. But the same problem arises with the conversion of a list pointer into the corresponding mbuf pointer, when the IP header is stored in a cluster (Figure 2.17).

To solve the problem, we’ll see in Section 27.9 that TCP stores the mbuf pointer in some unused fields in the TCP header, providing a back pointer of its own from the cluster to the mbuf, just to avoid calling m_pullup for every out-of-order segment. If the IP header is contained in the data portion of the mbuf (Figure 2.18), then this back pointer is superfluous, since the dtom macro would work on the list pointer. But if the IP header is contained in a cluster, this back pointer is required. We’ll examine the source code that implements this technique when we describe tcp_reass in Section 27.9.

Summary of m_pullup Usage

We’ve described three main points about m_pullup.

  • Most device drivers do not split the first portion of an IP datagram between mbufs. Therefore the possible calls to m_pullup that we’ll encounter in every protocol (IP, ICMP, IGMP, UDP, and TCP), just to assure that the protocol header is stored contiguously, rarely take place. When these calls to m_pullup do occur, it is normally because the IP datagram is too small, in which case m_pullup returns an error, the datagram is discarded, and an error counter is incremented.

  • m_pullup is called for every received IP fragment, when the IP fragment is stored in a cluster. This means that m_pullup is called for almost every received fragment, since the length of most fragments is greater than 208 bytes.

  • As long as TCP segments are not fragmented by IP, the receipt of a TCP segment, whether it be in order or out of order, should not invoke m_pullup. This is one reason to avoid IP fragmentation with TCP.

Summary of Mbuf Macros and Functions

Figure 2.19 lists the macros and Figure 2.20 lists the functions that we’ll encounter in the code that operates on mbufs. The macros in Figure 2.19 are shown as function prototypes, not as #define statements, to show the data types of the arguments. We will not go through the source code implementation of these routines since they are concerned primarily with manipulating the mbuf data structures and involve no networking issues. Also, there are additional mbuf macros and functions used elsewhere in the Net/3 sources that we don’t show in these two figures since we won’t encounter them in the text.

Table 2.19. Mbuf macros that we’ll encounter in the text.

Macro

Description

MCLGET

Get a cluster (an external buffer) and set the data pointer (m_data) (of the existing mbuf pointed to by m to point to the cluster. If memory for a cluster is not available, the M_EXT flag in the mbuf is not set on return.

void MCLGET(struct mbuf *m, int nowait;
MFREE

Free the single mbuf pointed to by m. If m points to a cluster (M_EXT is set), the cluster’s reference count is decremented but the cluster is not released until its reference count reaches 0 (as discussed in Section 2.9). On return, the pointer to m’s successor (pointed to by m->m_next, which can be null) is stored in n.

void MFREE(struct mbuf *m, struct mbuf *n;
MGETHDR

Allocate an mbuf and initialize it as a packet header. This macro is similar to MGET (Figure 2.12) except the M_PKTHDR flag is set and the data pointer (m_data) points to the 100-byte buffer just beyond the packet header.

void MGETHDR(struct mbuf *m, int nowait, int type);
MH_ALIGN

Set the m_data pointer of an mbuf containing a packet header to provide room for an object of size len bytes at the end of the mbuf’s data area. The data pointer is also longword aligned.

void MH_ALIGN(struct mbuf *m, int len);
M_PREPEND

Prepend len bytes of data in front of the data in the mbuf pointed to by m. If room exists in the mbuf, just decrement the pointer (m_data) and increment the length (m_len) by len bytes. If there is not enough room, a new mbuf is allocated, its m_next pointer is set to m, a pointer to the new mbuf is stored in m, and the data pointer of the new mbuf is set so that the len bytes of data go at the end of the mbuf (i.e., MH_ALIGN is called). Also, if a new mbuf is allocated and the existing mbuf had its packet header flag set, the packet header is moved from the existing mbuf to the new one.

void M_PREPEND(struct mbuf *m, int len, int nowait);
dtom

Convert the pointer x, which must point somewhere within the data area of an mbuf, into a pointer to the beginning of the mbuf.

struct mbuf *dtom(void *x);
mtod

Type cast the pointer to the data area of the mbuf pointed to by m to type.

type mtod(struct mbuf *m, type);

Table 2.20. Mbuf functions that we’ll encounter in the text.

Function

Description

m_adj

Remove len bytes of data from the mbuf pointed to by m. If len is positive, that number of bytes is trimmed from the start of the data in the mbuf chain, otherwise the absolute value of len bytes is trimmed from the end of the data in the mbuf chain.

void m_adj(struct mbuf *m, int len);
m_cat

Concatenate the mbuf chain pointed to by n to the end of the mbuf chain pointed to by m. We encounter this function when we describe IP reassembly (Chapter 10).

void m_cat(struct mbuf *m, struct mbuf *n);
m_copy

A three-argument version of m_copym that implies a fourth argument of M_DONTWAIT.

struct mbuf *m_copy(struct mbuf *m, int Ioffset, int len);
m_copydata

Copy len bytes of data from the mbuf chain pointed to by m into the buffer pointed to by cp. The copying starts from the specified byte offset from the beginning of the data in the mbuf chain.

void m_copydata(struct mbuf *m, int offset, int len, caddr_t cp);
m_copyback

Copy len bytes of data from the buffer pointed to by cp into the mbuf chain pointed to by m. The data is stored starting at the specified byte offset in the mbuf chain. The mbuf chain is extended with additional mbufs if necessary.

void m_copyback(struct mbuf *m, int offset, int len, caddr_t cp);
m_copym

Create a new mbuf chain and copy len bytes of data starting at offset from the mbuf chain pointed to by m. A pointer to the new mbuf chain is returned as the value of the function. If len equals the constant M_COPYALL, the remainder of the mbuf chain starting at offset is copied. We say more about this function in Section 2.9.

struct mbuf *m_copym(struct mbuf *m, int offset, int len, int nowait);
m_devget

Create a new mbuf chain with a packet header and return the pointer to the chain. The len and rcvif fields in the packet header are set to len and ifp. The function copy is called to copy the data from the device interface (pointed to by buf) into the mbuf. If copy is a null pointer, the function bcopy is called. off is 0 since trailer protocols are no longer supported. We described this function in Section 2.6.

struct mbuf *m_devget(char *buf, int len, int off, struct ifnet *ifp,
                      void (*copy)(const void *, void *, u_int));
m_free

A function version of the macro MFREE.

struct mbuf *m_free(struct mbuf *m);
m_freem

Free all the mbufs in the chain pointed to by m.

void m_freem(struct mbuf *m);
m_get

A function version of the MGET macro. We showed this function in Figure 2.12.

struct mbuf *m_get(int nowait, int type;
m_getclr

This function calls the MGET macro to get an mbuf and then zeros the 108-byte buffer.

struct mbuf *m_getclr(int nowait, int type);

m_gethdr

A function version of the MGETHDR macro.

struct mbuf *m_gethdr(int nowait, int type);
m_pullup

Rearrange the existing data in the mbuf chain pointed to by m so that the first len bytes of data are stored contiguously in the first mbuf in the chain. If this function succeeds, then the mtod macro returns a pointer that correctly references a structure of size len. We described this function in Section 2.6.

struct mbuf *m_pullup(struct mbuf *m, int len);

In all the prototypes the argument nowait is either M_WAIT or M_DONTWAIT, and the argument type is one of the MT_xxx constants shown in Figure 2.10.

As an example of M_PREPEND, this macro was called when the IP and UDP headers were prepended to the user’s data in the transition from Figure 1.7 to Figure 1.8, causing another mbuf to be allocated. But when this macro was called again (in the transition from Figure 1.8 to Figure 2.2) to prepend the Ethernet header, room already existed in the mbuf for the headers.

The data type of the last argument for m_copydata is caddr_t, which stands for “core address.” This data type is normally defined in <sys/types.h> to be a char *. It was originally used internally by the kernel, but got externalized when used by certain system calls. For example, the mmap system call, in both 4.4BSD and SVR4, uses caddr_t as the type of the first argument and as the return value type.

Summary of Net/3 Networking Data Structures

This section summarizes the types of data structures we’ll encounter in the Net/3 networking code. Other data structures are used in the Net/3 kernel (interested readers should examine the <sys/queue.h> header), but the following are the ones we’ll encounter in this text.

  1. An mbuf chain: a list of mbufs, linked through the m_next pointer. We’ve seen numerous examples of these already.

  2. A linked list of mbuf chains with a head pointer only. The mbuf chains are linked using the m_nextpkt pointer in the first mbuf of each chain.

    Figure 2.21 shows this type of list. Examples of this data structure are a socket’s send buffer and receive buffer.

    Linked list of mbuf chains with head pointer only.

    Figure 2.21. Linked list of mbuf chains with head pointer only.

    The top two mbufs form the first record on the queue, and the three mbufs on the bottom form the second record on the queue. For a record-based protocol, such as UDP, we can encounter multiple records per queue, but for a protocol such as TCP that has no record boundaries, we’ll find only a single record (one mbuf chain possibly consisting of multiple mbufs) per queue.

    To append an mbuf to the first record on the queue requires going through all the mbufs comprising the first record, until the one with a null m_next pointer is encountered. To append an mbuf chain comprising a new record to the queue requires going through all the records until the one with a null m_nextpkt pointer is encountered.

  3. A linked list of mbuf chains with head and tail pointers.

    Figure 2.22 shows this type of list. We encounter this with the interface queues (Figure 3.13), and showed an earlier example in Figure 2.2.

    Linked list with head and tail pointers.

    Figure 2.22. Linked list with head and tail pointers.

    The only change in this figure from Figure 2.21 is the addition of a tail pointer, to simplify the addition of new records.

  4. A doubly linked, circular list.

    Figure 2.23 shows this type of list, which we encounter with IP fragmentation and reassembly (Chapter 10), protocol control blocks (Chapter 22), and TCP’s out-of-order segment queue (Section 27.9).

    Doubly linked, circular list.

    Figure 2.23. Doubly linked, circular list.

    The elements in the list are not mbufs—they are structures of some type that are defined with two consecutive pointers: a next pointer followed by a previous pointer. Both pointers must appear at the beginning of the structure. If the list is empty, both the next and previous pointers of the head entry point to the head entry.

    For simplicity in the figure we show the back pointers pointing at another back pointer. Obviously all the pointers contain the address of the structure pointed to, that is the address of a forward pointer (since the forward and backward pointer are always at the beginning of the structure).

    This type of data structure allows easy traversal either forward or backward, and allows easy insertion or deletion at any point in the list.

    The functions insque and remque (Figure 10.20) are called to insert and delete elements in the list.

m_copy and Cluster Reference Counts

One obvious advantage with clusters is being able to reduce the number of mbufs required to contain large amounts of data. For example, if clusters were not used, it would require 10 mbufs to contain 1024 bytes of data: the first one with 100 bytes of data, the next eight with 108 bytes of data each, and the final one with 60 bytes of data. There is more overhead involved in allocating and linking 10 mbufs, than there is in allocating a single mbuf containing the 1024 bytes in a cluster. A disadvantage with clusters is the potential for wasted space. In our example it takes 2176 bytes using a cluster (2048+128), versus 1280 bytes without a cluster (10 x 128).

An additional advantage with clusters is being able to share a cluster between multiple mbufs. We encounter this with TCP output and the m_copy function, but describe it in more detail now.

As an example, assume the application performs a write of 4096 bytes to a TCP socket. Assuming the socket’s send buffer was previously empty, and that the receiver’s window is at least 4096, the following operations take place. One cluster is filled with the first 2048 bytes by the socket layer and the protocol’s send routine is called. The TCP send routine appends the mbuf to its send buffer, as shown in Figure 2.24, and calls tcp_output.

TCP socket send buffer containing 2048 bytes of data.

Figure 2.24. TCP socket send buffer containing 2048 bytes of data.

The socket structure contains the sockbuf structure, which holds the head of the list of mbufs on the send buffer: so_snd.sb_mb.

Assuming a TCP maximum segment size (MSS) of 1460 for this connection (typical for an Ethernet), tcp_output builds a segment to send containing the first 1460 bytes of data. It also builds an mbuf containing the IP and TCP headers, leaves room for a link-layer header (16 bytes), and passes this mbuf chain to IP output. The mbuf chain ends up on the interface’s output queue, which we show in Figure 2.25.

TCP socket send buffer and resulting segment on interface’s output queue.

Figure 2.25. TCP socket send buffer and resulting segment on interface’s output queue.

In our UDP example in Section 1.9, UDP took the mbuf chain containing the datagram, prepended an mbuf for the protocol headers, and passed the chain to IP output. UDP did not keep the mbuf in its send buffer. TCP cannot do this since TCP is a reliable protocol and it must maintain a copy of the data that it sends, until the data is acknowledged by the other end.

In this example tcp_output calls the function m_copy, requesting a copy be made of 1460 bytes, starting at offset 0 from the start of its send buffer. But since the data is in a cluster, m_copy creates an mbuf (the one on the lower right of Figure 2.25) and initializes it to point to the correct place in the existing cluster (the beginning of the cluster in this example). The length of this mbuf is 1460, even though an additional 588 bytes of data are in the cluster. We show the length of the mbuf chain as 1514, accounting for the Ethernet, IP, and TCP headers.

We also show this mbuf on the lower right of Figure 2.25 containing a packet header, yet this isn’t the first mbuf in the chain. When m_copy makes a copy of an mbuf that contains a packet header and the copy starts from offset 0 in the original mbuf, the packet header is also copied verbatim. Since this mbuf is not the first mbuf in the chain, this extraneous packet header is just ignored. The m_pkthdr.len value of 2048 in this extraneous packet header is also ignored.

This sharing of clusters prevents the kernel from copying the data from one mbuf into another—a big savings. It is implemented by providing a reference count for each cluster that is incremented each time another mbuf points to the cluster, and decremented each time a cluster is released. Only when the reference count reaches 0 is the memory used by the cluster available for some other use. (See Exercise 2.4.)

For example, when the bottom mbuf chain in Figure 2.25 reaches the Ethernet device driver and its contents have been copied to the device, the driver calls m_freem. This function releases the first mbuf with the protocol headers and then notices that the second mbuf in the chain points to a cluster. The cluster reference count is decremented, but since its value becomes 1, it is left alone. It cannot be released since it is still in the TCP send buffer.

Continuing our example, tcp_output returns after passing the 1460-byte segment to IP, since the remaining 588 bytes in the send buffer don’t comprise a full-sized segment. (In Chapter 26 we describe in detail the conditions under which tcp_output sends data.) The socket layer continues processing the data from the application: the remaining 2048 bytes are placed into an mbuf with a cluster, TCP’s send routine is called again, and this new mbuf is appended to the socket’s send buffer. Since a full-sized segment can be sent, tcp_output builds another mbuf chain with the protocol headers and the next 1460 bytes of data. The arguments to m_copy specify a starting offset of 1460 bytes from the start of the send buffer and a length of 1460 bytes. This is shown in Figure 2.26, assuming the mbuf chain is again on the interface output queue (so the length of the first mbuf in the chain reflects the Ethernet, IP, and TCP headers).

Mbuf chain to send next 1460-byte TCP segment.

Figure 2.26. Mbuf chain to send next 1460-byte TCP segment.

This time the 1460 bytes of data come from two clusters: the first 588 bytes are from the first cluster in the send buffer and the next 872 bytes are from the second cluster in the send buffer. It takes two mbufs to describe these 1460 bytes, but again m_copy does not copy the 1460 bytes of data—it references the existing clusters.

This time we do not show a packet header with either of the mbufs on the bottom right of Figure 2.26. The reason is that the starting offset in the call to m_copy is nonzero. Also, we show the second mbuf in the socket send buffer containing a packet header, even though it is not the first mbuf in the chain. This is a property of the sosend function, and this extraneous packet header is just ignored.

We encounter the m_copy function about a dozen times throughout the text. Although the name implies that a physical copy is made of the data, if the data is contained in a cluster, an additional reference is made to the cluster instead.

Alternatives

Mbufs are far from perfect and they are berated regularly. Nevertheless, they form the basis for all the Berkeley-derived networking code in use today.

A research implementation of the Internet protocols by Van Jacobson [Partridge 1993] has done away with the complex mbuf data structures in favor of large contiguous buffers. [Jacobson 1993] claims a speed improvement of one to two orders of magnitude, although many other changes were made besides getting rid of mbufs.

The complexity of mbufs is a tradeoff that avoids allocating large fixed buffers that are rarely filled to capacity. At the time mbufs were being designed, a VAX-11/780 with 4 megabytes of memory was a big system, and memory was an expensive resource that needed to be carefully allocated. Today memory is inexpensive, and the focus has shifted toward higher performance and simplicity of code.

The performance of mbufs is also dependent on the amount of data stored in the mbuf. [Hutchinson and Peterson 1991] show that the amount of time required for mbuf processing is nonlinear with respect to the amount of data.

Summary

We’ll encounter mbufs in almost every function in the text. Their main purpose is to hold the user data that travels from the process to the network interface, and vice versa, but mbufs are also used to contain a variety of other miscellaneous data: source and destination addresses, socket options, and so on.

There are four types of mbufs, depending whether the M_PKTHDR and M_EXT flags are on or off:

  • no packet header, with 0 to 108 bytes of data in mbuf itself,

  • packet header, with 0 to 100 bytes of data in mbuf itself,

  • no packet header, with data in cluster (external buffer), and

  • packet header, with data in cluster (external buffer).

We looked at the source code for a few of the mbuf macros and functions, but did not present the source code for all the mbuf routines. Figures 2.19 and 2.20 provide the function prototypes and descriptions of all the mbuf routines that we encounter in the text.

We looked at the operation of two functions that we’ll encounter: m_devget, which is called by many network device drivers to store a received frame; and m_pullup, which is called by all the input routines to place the required protocol headers into contiguous storage in an mbuf.

The clusters (external buffers) pointed to by an mbuf can be shared by m_copy. This is used, for example, by TCP output, because a copy of the data being transmitted must be maintained by the sender until that data is acknowledged by the other end. Sharing clusters through reference counts is a performance improvement over making a physical copy of the data.

Exercises

2.1

In Figure 2.9 the M_COPYFLAGS value was defined. Why was the M_EXT flag not copied?

2.1

The M_EXT flag is a property of the mbuf itself, not a property of the packet described by the mbuf.

2.2

In Section 2.6 we listed two reasons that m_pullup can fail. There are really three reasons. Obtain the source code for this function (Appendix B) and discover the additional reason.

2.2

The caller asks for more than 100 (MHLEN) contiguous bytes.

2.3

To avoid the problems we described in Section 2.6 with the dtom macro when the data is in a cluster, why not just add a back pointer to the mbuf for each cluster?

2.3

This is infeasible since clusters can be pointed to by multiple mbufs (Section 2.9). Also, there is no room in a cluster for a back pointer (Exercise 2.4).

2.4

Since the size of an mbuf cluster is a power of 2 (typically 1024 or 2048), space cannot be taken within the cluster for the reference count. Obtain the Net/3 sources (Appendix B) and determine where these reference counts are stored.

2.4

In the macros MCLALLOC and MCLFREE in <sys/mbuf.h> we see that the reference count is an array named mclrefcnt. This array is allocated when the kernel is initialized in the file machdep.c.

2.5

In Figure 2.5 we noted that the two counters m_drops and m_wait are not currently implemented. Modify the mbuf routines to increment these counters when appropriate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.90.131