Chapter 4. Interfaces: Ethernet

Introduction

In Chapter 3 we discussed the data structures used by all interfaces and the initialization of those data structures. In this chapter we show how the Ethernet device driver operates once it has been initialized and is receiving and transmitting frames. The second half of this chapter covers the generic ioctl commands for configuring network devices. Chapter 5 covers the SLIP and loopback drivers.

We won’t go through the entire source code for the Ethernet driver, since it is around 1,000 lines of C code (half of which is concerned with the hardware details of one particular interface card), but we do look at the device-independent Ethernet code and how the driver interfaces with the rest of the kernel.

If the reader is interested in going through the source code for a driver, the Net/3 release contains the source code for many different interfaces. Access to the interface’s technical specifications is required to understand the device-specific commands. Figure 4.1 shows the various drivers provided with Net/3, including the LANCE driver, which we discuss in this text.

Table 4.1. Ethernet drivers available in Net/3.

Device

File

DEC DEUNA Interface

vax/if/if_de.c

3Com Ethernet Interface

vax/if/if_ec.c

Excelan EXOS 204 Interface

vax/if/if_ex.c

Interlan Ethernet Communications Controller

vax/if/if_il.c

Interlan NP100 Ethernet Communications Controller

vax/if/if_ix.c

Digital Q-BUS to NI Adapter

vax/if/if_qe.c

CMC ENP-20 Ethernet Controller

tahoe/if/if_enp.c

Excelan EXOS 202(VME) & 203(QBUS)

tahoe/if/if_ex.c

ACC VERSAbus Ethernet Controller

tahoe/if/if_ace.c

AMD 7990 LANCE Interface

hp300/dev/if_le.c

NE2000 Ethernet

i386/isa/if_ne.c

Western Digital 8003 Ethernet Adapter

i386/isa/if_we.c

Network device drivers are accessed through the seven function pointers in the ifnet structure (Figure 3.6). Figure 4.2 lists the entry points to our three example drivers.

Table 4.2. Interface functions for the example drivers.

ifnet

Ethernet

SLIP

Loopback

Description

if_init

leinit

  

hardware initialization

if_output

ether_output

sloutput

looutput

accept and queue frame for transmission

if_start

lestart

  

begin transmission of frame

if_done

   

output complete (unused)

if_ioctl

leioctl

slioctl

loioctl

handle ioctl commands from a process

if_reset

lereset

  

reset the device to a known state

if_watchdog

   

watch the device for failures or collect statistics

Input functions are not included in Figure 4.2 as they are interrupt-driven for network devices. The configuration of interrupt service routines is hardware-dependent and beyond the scope of this book. We’ll identify the functions that handle device interrupts, but not the mechanism by which these functions are invoked.

Only the if_output and if_ioctl functions are called with any consistency. if_init, if_done, and if_reset are never called or only called from device-specific code (e.g., leinit is called directly by leioctl). if_start is called only by the ether_output function.

Code Introduction

The code for the Ethernet device driver and the generic interface ioctls resides in two headers and three C files, which are listed in Figure 4.3.

Table 4.3. Files discussed in this chapter.

File

Description

netinet/if_ether.h

Ethernet structures

net/if.h

ioctl command definitions

net/if_ethersubr.c

generic Ethernet functions

hp300/dev/if_le.c

LANCE Ethernet driver

net/if.c

ioctl processing

Global Variables

The global variables shown in Figure 4.4 include the protocol input queues, the LANCE interface structure, and the Ethernet broadcast address.

Table 4.4. Global variables introduced in this chapter.

Variable

Datatype

Description

arpintrq

struct ifqueue

ARP input queue

clnlintrq

struct ifqueue

CLNP input queue

ipintrq

struct ifqueue

IP input queue

le_softc

struct le_softc []

LANCE Ethernet interface

etherbroadcastaddr

u_char []

Ethernet broadcast address

le_softc is an array, since there can be several Ethernet interfaces.

Statistics

The statistics collected in the ifnet structure for each interface are described in Figure 4.5.

Table 4.5. Statistics maintained in the ifnet structure.

ifnet member

Description

Used by SNMP

if_collisions

#collisions on CSMA interfaces

 

if_ibytes

total #bytes received

if_ierrors

#packets received with input errors

if_imcasts

#packets received as multicasts or broadcasts

if_ipackets

#packets received on interface

if_iqdrops

#packets dropped on input, by this interface

if_lastchange

time of last change to statistics

if_noproto

#packets destined for unsupported protocol

if_obytes

total #bytes sent

if_oerrors

#output errors on interface

if_omcasts

#packets sent as multicasts

if_opackets

#packets sent on interface

if_snd.ifq_drops

#packets dropped during output

if_snd.ifq_len

#packets in output queue

 

Figure 4.6 shows some sample output from the netstat command, which includes statistics from the ifnet structure.

Table 4.6. Sample interface statistics.

netstat -i output

Name  Mtu   Network     Address            Ipkts Ierrs     Opkts  Oerrs   Coll
le0   1500  <Link>8.0.9.13.d.33         28680519   814  29234729     12  942798
le0   1500  128.32.33   128.32.33.5     28680519   814  29234729     12  942798
sl0*  296   <Link>                         54036     0     45402      0      0
sl0*  296   128.32.33   128.32.33.5        54036     0     45402      0      0
sl1   296   <Link>                         40397     0     33544      0      0
sl1   296   128.32.33   128.32.33.5        40397     0     33544      0      0
sl2*  296   <Link>                             0     0         0      0      0
sl3*  296   <Link>                             0     0         0      0      0
lo0   1536  <Link>                        493599     0    493599      0      0
lo0   1536  127         127.0.0.1         493599     0    493599      0      0

The first column contains if_name and if_unit displayed as a string. If the interface is shut down (IFF_UP is not set), an asterisk appears next to the name. In Figure 4.6, sl0, sl2, and sl3 are shut down.

The second column shows if_mtu. The output under the “Network” and “Address” headings depends on the type of address. For link-level addresses, the contents of sdl_data from the sockaddr_dl structure are displayed. For IP addresses, the subnet and unicast addresses are displayed. The remaining columns are if_ipackets, if_ierrors, if_opackets, if_oerrors, and if_collisions.

  • Approximately 3% of the packets collide on output (942,798/29,234,729 = 3%).

  • The SLIP output queues are never full on this machine since there are no output errors for the SLIP interfaces.

  • The 12 Ethernet output errors are problems detected by the LANCE hardware during transmission. Some of these errors may also be counted as collisions.

  • The 814 Ethernet input errors are also problems detected by the hardware, such as packets that are too short or that have invalid checksums.

SNMP Variables

Figure 4.7 shows a single interface entry object (ifEntry) from the SNMP interface table (ifTable), which is constructed from the ifnet structures for each interface.

Table 4.7. Variables in interface table: ifTable.

Interface table, index = <ifIndex>

SNMP variable

ifnet member

Description

ifIndex

if_index

uniquely identifies the interface

ifDescr

if_name

text name of interface

ifType

if_type

type of interface (e.g., Ethernet, SLIP, etc.)

ifMtu

if_mtu

MTU of the interface in bytes

ifSpeed

(see text)

nominal speed of the interface in bits per second

ifPhysAddress

ac_enaddr

media address (from arpcom structure)

ifAdminStatus

(see text)

desired state of the interface (IFF_UP flag)

ifOperStatus

if_flags

operational state of the interface (IFF_UP flag)

ifLastChange

(see text)

last time the statistics changed

ifInOctets

if_ibytes

total #input bytes

ifInUcastPkts

if_ipackets - if_imcasts

#input unicast packets

ifInNUcastPkts

if_imcasts

#input broadcast or multicast packets

ifInDiscards

if_iqdrops

#packets discarded because of implementation limits

ifInErrors

if_ierrors

#packets with errors

ifInUnknownProtos

if_noproto

#packets destined to an unknown protocol

ifOutOctets

if_obytes

#output bytes

ifOutUcastPkts

if_opackets - if_omcasts

#output unicast packets

ifOutNUcastPkts

if_omcasts

#output broadcast or multicast packets

ifOutDiscards

if_snd.ifq_drops

#output packets dropped because of implementation limits

ifOutErrors

if_oerrors

#output packets dropped because of errors

ifOutQLen

if_snd.ifq_len

output queue length

ifSpecific

n/a

SNMP object ID for media-specific information (not implemented)

The ISODE SNMP agent derives ifSpeed from if_type and maintains an internal variable for ifAdminStatus. The agent reports ifLastChange based on if_lastchange in the ifnet structure but relative to the agent’s boot time, not the boot time of the system. The agent returns a null variable for ifSpecific.

Ethernet Interface

Net/3 Ethernet device drivers all follow the same general design. This is common for most Unix device drivers because the writer of a driver for a new interface card often starts with a working driver for another card and modifies it. In this section we’ll provide a brief overview of the Ethernet standard and outline the design of an Ethernet driver. We’ll refer to the LANCE driver to illustrate the design.

Figure 4.8 illustrates Ethernet encapsulation of an IP packet.

Ethernet encapsulation of an IP packet.

Figure 4.8. Ethernet encapsulation of an IP packet.

Ethernet frames consist of 48-bit destination and source addresses followed by a 16-bit type field that identifies the format of the data carried by the frame. For IP packets, the type is 0x0800 (2048). The frame is terminated with a 32-bit CRC (cyclic redundancy check), which detects errors in the frame.

We are describing the original Ethernet framing standard published in 1982 by Digital Equipment Corp., Intel Corp., and Xerox Corp., as it is the most common form used today in TCP/IP networks. An alternative form is specified by the IEEE (Institute of Electrical and Electronics Engineers) 802.2 and 802.3 standards. Section 2.2 in Volume 1 describes the differences between the two forms. See [Stallings 1987] for more information on the IEEE standards.

Encapsulation of IP packets for Ethernet is specified by RFC 894 [Hornig 1984] and for 802.3 networks by RFC 1042 [Postel and Reynolds 1988].

We will refer to the 48-bit Ethernet addresses as hardware addresses. The translation from IP to hardware addresses is done by the ARP protocol described in Chapter 21 (RFC 826 [Plummer 1982]) and from hardware to IP addresses by the RARP protocol (RFC 903 [Finlayson et al. 1984]). Ethernet addresses come in two types, unicast and multicast. A unicast address specifies a single Ethernet interface, and a multicast address specifies a group of Ethernet interfaces. An Ethernet broadcast is a multicast received by all interfaces. Ethernet unicast addresses are assigned by the device’s manufacturer, although some devices allow the address to be changed by software.

Some DECNET protocols require the hardware addresses of a multihomed host to be identical, so DECNET must be able to change the Ethernet unicast address of a device.

Figure 4.9 illustrates the data structures and functions that are part of the Ethernet interface.

Ethernet device driver.

Figure 4.9. Ethernet device driver.

In figures, a function is identified by an ellipse (leintr), data structures by a box (le_softc[0]), and a group of functions by a rounded box (ARP protocol).

In the top left corner of Figure 4.9 we show the input queues for the OSI Connectionless Network Layer (clnl) protocol, IP, and ARP. We won’t say anything more about clnlintrq, but include it to emphasize that ether_input demultiplexes Ethernet frames into multiple protocol queues.

Technically, OSI uses the term Connectionless Network Protocol (CLNP versus CLNL) but we show the terminology used by the Net/3 code. The official standard for CLNP is ISO 8473. [Stallings 1993] summarizes the standard.

The le_softc interface structure is in the center of Figure 4.9. We are interested only in the ifnet and arpcom portions of the structure. The remaining portions are specific to the LANCE hardware. We showed the ifnet structure in Figure 3.6 and the arpcom structure in Figure 3.26.

leintr Function

We start with the reception of Ethernet frames. For now, we assume that the hardware has been initialized and the system has been configured so that leintr is called when the interface generates an interrupt. In normal operation, an Ethernet interface receives frames destined for its unicast hardware address and for the Ethernet broadcast address. When a complete frame is available, the interface generates an interrupt and the kernel calls leintr.

In Chapter 12, we’ll see that many Ethernet interfaces may be configured to receive Ethernet multicast frames (other than broadcasts).

Some interfaces can be configured to run in promiscuous mode in which the interface receives all frames that appear on the network. The tcpdump program described in Volume 1 can take advantage of this feature using BPF.

leintr examines the hardware and, if a frame has arrived, calls leread to transfer the frame from the interface to a chain of mbufs (with m_devget). If the hardware reports that a frame transmission has completed or an error has been detected (such as a bad checksum), leintr updates the appropriate interface statistics, resets the hardware, and calls lestart, which attempts to transmit another frame.

All Ethernet device drivers deliver their received frames to ether_input for further processing. The mbuf chain constructed by the device driver does not include the Ethernet header, so it is passed as a separate argument to ether_input. The ether_header structure is shown in Figure 4.10.

Table 4.10. The ether_header structure.

------------------------------------------------------------------------ if_ether.h
 38 struct ether_header {
 39     u_char  ether_dhost[6];     /* Ethernet destination address */
 40     u_char  ether_shost[6];     /* Ethernet source address */
 41     u_short ether_type;         /* Ethernet frame type */
 42 };
------------------------------------------------------------------------ if_ether.h

38-42

The Ethernet CRC is not generally available. It is computed and checked by the interface hardware, which discards frames that arrive with an invalid CRC. The Ethernet device driver is responsible for converting ether_type between network and host byte order. Outside of the driver, it is always in host byte order.

leread Function

The leread function (Figure 4.11) starts with a contiguous buffer of memory passed to it by leintr and constructs an ether_header structure and a chain of mbufs. The chain contains the data from the Ethernet frame. leread also passes the incoming frame to BPF.

Table 4.11. leread function.

------------------------------------------------------------------------- if_le.c
528 leread(unit, buf, len)
529 int     unit;
530 char   *buf;
531 int     len;
532 {
533     struct le_softc *le = &le_softc[unit];
534     struct ether_header *et;
535     struct mbuf *m;
536     int     off, resid, flags;

537     le->sc_if.if_ipackets++;
538     et = (struct ether_header *) buf;
539     et->ether_type = ntohs((u_short) et->ether_type);
540     /* adjust input length to account for header and CRC */
541     len = len - sizeof(struct ether_header) - 4;
542     off = 0;

543     if (len <= 0) {
544         if (ledebug)
545             log(LOG_WARNING,
546                 "le%d: ierror(runt packet): from %s: len=%d
",
547                 unit, ether_sprintf(et->ether_shost), len);
548         le->sc_runt++;
549         le->sc_if.if_ierrors++;
550         return;
551     }
552     flags = 0;
553     if (bcmp((caddr_t) etherbroadcastaddr,
554              (caddr_t) et->ether_dhost, sizeof(etherbroadcastaddr)) == 0)
555         flags |= M_BCAST;
556     if (et->ether_dhost[0] & 1)
557         flags |= M_MCAST;

558     /*
559      * Check if there's a bpf filter listening on this interface.
560      * If so, hand off the raw packet to enet.
561      */
562     if (le->sc_if.if_bpf) {
563         bpf_tap(le->sc_if.if_bpf, buf, len + sizeof(struct ether_header));

564         /*
565          * Keep the packet if it's a broadcast or has our
566          * physical ethernet address (or if we support
567          * multicast and it's one).
568          */

569         if ((flags & (M_BCAST | M_MCAST)) == 0 &&
570             bcmp(et->ether_dhost, le->sc_addr,
571                  sizeof(et->ether_dhost)) != 0)
572             return;
573     }
574     /*
575      * Pull packet off interface.  Off is nonzero if packet
576      * has trailing header; m_devget will then force this header
577      * information to be at the front, but we still have to drop
578      * the type and length which are at the front of any trailer data.
579      */
580     m = m_devget((char *) (et + 1), len, off, &le->sc_if, 0);
581     if (m == 0)
582         return;
583     m->m_flags |= flags;
584     ether_input(&le->sc_if, et, m);
585 }
------------------------------------------------------------------------- if_le.c

528-539

The leintr function passes three arguments to leread:unit, which identifies the particular interface card that received a frame; buf, which points to the received frame; and len, the number of bytes in the frame (including the header and the CRC).

The function constructs the ether_header structure by pointing et to the front of the buffer and converting the Ethernet type value to host byte order.

540-551

The number of data bytes is computed by subtracting the sizes of the Ethernet header and the CRC from len. Runt packets, which are too short to be a valid Ethernet frame, are logged, counted, and discarded.

552-557

Next, the destination address is examined to determine if it is the Ethernet broadcast or an Ethernet multicast address. The Ethernet broadcast address is a special case of an Ethernet multicast address; it has every bit set. etherbroadcastaddr is an array defined as

    u_char  etherbroadcastaddr[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };

This is a convenient way to define a 48-bit value in C. This technique works only if we assume that characters are 8-bit values something that isn’t guaranteed by ANSI C.

If bcmp reports that etherbroadcastaddr and ether_dhost are the same, the M_BCAST flag is set.

An Ethernet multicast addresses is identified by the low-order bit of the most significant byte of the address. Figure 4.12 illustrates this.

Testing for an Ethernet multicast address.

Figure 4.12. Testing for an Ethernet multicast address.

In Chapter 12 we’ll see that not all Ethernet multicast frames are IP multicast datagrams and that IP must examine the packet further.

If the multicast bit is on in the address, M_MCAST is set in the mbuf header. The order of the tests is important: first ether_input compares the entire 48-bit address to the Ethernet broadcast address, and if they are different it checks the low-order bit of the most significant byte to identify an Ethernet multicast address (Exercise 4.1).

558-573

If the interface is tapped by BPF, the frame is passed directly to BPF by calling bpf_tap. We’ll see that for SLIP and the loopback interfaces, a special BPF frame is constructed since those networks do not have a link-level header (unlike Ethernet).

When an interface is tapped by BPF, it can be configured to run in promiscuous mode and receive all Ethernet frames that appear on the network instead of the subset of frames normally received by the hardware. The packet is discarded by leread if it was sent to a unicast address that does not match the interface’s address.

574-585

m_devget (Section 2.6) copies the data from the buffer passed to leread to an mbuf chain it allocates. The first argument to m_devget points to the first byte after the Ethernet header, which is the first data byte in the frame. If m_devget runs out of memory, leread returns immediately. Otherwise the broadcast and multicast flags are set in the first mbuf in the chain, and ether_input processes the packet.

ether_input Function

ether_input, shown in Figure 4.13, examines the ether_header structure to determine the type of data that has been received and then queues the received packet for processing.

Table 4.13. ether_input function.

------------------------------------------------------------------ if_ethersubr.c
196 void
197 ether_input(ifp, eh, m)
198 struct ifnet *ifp;
199 struct ether_header *eh;
200 struct mbuf *m;
201 {
202     struct ifqueue *inq;
203     struct llc *l;
204     struct arpcom *ac = (struct arpcom *) ifp;
205     int     s;

206     if ((ifp->if_flags & IFF_UP) == 0) {
207         m_freem(m);
208         return;
209     }
210     ifp->if_lastchange = time;
211     ifp->if_ibytes += m->m_pkthdr.len + sizeof(*eh);
212     if (bcmp((caddr_t) etherbroadcastaddr, (caddr_t) eh->ether_dhost,
213              sizeof(etherbroadcastaddr)) == 0)
214         m->m_flags |= M_BCAST;
215     else if (eh->ether_dhost[0] & 1)
216         m->m_flags |= M_MCAST;
217     if (m->m_flags & (M_BCAST | M_MCAST))
218         ifp->if_imcasts++;

219     switch (eh->ether_type) {
220     case ETHERTYPE_IP:
221         schednetisr(NETISR_IP);
222         inq = &ipintrq;
223         break;

224     case ETHERTYPE_ARP:
225         schednetisr(NETISR_ARP);
226         inq = &arpintrq;
227         break;

228     default:
229         if (eh->ether_type > ETHERMTU) {
230             m_freem(m);
231             return;
232         }
                                                                                
                                       /* OSI code */                           
                                                                                
307     }

308     s = splimp();
309     if (IF_QFULL(inq)) {
310         IF_DROP(inq);
311         m_freem(m);
312     } else
313         IF_ENQUEUE(inq, m);
314     splx(s);
315 }
------------------------------------------------------------------ if_ethersubr.c

Broadcast and multicast recognition

196-209

The arguments to ether_input are ifp, a pointer to the receiving interface’s ifnet structure; eh, a pointer to the Ethernet header of the received packet; and m, a pointer to the received packet (excluding the Ethernet header).

Any packets that arrive on an inoperative interface are silently discarded. The interface may not have been configured with a protocol address, or may have been disabled by an explicit request from the ifconfig(8) program (Section 6.6).

210-218

The variable time is a global timeval structure that the kernel maintains with the current time and date, as the number of seconds and microseconds past the Unix Epoch (00:00:00 January 1, 1970, Coordinated Universal Time [UTC]). A brief discussion of UTC can be found in [Itano and Ramsey 1993]. We’ll encounter the timeval structure throughout the Net/3 sources:

struct timeval {
  long  tv_sec;             /* seconds */
  long  tv_usec;            /* and microseconds */
};

ether_input updates if_lastchange with the current time and increments if_ibytes by the size of the incoming packet (the packet length plus the 14-byte Ethernet header).

Next, ether_input repeats the tests done by leread to determine if the packet is a broadcast or multicast packet.

Some kernels may not have been compiled with the BPF code, so the test must also be done in ether_input.

Link-level demultiplexing

219-227

ether_input jumps according to the Ethernet type field. For an IP packet, schednetisr schedules an IP software interrupt and the IP input queue, ipintrq, is selected. For an ARP packet, the ARP software interrupt is scheduled and arpintrq is selected.

An isr is an interrupt service routine.

In previous BSD releases, ARP packets were processed immediately while at the network interrupt level by calling arpinput directly. By queueing the packets, they can be processed at the software interrupt level.

If other Ethernet types are to be handled, a kernel programmer would add additional cases here. Alternately, a process can receive other Ethernet types using BPF. For example, RARP servers are normally implemented using BPF under Net/3.

228-307

The default case processes unrecognized Ethernet types or packets that are encapsulated according to the 802.3 standard (such as the OSI connectionless transport). The Ethernet type field and the 802.3 length field occupy the same position in an Ethernet frame. The two encapsulations can be distinguished because the range of types in an Ethernet encapsulation is distinct from the range of lengths in the 802.3 encapsulation (Figure 4.14). We have omitted the OSI code. [Stallings 1993] contains a description of the OSI link-level protocols.

Table 4.14. Ethernet type and 802.3 length fields.

Range

Description

0—1500

IEEE 802.3 length field

1501—65535

Ethernet type field:

2048

  • IP packet

2054

  • ARP packet

There are many additional Ethernet type values that are assigned to various protocols; we don’t show them in Figure 4.14. RFC 1700 [Reynolds and Postel 1994] contains a list of the more common types.

Queue the packet

308-315

Finally, ether_input places the packet on the selected queue or discards the packet if the queue is full. We’ll see in Figures 7.23 and 21.16 that the default limit for the IP and ARP input queues is 50 (ipqmaxlen) packets each.

When ether_input returns, the device driver tells the hardware that it is ready to receive the next packet, which may already be present in the device. The packet input queues are processed when the software interrupt scheduled by schednetisr occurs (Section 1.12). Specifically, ipintr is called to process the packets on the IP input queue, and arpintr is called to process the packets on the ARP input queue.

ether_output Function

We now examine the output of Ethernet frames, which starts when a network-level protocol such as IP calls the if_output function, specified in the interface’s ifnet structure. The if_output function for all Ethernet devices is ether_output (Figure 4.2). ether_output takes the data portion of an Ethernet frame, encapsulates it with the 14-byte Ethernet header, and places it on the interface’s send queue. This is a large function so we describe it in four parts:

  • verification,

  • protocol-specific processing,

  • frame construction, and

  • interface queueing.

Figure 4.15 includes the first part of the function.

Table 4.15. ether_output function: verification.

---------------------------------------------------------------- if_ethersubr.c
 49 int
 50 ether_output(ifp, m0, dst, rt0)
 51 struct ifnet *ifp;
 52 struct mbuf *m0;
 53 struct sockaddr *dst;
 54 struct rtentry *rt0;
 55 {
 56     short   type;
 57     int     s, error = 0;
 58     u_char  edst[6];
 59     struct mbuf *m = m0;
 60     struct rtentry *rt;
 61     struct mbuf *mcopy = (struct mbuf *) 0;
 62     struct ether_header *eh;
 63     int     off, len = m->m_pkthdr.len;
 64     struct arpcom *ac = (struct arpcom *) ifp;

 65     if ((ifp->if_flags & (IFF_UP | IFF_RUNNING)) != (IFF_UP | IFF_RUNNING))
 66         senderr(ENETDOWN);
 67     ifp->if_lastchange = time;
 68     if (rt = rt0) {
 69         if ((rt->rt_flags & RTF_UP) == 0) {
 70             if (rt0 = rt = rtalloc1(dst, 1))
 71                 rt->rt_refcnt--;
 72             else
 73                 senderr(EHOSTUNREACH);
 74         }
 75         if (rt->rt_flags & RTF_GATEWAY) {
 76             if (rt->rt_gwroute == 0)
 77                 goto lookup;
 78             if (((rt = rt->rt_gwroute)->rt_flags & RTF_UP) == 0) {
 79                 rtfree(rt);
 80                 rt = rt0;
 81     lookup:     rt->rt_gwroute = rtalloc1(rt->rt_gateway, 1);

 82                 if ((rt = rt->rt_gwroute) == 0)
 83                     senderr(EHOSTUNREACH);
 84             }
 85         }
 86         if (rt->rt_flags & RTF_REJECT)
 87             if (rt->rt_rmx.rmx_expire == 0 ||
 88                 time.tv_sec < rt->rt_rmx.rmx_expire)
 89                 senderr(rt == rt0 ? EHOSTDOWN : EHOSTUNREACH);
 90     }
---------------------------------------------------------------- if_ethersubr.c

49-64

The arguments to ether_output are ifp, which points to the outgoing interface’s ifnet structure; m0, the packet to send; dst, the destination address of the packet; and rt0, routing information.

65-67

The macro senderr is called throughout ether_output.

#define senderr(e) { error = (e); goto bad;}

senderr saves the error code and jumps to bad at the end of the function, where the packet is discarded and ether_output returns error.

If the interface is up and running, ether_output updates the last change time for the interface. Otherwise, it returns ENETDOWN.

Host route

68-74

rt0 points to the routing entry located by ip_output and passed to ether_output. If ether_output is called from BPF, rt0 can be null, in which case control passes to the code in Figure 4.16. Otherwise, the route is verified. If the route is not valid, the routing tables are consulted and EHOSTUNREACH is returned if a route cannot be located. At this point, rt0 and rt point to a valid route for the next-hop destination.

Table 4.16. ether_output function: network protocol processing.

-------------------------------------------------------------------- if_ethersubr.c
 91     switch (dst->sa_family) {

 92     case AF_INET:
 93         if (!arpresolve(ac, rt, m, dst, edst))
 94             return (0);         /* if not yet resolved */
 95         /* If broadcasting on a simplex interface, loopback a copy */
 96         if ((m->m_flags & M_BCAST) && (ifp->if_flags & IFF_SIMPLEX))
 97             mcopy = m_copy(m, 0, (int) M_COPYALL);
 98         off = m->m_pkthdr.len - m->m_len;
 99         type = ETHERTYPE_IP;
100         break;
101     case AF_ISO:
                                                                                 
                                       /* OSI code */                            
                                                                                 
142     case AF_UNSPEC:
143         eh = (struct ether_header *) dst->sa_data;
144         bcopy((caddr_t) eh->ether_dhost, (caddr_t) edst, sizeof(edst));
145         type = eh->ether_type;
146         break;

147     default:
148         printf("%s%d: can't handle af%d
", ifp->if_name, ifp->if_unit,
149                dst->sa_family);
150         senderr(EAFNOSUPPORT);
151     }
-------------------------------------------------------------------- if_ethersubr.c

Gateway route

75-85

If the next hop for the packet is a gateway (versus a final destination), a route to the gateway is located and pointed to by rt. If a gateway route cannot be found, EHOSTUNREACH is returned. At this point, rt points to the route for the next-hop destination. The next hop may be a gateway or the final destination.

Avoid ARP flooding

86-90

The RTF_REJECT flag is enabled by the ARP code to discard packets to the destination when the destination is not responding to ARP requests. This is described with Figure 21.24.

ether_output processing continues according to the destination address of the packet. Since Ethernet devices respond only to Ethernet addresses, to send a packet, ether_output must find the Ethernet address that corresponds to the IP address of the next-hop destination. The ARP protocol (Chapter 21) implements this translation. Figure 4.16 shows how the driver accesses the ARP protocol.

IP output

91-101

ether_output jumps according to sa_family in the destination address. We show only the AF_INET, AF_ISO, and AF_UNSPEC cases in Figure 4.16 and have omitted the code for AF_ISO.

The AF_INET case calls arpresolve to determine the Ethernet address corresponding to the destination IP address. If the Ethernet address is already in the ARP cache, arpresolve returns 1 and ether_output proceeds. Otherwise this IP packet is held by ARP, and when ARP determines the address, it calls ether_output from the function in_arpinput.

Assuming the ARP cache contains the hardware address, ether_output checks if the packet is going to be broadcast and if the interface is simplex (i.e., it can’t receive its own transmissions). If both tests are true, m_copy makes a copy of the packet. After the switch, the copy is queued as if it had arrived on the Ethernet interface. This is required by the definition of broadcasting; the sending host must receive a copy of the packet.

We’ll see in Chapter 12 that multicast packets may also be looped back to be received on the output interface.

Explicit Ethernet output

142-146

Some protocols, such as ARP, need to specify the Ethernet destination and type explicitly. The address family constant AF_UNSPEC indicates that dst points to an Ethernet header. bcopy duplicates the destination address in edst and assigns the Ethernet type to type. It isn’t necessary to call arpresolve (as for AF_INET) because the Ethernet destination address has been provided explicitly by the caller.

Unrecognized address families

147-151

Unrecognized address families generate a console message and ether_output returns EAFNOSUPPORT.

In the next section of ether_output, shown in Figure 4.17, the Ethernet frame is constructed.

Table 4.17. ether_output function: Ethernet frame construction.

---------------------------------------------------------------------- if_ethersubr.c
152     if (mcopy)
153         (void) looutput(ifp, mcopy, dst, rt);
154     /*
155      * Add local net header.  If no space in first mbuf,
156      * allocate another.
157      */
158     M_PREPEND(m, sizeof(struct ether_header), M_DONTWAIT);
159     if (m == 0)
160         senderr(ENOBUFS);
161     eh = mtod(m, struct ether_header *);
162     type = htons((u_short) type);
163     bcopy((caddr_t) &type, (caddr_t) &eh->ether_type,
164         sizeof(eh->ether_type));
165     bcopy((caddr_t)edst, (caddr_t)eh->ether_dhost, sizeof (edst));
166     bcopy((caddr_t)ac->ac_enaddr, (caddr_t)eh->ether_shost,
167         sizeof(eh->ether_shost));
---------------------------------------------------------------------- if_ethersubr.c

Ethernet header

152-167

If the code in the switch made a copy of the packet, the copy is processed as if it had been received on the output interface by calling looutput. The loopback interface and looutput are described in Section 5.4.

M_PREPEND ensures that there is room for 14 bytes at the front of the packet.

Most protocols arrange to leave room at the front of the mbuf chain so that M_PREPEND needs only to adjust some pointers (e.g., sosend for UDP output in Section 16.7 and igmp_sendreport in Section 13.6).

ether_output forms the Ethernet header from type, edst, and ac_enaddr (Figure 3.26). ac_enaddr is the unicast Ethernet address associated with the output interface and is the source Ethernet address for all frames transmitted on the interface. ether_output overwrites the source address the caller may have specified in the ether_header structure with ac_enaddr. This makes it more difficult to forge the source address of an Ethernet frame.

At this point, the mbuf contains a complete Ethernet frame except for the 32-bit CRC, which is computed by the Ethernet hardware during transmission. The code shown in Figure 4.18 queues the frame for transmission by the device.

Table 4.18. ether_output function: output queueing.

--------------------------------------------------------------------- if_ethersubr.c
168     s = splimp();
169     /*
170      * Queue message on interface, and start output if interface
171      * not yet active.
172      */
173     if (IF_QFULL(&ifp->if_snd)) {
174         IF_DROP(&ifp->if_snd);
175         splx(s);
176         senderr(ENOBUFS);
177     }
178     IF_ENQUEUE(&ifp->if_snd, m);
179     if ((ifp->if_flags & IFF_OACTIVE) == 0)
180         (*ifp->if_start) (ifp);
181     splx(s);
182     ifp->if_obytes += len + sizeof(struct ether_header);
183     if (m->m_flags & M_MCAST)
184         ifp->if_omcasts++;
185     return (error);

186   bad:
187     if (m)
188         m_freem(m);
189     return (error);
190 }
--------------------------------------------------------------------- if_ethersubr.c

168-185

If the output queue is full, ether_output discards the frame and returns ENOBUFS. If the output queue is not full, the frame is placed on the interface’s send queue, and the interface’s if_start function transmits the next frame if the interface is not already active.

186-190

The senderr macro jumps to bad where the frame is discarded and an error code is returned.

lestart Function

The lestart function dequeues frames from the interface output queue and arranges for them to be transmitted by the LANCE Ethernet card. If the device is idle, the function is called to begin transmitting frames. An example appears at the end of ether_output (Figure 4.18), where lestart is called indirectly through the interface’s if_start function.

If the device is busy, it generates an interrupt when it completes transmission of the current frame. The driver calls lestart to dequeue and transmit the next frame. Once started, the protocol layer can queue frames without calling lestart since the driver dequeues and transmits frames until the queue is empty.

Figure 4.19 shows the lestart function. lestart assumes splimp has been called to block any device interrupts.

Table 4.19. lestart function.

---------------------------------------------------------------------------- if_le.c
325 lestart(ifp)
326 struct ifnet *ifp;
327 {
328     struct le_softc *le = &le_softc[ifp->if_unit];
329     struct letmd *tmd;
330     struct mbuf *m;
331     int     len;

332     if ((le->sc_if.if_flags & IFF_RUNNING) == 0)
333         return (0);
                                                                                   
                                /* device-specific code */                         
                                                                                   
335     do {
                                                                                   
                                  /* device-specific code */                       
                                                                                   
340         IF_DEQUEUE(&le->sc_if.if_snd, m);
341         if (m == 0)
342             return (0);
343         len = leput(le->sc_r2->ler2_tbuf[le->sc_tmd], m);
344         /*
345          * If bpf is listening on this interface, let it
346          * see the packet before we commit it to the wire.
347          */
348         if (ifp->if_bpf)
349             bpf_tap(ifp->if_bpf, le->sc_r2->ler2_tbuf[le->sc_tmd],
350                     len);
                                                                                   
                                  /* device-specific code */                       
                                                                                   
359     } while (++le->sc_txcnt < LETBUF);
360     le->sc_if.if_flags |= IFF_OACTIVE;
361     return (0);
362 }
---------------------------------------------------------------------------- if_le.c

Interface must be initialized

325-333

If the interface is not initialized, lestart returns immediately.

Dequeue frame from output queue

335-342

If the interface is initialized, the next frame is removed from the queue. If the interface output queue is empty, lestart returns.

Transmit frame and pass to BPF

343-350

leput copies the frame in m to the hardware buffer pointed to by the first argument to leput. If the interface is tapped by BPF, the frame is passed to bpf_tap. We have omitted the device-specific code that initiates the transmission of the frame from the hardware buffer.

Repeat if device is ready for more frames

359

lestart stops passing frames to the device when le->sc_txcnt equals LETBUF. Some Ethernet interfaces can queue more than one outgoing Ethernet frame. For the LANCE driver, LETBUF is the number of hardware transmit buffers available to the driver, and le->sc_txcnt keeps track of how many of the buffers are in use.

Mark device as busy

360-362

Finally, lestart turns on IFF_OACTIVE in the ifnet structure to indicate the device is busy transmitting frames.

There is an unfortunate side effect to queueing multiple frames in the device for transmission. According to [Jacobson 1988a], the LANCE chip is able to transmit queued frames with very little delay between frames. Unfortunately, some [broken] Ethernet devices drop the frames because they can’t process the incoming data fast enough.

This interacts badly with an application such as NFS that sends large UDP datagrams (often greater than 8192 bytes) that are fragmented by IP and queued in the LANCE device as multiple Ethernet frames. Fragments are lost on the receiving side, resulting in many incomplete datagrams and high delays as NFS retransmits the entire UDP datagram.

Jacobson noted that Sun’s LANCE driver only queued one frame at a time, perhaps to avoid this problem.

ioctl System Call

The ioctl system call supports a generic command interface used by a process to access features of a device that aren’t supported by the standard system calls. The prototype for ioctl is:

int ioctl (int fd, unsigned long com, ...);

fd is a descriptor, usually a device or network connection. Each type of descriptor supports its own set of ioctl commands specified by the second argument, com. A third argument is shown as “ ” in the prot otype, since it is a pointer of some type that depends on the ioctl command being invoked. If the command is retrieving information, the third argument must point to a buffer large enough to hold the data. In this text, we discuss only the ioctl commands applicable to socket descriptors.

The prototype we show for system calls is the one used by a process to issue the system call. We’ll see in Chapter 15 that the function within the kernel that implements a system call has a different prototype.

We describe the implementation of the ioctl system call in Chapter 17 but we discuss the implementation of individual ioctl commands throughout the text.

The first ioctl commands we discuss provide access to the network interface structures that we have described. Throughout the text we summarize ioctl commands as shown in Figure 4.20.

Table 4.20. Interface ioctl commands.

Command

Third argument

Function

Description

SIOCGIFCONF

struct ifconf *

ifconf

retrieve list of interface configuration

SIOCGIFFLAGS

struct ifreq *

ifioctl

get interface flags

SIOCGIFMETRIC

struct ifreq *

ifioctl

get interface metric

SIOCSIFFLAGS

struct ifreq *

ifioctl

set interface flags

SIOCSIFMETRIC

struct ifreq *

ifioctl

set interface metric

The first column shows the symbolic constant that identifies the ioctl command (the second argument, com). The second column shows the type of the third argument passed to the ioctl system call for the command shown in the first column. The third column names the function that implements the command.

Figure 4.21 shows the organization of the various functions that process ioctl commands. The shaded functions are the ones we describe in this chapter. The remaining functions are described in other chapters.

ioctl functions described in this chapter.

Figure 4.21. ioctl functions described in this chapter.

ifioctl Function

The ioctl system call routes the five commands shown in Figure 4.20 to the ifioctl function shown in Figure 4.22.

Table 4.22. ifioctl function: overview and SIOCGIFCONF.

------------------------------------------------------------------------------------ if.c
394 int
395 ifioctl(so, cmd, data, p)
396 struct socket *so;
397 int     cmd;
398 caddr_t data;
399 struct proc *p;
400 {
401     struct ifnet *ifp;
402     struct ifreq *ifr;
403     int     error;

404     if (cmd == SIOCGIFCONF)
405         return (ifconf(cmd, data));

406     ifr = (struct ifreq *) data;
407     ifp = ifunit(ifr->ifr_name);
408     if (ifp == 0)
409         return (ENXIO);
410     switch (cmd) {
                                                                               
           /* other interface ioctl commands (Figures 4.29 and 12.11) */       
                                                                               
447     default:
448         if (so->so_proto == 0)
449             return (EOPNOTSUPP);
450         return ((*so->so_proto->pr_usrreq) (so, PRU_CONTROL,
451                                             cmd, data, ifp));
452     }
453     return (0);
454 }
------------------------------------------------------------------------------------ if.c

394-405

For the SIOCGIFCONF command, ifioctl calls ifconf to construct a table of variable-length ifreq structures.

406-410

For the remaining ioctl commands, the data argument is a pointer to an ifreq structure. ifunit searches the ifnet list for an interface with the text name provided by the process in ifr->ifr_name (e.g., "sl0","le1", or "lo0"). If there is no matching interface, ifioctl returns ENXIO. The remaining code depends on cmd and is described with Figure 4.29.

447-454

If the interface ioctl command is not recognized, ifioctl forwards the command to the user-request function of the protocol associated with the socket on which the request was made. For IP, these commands are issued on a UDP socket and udp_usrreq is called. The commands that fall into this category are described in Figure 6.10. Section 23.10 describes the udp_usrreq function in detail.

If control falls out of the switch, 0 is returned.

ifconf Function

ifconf provides a standard way for a process to discover the interfaces present and the addresses configured on a system. Interface information is represented by ifreq and ifconf structures shown in Figures 4.23 and 4.24.

Table 4.23. ifreq structure.

---------------------------------------------------------------------------- if.h
262 struct  ifreq {
263 #define IFNAMSIZ    16
264     char    ifr_name[IFNAMSIZ];                 /* if name, e.g. "en0" */
265     union {
266         struct  sockaddr ifru_addr;
267         struct  sockaddr ifru_dstaddr;
268         struct  sockaddr ifru_broadaddr;
269         short   ifru_flags;
270         int ifru_metric;
271         caddr_t ifru_data;
272     } ifr_ifru;
273 #define ifr_addr    ifr_ifru.ifru_addr          /* address */
274 #define ifr_dstaddr ifr_ifru.ifru_dstaddr       /* other end of p-to-p link */
275 #define ifr_broadaddr   ifr_ifru.ifru_broadaddr /* broadcast address */
276 #define ifr_flags   ifr_ifru.ifru_flags         /* flags */
277 #define ifr_metric  ifr_ifru.ifru_metric        /* metric */
278 #define ifr_data    ifr_ifru.ifru_data          /* for use by interface */
279 };
---------------------------------------------------------------------------- if.h

Table 4.24. ifconf structure.

----------------------------------------------------------------------------- if.h
292 struct  ifconf {
293     int ifc_len;                    /* size of associated buffer */
294     union {
295         caddr_t ifcu_buf;
296         struct  ifreq *ifcu_req;
297     } ifc_ifcu;
298 #define ifc_buf ifc_ifcu.ifcu_buf   /* buffer address */
299 #define ifc_req ifc_ifcu.ifcu_req   /* array of structures returned */
300 };

----------------------------------------------------------------------------- if.h

262-279

An ifreq structure contains the name of an interface in ifr_name. The remaining members in the union are accessed by the various ioctl commands. As usual, macros simplify the syntax required to access the members of the union.

292-300

In the ifconf structure, ifc_len is the size in bytes of the buffer pointed to by ifc_buf. The buffer is allocated by a process but filled in by ifconf with an array of variable-length ifreq structures. For the ifconf function, ifr_addr is the relevant member of the union in the ifreq structure. Each ifreq structure has a variable length because the length of ifr_addr (a sockaddr structure) varies according to the type of address. The sa_len member from the sockaddr structure must be used to locate the end of each entry. Figure 4.25 illustrates the data structures manipulated by ifconf.

ifconf data structures.

Figure 4.25. ifconf data structures.

In Figure 4.25, the data on the left is in the kernel and the data on the right is in a process. We’ll refer to this figure as we discuss the ifconf function listed in Figure 4.26.

Table 4.26. ifconf function.

------------------------------------------------------------------------- if.c
462 int
463 ifconf(cmd, data)
464 int     cmd;
465 caddr_t data;
466 {
467     struct ifconf *ifc = (struct ifconf *) data;
468     struct ifnet *ifp = ifnet;
469     struct ifaddr *ifa;
470     char   *cp, *ep;
471     struct ifreq ifr, *ifrp;
472     int     space = ifc->ifc_len, error = 0;
473     ifrp = ifc->ifc_req;
474     ep = ifr.ifr_name + sizeof(ifr.ifr_name) - 2;

475     for (; space > sizeof(ifr) && ifp; ifp = ifp->if_next) {
476         strncpy(ifr.ifr_name, ifp->if_name, sizeof(ifr.ifr_name) - 2);
477         for (cp = ifr.ifr_name; cp < ep && *cp; cp++)
478             continue;
479         *cp++ = '0' + ifp->if_unit;
480         *cp = 'e0';
481         if ((ifa = ifp->if_addrlist) == 0) {
482             bzero((caddr_t) & ifr.ifr_addr, sizeof(ifr.ifr_addr));
483             error = copyout((caddr_t) & ifr, (caddr_t) ifrp,
484                             sizeof(ifr));
485             if (error)
486                 break;
487             space -= sizeof(ifr), ifrp++;
488         } else
489             for (; space > sizeof(ifr) && ifa; ifa = ifa->ifa_next) {
490                 struct sockaddr *sa = ifa->ifa_addr;
491                 if (sa->sa_len <= sizeof(*sa)) {
492                     ifr.ifr_addr = *sa;
493                     error = copyout((caddr_t) & ifr, (caddr_t) ifrp,
494                                     sizeof(ifr));
495                     ifrp++;
496                 } else {
497                     space -= sa->sa_len - sizeof(*sa);
498                     if (space < sizeof(ifr))
499                         break;
500                     error = copyout((caddr_t) & ifr, (caddr_t) ifrp,
501                                     sizeof(ifr.ifr_name));
502                     if (error == 0)
503                         error = copyout((caddr_t) sa,
504                                   (caddr_t) & ifrp->ifr_addr, sa->sa_len);
505                     ifrp = (struct ifreq *)
506                         (sa->sa_len + (caddr_t) & ifrp->ifr_addr);
507                 }
508                 if (error)
509                     break;
510                 space -= sizeof(ifr);
511             }
512     }
513     ifc->ifc_len -= space;
514     return (error);
515 }
------------------------------------------------------------------------- if.c

462-474

The two arguments to ifconf are: cmd, which is ignored; and data, which points to a copy of the ifconf structure specified by the process.

ifc is data cast to a ifconf structure pointer. ifp traverses the interface list starting at ifnet (the head of the list), and ifa traverses the address list for each interface. cp and ep control the construction of the text interface name within ifr, which is the ifreq structure that holds an interface name and address before they are copied to the process’s buffer. ifrp points to this buffer and is advanced after each address is copied. space is the number of bytes remaining in the process’s buffer, cp is used to search for the end of the name, and ep marks the last possible location for the numeric portion of the interface name.

475-488

The for loop traverses the list of interfaces. For each interface, the text name is copied to ifr_name followed by the text representation of the if_unit number. If no addresses have been assigned to the interface, an address of all 0s is constructed, the resulting ifreq structure is copied to the process, space is decreased, and ifrp is advanced.

489-515

If the interface has one or more addresses, the for loop processes each one. The address is added to the interface name in ifr and then ifr is copied to the process. Addresses longer than a standard sockaddr structure don’t fit in ifr and are copied directly out to the process. After each address, space and ifrp are adjusted. After all the interfaces are processed, the length of the buffer is updated (ifc->ifc_len) and ifconf returns. The ioctl system call takes care of copying the new contents of the ifconf structure back to the ifconf structure in the process.

Example

Figure 4.27 shows the configuration of the interface structures after the Ethernet, SLIP, and loopback interfaces have been initialized.

Interface and address data structures.

Figure 4.27. Interface and address data structures.

Figure 4.28 shows the contents of ifc and buffer after the following code is executed.

Data returned by the SIOCGIFCONF command.

Figure 4.28. Data returned by the SIOCGIFCONF command.

    struct ifconf ifc;     /* SIOCGIFCONF adjusts this */
    char buffer[144];      /* contains interface addresses
when ioctl returns */
    int s;                 /* any socket */

    ifc.ifc_len = 144;
    ifc.ifc_buf = buffer;
    if (ioctl(s, SIOCGIFCONF, &ifc) < 0 ) {
        perror("ioctl failed");
        exit(1);
}

There are no restrictions on the type of socket specified with the SIOCGIFCONF command, which, as we have seen, returns the addresses for all protocol families.

In Figure 4.28, ifc_len has been changed from 144 to 108 by ioctl since the three addresses returned in the buffer only occupy 108 (3×36) bytes. Three sockaddr_dl addresses are returned and the last 36 bytes of the buffer are unused. The first 16 bytes of each entry contain the text name of the interface. In this case only 3 of the 16 bytes are used.

ifr_addr has the form of a sockaddr structure, so the first value is the length (20 bytes) and the second value is the type of address (18, AF_LINK). The next value is sdl_index, which is different for each interface as is sdl_type (6, 28, and 24 correspond to IFT_ETHER, IFT_SLIP, and IFT_LOOP).

The next three values are sa_nlen (the length of the text name), sa_alen (the length of the hardware address), and sa_slen (unused). sa_nlen is 3 for all three entries. sa_alen is 6 for the Ethernet address and 0 for both the SLIP and loopback interfaces. sa_slen is always 0.

Finally, the text interface name appears, followed by the hardware address (Ethernet only). Neither the SLIP nor the loopback interface store a hardware-level address in the sockaddr_dl structure.

In the example, only sockaddr_dl addresses are returned (because no other address types were configured in Figure 4.27), so each entry in the buffer is the same size. If other addresses (e.g., IP or OSI addresses) were configured for an interface, they would be returned along with the sockaddr_dl addresses, and the size of each entry would vary according to the type of address returned.

Generic Interface ioctl commands

The four remaining interface commands from Figure 4.20 (SIOCGIFFLAGS, SIOCGIFMETRIC, SIOCSIFFLAGS, and SIOCSIFMETRIC) are handled by the ifioctl function. Figure 4.29 shows the case statements for these commands.

Table 4.29. ifioctl function: flags and metrics.

--------------------------------------------------------------------------- if.c
410     switch (cmd) {
411     case SIOCGIFFLAGS:
412         ifr->ifr_flags = ifp->if_flags;
413         break;

414     case SIOCGIFMETRIC:
415         ifr->ifr_metric = ifp->if_metric;
416         break;

417     case SIOCSIFFLAGS:
418         if (error = suser(p->p_ucred, &p->p_acflag))
419             return (error);
420         if (ifp->if_flags & IFF_UP && (ifr->ifr_flags & IFF_UP) == 0) {
421             int     s = splimp();
422             if_down(ifp);
423             splx(s);
424         }
425         if (ifr->ifr_flags & IFF_UP && (ifp->if_flags & IFF_UP) == 0) {
426             int     s = splimp();
427             if_up(ifp);
428             splx(s);
429         }
430         ifp->if_flags = (ifp->if_flags & IFF_CANTCHANGE) |
431             (ifr->ifr_flags & ~IFF_CANTCHANGE);
432         if (ifp->if_ioctl)
433             (void) (*ifp->if_ioctl) (ifp, cmd, data);
434         break;

435     case SIOCSIFMETRIC:
436         if (error = suser(p->p_ucred, &p->p_acflag))
437             return (error);
438         ifp->if_metric = ifr->ifr_metric;
439         break;
--------------------------------------------------------------------------- if.c

SIOCGIFFLAGS and SIOCGIFMETRIC

410-416

For the two SIOCGxxx commands, ifioctl copies the if_flags or if_metric value for the interface into the ifreq structure. For the flags, the ifr_flags member of the union is used and for the metric, the ifr_metric member is used (Figure 4.23).

SIOCSIFFLAGS

417-429

To change the interface flags, the calling process must have superuser privileges. If the process is shutting down a running interface or bringing up an interface that isn’t running, if_down or if_up are called respectively.

Ignore IFF_CANTCHANGE flags

430-434

Recall from Figure 3.7 that some interface flags cannot be changed by a process. The expression (ifp->if_flags & IFF_CANTCHANGE) clears the interface flags that can be changed by the process, and the expression (ifr->ifr_flags &~IFF_CANTCHANGE) clears the flags in the request that may not be changed by the process. The two expressions are ORed together and saved as the new value for ifp>if_flags. Before returning, the request is passed to the if_ioctl function associated with the device (e.g., leioctl for the LANCE driver Figure 4.31).

SIOCSIFMETRIC

435-439

Changing the interface metric is easier; as long as the process has superuser privileges, ifioctl copies the new metric into if_metric for the interface.

if_down and if_up Functions

With the ifconfig program, an administrator can enable and disable an interface by setting or clearing the IFF_UP flag through the SIOCSIFFLAGS command. Figure 4.30 shows the code for the if_down and if_up functions.

Table 4.30. if_down and if_up functions.

--------------------------------------------------------------- if.c
292 void
293 if_down(ifp)
294 struct ifnet *ifp;
295 {
296     struct ifaddr *ifa;

297     ifp->if_flags &= ~IFF_UP;
298     for (ifa = ifp->if_addrlist; ifa; ifa = ifa->ifa_next)
299         pfctlinput(PRC_IFDOWN, ifa->ifa_addr);
300     if_qflush(&ifp->if_snd);
301     rt_ifmsg(ifp);
302 }

308 void
309 if_up(ifp)
310 struct ifnet *ifp;
311 {
312     struct ifaddr *ifa;

313     ifp->if_flags |= IFF_UP;
314     rt_ifmsg(ifp);
315 }
--------------------------------------------------------------- if.c

292-302

When an interface is shut down, the IFF_UP flag is cleared and the PRC_IFDOWN command is issued by pfctlinput (Section 7.7) for each address associated with the interface. This gives each protocol an opportunity to respond to the interface being shut down. Some protocols, such as OSI, terminate connections using the interface. IP attempts to reroute connections through other interfaces if possible. TCP and UDP ignore failing interfaces and rely on the routing protocols to find alternate paths for the packets.

if_qflush discards any packets queued for the interface. The routing system is notified of the change by rt_ifmsg. TCP retransmits the lost packets automatically; UDP applications must explicitly detect and respond to this condition on their own.

308-315

When an interface is enabled, the IFF_UP flag is set and rt_ifmsg notifies the routing system that the interface status has changed.

Ethernet, SLIP, and Loopback

We saw in Figure 4.29 that for the SIOCSIFFLAGS command, ifioctl calls the if_ioctl function for the interface. In our three sample interfaces, the slioctl and loioctl functions return EINVAL for this command, which is ignored by ifioctl. Figure 4.31 shows the leioctl function and SIOCSIFFLAGS processing of the LANCE Ethernet driver.

Table 4.31. leioctl function: SIOCSIFFLAGS.

--------------------------------------------------------------------------- if_le.c
614 leioctl(ifp, cmd, data)
615 struct ifnet *ifp;
616 int     cmd;
617 caddr_t data;
618 {
619     struct ifaddr *ifa = (struct ifaddr *) data;
620     struct le_softc *le = &le_softc[ifp->if_unit];
621     struct lereg1 *ler1 = le->sc_r1;
622     int     s = splimp(), error = 0;

623     switch (cmd) {
                                                                                  
                       /* SIOCSIFADDR code (Figure 6.28) */                       
                                                                                  
638     case SIOCSIFFLAGS:
639         if ((ifp->if_flags & IFF_UP) == 0 &&
640             ifp->if_flags & IFF_RUNNING) {
641             LERDWR(le->sc_r0, LE_STOP, ler1->ler1_rdp);
642             ifp->if_flags &= ~IFF_RUNNING;
643         } else if (ifp->if_flags & IFF_UP &&
644                    (ifp->if_flags & IFF_RUNNING) == 0)
645             leinit(ifp->if_unit);
646         /*
647          * If the state of the promiscuous bit changes, the interface
648          * must be reset to effect the change.
649          */
650         if (((ifp->if_flags ^ le->sc_iflags) & IFF_PROMISC) &&
651             (ifp->if_flags & IFF_RUNNING)) {
652             le->sc_iflags = ifp->if_flags;
653             lereset(ifp->if_unit);
654             lestart(ifp);
655         }
656         break;
                                                                                  
              /* SIOCADDMULTI and SIOCDELMULTI code (Figure 12.31) */             
                                                                                  
672     default:
673         error = EINVAL;
674     }
675     splx(s);
676     return (error);
677 }
--------------------------------------------------------------------------- if_le.c

614-623

leioctl casts the third argument, data, to an ifaddr structure pointer and saves the value in ifa. The le pointer references the le_softc structure indexed by ifp->if_unit. The switch statement, based on cmd, makes up the main body of the function.

638-656

Only the SIOCSIFFLAGS case is shown in Figure 4.31. By the time ifioctl calls leioctl, the interface flags have been changed. The code shown here forces the physical interface into a state that matches the configuration of the flags. If the interface is going down (IFF_UP is not set), but the interface is operating, the interface is shut down. If the interface is going up but is not operating, the interface is initialized and restarted.

If the promiscuous bit has been changed, the interface is shut down, reset, and restarted to implement the change.

The expression including the exclusive OR and IFF_PROMISC is true only if the request changes the IFF_PROMISC bit.

672-677

The default case for unrecognized commands posts EINVAL, which is returned at the end of the function.

Summary

In this chapter we described the implementation of the LANCE Ethernet device driver, which we refer to throughout the text. We saw how the Ethernet driver detects broadcast and multicast addresses on input, how the Ethernet and 802.3 encapsulations are detected, and how incoming frames are demultiplexed to the appropriate protocol queue. In Chapter 21 we’ll see how IP addresses (unicast, broadcast, and multicast) are converted into the correct Ethernet addresses on output.

Finally, we discussed the protocol-specific ioctl commands that access the interface-layer data structures.

Exercises

4.1

In leread, the M_MCAST flag (in addition to M_BCAST) is always set when a broadcast packet is received. Compare this behavior to the code in ether_input. Why are the flags set in leread and ether_input? Does it matter? Which is correct?

4.1

leread must examine the packet to decide if it needs to be discarded after it is passed to BPF. Since a BPF tap can enable promiscuous mode on the interface, packets may be addressed to some other system on the Ethernet and must be discarded after BPF has processed them.

When the interface is not tapped, the tests must be done in ether_input.

4.2

In ether_input (Figure 4.13), what would happen if the test for the broadcast address and the test for a multicast address were swapped? What would happen if the if on the test for a multicast address were not preceded by an else?

4.2

If the tests were reversed, the broadcast flag would never be set.

If the second if wasn’t preceded by an else, every broadcast packet would also have the multicast flag set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.196.182