Chapter 32. Raw IP

Introduction

A process accesses the raw IP layer by creating a socket of type SOCK_RAW in the Internet domain. There are three uses for raw sockets:

  1. Raw sockets allow a process to send and receive ICMP and IGMP messages.

    The Ping program uses this type of socket to send ICMP echo requests and to receive ICMP echo replies.

    Some routing daemons use this feature to track ICMP redirects that are processed by the kernel. We saw in Section 19.7 that Net/3 generates an RTM_REDIRECT message on a routing socket when a redirect is processed, obviating the need for this use of raw sockets.

    This feature is also used to implement protocols based on ICMP, such as router advertisement and router solicitation (Section 9.6 of Volume 1), which use ICMP but are better implemented as user processes than within the kernel.

    The multicast routing daemon uses a raw IGMP socket to send and receive IGMP messages.

  2. Raw sockets let a process build its own IP headers. The Traceroute program uses this feature to build its own UDP datagrams, including the IP and UDP headers.

  3. Raw sockets let a process read and write IP datagrams with an IP protocol type that the kernel doesn’t support.

    The gated program uses this to support three routing protocols that are built directly on IP: EGP, HELLO, and OSPF.

    This type of raw socket can also be used to experiment with new transport layers on top of IP, instead of adding support to the kernel. It is usually much easier to debug code within a user process than it is within the kernel.

This chapter examines the implementation of raw IP sockets.

Code Introduction

There are five raw IP functions in a single C file, shown in Figure 32.1.

Table 32.1. File discussed in this chapter.

File

Description

netinet/raw_ip.c

raw IP functions

Figure 32.2 shows the relationship of the five raw IP functions to other kernel functions.

Relationship of raw IP functions to rest of kernel.

Figure 32.2. Relationship of raw IP functions to rest of kernel.

The shaded ellipses are the five functions that we cover in this chapter. Be aware that the “rip” prefix used within the raw IP functions stands for “raw IP” and not the “Routing Information Protocol,” whose common acronym is RIP.

Global Variables

Four global variables are introduced in this chapter, which are shown in Figure 32.3.

Table 32.3. Global variables introduced in this chapter.

Variable

Datatype

Description

rawinpcb

struct inpcb

head of the raw IP Internet PCB list

ripsrc

struct sockaddr_in

contains sender’s IP address on input

rip_recvspace

rip_sendspace

u_long

u_long

default size of socket receive buffer, 8192 bytes

default size of socket send buffer, 8192 bytes

Statistics

Raw IP maintains two of the counters in the ipstat structure (Figure 8.4). We describe these in Figure 32.4.

Table 32.4. Raw IP statistics maintained in the ipstat structure.

ipstat member

Description

Used by SNMP

ips_noproto

ips_rawout

#packets with an unknown or unsupported protocol total #raw ip packets generated

The use of the ips_noproto counter with SNMP is shown in Figure 8.6. Figure 8.5 shows some sample output of these two counters.

Raw IP protosw Structure

Unlike all other protocols, raw IP is accessed through multiple entries in the inetsw array. There are four entries in this structure with a socket type of SOCK_RAW, each with a different protocol value:

  • IPPROTO_ICMP (protocol value of 1),

  • IPPROTO_IGMP (protocol value of 2),

  • IPPROTO_RAW (protocol value of 255), and

  • raw wildcard entry (protocol value of 0).

The first two entries for ICMP and IGMP were described earlier (Figures 11.12 and 13.9). The difference in these four entries can be summarized as follows:

  • If the process creates a raw socket (SOCK_RAW) with a nonzero protocol value (the third argument to socket), and if that value matches IPPROTO_ICMP, IPPROTO_IGMP, or IPPROTO_RAW, then the corresponding protosw entry is used.

  • If the process creates a raw socket with a nonzero protocol value that is not known to the kernel, the wildcard entry with a protocol of 0 is matched by pffindproto. This allows a process to handle any IP protocol that is not known to the kernel, without making kernel modifications.

We saw in Section 7.8 that all entries in the ip_protox array that are unknown are set to point to the entry for IPPROTO_RAW, whose protocol switch entry we show in Figure 32.5.

Table 32.5. The raw IP protosw structure.

Member

inetsw[3]

Description

pr_type

SOCK_RAW

raw socket

pr_domain

&inetdomain

raw IP is part of the Internet domain

pr_protocol

IPPROTO_RAW (255)

appears in the ip_p field of the IP header

pr_flags

PR_ATOMIC|PR_ADDR

socket layer flags, not used by protocol processing

pr_input

rip_input

receives messages from IP layer

pr_output

0

not used by raw IP

pr_ctlinput

0

not used by raw IP

pr_ctloutput

rip_ctloutput

respond to administrative requests from a process

pr_usrreq

rip_usrreq

respond to communication requests from a process

pr_init

0

not used by raw IP

pr_fasttimo

0

not used by raw IP

pr_slowtimo

0

not used by raw IP

pr_drain

0

not used by raw IP

pr_sysctl

0

not used by raw IP

We describe the three functions that begin with rip_ in this chapter. We also cover the function rip_output, which is not in the protocol switch entry but is called by rip_usrreq when a raw IP datagram is output.

The fifth raw IP function, rip_init, is contained only in the wildcard entry. The initialization function must be called only once, so it could appear in either the IPPROTO_RAW entry or in the wildcard entry.

What Figure 32.5 doesn’t show, however, is that other protocols (ICMP and IGMP) also reference some of the raw IP functions in their protosw entries. Figure 32.6 compares the relevant fields in the protosw entries for the four SOCK_RAW protocols. To highlight the differences, values in these rows are in a bolder font when they differ.

Table 32.6. Comparison of protocol switch values for raw sockets.

protosw entry

SOCK_RAW protocol type

IPPROTO_ICMP (1)

IPPROTO_IGMP (2)

IPPROTO_RAW (255)

wildcard (0)

pr_input

icmp_input

igmp_input

rip_input

rip_input

pr_output

rip_output

rip_output

rip_output

rip_output

pr_ctloutput

rip_ctloutput

rip_ctloutput

rip_ctloutput

rip_ctloutput

pr_usrreq

rip_usrreq

rip_usrreq

rip_usrreq

rip_usrreq

pr_init

0

igmp_init

0

rip_init

pr_sysctl

icmp_sysctl

0

0

0

pr_fasttimo

0

igmp_fasttimo

0

0

The implementation of raw sockets has changed with the different BSD releases. The entry with a protocol of IPPROTO_RAW has always been used as the wildcard entry in the ip_protox table for unknown IP protocols. The entry with a protocol of 0 has always been the default entry, to allow processes to read and write IP datagrams with a protocol that the kernel doesn’t support.

Usage of the IPPROTO_RAW entry by a process started when Traceroute was developed by Van Jacobson, because Traceroute was the first process that needed to write its own IP headers (to change the TTL field). The kernel patches to 4.3BSD and Net/1 to support Traceroute included a change to rip_output so that if the protocol was IPPROTO_RAW, it was assumed the process had passed a complete IP datagram, including the IP header. This was changed with Net/2 when the IP_HDRINCL socket option was introduced, removing this overloading of the IPPROTO_RAW protocol and allowing a process to send its own IP header with the wildcard entry.

rip_init Function

The domaininit function calls the raw IP initialization function rip_init (Figure 32.7) at system initialization time.

Table 32.7. rip_init function.

---------------------------------------------------------------------------- raw_ip.c
 47 void
 48 rip_init()
 49 {

 50     rawinpcb.inp_next = rawinpcb.inp_prev = &rawinpcb;
 51 }
---------------------------------------------------------------------------- raw_ip.c

The only action performed by this function is to set the next and previous pointers in the head PCB (rawinpcb) to point to itself. This is an empty doubly linked list.

Whenever a socket of type SOCK_RAW is created by the socket system call, we’ll see that the raw IP PRU_ATTACH function creates an Internet PCB and puts it onto the rawinpcb list.

rip_input Function

Since all entries in the ip_protox array for unknown protocols are set to point to the entry for IPPROTO_RAW (Section 7.8), and since the pr_input function for this protocol is rip_input (Figure 32.6), this function is called for all IP datagrams that have a protocol value that the kernel doesn’t recognize. But from Figure 32.2 we see that both ICMP and IGMP also call rip_input. This happens under the following conditions:

  • icmp_input calls rip_input for all unknown ICMP message types and for all ICMP messages that are not reflected.

  • igmp_input calls rip_input for all IGMP packets.

One reason for calling rip_input in these two cases is to allow a process with a raw socket to handle new ICMP and IGMP messages that might not be supported by the kernel.

Figure 32.8 shows the rip_input function.

Table 32.8. rip_input function.

------------------------------------------------------------------------- raw_ip.c
 59 void
 60 rip_input(m)
 61 struct mbuf *m;
 62 {
 63     struct ip *ip = mtod(m, struct ip *);
 64     struct inpcb *inp;
 65     struct socket *last = 0;

 66     ripsrc.sin_addr = ip->ip_src;
 67     for (inp = rawinpcb.inp_next; inp != &rawinpcb; inp = inp->inp_next) {
 68         if (inp->inp_ip.ip_p && inp->inp_ip.ip_p != ip->ip_p)
 69             continue;
 70         if (inp->inp_laddr.s_addr &&
 71             inp->inp_laddr.s_addr == ip->ip_dst.s_addr)
 72             continue;
 73         if (inp->inp_faddr.s_addr &&
 74             inp->inp_faddr.s_addr == ip->ip_src.s_addr)
 75             continue;
 76         if (last) {
 77             struct mbuf *n;
 78             if (n = m_copy(m, 0, (int) M_COPYALL)) {
 79                 if (sbappendaddr(&last->so_rcv, &ripsrc,
 80                                  n, (struct mbuf *) 0) == 0)
 81                     /* should notify about lost packet */
 82                     m_freem(n);
 83                 else
 84                     sorwakeup(last);
 85             }
 86         }
 87         last = inp->inp_socket;
 88     }
 89     if (last) {
 90         if (sbappendaddr(&last->so_rcv, &ripsrc,
 91                          m, (struct mbuf *) 0) == 0)
 92             m_freem(m);
 93         else
 94             sorwakeup(last);
 95     } else {
 96         m_freem(m);
 97         ipstat.ips_noproto++;
 98         ipstat.ips_delivered--;
 99     }
100 }
------------------------------------------------------------------------- raw_ip.c

Save source IP address

59-66

The source address from the IP datagram is put into the global variable ripsrc, which becomes an argument to sbappendaddr whenever a matching PCB is found. Unlike UDP, there is no concept of a port number with raw IP, so the sin_port field in the sockaddr_in structure is always 0.

Search all raw IP PCBs for one or more matching entries

67-88

Raw IP handles its list of PCBs differently from UDP and TCP. We saw that these two protocols maintain a pointer to the PCB for the most recently received datagram (a one-behind cache) and call the generic function in_pcblookup to search for a single “best” match when the received datagram does not equal the cache entry. Raw IP has completely different criteria for a matching PCB, so it searches the PCB list itself. in_pcblookup cannot be used because a raw IP datagram can be delivered to multiple sockets, so every PCB on the raw PCB list must be scanned. This is similar to UDP’s handling of a received datagram destined for a broadcast or multicast address (Figure 23.26).

Compare protocols

68-69

If the protocol field in the PCB is nonzero, and if it doesn’t match the protocol field in the IP header, the PCB is ignored. This implies that a raw socket with a protocol value of 0 (the third argument to socket) can match any received raw IP datagram.

Compare local and foreign IP addresses

70-75

If the local address in the PCB is nonzero, and if it doesn’t match the destination IP address in the IP header, the PCB is ignored. If the foreign address in the PCB is nonzero, and if it doesn’t match the source IP address in the IP header, the PCB is ignored.

These three tests imply that a process can create a raw socket with a protocol of 0, not bind a local address, and not connect to a foreign address, and the process receives all datagrams processed by rip_input.

Lines 71 and 74 both contain the same bug: the test for equality should be a test for inequality.

Pass copy of received datagram to processes

76-94

sbappendaddr passes a copy of the received datagram to the process. The use of the variable last is similar to what we saw in Figure 23.26: since sbappendaddr releases the mbuf after placing it onto the appropriate queue, if more than one process receives a copy of the datagram, rip_input must make a copy by calling m_copy. But if only one process receives the datagram, there’s no need to make a copy.

Undeliverable datagram

95-99

If no matching sockets are found for the datagram, the mbuf is released, ips_noproto is incremented, and ips_delivered is decremented. This latter counter was incremented by IP just before calling the rip_input (Figure 8.15). It must be decremented so that the two SNMP counters, ipInDiscards and ipInDelivers (Figure 8.6) are correct, since the datagram was not really delivered to a transport layer.

At the beginning of this section we mentioned that icmp_input calls rip_input for unknown message types and for messages that are not reflected. This means that the receipt of an ICMP host unreachable causes ips_noproto to be incremented if there are no raw listeners whose PCB is matched by rip_input. That’s one reason this counter has such a large value in Figure 8.5. The description of this counter as being “unknown or unsupported protocols” is not entirely accurate.

Net/3 does not generate an ICMP destination unreachable message with code 2 (protocol unreachable) when an IP datagram is received with a protocol field that is not handled by either the kernel or some process through a raw socket. RFC 1122 says an implementation should generate this ICMP error. (See Exercise 32.4.)

rip_output Function

We saw in Figure 32.6 that rip_output is called for output for raw sockets by ICMP, IGMP, and raw IP. Output occurs when the application calls one of the five write functions: send, sendto, sendmsg, write, or writev. If the socket is connected, any of the five functions can be called, although a destination address cannot be specified with sendto or sendmsg. If the socket is unconnected, only sendto and sendmsg can be called, and a destination address must be specified.

The function rip_output is shown in Figure 32.9.

Table 32.9. rip_output function.

------------------------------------------------------------------------- raw_ip.c
105 int
106 rip_output(m, so, dst)
107 struct mbuf *m;
108 struct socket *so;
109 u_long  dst;
110 {
111     struct ip *ip;
112     struct inpcb *inp = sotoinpcb(so);
113     struct mbuf *opts;
114     int     flags = (so->so_options & SO_DONTROUTE) | IP_ALLOWBROADCAST;

115     /*
116      * If the user handed us a complete IP packet, use it.
117      * Otherwise, allocate an mbuf for a header and fill it in.
118      */
119     if ((inp->inp_flags & INP_HDRINCL) == 0) {
120         M_PREPEND(m, sizeof(struct ip), M_WAIT);
121         ip = mtod(m, struct ip *);
122         ip->ip_tos = 0;
123         ip->ip_off = 0;
124         ip->ip_p = inp->inp_ip.ip_p;
125         ip->ip_len = m->m_pkthdr.len;
126         ip->ip_src = inp->inp_laddr;
127         ip->ip_dst.s_addr = dst;
128         ip->ip_ttl = MAXTTL;
129         opts = inp->inp_options;
130     } else {
131         ip = mtod(m, struct ip *);
132         if (ip->ip_id == 0)
133             ip->ip_id = htons(ip_id++);
134         opts = NULL;
135         /* XXX prevent ip_output from overwriting header fields */
136         flags |= IP_RAWOUTPUT;
137         ipstat.ips_rawout++;
138     }
139     return (ip_output(m, opts, &inp->inp_route, flags, inp->inp_moptions));
140 }
------------------------------------------------------------------------- raw_ip.c

Kernel fills in IP header

119-128

If the IP_HDRINCL socket option is not defined, M_PREPEND allocates room for an IP header, and fields in the IP header are filled in. The fields that are not filled in here are left for ip_output to initialize (Figure 8.22). The protocol field is set to the value stored in the PCB, which we’ll see in Figure 32.10 is the third argument to the socket system call.

Table 32.10. rip_usrreq function: PRU_ATTACH request.

------------------------------------------------------------------------- raw_ip.c
194 int
195 rip_usrreq(so, req, m, nam, control)
196 struct socket *so;
197 int     req;
198 struct mbuf *m, *nam, *control;
199 {
200     int     error = 0;
201     struct inpcb *inp = sotoinpcb(so);
202     extern struct socket *ip_mrouter;
203     switch (req) {

204     case PRU_ATTACH:
205         if (inp)
206             panic("rip_attach");
207         if ((so->so_state & SS_PRIV) == 0) {
208             error = EACCES;
209             break;
210         }
211         if ((error = soreserve(so, rip_sendspace, rip_recvspace)) ||
212             (error = in_pcballoc(so, &rawinpcb)))
213             break;
214         inp = (struct inpcb *) so->so_pcb;
215         inp->inp_ip.ip_p = (int) nam;
216         break;
------------------------------------------------------------------------- raw_ip.c

The TOS is set to 0 and the TTL to 255. These values are always used for a raw socket when the kernel fills in the header. This differs from UDP and TCP where the process had the capability of setting the IP_TTL and IP_TOS socket options.

129

Any IP options set by the process with the IP_OPTIONS socket options are passed to ip_output through the opts variable.

Caller fills in IP header: IP_HDRINCL socket option

130-133

If the IP_HDRINCL socket option is set, the caller supplies a completed IP header at the front of the datagram. The only modification made to this IP header is to set the ID field if the value supplied by the process is 0. The ID field of an IP datagram can be 0. The assignment of the ID field here by rip_output is just a convention that allows the process to set it to 0, asking the kernel to assign an ID value based on the kernel’s current ip_id variable.

134-136

The opts variable is set to a null pointer, which ignores any IP options the process may have set with the IP_OPTIONS socket option. The convention here is that if the caller builds its own IP header, that header includes any IP options the caller might want. The flags variable must also include the IP_RAWOUTPUT flag, telling ip_output to leave the header alone.

137

The counter ips_rawout is incremented. Running Traceroute causes this variable to be incremented by 1 for each datagram sent by Traceroute.

The operation of rip_output has changed over time. When the IP_HDRINCL socket option is used in Net/3, the only change made to the IP header by rip_output is to set the ID field, if the process sets it to 0. The Net/3 ip_output function does nothing to the IP header fields because the IP_RAWOUTPUT flag is set. Net/2, however, always set certain fields in the IP header, even if the IP_HDRINCL socket option was set: the IP version was set to 4, the fragment offset was set to 0, and the more-fragments flag was cleared.

rip_usrreq Function

The protocol’s user-request function is called for a variety of operations. As with the UDP and TCP user-request functions, rip_usrreq is a large switch statement, with one case for each PRU_xxx request.

The PRU_ATTACH request, shown in Figure 32.10, is from the socket system call.

194-206

Since the socket function creates a new socket structure each time it is called, that structure cannot point to an Internet PCB.

Verify superuser

207-210

Only the superuser can create a raw socket. This is to prevent random users from writing their own IP datagrams to the network.

Create Internet PCB and reserve buffer space

211-215

Space is reserved for input and output queues, and in_pcballoc allocates a new Internet PCB. The PCB is added to the raw IP PCB list (rawinpcb). The PCB is linked to the socket structure. The nam argument to rip_usrreq is the third argument to the socket system call: the protocol. It is stored in the PCB since it is used by rip_input to demultiplex received datagrams, and its value is placed into the protocol field of outgoing datagrams by rip_output (if IP_HDRINCL is not set).

A raw IP socket can be connected to a foreign IP address similar to a UDP socket being connected to a foreign IP address. This fixes the foreign IP address from which the raw socket receives datagrams, as we saw in rip_input. Since raw IP is a connectionless protocol like UDP, a PRU_DISCONNECT request can occur in two cases:

  1. When a connected raw socket is closed, PRU_DISCONNECT is called before PRU_DETACH.

  2. When a connect is issued on an already-connected raw socket, soconnect issues the PRU_DISCONNECT request before the PRU_CONNECT request.

Figure 32.11 shows the PRU_DISCONNECT, PRU_ABORT, and PRU_DETACH requests.

Table 32.11. rip_usrreq function: PRU_DISCONNECT, PRU_ABORT, and PRU_DETACH requests.

------------------------------------------------------------------------- raw_ip.c
217     case PRU_DISCONNECT:
218         if ((so->so_state & SS_ISCONNECTED) == 0) {
219             error = ENOTCONN;
220             break;
221         }
222         /* FALLTHROUGH */

223     case PRU_ABORT:
224         soisdisconnected(so);
225         /* FALLTHROUGH */

226     case PRU_DETACH:
227         if (inp == 0)
228             panic("rip_detach");
229         if (so == ip_mrouter)
230             ip_mrouter_done();
231         in_pcbdetach(inp);
232         break;
------------------------------------------------------------------------- raw_ip.c

217-222

The socket must already be connected to disconnect or else an error is returned.

223-225

A PRU_ABORT abort should never be issued for a raw IP socket, but this case also handles the fall through from PRU_DISCONNECT. The socket is marked as disconnected.

226-230

The close system call issues the PRU_DETACH request, and this case also handles the fall through from the PRU_DISCONNECT request. If the socket structure is the one used for multicast routing (ip_mrouner), multicast routing is disabled by calling ip_mrouter_done. Normally the mrouted(8) daemon issues the DVMRP_DONE socket option to disable multicast routing, so this check handles the case of the router daemon terminating (i.e., crashing) without issuing the socket option.

231

The Internet PCB is released by in_pcbdetach, which also removes the PCB from the list of raw IP PCBs (rawinpcb).

A raw IP socket can be bound to a local IP address with the PRU_BIND request, shown in Figure 32.12. We saw in rip_input that the socket will receive only datagrams sent to this IP address.

Table 32.12. rip_usrreq function: PRU_BIND request.

------------------------------------------------------------------------- raw_ip.c
233     case PRU_BIND:
234         {
235             struct sockaddr_in *addr = mtod(nam, struct sockaddr_in *);

236             if (nam->m_len != sizeof(*addr)) {
237                 error = EINVAL;
238                 break;
239             }
240             if ((ifnet == 0) ||
241                 ((addr->sin_family != AF_INET) &&
242                  (addr->sin_family != AF_IMPLINK)) ||
243                 (addr->sin_addr.s_addr &&
244                  ifa_ifwithaddr((struct sockaddr *) addr) == 0)) {
245                 error = EADDRNOTAVAIL;
246                 break;
247             }
248             inp->inp_laddr = addr->sin_addr;
249             break;
250         }
------------------------------------------------------------------------- raw_ip.c

233-250

The process fills in a sockaddr_in structure with the local IP address. The following three conditions must all be true, or else the error EADDRNOTAVAIL is returned:

  1. at least one interface must be configured,

  2. the address family must be AF_INET (or AF_IMPLINK, a historical artifact), and

  3. if the IP address being bound is not 0.0.0.0, it must correspond to a local interface. For the call to ifa_ifwithaddr to succeed, the port number in the caller’s sockaddr_in must be 0.

The local IP address is stored in the PCB.

A process can also connect a raw IP socket to a particular foreign IP address. We saw in rip_input that this restricts the process so that it receives only IP datagrams with a source IP address equal to the connected IP address. A process has the option of calling bind, connect, both, or neither, depending on the type of filtering it wants rip_input to place on received datagrams. Figure 32.13 shows the PRU_CONNECT request.

Table 32.13. rip_usrreq function: PRU_CONNECT request.

------------------------------------------------------------------------- raw_ip.c
251     case PRU_CONNECT:
252         {
253             struct sockaddr_in *addr = mtod(nam, struct sockaddr_in *);

254             if (nam->m_len != sizeof(*addr)) {
255                 error = EINVAL;
256                 break;
257             }
258             if (ifnet == 0) {
259                 error = EADDRNOTAVAIL;
260                 break;
261             }
262             if ((addr->sin_family != AF_INET) &&
263                 (addr->sin_family != AF_IMPLINK)) {
264                 error = EAFNOSUPPORT;
265                 break;
266             }
267             inp->inp_faddr = addr->sin_addr;
268             soisconnected(so);
269             break;
270         }
------------------------------------------------------------------------- raw_ip.c

251-270

If the caller’s sockaddr_in is initialized correctly and at least one IP interface is configured, the specified foreign IP address is stored in the PCB. Notice that this process differs from the connection of a UDP socket to a foreign address. In the UDP case, in_pcbconnect acquires a route to the foreign address and also stores the outgoing interface as the local address (Figure 22.9). With raw IP, only the foreign IP address is stored in the PCB, and unless the process also calls bind, only the foreign address is compared by rip_input.

A call to shutdown specifying that the process has finished sending data generates the PRU_SHUTDOWN request, although it is rare for a process to issue this system call for a raw IP socket. Figure 32.14 shows the PRU_CONNECT2 and PRU_SHUTDOWN requests.

Table 32.14. rip_usrreq function: PRU_CONNECT2 and PRU_SHUTDOWN requests.

---------------------------------------------------------------------------- raw_ip.c
271     case PRU_CONNECT2:
272         error = EOPNOTSUPP;
273         break;

274         /*
275          * Mark the connection as being incapable of further input.
276          */
277     case PRU_SHUTDOWN:
278         socantsendmore(so);
279         break;
---------------------------------------------------------------------------- raw_ip.c

271-273

The PRU_CONNECT2 request is not supported for a raw IP socket.

274-279

socantsendmore sets the socket’s flags to prevent any future output.

In Figure 23.14 we showed how the five write functions call the protocol’s pr_usrreq function with a PRU_SEND request. We show this request in Figure 32.15.

Table 32.15. rip_usrreq function: PRU_SEND request.

-------------------------------------------------------------------------- raw_ip.c
280         /*
281          * Ship a packet out.  The appropriate raw output
282          * routine handles any massaging necessary.
283          */
284     case PRU_SEND:
285         {
286             u_long  dst;

287             if (so->so_state & SS_ISCONNECTED) {
288                 if (nam) {
289                     error = EISCONN;
290                     break;
291                 }
292                 dst = inp->inp_faddr.s_addr;
293             } else {
294                 if (nam == NULL) {
295                     error = ENOTCONN;
296                     break;
297                 }
298                 dst = mtod(nam, struct sockaddr_in *)->sin_addr.s_addr;
299             }
300             error = rip_output(m, so, dst);
301             m = NULL;
302             break;
303         }
-------------------------------------------------------------------------- raw_ip.c

280-303

If the socket state is connected, the caller cannot specify a destination address (the nam argument). Likewise, if the state is unconnected, a destination address is required. If all is OK, in either state, dst is set to the destination IP address. rip_output sends the datagram. The mbuf pointer m is set to a null pointer, to prevent it from being released at the end of the function. This is because the interface output routine will release the mbuf after it has been output. (Remember that rip_output passes the mbuf chain to ip_output, who appends it to the interface’s output queue.)

The final part of rip_usrreq is shown in Figure 32.16. The PRU_SENSE request, generated by the fstat system call, returns nothing. The PRU_SOCKADDR and PRU_PEERADDR requests are from the getsockname and getpeername system calls, respectively. The remaining requests are not supported.

Table 32.16. rip_usrreq function: remaining requests.

---------------------------------------------------------------------------- raw_ip.c
304     case PRU_SENSE:
305         /*
306          * fstat: don't bother with a blocksize.
307          */
308         return (0);

309         /*
310          * Not supported.
311          */
312     case PRU_RCVOOB:
313     case PRU_RCVD:
314     case PRU_LISTEN:
315     case PRU_ACCEPT:
316     case PRU_SENDOOB:
317         error = EOPNOTSUPP;
318         break;

319     case PRU_SOCKADDR:
320         in_setsockaddr(inp, nam);
321         break;

322     case PRU_PEERADDR:
323         in_setpeeraddr(inp, nam);
324         break;

325     default:
326         panic("rip_usrreq");
327     }
328     if (m != NULL)
329         m_freem(m);
330     return (error);
331 }
---------------------------------------------------------------------------- raw_ip.c

319-324

The functions in_setsockaddr and in_setpeeraddr fetch the information from the PCB, storing the result in the nam argument.

rip_ctloutput Function

The setsockopt and getsockopt system calls invoke the rip_ctloutput function. Only one IP socket option is handled here, along with eight socket options related to multicast routing.

Figure 32.17 shows the first part of the rip_ctloutput function.

Table 32.17. rip_usrreq function: process IP_HDRINCL socket option.

------------------------------------------------------------------------------- raw_ip.c
144 int
145 rip_ctloutput(op, so, level, optname, m)
146 int     op;
147 struct socket *so;
148 int     level, optname;
149 struct mbuf **m;
150 {
151     struct inpcb *inp = sotoinpcb(so);
152     int     error;

153     if (level != IPPROTO_IP)
154         return (EINVAL);

155     switch (optname) {

156     case IP_HDRINCL:
157         if (op == PRCO_SETOPT || op == PRCO_GETOPT) {
158             if (m == 0 || *m == 0 || (*m)->m_len < sizeof(int))
159                         return (EINVAL);
160             if (op == PRCO_SETOPT) {
161                 if (*mtod(*m, int *))
162                             inp->inp_flags |= INP_HDRINCL;
163                 else
164                     inp->inp_flags &= ~INP_HDRINCL;
165                 (void) m_free(*m);
166             } else {
167                 (*m)->m_len = sizeof(int);
168                 *mtod(*m, int *) = inp->inp_flags & INP_HDRINCL;
169             }
170             return (0);
171         }
172         break;
------------------------------------------------------------------------------- raw_ip.c

144-172

The size of the mbuf that contains either the new value of the option or will hold the current value of the option must be at least as large as an integer. For the setsockopt system call, the flag is set if the integer value in the mbuf is nonzero, or cleared otherwise. For the getsockopt system call, the value returned in the mbuf is either 0 or the nonzero value of the flag. The function returns, to avoid the processing at the end of the switch statement for other IP options.

Figure 32.18 shows the last portion of the rip_ctloutput function. It handles eight multicast routing socket options.

Table 32.18. rip_usrreq function: process multicast routing socket option.

------------------------------------------------------------------------- raw_ip.c
173     case DVMRP_INIT:
174     case DVMRP_DONE:
175     case DVMRP_ADD_VIF:
176     case DVMRP_DEL_VIF:
177     case DVMRP_ADD_LGRP:
178     case DVMRP_DEL_LGRP:
179     case DVMRP_ADD_MRT:
180     case DVMRP_DEL_MRT:
                                                                               
                                  /* shown in Figure 14.9 */                   
                                                                               
188     }
189     return (ip_ctloutput(op, so, level, optname, m));
190 }
------------------------------------------------------------------------- raw_ip.c

173-188

These eight socket options are valid only for the setsockopt system call. They are processed by the ip_mrouter_cmd function as discussed with Figure 14.9.

189

Any other IP socket options, such as IP_OPTIONS to set the IP options, are processed by ip_ctloutput.

Summary

Raw sockets provide three capabilities for an IP host.

  1. They are used to send and receive ICMP and IGMP messages.

  2. They allow a process to build its own IP headers.

  3. They allow additional IP-based protocols to be supported in a user process.

We saw that raw IP output is simple it just fills in a few fields in the IP header b ut it allows a process to supply its own IP header. This allows diagnostic programs to create any type of IP datagram.

Raw IP input provides three types of filtering for incoming IP datagrams. The process chooses to receive datagrams based on (1) the protocol field, (2) the source IP address (set by connect), and (3) the destination IP address (set by bind). The process chooses which combination of these three filters (if any) to apply.

Exercises

32.1

Assume the IP_HDRINCL socket option is not set. What value will rip_output place into the IP header protocol field (ip_p) when the third argument to socket is 0? What value will rip_output place into this field when the third argument to socket is IPPROTO_RAW(255)?

32.1

0 in the first example, and 255 in the second. Both of these values are reserved in RFC 1700 [Reynolds and Postel 1994] and should not appear in datagrams. This means, for example, that a socket created with a protocol of IPPROTO_RAW should always have the IP_HDRINCL socket option set, and datagrams written to the socket should have a valid protocol value.

32.2

A process creates a raw socket with a protocol value of IPPROTO_RAW (255). What type of IP datagrams will the process receive on this socket?

32.2

Since the IP protocol value of 255 is reserved, datagrams should never appear on the wire with this protocol value. Since this is a nonzero protocol value, the first of the three tests in rip_input will ignore every received datagram that does not have this protocol value. Therefore the process should not receive any datagrams on the socket.

32.3

A process creates a raw socket with a protocol value of 0. What type of IP datagrams will the process receive on this socket?

32.3

Even though this protocol value is reserved and datagrams should never appear on the wire with this value, the first of the three tests in rip_input allows datagrams with any protocol value to be received by sockets of this type. The only input filtering that occurs for this type of raw socket is based on the source and destination IP addresses, if the process calls either connect or bind, or both.

32.4

Modify rip_input to send an ICMP destination unreachable with code 2 (protocol unreachable) when appropriate. Be careful not to generate the error for received ICMP and IGMP packets for which rip_input is called.

32.4

Since the array ip_protox array (Figure 7.22) contains information about which protocol the kernel supports, the ICMP error should be generated only when there are no raw listeners for the protocol and the pointer inetsw[ip_protox[ip->ip_p]].pr_input equals rip_input.

32.5

If a process wants to write its own IP datagrams with its own IP header, what are the differences in using a raw IP socket with the IP_HDRINCL option, and using BPF (Chapter 31)?

32.5

In both cases the process must build its own IP header, in addition to whatever follows the IP header (UDP datagram, TCP segment, or whatever). With a raw IP socket, output is normally done using sendto specifying the destination address as an Internet socket address structure containing an IP address. ip_output is called and normal IP routing is done based on the destination IP address.

BPF requires the process to supply a complete data-link header, such as an Ethernet header. Output is normally done by calling write, since a destination address cannot be specified. The packet is passed directly to the interface output function, bypassing ip_output (Figure 31.20). The process selects the outgoing interface using the BIOCSETIF ioctl (Figure 31.16). Since IP routing is not performed, the destination of the packet is limited to another system on an attached network (unless the process duplicates the IP routing function and sends the packet to a router on an attached network, for the router to forward based on the destination IP address).

32.6

When would a process read from a raw IP socket, and when would it read from BPF?

32.6

A raw IP socket receives only IP datagrams destined for an IP protocol that the kernel does not process itself. A process cannot receive TCP segments or UDP datagrams on a raw socket, for example.

BPF can receive all frames received on a specified interface, regardless of whether they are IP datagrams or not. The BIOCPROMISC ioctl can put the interface into a promiscuous mode, to receive datagrams that are not even destined for this host.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.243.32