A process sends and receives the routing messages described in the previous chapter by using a socket in the routing domain. The socket
system call is issued specifying a family of PF_ROUTE
and a socket type of SOCK_RAW
.
The process can then send five routing messages to the kernel:
RTM_ADD:
add a new route.
RTM_DELETE:
delete an existing route.
RTM_GET:
fetch all the information about a route.
RTM_CHANGE:
change the gateway, interface, or metrics of an existing route.
RTM_LOCK:
specify which metrics the kernel should not modify.
Additionally, the process can receive any of the other seven types of routing messages that are generated by the kernel when some event, such as interface down, redirect received, etc., occurs.
This chapter looks at the routing domain, the routing control blocks that are created for each routing socket, the function that handles messages from a process (route_output
), the function that sends routing messages to one or more processes (raw_input
), and the various functions that support all the socket operations on a routing socket.
Before describing the routing socket functions, we need to discuss additional details about the routing domain; the SOCK_RAW
protocol supported in the routing domain; and routing control blocks, one of which is associated with each routing socket.
Figure 20.1 lists the domain
structure for the PF_ROUTE
domain, named routedomain
.
Table 20.1. routedomain
structure.
Member |
| Description |
---|---|---|
|
| protocol family for domain |
|
| name |
|
| domain initialization, Figure 18.30 |
|
| not used in routing domain |
|
| not used in routing domain |
|
| protocol switch structure, Figure 20.2 |
| pointer past end of protocol switch structure | |
| filled in by | |
|
| not used in routing domain |
|
| not used in routing domain |
|
| not used in routing domain |
Unlike the Internet domain, which supports multiple protocols (TCP, UDP, ICMP, etc.), only one protocol (of type SOCK_RAW
) is supported in the routing domain. Figure 20.2 lists the protocol switch entry for the PF_ROUTE
domain.
Table 20.2. The routing protocol protosw
structure.
Member |
| Description |
---|---|---|
|
| raw socket |
|
| part of the routing domain |
|
| |
|
| socket layer flags, not used by protocol processing |
|
| this entry not used; |
|
| called for |
|
| control input function |
|
| not used |
|
| respond to communication requests from a process |
|
| initialization |
|
| not used |
|
| not used |
|
| not used |
|
| for |
Each time a routing socket is created with a call of the form
socket(PF_ROUTE, SOCK_RAW, protocol);
the corresponding PRU_ATTACH
request to the protocol’s user-request function (route_usrreq
) allocates a routing control block and links it to the socket structure. The protocol can restrict the messages sent to the process on this socket to one particular family. If a protocol of AF_INET
is specified, for example, only routing messages containing Internet addresses will be sent to the process. A protocol of 0 causes all routing messages from the kernel to be sent on the socket.
Recall that we call these structures routing control blocks, not raw control blocks, to avoid confusion with the raw IP control blocks in Chapter 32.
Figure 20.3 shows the definition of the rawcb
structure.
Table 20.3. rawcb
structure.
----------------------------------------------------------------------- raw_cb.h 39 struct rawcb { 40 struct rawcb *rcb_next; /* doubly linked list */ 41 struct rawcb *rcb_prev; 42 struct socket *rcb_socket; /* back pointer to socket */ 43 struct sockaddr *rcb_faddr; /* destination address */ 44 struct sockaddr *rcb_laddr; /* socket's address */ 45 struct sockproto rcb_proto; /* protocol family, protocol */ 46 }; 47 #define sotorawcb(so) ((struct rawcb *)(so)->so_pcb) ----------------------------------------------------------------------- raw_cb.h |
Additionally, a global of the same name, rawcb
, is allocated as the head of the doubly linked list. Figure 20.4 shows the arrangement.
39-47
We showed the sockproto
structure in Figure 19.26. Its sp_family
member is set to PF_ROUTE
and its sp_protocol
member is set to the third argument to the socket
system call. The rcb_faddr
member is permanently set to point to route_src
, which we described with Figure 19.26. rcb_laddr
is always a null pointer.
The raw_init
function, shown in Figure 20.5, is the protocol initialization function in the protosw
structure in Figure 20.2. We described the entire initialization of the routing domain with Figure 18.29.
Table 20.5. raw_init
function: initialize doubly linked list of routing control blocks.
-------------------------------------------------------------------- raw_usrreq.c 38 void 39 raw_init() 40 { 41 rawcb.rcb_next = rawcb.rcb_prev = &rawcb; 42 } -------------------------------------------------------------------- raw_usrreq.c |
38-42
The function initializes the doubly linked list of routing control blocks by setting the next and previous pointers of the head structure to point to itself.
As we showed in Figure 18.11, route_output
is called when the PRU_SEND
request is issued to the protocol’s user-request function, which is the result of a write operation by a process to a routing socket. In Figure 18.9 we indicated that five different types of routing messages are accepted by the kernel from a process.
Since this function is invoked as a result of a write by a process, the data from the process (the routing message to process) is in an mbuf chain from sosend
. Figure 20.6 shows an overview of the processing steps, assuming the process sends an RTM_ADD
command, specifying three addresses: the destination, its gateway, and a network mask (hence this is a network route, not a host route).
There are numerous points to note in this figure, most of which we’ll cover as we proceed through the source code for route_output
. Also note that, to save space, we omit the RTAX_
prefix for each array index in the rt_addrinfo
structure.
The process specifies which socket address structures follow the fixed-length rt_msghdr
structure by setting the bitmask rtm_addrs
. We show a bitmask of
0x07
, which corresponds to a destination address, a gateway address, and a network mask (Figure 19.19). The RTM_ADD
command requires the first two; the third is optional. Another optional address, the genmask
specifies the mask to be used for generating cloned routes.
The write
system call (the sosend
function) copies the buffer from the process into an mbuf chain in the kernel.
m_copydata
copies the mbuf chain into a buffer that route_output
obtains using malloc
. It is easier to access all the information in the structure and the socket address structures that follow when stored in a single contiguous buffer than it is when stored in an mbuf chain.
The function rt_xaddrs
is called by route_output
to take the bitmask and build the rt_addrinfo
structure that points into the buffer. The code in route_output
references these structures using the names shown in the fifth column in Figure 19.19. The bitmask is also copied into the rti_addrs
member.
route_output
normally modifies the rt_msghdr
structure. If an error occurs, the corresponding errno
value is returned in rtm_errno
(for example, EEXIST
if the route already exists); otherwise the flag RTF_DONE
is logically ORed into the rtm_flags
supplied by the process.
The rt_msghdr
structure and the addresses that follow become input to 0 or more processes that are reading from a routing socket. The buffer is first converted back into an mbuf chain by m_copyback. raw_input
goes through all the routing PCBs and passes a copy to the appropriate processes. We also show that a process with a routing socket receives a copy of each message it writes to that socket unless it disables the SO_USELOOPBACK
socket option.
To avoid receiving a copy of their own routing messages, some programs, such as
route
, callshutdown
with a second argument of 0 to prevent any data from being received on the routing socket.
We examine the source code for route_output
in seven parts. Figure 20.7 shows an overview of the function.
Table 20.7. Summary of route_output
processing steps.
------------------------------------------------------------------------------ int route_output() { R_Malloc() to allocate buffer; m_copydata() to copy from mbuf chain into buffer; rt_xaddrs() to build rt_addrinfo{}; switch (message type) { case RTM_ADD: rtrequest(RTM_ADD); rt_setmetrics(); break; case RTM_DELETE: rtrequest(RTM_DELETE); break; case RTM_GET: case RTM_CHANGE: case RTM_LOCK: rtalloc1(); switch (message type) { case RTM_GET: rt_msg2(RTM_GET); break; case RTM_CHANGE: change appropriate fields; /* fall through */ case RTM_LOCK: set rmx_locks; break; } break; } set rtm_error if error, else set RTF_DONE flag; m_copyback() to copy from buffer into mbuf chain; raw_input(); /* mbuf chain to appropriate processes */ } ------------------------------------------------------------------------------ |
The first part of route_output
is shown in Figure 20.8.
Table 20.8. route_output
function: initial processing, copy message from mbuf chain.
--------------------------------------------------------------------------- rtsock.c 113 int 114 route_output(m, so) 115 struct mbuf *m; 116 struct socket *so; 117 { 118 struct rt_msghdr *rtm = 0; 119 struct rtentry *rt = 0; 120 struct rtentry *saved_nrt = 0; 121 struct rt_addrinfo info; 122 int len, error = 0; 123 struct ifnet *ifp = 0; 124 struct ifaddr *ifa = 0; 125 #define senderr(e) { error = e; goto flush;} 126 if (m == 0 || ((m->m_len < sizeof(long)) && 127 (m = m_pullup(m, sizeof(long))) == 0)) 128 return (ENOBUFS); 129 if ((m->m_flags & M_PKTHDR) == 0) 130 panic("route_output"); 131 len = m->m_pkthdr.len; 132 if (len < sizeof(*rtm) || 133 len != mtod(m, struct rt_msghdr *)->rtm_msglen) { 134 dst = 0; 135 senderr(EINVAL); 136 } 137 R_Malloc(rtm, struct rt_msghdr *, len); 138 if (rtm == 0) { 139 dst = 0; 140 senderr(ENOBUFS); 141 } 142 m_copydata(m, 0, len, (caddr_t) rtm); 143 if (rtm->rtm_version != RTM_VERSION) { 144 dst = 0; 145 senderr(EPROTONOSUPPORT); 146 } 147 rtm->rtm_pid = curproc->p_pid; 148 info.rti_addrs = rtm->rtm_addrs; 149 rt_xaddrs((caddr_t) (rtm + 1), len + (caddr_t) rtm, &info); 150 if (dst == 0) 151 senderr(EINVAL); 152 if (genmask) { 153 struct radix_node *t; 154 t = rn_addmask((caddr_t) genmask, 1, 2); 155 if (t && Bcmp(genmask, t->rn_key, *(u_char *) genmask) == 0) 156 genmask = (struct sockaddr *) (t->rn_key); 157 else 158 senderr(ENOBUFS); 159 } --------------------------------------------------------------------------- rtsock.c |
113-136
The mbuf chain is checked for validity: its length must be at least the size of an rt_msghdr
structure. The first longword is fetched from the data portion of the mbuf, which contains the rtm_msglen
value.
137-142
A buffer is allocated to hold the entire message and m_copydata
copies the message from the mbuf chain into the buffer.
143-146
The version of the message is checked. In the future, should a new version of the routing messages be introduced, this member could be used to provide support for older versions.
147-149
The process ID is copied into rtm_pid
and the bitmask supplied by the process is copied into info
.rti_addrs
, a structure local to this function. The function
rt_xaddrs
(shown in the next section) fills in the eight socket address pointers in the info
structure to point into the buffer now containing the message.
150-151
A destination address is a required address for all commands. If the info
.rti_info
[
RTAX_DST
]
element is a null pointer, EINVAL
is returned. Remember that dst
refers to this array element (Figure 19.19).
152-159
A genmask
is optional and is used as the network mask for routes created when the RTF_CLONING
flag is set (Figure 19.8). rn_addmask
adds the mask to the tree of masks, first searching for an existing entry for the mask and then referencing that entry if found. If the mask is found or added to the mask tree, an additional check is made that the entry in the mask tree really equals the genmask
value, and, if so, the genmask
pointer is replaced with a pointer to the mask in the mask tree.
Figure 20.9 shows the next part of route_output
, which handles the RTM_ADD
and RTM_DELETE
commands.
Table 20.9. route_output
function: process RTM_ADD
and RTM_DELETE
commands.
-------------------------------------------------------------------------- rtsock.c 160 switch (rtm->rtm_type) { 161 case RTM_ADD: 162 if (gate == 0) 163 senderr(EINVAL); 164 error = rtrequest(RTM_ADD, dst, gate, netmask, 165 rtm->rtm_flags, &saved_nrt); 166 if (error == 0 && saved_nrt) { 167 rt_setmetrics(rtm->rtm_inits, 168 &rtm->rtm_rmx, &saved_nrt->rt_rmx); 169 saved_nrt->rt_refcnt--; 170 saved_nrt->rt_genmask = genmask; 171 } 172 break; 173 case RTM_DELETE: 174 error = rtrequest(RTM_DELETE, dst, gate, netmask, 175 rtm->rtm_flags, (struct rtentry **) 0); 176 break; -------------------------------------------------------------------------- rtsock.c |
162-163
An RTM_ADD
command requires the process to specify a gateway.
164-165
rtrequest
processes the request. The netmask
pointer can be null if the route being entered is a host route. If all is OK, the pointer to the new routing table entry is returned through saved_nrt
.
166-172
The rt_metrics
structure is copied from the caller’s buffer into the routing table entry. The reference count is decremented and the genmask
pointer is stored (possibly a null pointer).
173-176
Processing the RTM_DELETE
command is simple because all the work is done by rtrequest
. Since the final argument is a null pointer, rtrequest
calls rtfree
if the reference count is 0, deleting the entry from the routing table (Figure 19.7).
The next part of the processing is shown in Figure 20.10, which handles the common code for the RTM_GET, RTM_CHANGE
, and RTM_LOCK
commands.
Table 20.10. route_output
function: common processing for RTM_GET, RTM_CHANGE
, and RTM_LOCK
.
-------------------------------------------------------------------------- rtsock.c 177 case RTM_GET: 178 case RTM_CHANGE: 179 case RTM_LOCK: 180 rt = rtalloc1(dst, 0); 181 if (rt == 0) 182 senderr(ESRCH); 183 if (rtm->rtm_type != RTM_GET) { /* XXX: too grotty */ 184 struct radix_node *rn; 185 extern struct radix_node_head *mask_rnhead; 186 if (Bcmp(dst, rt_key(rt), dst->sa_len) != 0) 187 senderr(ESRCH); 188 if (netmask && (rn = rn_search(netmask, 189 mask_rnhead->rnh_treetop))) 190 netmask = (struct sockaddr *) rn->rn_key; 191 for (rn = rt->rt_nodes; rn; rn = rn->rn_dupedkey) 192 if (netmask == (struct sockaddr *) rn->rn_mask) 193 break; 194 if (rn == 0) 195 senderr(ETOOMANYREFS); 196 rt = (struct rtentry *) rn; 197 } -------------------------------------------------------------------------- rtsock.c |
177-182
Since all three commands reference an existing entry, rtalloc1
locates the entry. If the entry isn’t found, ESRCH
is returned.
183-187
For the RTM_CHANGE
and RTM_LOCK
commands, a network match is inadequate: an exact match with the routing table key is required. Therefore, if the dst
argument doesn’t equal the routing table key, the match was a network match and ESRCH
is returned.
188-193
Even with an exact match, if there are duplicate keys, each with a different network mask, the correct entry must still be located. If a netmask
argument was supplied, it is looked up in the mask table (mask_rnhead
). If found, the netmask
pointer is replaced with the pointer to the mask in the mask tree. Each leaf node in the duplicate key list is examined, looking for an entry with an rn_mask
pointer that equals netmask
. This test compares the pointers, not the structures that they point to. This works because all masks appear in the mask tree, and only one copy of each unique mask is stored in this tree. In the common case, keys are not duplicated, so the for
loop iterates once. If a host entry is being modified, a mask must not be specified and then both netmask
and rn_mask
are null pointers (which are equal). But if an entry that has an associated mask is being modified, that mask must be specified as the netmask
argument.
194-195
If the for
loop terminates without finding a matching network mask, ETOOMANYREFS
is returned.
The comment
XXX
is because this function must go to all this work to find the desired entry. All these details should be hidden in another function similar tortalloc1
that detects a network match and handles a mask argument.
The next part of this function, shown in Figure 20.11, continues processing the RTM_GET
command. This command is unique among the commands supported by route_output
in that it can return more data than it was passed. For example, only a single socket address structure is required as input, the destination, but at least two are returned: the destination and its gateway. With regard to Figure 20.6, this means the buffer allocated for m_copydata
to copy into might need to be increased in size.
Table 20.11. route_output
function: RTM_GET
processing.
-------------------------------------------------------------------------------- rtsock.c 198 switch (rtm->rtm_type) { 199 case RTM_GET: 200 dst = rt_key(rt); 201 gate = rt->rt_gateway; 202 netmask = rt_mask(rt); 203 genmask = rt->rt_genmask; 204 if (rtm->rtm_addrs & (RTA_IFP | RTA_IFA)) { 205 if (ifp = rt->rt_ifp) { 206 ifpaddr = ifp->if_addrlist->ifa_addr; 207 ifaaddr = rt->rt_ifa->ifa_addr; 208 rtm->rtm_index = ifp->if_index; 209 } else { 210 ifpaddr = 0; 211 ifaaddr = 0; 212 } 213 } 214 len = rt_msg2(RTM_GET, &info, (caddr_t) 0, 215 (struct walkarg *) 0); 216 if (len > rtm->rtm_msglen) { 217 struct rt_msghdr *new_rtm; 218 R_Malloc(new_rtm, struct rt_msghdr *, len); 219 if (new_rtm == 0) 220 senderr(ENOBUFS); 221 Bcopy(rtm, new_rtm, rtm->rtm_msglen); 222 Free(rtm); 223 rtm = new_rtm; 224 } 225 (void) rt_msg2(RTM_GET, &info, (caddr_t) rtm, 226 (struct walkarg *) 0); 227 rtm->rtm_flags = rt->rt_flags; 228 rtm->rtm_rmx = rt->rt_rmx; 229 rtm->rtm_addrs = info.rti_addrs; 230 break; -------------------------------------------------------------------------------- rtsock.c |
198-203
Four pointers are stored in the rti_info
array: dst, gate, netmask
, and genmask
. The latter two might be null pointers. These pointers in the info
structure point to the socket address structures that will be returned to the process.
204-213
The process can set the masks RTA_IFP
and RTA_IFA
in the rtm_flags
bitmask. If either or both are set, the process wants to receive the contents of both the ifaddr
structures pointed to by this routing table entry: the link-level address of the interface (pointed to by rt_ifp>if_addrlist
) and the protocol address for this entry (pointed to by rt_ifa>ifa_addr
). The interface index is also returned.
214-224
rt_msg2
is called with a null third pointer to calculate the length of the routing message corresponding to RTM_GET
and the addresses pointed to by the info
structure. If the length of the result message exceeds the length of the input message, then a new buffer is allocated, the input message is copied into the new buffer, the old buffer is released, and rtm
is set to point to the new buffer.
225-230
rt_msg2
is called again, this time with a nonnull third pointer, which builds the result message in the buffer. The final three members in the rt_msghdr
structure are then filled in.
Figure 20.12 shows the processing of the RTM_CHANGE
and RTM_LOCK
commands.
Table 20.12. route_output
function: RTM_CHANGE
and RTM_LOCK
processing.
-------------------------------------------------------------------------- rtsock.c 231 case RTM_CHANGE: 232 if (gate && rt_setgate(rt, rt_key(rt), gate)) 233 senderr(EDQUOT); 234 /* new gateway could require new ifaddr, ifp; flags may also be 235 different; ifp may be specified by ll sockaddr when protocol 236 address is ambiguous */ 237 if (ifpaddr && (ifa = ifa_ifwithnet(ifpaddr)) && 238 (ifp = ifa->ifa_ifp)) 239 ifa = ifaof_ifpforaddr(ifaaddr ? ifaaddr : gate, 240 ifp); 241 else if ((ifaaddr && (ifa = ifa_ifwithaddr(ifaaddr))) || 242 (ifa = ifa_ifwithroute(rt->rt_flags, 243 rt_key(rt), gate))) 244 ifp = ifa->ifa_ifp; 245 if (ifa) { 246 struct ifaddr *oifa = rt->rt_ifa; 247 if (oifa != ifa) { 248 if (oifa && oifa->ifa_rtrequest) 249 oifa->ifa_rtrequest(RTM_DELETE, 250 rt, gate); 251 IFAFREE(rt->rt_ifa); 252 rt->rt_ifa = ifa; 253 ifa->ifa_refcnt++; 254 rt->rt_ifp = ifp; 255 } 256 } 257 rt_setmetrics(rtm->rtm_inits, &rtm->rtm_rmx, 258 &rt->rt_rmx); 259 if (rt->rt_ifa && rt->rt_ifa->ifa_rtrequest) 260 rt->rt_ifa->ifa_rtrequest(RTM_ADD, rt, gate); 261 if (genmask) 262 rt->rt_genmask = genmask; 263 /* 264 * Fall into 265 */ 266 case RTM_LOCK: 267 rt->rt_rmx.rmx_locks &= ~(rtm->rtm_inits); 268 rt->rt_rmx.rmx_locks |= 269 (rtm->rtm_inits & rtm->rtm_rmx.rmx_locks); 270 break; 271 } 272 break; 273 default: 274 senderr(EOPNOTSUPP); 275 } -------------------------------------------------------------------------- rtsock.c |
231-233
If a gate
address was passed by the process, rt_setgate
is called to change the gateway for the entry.
234-244
The new gateway (if changed) can also require new rt_ifp
and rt_ifa
pointers. The process can specify these new values by passing either an ifpaddr
socket address structure or an ifaaddr
socket address structure. The former is tried first, and then the latter. If neither is passed by the process, the rt_ifp
and rt_ifa
pointers are left alone.
245-256
If an interface was located (ifa
is nonnull), then the existing rt_ifa
pointer for the route is compared to the new value. If it has changed, new values for rt_ifp
and rt_ifa
are stored in the routing table entry. Before doing this the interface request function (if defined) is called with a command of RTM_DELETE
. The delete is required because the link-layer information from one type of network to another can be quite different, say changing a route from an X.25 network to an Ethernet, and the output routines must be notified.
259-260
If an interface request function is defined, it is called with a command of RTM_ADD
.
261-262
If the process specifies the genmask
argument, the pointer to the mask that was obtained in Figure 20.8 is saved in rt_genmask
.
266-270
The RTM_LOCK
command updates the bitmask stored in rt_rmx.rmx_locks
. Figure 20.13 shows the values of the different bits in this bitmask, one value per metric.
Table 20.13. Constants to initialize or lock metrics.
Constant | Value | Description |
---|---|---|
|
| initialize or lock |
|
| initialize or lock |
|
| initialize or lock |
|
| initialize or lock |
|
| initialize or lock |
|
| initialize or lock |
|
| initialize or lock |
|
| initialize or lock |
The rmx_locks
member of the rt_metrics
structure in the routing table entry is the bitmask telling the kernel which metrics to leave alone. That is, those metrics specified by rmx_locks
won’t be updated by the kernel. The only use of these metrics by the kernel is with TCP, as noted with Figure 27.3. The rmx_pksent
metric cannot be locked or initialized, but it turns out this member is never even referenced or updated by the kernel.
The rtm_inits
value in the message from the process specifies the bitmask of which metrics were just initialized by rt_setmetrics
. The rtm_rmx.rmx_locks
value in the message specifies the bitmask of which metrics should now be locked. The value of rt_rmx.rmx_locks
is the bitmask in the routing table of which metrics are currently locked. First, any bits to be initialized (rtm_inits
) are unlocked. Any bits that are both initialized (rtm_inits
) and locked (rtm_rmx.rmx_locks
) are locked.
273-275
This default
is for the switch
at the beginning of Figure 20.9 and catches any of the routing commands other than the five that are supported in messages from a process.
The final part of route_output
, shown in Figure 20.14, sends the reply to raw_input
.
Table 20.14. route_output
function: pass results to raw_input
.
----------------------------------------------------------------------------- rtsock.c 276 flush: 277 if (rtm) { 278 if (error) 279 rtm->rtm_errno = error; 280 else 281 rtm->rtm_flags |= RTF_DONE; 282 } 283 if (rt) 284 rtfree(rt); 285 { 286 struct rawcb *rp = 0; 287 /* 288 * Check to see if we don't want our own messages. 289 */ 290 if ((so->so_options & SO_USELOOPBACK) == 0) { 291 if (route_cb.any_count <= 1) { 292 if (rtm) 293 Free(rtm); 294 m_freem(m); 295 return (error); 296 } 297 /* There is another listener, so construct message */ 298 rp = sotorawcb(so); 299 } 300 if (rtm) { 301 m_copyback(m, 0, rtm->rtm_msglen, (caddr_t) rtm); 302 Free(rtm); 303 } 304 if (rp) 305 rp->rcb_proto.sp_family = 0; /* Avoid us */ 306 if (dst) 307 route_proto.sp_protocol = dst->sa_family; 308 raw_input(m, &route_proto, &route_src, &route_dst); 309 if (rp) 310 rp->rcb_proto.sp_family = PF_ROUTE; 311 } 312 return (error); 313 } ----------------------------------------------------------------------------- rtsock.c |
276-282
flush
is the label jumped to by the senderr
macro defined at the beginning of the function. If an error occurred it is returned in the rtm_errno
member; otherwise the RTF_DONE
flag is set.
283-284
If a route is being held, it is released. The call to rtalloc1
at the beginning of Figure 20.10 holds the route, if found.
285-296
The SO_USELOOPBACK
socket option is true by default and specifies that the sending process is to receive a copy of each routing message that it writes to a routing socket. (If the sender doesn’t receive a copy, it can’t receive any of the information returned by RTM_GET
.) If that option is not set, and the total count of routing sockets is less than or equal to 1, there are no other processes to receive the message and the sender doesn’t want a copy. The buffer and mbuf chain are both released and the function returns.
297-299
There is at least one other listener but the sending process does not want a copy. The pointer rp
, which defaults to null, is set to point to the routing control block for the sender and is also used as a flag that the sender doesn’t want a copy.
300-303
The buffer is converted back into an mbuf chain (Figure 20.6) and the buffer released.
304-305
If rp
is set, some other process might want the message but the sender does not want a copy. The sp_family
member of the sender’s routing control block is temporarily set to 0, but the sp_family
of the message (the route_proto
structure, shown with Figure 19.26) has a family of PF_ROUTE
. This trick prevents raw_input
from passing a copy of the result to the sending process because raw_input
does not pass a copy to any socket with an sp_family
of 0.
306-308
If dst
is a nonnull pointer, the address family of that socket address structure becomes the protocol of the routing message. With the Internet protocols this value would be PF_INET
. A copy is passed to the appropriate listeners by raw_input
.
309-313
If the sp_family
member in the calling process was temporarily set to 0, it is reset to PF_ROUTE
, its normal value.
The rt_xaddrs
function is called only once from route_output
(Figure 20.8) after the routing message from the process has been copied from the mbuf chain into a buffer and after the bitmask from the process (rtm_addrs
) has been copied into the rti_info
member of an rt_addrinfo
structure. The purpose of rt_xaddrs
is to take this bitmask and set the pointers in the rti_info
array to point to the corresponding address in the buffer. Figure 20.15 shows the function.
Table 20.15. rt_xaddrs
function: fill rti_into
array with pointers.
-------------------------------------------------------------------------- rtsock.c 330 #define ROUNDUP(a) 331 ((a) > 0 ? (1 + (((a) - 1) | (sizeof(long) - 1))) : sizeof(long)) 332 #define ADVANCE(x, n) (x += ROUNDUP((n)->sa_len)) 333 static void 334 rt_xaddrs(cp, cplim, rtinfo) 335 caddr_t cp, cplim; 336 struct rt_addrinfo *rtinfo; 337 { 338 struct sockaddr *sa; 339 int i; 340 bzero(rtinfo->rti_info, sizeof(rtinfo->rti_info)); 341 for (i = 0; (i < RTAX_MAX) && (cp < cplim); i++) { 342 if ((rtinfo->rti_addrs & (1 << i)) == 0) 343 continue; 344 rtinfo->rti_info[i] = sa = (struct sockaddr *) cp; 345 ADVANCE(cp, sa); 346 } 347 } -------------------------------------------------------------------------- rtsock.c |
330-340
The array of pointers is set to 0 so all the pointers to address structures not appearing in the bitmask will be null.
341-347
Each of the 8 (RTAX_MAX
) possible bits in the bitmask is tested and, if set, a pointer is stored in the rti_info
array to the corresponding socket address structure. The ADVANCE
macro takes the sa_len
field of the socket address structure, rounds it up to the next multiple of 4 bytes, and increments the pointer cp
accordingly.
This function was called twice from route_output:
when a new route was added and when an existing route was changed. The rtm_inits
member in the routing message from the process specifies which of the metrics the process wants to initialize from the rtm_rmx
array. The bit values in the bitmask are shown in Figure 20.13.
Notice that both rtm_addrs
and rtm_inits
are bitmasks in the message from the process, the former specifying the socket address structures that follow, and the latter specifying which metrics are to be initialized. Socket address structures whose bits don’t appear in rtm_addrs
don’t even appear in the routing message, to save space. But the entire rt_metrics
array always appears in the fixed-length rt_msghdr
structure elements in the array whose bits are not set in rtm_inits
are ignored.
Figure 20.16 shows the rt_setmetrics
function.
Table 20.16. rt_setmetrics
function: set elements of the rt_metrics
structure.
--------------------------------------------------------------------- rtsock.c 314 void 315 rt_setmetrics(which, in, out) 316 u_long which; 317 struct rt_metrics *in, *out; 318 { 319 #define metric(f, e) if (which & (f)) out->e = in->e; 320 metric(RTV_RPIPE, rmx_recvpipe); 321 metric(RTV_SPIPE, rmx_sendpipe); 322 metric(RTV_SSTHRESH, rmx_ssthresh); 323 metric(RTV_RTT, rmx_rtt); 324 metric(RTV_RTTVAR, rmx_rttvar); 325 metric(RTV_HOPCOUNT, rmx_hopcount); 326 metric(RTV_MTU, rmx_mtu); 327 metric(RTV_EXPIRE, rmx_expire); 328 #undef metric 329 } --------------------------------------------------------------------- rtsock.c |
314-318
The which
argument is always the rtm_inits
member of the routing message from the process. in
points to the rt_metrics
structure from the process, and out
points to the rt_metrics
structure in the routing table entry that is being created or modified.
319-329
Each of the 8 bits in the bitmask is tested and if set, the corresponding metric is copied. Notice that when a new routing table entry is being created with the RTM_ADD
command, route_output
calls rtrequest
, which sets the entire routing table entry to 0 (Figure 19.9). Hence, any metrics not specified by the process in the routing message default to 0.
All routing messages destined for a process those that originate from within the kernel and those that originate from a process ar e given to raw_input
, which selects the processes to receive the message. Figure 18.11 summarizes the four functions that call raw_input
.
When a routing socket is created, the family is always PF_ROUTE
and the protocol, the third argument to socket
, can be 0, which means the process wants to receive all routing messages, or a value such as AF_INET
, which restricts the socket to messages containing addresses of that specific protocol family. A routing control block is created for each routing socket (Section 20.3) and these two values are stored in the sp_family
and sp_protocol
members of the rcb_proto
structure.
Figure 20.17 shows the raw_input
function.
Table 20.17. raw_input
function: pass routing messages to 0 or more processes.
----------------------------------------------------------------------- raw_usrreq.c 51 void 52 raw_input(m0, proto, src, dst) 53 struct mbuf *m0; 54 struct sockproto *proto; 55 struct sockaddr *src, *dst; 56 { 57 struct rawcb *rp; 58 struct mbuf *m = m0; 59 int sockets = 0; 60 struct socket *last; 61 last = 0; 62 for (rp = rawcb.rcb_next; rp != &rawcb; rp = rp->rcb_next) { 63 if (rp->rcb_proto.sp_family != proto->sp_family) 64 continue; 65 if (rp->rcb_proto.sp_protocol && 66 rp->rcb_proto.sp_protocol != proto->sp_protocol) 67 continue; 68 /* 69 * We assume the lower level routines have 70 * placed the address in a canonical format 71 * suitable for a structure comparison. 72 * 73 * Note that if the lengths are not the same 74 * the comparison will fail at the first byte. 75 */ 76 #define equal(a1, a2) 77 (bcmp((caddr_t)(a1), (caddr_t)(a2), a1->sa_len) == 0) 78 if (rp->rcb_laddr && !equal(rp->rcb_laddr, dst)) 79 continue; 80 if (rp->rcb_faddr && !equal(rp->rcb_faddr, src)) 81 continue; 82 if (last) { 83 struct mbuf *n; 84 if (n = m_copy(m, 0, (int) M_COPYALL)) { 85 if (sbappendaddr(&last->so_rcv, src, 86 n, (struct mbuf *) 0) == 0) 87 /* should notify about lost packet */ 88 m_freem(n); 89 else { 90 sorwakeup(last); 91 sockets++; 92 } 93 } 94 } 95 last = rp->rcb_socket; 96 } 97 if (last) { 98 if (sbappendaddr(&last->so_rcv, src, 99 m, (struct mbuf *) 0) == 0) 100 m_freem(m); 101 else { 102 sorwakeup(last); 103 sockets++; 104 } 105 } else 106 m_freem(m); 107 } ----------------------------------------------------------------------- raw_usrreq.c |
51-61
In all four calls to raw_input
that we’ve seen, the proto, src
, and dst
arguments are pointers to the three globals route_proto, route_src
, and route_dst
, which are declared and initialized as shown with Figure 19.26.
62-67
The for
loop goes through every routing control block checking for a match. The family in the control block (normally PF_ROUTE
) must match the family in the sockproto
structure or the control block is skipped. Next, if the protocol in the control block (the third argument to socket
) is nonzero, it must match the family in the sockproto
structure, or the message is skipped. Hence a process that creates a routing socket with a protocol of 0 receives all routing messages.
68-81
These two tests compare the local address in the control block and the foreign address in the control block, if specified. Currently the process is unable to set the rcb_laddr
or rcb_faddr
members of the control block. Normally a process would set the former with bind
and the latter with connect
, but that is not possible with routing sockets in Net/3. Instead, we’ll see that route_usrreq
permanently connects the socket to the route_src
socket address structure, which is OK since that is always the src
argument to this function.
82-107
If last
is nonnull, it points to the most recently seen socket
structure that should receive this message. If this variable is nonnull, a copy of the message is appended to that socket’s receive buffer by m_copy
and sbappendaddr
, and any processes waiting on this receive buffer are awakened. Then last
is set to point to this socket that just matched the previous tests. The use of last
is to avoid calling m_copy
(an expensive operation) if only one process is to receive the message.
If N processes are to receive the message, the first N—1 receive a copy and the final one receives the message itself.
The variable sockets
that is incremented within this function is not used. Since it is incremented only when a message is passed to a process, if it is 0 at the end of the function it indicates that no process received the message (but the value isn’t stored anywhere).
route_usrreq
is the routing protocol’s user-request function. It is called for a variety of operations. Figure 20.18 shows the function.
Table 20.18. route_usrreq
function: process PRU_
xxx requests.
----------------------------------------------------------------------------- rtsock.c 64 int 65 route_usrreq(so, req, m, nam, control) 66 struct socket *so; 67 int req; 68 struct mbuf *m, *nam, *control; 69 { 70 int error = 0; 71 struct rawcb *rp = sotorawcb(so); 72 int s; 73 if (req == PRU_ATTACH) { 74 MALLOC(rp, struct rawcb *, sizeof(*rp), M_PCB, M_WAITOK); 75 if (so->so_pcb = (caddr_t) rp) 76 bzero(so->so_pcb, sizeof(*rp)); 77 } 78 if (req == PRU_DETACH && rp) { 79 int af = rp->rcb_proto.sp_protocol; 80 if (af == AF_INET) 81 route_cb.ip_count--; 82 else if (af == AF_NS) 83 route_cb.ns_count--; 84 else if (af == AF_ISO) 85 route_cb.iso_count--; 86 route_cb.any_count--; 87 } 88 s = splnet(); 89 error = raw_usrreq(so, req, m, nam, control); 90 rp = sotorawcb(so); 91 if (req == PRU_ATTACH && rp) { 92 int af = rp->rcb_proto.sp_protocol; 93 if (error) { 94 free((caddr_t) rp, M_PCB); 95 splx(s); 96 return (error); 97 } 98 if (af == AF_INET) 99 route_cb.ip_count++; 100 else if (af == AF_NS) 101 route_cb.ns_count++; 102 else if (af == AF_ISO) 103 route_cb.iso_count++; 104 route_cb.any_count++; 105 rp->rcb_faddr = &route_src; 106 soisconnected(so); 107 so->so_options |= SO_USELOOPBACK; 108 } 109 splx(s); 110 return (error); 111 } ----------------------------------------------------------------------------- rtsock.c |
64-77
The PRU_ATTACH
request is issued when the process calls socket
. Memory is allocated for a routing control block. The pointer returned by MALLOC
is stored in the so_pcb
member of the socket
structure, and if the memory was allocated, the rawcb
structure is set to 0.
78-87
The close
system call issues the PRU_DETACH
request. If the socket
structure points to a protocol control block, two of the counters in the route_cb
structure are decremented: one is the any_count
and one is based on the protocol.
91-104
If the request is PRU_ATTACH
and the socket points to a routing control block, a check is made for an error from raw_usrreq
. Two of the counters in the route_cb
structure are then incremented: one is the any_count
and one is based on the protocol.
105-106
The foreign address in the routing control block is set to route_src
. This permanently connects the new socket to receive routing messages from the PF_ROUTE
family.
raw_usrreq
performs most of the processing for the user request in the routing domain. It was called by route_usrreq
in the previous section. The reason the user-request processing is divided between these two functions is that other protocols (e.g., the OSI CLNP) call raw_usrreq
but not route_usrreq. raw_usrreq
is not intended to be the pr_usrreq
function for a protocol. Instead it is a common subroutine called by the various pr_usrreq
functions.
Figure 20.19 shows the beginning and end of the raw_usrreq
function. The body of the switch
is discussed in separate figures following this figure.
Table 20.19. Body of raw_usrreq
function.
--------------------------------------------------------------- raw_usrreq.c 119 int 120 raw_usrreq(so, req, m, nam, control) 121 struct socket *so; 122 int req; 123 struct mbuf *m, *nam, *control; 124 { 125 struct rawcb *rp = sotorawcb(so); 126 int error = 0; 127 int len; 128 if (req == PRU_CONTROL) 129 return (EOPNOTSUPP); 130 if (control && control->m_len) { 131 error = EOPNOTSUPP; 132 goto release; 133 } 134 if (rp == 0) { 135 error = EINVAL; 136 goto release; 137 } 138 switch (req) { /* switch cases */ 262 default: 263 panic("raw_usrreq"); 264 } 265 release: 266 if (m != NULL) 267 m_freem(m); 268 return (error); 269 } --------------------------------------------------------------- raw_usrreq.c |
119-129
The PRU_CONTROL
request is from the ioctl
system call and is not supported in the routing domain.
130-133
If control information was passed by the process (using the sendmsg
system call) an error is returned, since the routing domain doesn’t use this optional information.
134-137
If the socket
structure doesn’t point to a routing control block, an error is returned. If a new socket is being created, it is the caller’s responsibility (i.e., route_usrreq
) to allocate this control block and store the pointer in the so_pcb
member before calling this function.
262-269
The default
for this switch
catches two requests that are not handled by case
statements: PRU_BIND
and PRU_CONNECT
. The code for these two requests is present but commented out in Net/3. Therefore issuing the bind
or connect
system calls on a routing socket causes a kernel panic. This is a bug. Fortunately it requires a superuser process to create this type of socket.
We now discuss the individual case
statements. Figure 20.20 shows the processing for the PRU_ATTACH
and PRU_DETACH
requests.
Table 20.20. raw_usrreq
function: PRU_ATTACH
and PRU_DETACH
requests.
----------------------------------------------------------------------- raw_usrreq.c 139 /* 140 * Allocate a raw control block and fill in the 141 * necessary info to allow packets to be routed to 142 * the appropriate raw interface routine. 143 */ 144 case PRU_ATTACH: 145 if ((so->so_state & SS_PRIV) == 0) { 146 error = EACCES; 147 break; 148 } 149 error = raw_attach(so, (int) nam); 150 break; 151 /* 152 * Destroy state just before socket deallocation. 153 * Flush data or not depending on the options. 154 */ 155 case PRU_DETACH: 156 if (rp == 0) { 157 error = ENOTCONN; 158 break; 159 } 160 raw_detach(rp); 161 break; ----------------------------------------------------------------------- raw_usrreq.c |
139-148
The PRU_ATTACH
request is a result of the socket
system call. A routing socket must be created by a superuser process.
149-150
The function raw_attach
(Figure 20.24) links the control block into the doubly linked list. The nam
argument is the third argument to socket
and gets stored in the control block.
151-159
The PRU_DETACH
is issued by the close
system call. The test of a null rp
pointer is superfluous, since the test was already done before the switch
statement.
160-161
raw_detach
(Figure 20.25) removes the control block from the doubly linked list.
Figure 20.21 shows the processing of the PRU_CONNECT2, PRU_DISCONNECT
, and PRU_SHUTDOWN
requests.
Table 20.21. raw_usrreq
function: PRU_CONNECT2, PRU_DISCONNECT
, and PRU_SHUTDOWN
requests.
--------------------------------------------------------------------- raw_usrreq.c 186 case PRU_CONNECT2: 187 error = EOPNOTSUPP; 188 goto release; 189 case PRU_DISCONNECT: 190 if (rp->rcb_faddr == 0) { 191 error = ENOTCONN; 192 break; 193 } 194 raw_disconnect(rp); 195 soisdisconnected(so); 196 break; 197 /* 198 * Mark the connection as being incapable of further input. 199 */ 200 case PRU_SHUTDOWN: 201 socantsendmore(so); 202 break; --------------------------------------------------------------------- raw_usrreq.c |
186-188
The PRU_CONNECT2
request is from the socketpair
system call and is not supported in the routing domain.
189-196
Since a routing socket is always connected (Figure 20.18), the PRU_DISCONNECT
request is issued by close
before the PRU_DETACH
request. The socket must already be connected to a foreign address, which is always true for a routing socket. raw_disconnect
and soisdisconnected
complete the processing.
197-202
The PRU_SHUTDOWN
request is from the shutdown
system call when the argument specifies that no more writes will be performed on the socket. socantsendmore
disables further writes.
The most common request for a routing socket, PRU_SEND
, and the PRU_ABORT
and PRU_SENSE
requests are shown in Figure 20.22.
Table 20.22. raw_usrreq
function: PRU_SEND, PRU_ABORT
, and PRU_SENSE
requests.
---------------------------------------------------------------------- raw_usrreq.c 203 /* 204 * Ship a packet out. The appropriate raw output 205 * routine handles any massaging necessary. 206 */ 207 case PRU_SEND: 208 if (nam) { 209 if (rp->rcb_faddr) { 210 error = EISCONN; 211 break; 212 } 213 rp->rcb_faddr = mtod(nam, struct sockaddr *); 214 } else if (rp->rcb_faddr == 0) { 215 error = ENOTCONN; 216 break; 217 } 218 error = (*so->so_proto->pr_output) (m, so); 219 m = NULL; 220 if (nam) 221 rp->rcb_faddr = 0; 222 break; 223 case PRU_ABORT: 224 raw_disconnect(rp); 225 sofree(so); 226 soisdisconnected(so); 227 break; 228 case PRU_SENSE: 229 /* 230 * stat: don't bother with a blocksize. 231 */ 232 return (0); ---------------------------------------------------------------------- raw_usrreq.c |
203-217
The PRU_SEND
request is issued by sosend
when the process writes to the socket. If a nam
argument is specified, that is, the process specified a destination address using either sendto
or sendmsg
, an error is returned because route_usrreq
always sets rcb_faddr
for a routing socket.
218-222
The message in the mbuf chain pointed to by m
is passed to the protocol’s pr_output
function, which is route_output
.
223-227
If a PRU_ABORT
request is issued, the control block is disconnected, the socket is released, and the socket is disconnected.
228-232
The PRU_SENSE
request is issued by the fstat
system call. The function returns OK.
Figure 20.23 shows the remaining PRU_
xxx requests.
Table 20.23. raw_usrreq
function: final part.
---------------------------------------------------------------------- raw_usrreq.c 233 /* 234 * Not supported. 235 */ 236 case PRU_RCVOOB: 237 case PRU_RCVD: 238 return (EOPNOTSUPP); 239 case PRU_LISTEN: 240 case PRU_ACCEPT: 241 case PRU_SENDOOB: 242 error = EOPNOTSUPP; 243 break; 244 case PRU_SOCKADDR: 245 if (rp->rcb_laddr == 0) { 246 error = EINVAL; 247 break; 248 } 249 len = rp->rcb_laddr->sa_len; 250 bcopy((caddr_t) rp->rcb_laddr, mtod(nam, caddr_t), (unsigned) len); 251 nam->m_len = len; 252 break; 253 case PRU_PEERADDR: 254 if (rp->rcb_faddr == 0) { 255 error = ENOTCONN; 256 break; 257 } 258 len = rp->rcb_faddr->sa_len; 259 bcopy((caddr_t) rp->rcb_faddr, mtod(nam, caddr_t), (unsigned) len); 260 nam->m_len = len; 261 break; ---------------------------------------------------------------------- raw_usrreq.c |
233-243
These five requests are not supported.
244-261
The PRU_SOCKADDR
and PRU_PEERADDR
requests are from the getsockname
and getpeername
system calls respectively. The former always returns an error, since the bind
system call, which sets the local address, is not supported in the routing domain. The latter always returns the contents of the socket address structure route_src
, which was set by route_usrreq
as the foreign address.
The raw_attach
function, shown in Figure 20.24, was called by raw_input
to finish processing the PRU_ATTACH
request.
Table 20.24. raw_attach
function.
------------------------------------------------------------------------- raw_cb.c 49 int 50 raw_attach(so, proto) 51 struct socket *so; 52 int proto; 53 { 54 struct rawcb *rp = sotorawcb(so); 55 int error; 56 /* 57 * It is assumed that raw_attach is called 58 * after space has been allocated for the 59 * rawcb. 60 */ 61 if (rp == 0) 62 return (ENOBUFS); 63 if (error = soreserve(so, raw_sendspace, raw_recvspace)) 64 return (error); 65 rp->rcb_socket = so; 66 rp->rcb_proto.sp_family = so->so_proto->pr_domain->dom_family; 67 rp->rcb_proto.sp_protocol = proto; 68 insque(rp, &rawcb); 69 return (0); 70 } ------------------------------------------------------------------------- raw_cb.c |
49-64
The caller must have already allocated the raw protocol control block. soreserve
sets the high-water marks for the send and receive buffers to 8192. This should be more than adequate for the routing messages.
65-67
A pointer to the socket
structure is stored in the protocol control block along with the dom_family
(which is PF_ROUTE
from Figure 20.1 for the routing domain) and the proto
argument (which is the third argument to socket
).
68-70
insque
adds the control block to the front of the doubly linked list headed by the global rawcb
.
The raw_detach
function, shown in Figure 20.25, was called by raw_input
to finish processing the PRU_DETACH
request.
Table 20.25. raw_detach
function.
------------------------------------------------------------------------- raw_cb.c 75 void 76 raw_detach(rp) 77 struct rawcb *rp; 78 { 79 struct socket *so = rp->rcb_socket; 80 so->so_pcb = 0; 81 sofree(so); 82 remque(rp); 83 free((caddr_t) (rp), M_PCB); 84 } ------------------------------------------------------------------------- raw_cb.c |
75-84
The so_pcb
pointer in the socket
structure is set to null and the socket is released. The control block is removed from the doubly linked list by remque
and the memory used for the control block is released by free
.
The raw_disconnect
function, shown in Figure 20.26, was called by raw_input
to process the PRU_DISCONNECT
and PRU_ABORT
requests.
Table 20.26. raw_disconnect
function.
--------------------------------------------------------------------- raw_cb.c 88 void 89 raw_disconnect(rp) 90 struct rawcb *rp; 91 { 92 if (rp->rcb_socket->so_state & SS_NOFDREF) 93 raw_detach(rp); 94 } --------------------------------------------------------------------- raw_cb.c |
88-94
If the socket does not reference a descriptor, raw_detach
releases the socket and control block.
A routing socket is a raw socket in the PF_ROUTE
domain. Routing sockets can be created only by a superuser process. If a nonprivileged process wants to read the routing information contained in the kernel, the sysctl
system call supported by the routing domain can be used (we described this in the previous chapter).
This chapter was our first encounter with the protocol control blocks (PCBs) that are normally associated with each socket. In the routing domain a special rawcb
contains information about the routing socket: the local and foreign addresses, the address family, and the protocol. We’ll see in Chapter 22 that the larger Internet protocol control block (inpcb
) is used with UDP, TCP, and raw IP sockets. The concepts are the same, however: the socket
structure is used by the socket layer, and the PCB, a rawcb
or an inpcb
, is used by the protocol layer. The socket
structure points to the PCB and vice versa.
The route_output
function handles the five routing requests that can be issued by a process. raw_input
delivers a routing message to one or more routing sockets, depending on the protocol and address family. The various PRU_
xxx requests for a routing socket are handled by raw_usrreq
and route_usrreq
. In later chapters we’ll encounter additional xxx_usrreq
functions, one per protocol (UDP, TCP, and raw IP), each consisting of a switch
statement to handle each request.
20.1 | List two ways a process can receive the return value from |
20.1 | The return value is returned in the |
20.2 | What happens when a process specifies a nonzero protocol argument to the |
20.2 | For a |
20.3 | Routes in the routing table (other than ARP entries) never time out. Implement a timeout on routes. |
3.128.94.171