Chapter 19. Routing Requests and Routing Messages

Introduction

The various protocols within the kernel don’t access the routing trees directly, using the functions from the previous chapter, but instead call a few functions that we describe in this chapter: rtalloc and rtalloc1 are two that perform routing table lookups, rtrequest adds and deletes routing table entries, and rtinit is called by most interfaces when the interface goes up or down.

Routing messages communicate information in two directions. A process such as the route command or one of the routing daemons (routed or gated) writes routing messages to a routing socket, causing the kernel to add a new route, delete an existing route, or modify an existing route. The kernel also generates routing messages that can be read by any routing socket when events occur in which the processes might be interested: an interface has gone down, a redirect has been received, and so on. In this chapter we cover the formats of these routing messages and the information contained therein, and we save our discussion of routing sockets until the next chapter.

Another interface provided by the kernel to the routing tables is through the sysctl system call, which we describe at the end of this chapter. This system call allows a process to read the entire routing table or a list of all the configured interfaces and interface addresses.

rtalloc and rtalloc1 Functions

rtalloc and rtalloc1 are the functions normally called to look up an entry in the routing table. Figure 19.1 shows rtalloc.

Table 19.1. rtalloc function.

------------------------------------------------------------------------- route.c
 58 void
 59 rtalloc(ro)
 60 struct route *ro;
 61 {
 62     if (ro->ro_rt && ro->ro_rt->rt_ifp && (ro->ro_rt->rt_flags & RTF_UP))
 63         return;                 /* XXX */
 64     ro->ro_rt = rtalloc1(&ro->ro_dst, 1);
 65 }
------------------------------------------------------------------------- route.c

58-65

The argument ro is often the pointer to a route structure contained in an Internet PCB (Chapter 22) which is used by UDP and TCP. If ro already points to an rtentry structure (ro_rt is nonnull), and that structure points to an interface structure, and the route is up, the function returns. Otherwise rtalloc1 is called with a second argument of 1. We’ll see the purpose of this argument shortly.

rtalloc1, shown in Figure 19.2, calls the rnh_matchaddr function, which is always rn_match (Figure 18.17) for Internet addresses.

Table 19.2. rtalloc1 function.

------------------------------------------------------------------------- route.c
 66 struct rtentry *
 67 rtalloc1(dst, report)
 68 struct sockaddr *dst;
 69 int     report;
 70 {
 71     struct radix_node_head *rnh = rt_tables[dst->sa_family];
 72     struct rtentry *rt;
 73     struct radix_node *rn;
 74     struct rtentry *newrt = 0;
 75     struct rt_addrinfo info;
 76     int     s = splnet(), err = 0, msgtype = RTM_MISS;

 77     if (rnh && (rn = rnh->rnh_matchaddr((caddr_t) dst, rnh)) &&
 78         ((rn->rn_flags & RNF_ROOT) == 0)) {
 79         newrt = rt = (struct rtentry *) rn;
 80         if (report && (rt->rt_flags & RTF_CLONING)) {
 81             err = rtrequest(RTM_RESOLVE, dst, SA(0),
 82                             SA(0), 0, &newrt);
 83             if (err) {
 84                 newrt = rt;
 85                 rt->rt_refcnt++;
 86                 goto miss;
 87             }
 88             if ((rt = newrt) && (rt->rt_flags & RTF_XRESOLVE)) {
 89                 msgtype = RTM_RESOLVE;
 90                 goto miss;
 91             }
 92         } else
 93             rt->rt_refcnt++;
 94     } else {
 95         rtstat.rts_unreach++;
 96       miss:if (report) {
 97             bzero((caddr_t) & info, sizeof(info));
 98             info.rti_info[RTAX_DST] = dst;
 99             rt_missmsg(msgtype, &info, 0, err);
100         }
101     }
102     splx(s);
103     return (newrt);
104 }
------------------------------------------------------------------------- route.c

66-76

The first argument is a pointer to a socket address structure containing the address to search for. The sa_family member selects the routing table to search.

Call rn_match

77-78

If the following three conditions are met, the search is successful.

  1. A routing table exists for the protocol family,

  2. rn_match returns a nonnull pointer, and

  3. the matching radix_node does not have the RNF_ROOT flag set.

Remember that the two leaves that mark the end of the tree both have the RNF_ROOT flag set.

Search fails

94-101

If the search fails because any one of the three conditions is not met, the statistic rts_unreach is incremented and if the second argument to rtalloc1 (report) is nonzero, a routing message is generated that can be read by any interested processes on a routing socket. The routing message has the type RTM_MISS, and the function returns a null pointer.

79

If all three of the conditions are met, the lookup succeeded and the pointer to the matching radix_node is stored in rt and newrt. Notice that in the definition of the rtentry structure (Figure 18.24) the two radix_node structures are at the beginning, and, as shown in Figure 18.8, the first of these two structures contains the leaf node. Therefore the pointer to a radix_node structure returned by rn_match is really a pointer to an rtentry structure, which is the matching leaf node.

Create clone entries

80-82

If the caller specified a nonzero second argument, and if the RTF_CLONING flag is set, rtrequest is called with a command of RTM_RESOLVE to create a new rtentry structure that is a clone of the one that was located. This feature is used by ARP and for multicast addresses.

Clone creation fails

83-87

If rtrequest returns an error, newrt is set back to the entry returned by rn_match and its reference count is incremented. A jump is made to miss where an RTM_MISS message is generated.

Check for external resolution

88-91

If rtrequest succeeds but the newly cloned entry has the RTF_XRESOLVE flag set, a jump is made to miss, this time to generate an RTM_RESOLVE message. The intent of this message is to notify a user process when the route is created, and it could be used with the conversion of IP addresses to X.121 addresses.

Increment reference count for normal successful search

92-93

When the search succeeds but the RTF_CLONING flag is not set, this statement increments the entry’s reference count. This is the normal flow through the function, which then returns the nonnull pointer.

For a small function, rtalloc1 has many options in how it operates. There are seven different flows through the function, summarized in Figure 19.3.

Table 19.3. Summary of operation of rtalloc1.

 

report argument

RTF_CLONING flag

RTM_RESOLVE return

RTF_XRESOLVE flag

routing message generated

rt_refcnt

return value

entry not found

0

     

null

1

   

RTM_MISS

 

null

entry found

 

0

   

++

ptr

0

    

++

ptr

1

1

OK

0

 

++

ptr

1

1

OK

1

RTM_RESOLVE

++

ptr

1

1

error

 

RTM_MISS

++

ptr

We note that the first two rows (entry not found) are impossible if a default route exists. Also we show rt_refcnt being incremented in the fifth and sixth rows when the call to rtrequest with a command of RTM_RESOLVE is OK. The increment is done by rtrequest.

RTFREE Macro and rtfree Function

The RTFREE macro, shown in Figure 19.4, calls the rtfree function only if the reference count is less than or equal to 1, otherwise it just decrements the reference count.

Table 19.4. RTFREE macro.

------------------------------------------------------------------------- route.h
209 #define RTFREE(rt) 
210     if ((rt)->rt_refcnt <= 1) 
211         rtfree(rt); 
212     else 
213         (rt)->rt_refcnt--;      /* no need for function call */
------------------------------------------------------------------------- route.h

209-213

The rtfree function, shown in Figure 19.5, releases an rtentry structure when there are no more references to it. We’ll see in Figure 22.7, for example, that when a process control block is released, if it points to a routing entry, rtfree is called.

Table 19.5. rtfree function: release an rtentry structure.

------------------------------------------------------------------------- route.c
105 void
106 rtfree(rt)
107 struct rtentry *rt;
108 {
109     struct ifaddr *ifa;

110     if (rt == 0)
111         panic("rtfree");
112     rt->rt_refcnt--;
113     if (rt->rt_refcnt <= 0 && (rt->rt_flags & RTF_UP) == 0) {
114         if (rt->rt_nodes->rn_flags & (RNF_ACTIVE | RNF_ROOT))
115             panic("rtfree 2");
116         rttrash--;
117         if (rt->rt_refcnt < 0) {
118             printf("rtfree: %x not freed (neg refs)
", rt);
119             return;
120         }
121         ifa = rt->rt_ifa;
122         IFAFREE(ifa);
123         Free(rt_key(rt));
124         Free(rt);
125     }
126 }
------------------------------------------------------------------------- route.c

105-115

The entry’s reference count is decremented and if it is less than or equal to 0 and the route is not usable, the entry can be released. If either of the flags RNF_ACTIVE or RNF_ROOT are set, this is an internal error. If RNF_ACTIVE is set, this structure is still part of the routing table tree. If RNF_ROOT is set, this structure is one of the end markers built by rn_inithead.

116

rttrash is a debugging counter of the number of routing entries not in the routing tree, but not released. It is incremented by rtrequest when it begins deleting a route, and then decremented here. Its value should normally be 0.

Release interface reference

117-122

A check is made that the reference count is not negative, and then IFAFREE decrements the reference count for the ifaddr structure and releases it by calling ifafree when it reaches 0.

Release routing memory

123-124

The memory occupied by the routing entry key and its gateway is released. We’ll see in rt_setgate that the memory for both is allocated in one contiguous chunk, allowing both to be released with a single call to Free. Finally the rtentry structure itself is released.

Routing Table Reference Counts

The handling of the routing table reference count, rt_refcnt, differs from most other reference counts. We see in Figure 18.2 that most routes have a reference count of 0, yet the routing table entries without any references are not deleted. We just saw the reason in rtfree: an entry with a reference count of 0 is not deleted unless the entry’s RTF_UP flag is not set. The only time this flag is cleared is by rtrequest when a route is deleted from the routing tree.

Most routes are used in the following fashion.

  • If the route is created automatically as a route to an interface when the interface is configured (which is typical for Ethernet interfaces, for example), then rtinit calls rtrequest with a command of RTM_ADD, creating the new entry and setting the reference count to 1. rtinit then decrements the reference count to 0 before returning.

    A point-to-point interface follows a similar procedure, so the route starts with a reference count of 0.

    If the route is created manually by the route command or by a routing daemon, a similar procedure occurs, with route_output calling rtrequest with a command of RTM_ADD, setting the reference count to 1. This is then decremented by route_output to 0 before it returns.

    Therefore all newly created routes start with a reference count of 0.

  • When an IP datagram is sent on a socket, be it TCP or UDP, we saw that ip_output calls rtalloc, which calls rtalloc1. In Figure 19.3 we saw that the reference count is incremented by rtalloc1 if the route is found.

    The located route is called a held route, since a pointer to the routing table entry is being held by the protocol, normally in a route structure contained within a protocol control block. An rtentry structure that is being held by someone else cannot be deleted, which is why rtfree doesn’t release the structure until its reference count reaches 0.

  • A protocol releases a held route by calling RTFREE or rtfree. We saw this in Figure 8.24 when ip_output detects a change in the destination address. We’ll encounter it in Chapter 22 when a protocol control block that holds a route is released.

Part of the confusion we’ll encounter in the code that follows is that rtalloc1 is often called to look up a route in order to verify that a route to the destination exists, but when the caller doesn’t want to hold the route. Since rtalloc1 increments the counter, the caller immediately decrements it.

Consider a route being deleted by rtrequest. The RTF_UP flag is cleared, and if no one is holding the route (its reference count is 0), rtfree should be called. But rtfree considers it an error for the reference count to go below 0, so rtrequest checks whether its reference count is less than or equal to 0, and, if so, increments it and calls rtfree. Normally this sets the reference count to 1 and rtfree decrements it to 0 and deletes the route.

rtrequest Function

The rtrequest function is the focal point for adding and deleting routing table entries. Figure 19.6 shows some of the other functions that call it.

Summary of functions that call rtrequest.

Figure 19.6. Summary of functions that call rtrequest.

rtrequest is a switch statement with one case per command: RTM_ADD, RTM_DELETE, and RTM_RESOLVE. Figure 19.7 shows the start of the function and the RTM_DELETE command.

Table 19.7. rtrequest function: RTM_DELETE command.

------------------------------------------------------------------------- route.c
290 int
291 rtrequest(req, dst, gateway, netmask, flags, ret_nrt)
292 int     req, flags;
293 struct sockaddr *dst, *gateway, *netmask;
294 struct rtentry **ret_nrt;
295 {
296     int     s = splnet();
297     int     error = 0;
298     struct rtentry *rt;
299     struct radix_node *rn;
300     struct radix_node_head *rnh;
301     struct ifaddr *ifa;
302     struct sockaddr *ndst;
303 #define senderr(x) { error = x ; goto bad; }

304     if ((rnh = rt_tables[dst->sa_family]) == 0)
305         senderr(ESRCH);
306     if (flags & RTF_HOST)
307         netmask = 0;

308     switch (req) {
309     case RTM_DELETE:
310         if ((rn = rnh->rnh_deladdr(dst, netmask, rnh)) == 0)
311             senderr(ESRCH);
312         if (rn->rn_flags & (RNF_ACTIVE | RNF_ROOT))
313             panic("rtrequest delete");
314         rt = (struct rtentry *) rn;
315         rt->rt_flags &= ~RTF_UP;
316         if (rt->rt_gwroute) {
317             rt = rt->rt_gwroute;
318             RTFREE(rt);
319             (rt = (struct rtentry *) rn)->rt_gwroute = 0;
320         }
321         if ((ifa = rt->rt_ifa) && ifa->ifa_rtrequest)
322             ifa->ifa_rtrequest(RTM_DELETE, rt, SA(0));
323         rttrash++;
324         if (ret_nrt)
325             *ret_nrt = rt;
326         else if (rt->rt_refcnt <= 0) {
327             rt->rt_refcnt++;
328             rtfree(rt);
329         }
330         break;
------------------------------------------------------------------------- route.c

290-307

The second argument, dst, is a socket address structure specifying the key to be added or deleted from the routing table. The sa_family from this key selects the routing table. If the flags argument indicates a host route (instead of a route to a network), the netmask pointer is set to null, ignoring any value the caller may have passed.

Delete from routing tree

309-315

The rnh_deladdr function (rn_delete from Figure 18.17) deletes the entry from the routing table tree and returns a pointer to the corresponding rtentry structure. The RTF_UP flag is cleared.

Remove reference to gateway routing table entry

316-320

If the entry is an indirect route through a gateway, RTFREE decrements the rt_refcnt member of the gateway’s entry and deletes it if the count reaches 0. The rt_gwroute pointer is set to null and rt is set back to point to the entry that was deleted.

Call interface request function

321-322

If an ifa_rtrequest function is defined for this entry, that function is called. This function is used by ARP, for example, in Chapter 21 to delete the corresponding ARP entry.

Return pointer or release reference

323-330

The rttrash global is incremented because the entry may not be released in the code that follows. If the caller wants the pointer to the rtentry structure that was deleted from the routing tree (if ret_nrt is nonnull), then that pointer is returned, but the entry cannot be released: it is the caller’s responsibility to call rtfree when it is finished with the entry. If ret_nrt is null, the entry can be released: if the reference count is less than or equal to 0, it is incremented, and rtfree is called. The break causes the function to return.

Figure 19.8 shows the next part of the function, which handles the RTM_RESOLVE command. This function is called with this command only from rtalloc1, when a new entry is to be created from an entry with the RTF_CLONING flag set.

Table 19.8. rtrequest function: RTM_RESOLVE command.

------------------------------------------------------------------------- route.c
331     case RTM_RESOLVE:
332         if (ret_nrt == 0 || (rt = *ret_nrt) == 0)
333             senderr(EINVAL);
334         ifa = rt->rt_ifa;
335         flags = rt->rt_flags & ~RTF_CLONING;
336         gateway = rt->rt_gateway;
337         if ((netmask = rt->rt_genmask) == 0)
338             flags |= RTF_HOST;
339         goto makeroute;
------------------------------------------------------------------------- route.c

331-339

The final argument, ret_nrt, is used differently for this command: it contains the pointer to the entry with the RTF_CLONING flag set (Figure 19.2). The new entry will have the same rt_ifa pointer, the same flags (with the RTF_CLONING flag cleared), and the same rt_gateway. If the entry being cloned has a null rt_genmask pointer, the new entry has its RTF_HOST flag set, because it is a host route; otherwise the new entry is a network route and the network mask of the new entry is copied from the rt_genmask value. We give an example of cloned routes with a network mask at the end of this section. This case continues at the label makeroute, which is in the next figure.

Figure 19.9 shows the RTM_ADD command.

Table 19.9. rtrequest function: RTM_ADD command.

------------------------------------------------------------------------- route.c
340     case RTM_ADD:
341         if ((ifa = ifa_ifwithroute(flags, dst, gateway)) == 0)
342             senderr(ENETUNREACH);

343       makeroute:
344         R_Malloc(rt, struct rtentry *, sizeof(*rt));
345         if (rt == 0)
346             senderr(ENOBUFS);
347         Bzero(rt, sizeof(*rt));
348         rt->rt_flags = RTF_UP | flags;
349         if (rt_setgate(rt, dst, gateway)) {
350             Free(rt);
351             senderr(ENOBUFS);
352         }
353         ndst = rt_key(rt);
354         if (netmask) {
355             rt_maskedcopy(dst, ndst, netmask);
356         } else
357             Bcopy(dst, ndst, dst->sa_len);

358         rn = rnh->rnh_addaddr((caddr_t) ndst, (caddr_t) netmask,
359                               rnh, rt->rt_nodes);
360         if (rn == 0) {
361             if (rt->rt_gwroute)
362                 rtfree(rt->rt_gwroute);
363             Free(rt_key(rt));
364             Free(rt);
365             senderr(EEXIST);
366         }
367         ifa->ifa_refcnt++;
368         rt->rt_ifa = ifa;
369         rt->rt_ifp = ifa->ifa_ifp;
370         if (req == RTM_RESOLVE)
371             rt->rt_rmx = (*ret_nrt)->rt_rmx;    /* copy metrics */
372         if (ifa->ifa_rtrequest)
373             ifa->ifa_rtrequest(req, rt, SA(ret_nrt ? *ret_nrt : 0));
374         if (ret_nrt) {
375             *ret_nrt = rt;
376             rt->rt_refcnt++;
377         }
378         break;
379     }
380   bad:
381     splx(s);
382     return (error);
383 }
------------------------------------------------------------------------- route.c

Locate corresponding interface

340-342

The function ifa_ifwithroute finds the appropriate local interface for the destination (dst), returning a pointer to its ifaddr structure.

Allocate memory for routing table entry

343-348

An rtentry structure is allocated. Recall that this structure contains both the two radix_node structures for the routing tree and the other routing information. The structure is zeroed and the rt_flags are set from the caller’s flags, including the RTF_UP flag.

Allocate and copy gateway address

349-352

The rt_setgate function (Figure 19.11) allocates memory for both the routing table key (dst) and its gateway. It then copies gateway into the new memory and sets the pointers rt_key, rt_gateway, and rt_gwroute.

Copy destination address

353-357

The destination address (the routing table key dst) must now be copied into the memory pointed to by rn_key. If a network mask is supplied, rt_maskedcopy logically ANDs dst and netmask, forming the new key. Otherwise dst is copied into the new key. The reason for logically ANDing dst and netmask is to guarantee that the key in the table has already been ANDed with its mask, so when a search key is compared against the key in the table only the search key needs to be ANDed. For example, the following command adds another IP address (an alias) to the Ethernet interface le0, with subnet 12 instead of 13:

bsdi $ ifconfig le0 inet 140.252.12.63 netmask 0xffffffe0 alias

The problem is that we’ve incorrectly specified all one bits for the host ID. Nevertheless, when the key is stored in the routing table we can verify with netstat that the address is first logically ANDed with the mask:

Destination      Gateway            Flags     Refs     Use  Interface
140.252.12.32    link#1             U C         0      0  le0

Add entry to routing tree

358-366

The rnh_addaddr function (rn_addroute from Figure 18.17) adds this rtentry structure, with its destination and mask, to the routing table tree. If an error occurs, the structures are released and EEXIST returned (i.e., the entry is already in the routing table).

Store interface pointers

367-369

The ifaddr structure’s reference count is incremented and the pointers to its ifaddr and ifnet structures are stored.

Copy metrics for newly cloned route

370-371

If the command was RTM_RESOLVE (not RTM_ADD), the entire metrics structure is copied from the cloned entry into the new entry. If the command was RTM_ADD, the caller can set the metrics after this function returns.

Call interface request function

372-373

If an ifa_rtrequest function is defined for this entry, that function is called. ARP uses this to perform additional processing for both the RTM_ADD and RTM_RESOLVE commands (Section 21.13).

Return pointer and increment reference count

374-378

If the caller wants a copy of the pointer to the new structure, it is returned through ret_nrt and the rt_refcnt reference count is incremented from 0 to 1.

Example: Cloned Routes with Network Masks

The only use of the rt_genmask value is with cloned routes created by the RTM_RESOLVE command in rtrequest. If an rt_genmask pointer is nonnull, then the socket address structure pointed to by this pointer becomes the network mask of the newly created route. In our routing table, Figure 18.2, the cloned routes are for the local Ethernet and for multicast addresses. The following example from [Sklower 1991] provides a different use of cloned routes. Another example is in Exercise 19.2.

Consider a class B network, say 128.1, that is behind a point-to-point link. The subnet mask is 0xffffff00, the typical value that uses 8 bits for the subnet ID and 8 bits for the host ID. We need a routing table entry for all possible 254 subnets, with a gateway value of a router that is directly connected to our host and that knows how to reach the link to which the 128.1 network is connected.

The easiest solution, assuming the gateway router isn’t our default router, is a single entry with a destination of 128.1.0.0 and a mask of 0xffff0000. Assume, however, that the topology of the 128.1 network is such that each of the possible 254 subnets can have different operational characteristics: RTTs, MTUs, delays, and so on. If a separate routing table entry were used for each subnet, we would see that whenever a connection is closed, TCP would update the routing table entry with statistics about that route its RTT, RTT variance, and so on (Figure 27.3). While we could create up to 254 entries by hand using the route command, one per subnet, a better solution is to use the cloning feature.

One entry is created by the system administrator with a destination of 128.1.0.0 and a network mask of 0xffff0000. Additionally, the RTF_CLONING flag is set and the genmask is set to 0xffffff00, which differs from the network mask. If the routing table is searched for 128.1.2.3, and an entry does not exist for the 128.1.2 subnet, the entry for 128.1 with the mask of 0xffff0000 is the best match. A new entry is created (since the RTF_CLONING flag is set) with a destination of 128.1.2 and a network mask of 0xffffff00 (the genmask value). The next time any host on this subnet is referenced, say 128.1.2.88, it will match this newly created entry.

rt_setgate Function

Each leaf in the routing tree has a key (rt_key, which is just the rn_key member of the radix_node structure contained at the beginning of the rtentry structure), and an associated gateway (rt_gateway). Both are socket address structures specified when the routing table entry is created. Memory is allocated for both structures by rt_setgate, as shown in Figure 19.10.

Example of routing table keys and associated gateways.

Figure 19.10. Example of routing table keys and associated gateways.

This example shows two of the entries from Figure 18.2, the ones with keys of 127.0.0.1 and 140.252.13.33. The former’s gateway member points to an Internet socket address structure, while the latter’s points to a data-link socket address structure that contains an Ethernet address. The former was entered into the routing table by the route system when the system was initialized, and the latter was created by ARP.

We purposely show the two structures pointed to by rt_key one right after the other, since they are allocated together by rt_setgate, which we show in Figure 19.11.

Table 19.11. rt_setgate function.

----------------------------------------------------------------------- route.c
384 int
385 rt_setgate(rt0, dst, gate)
386 struct rtentry *rt0;
387 struct sockaddr *dst, *gate;
388 {
389     caddr_t new, old;
390     int     dlen = ROUNDUP(dst->sa_len), glen = ROUNDUP(gate->sa_len);
391     struct rtentry *rt = rt0;

392     if (rt->rt_gateway == 0 || glen > ROUNDUP(rt->rt_gateway->sa_len)) {
393         old = (caddr_t) rt_key(rt);
394         R_Malloc(new, caddr_t, dlen + glen);
395         if (new == 0)
396             return 1;
397         rt->rt_nodes->rn_key = new;
398     } else {
399         new = rt->rt_nodes->rn_key;
400         old = 0;
401     }
402     Bcopy(gate, (rt->rt_gateway = (struct sockaddr *) (new + dlen)), glen);
403     if (old) {
404         Bcopy(dst, new, dlen);
405         Free(old);
406     }
407     if (rt->rt_gwroute) {
408         rt = rt->rt_gwroute;
409         RTFREE(rt);
410         rt = rt0;
411         rt->rt_gwroute = 0;
412     }
413     if (rt->rt_flags & RTF_GATEWAY) {
414         rt->rt_gwroute = rtalloc1(gate, 1);
415     }
416     return 0;
417 }
----------------------------------------------------------------------- route.c

Set lengths from socket address structures

384-391

dlen is the length of the destination socket address structure, and glen is the length of the gateway socket address structure. The ROUNDUP macro rounds the value up to the next multiple of 4 bytes, but the size of most socket address structures is already a multiple of 4.

Allocate memory

392-397

If memory has not been allocated for this routing table key and gateway yet, or if glen is greater than the current size of the structure pointed to by rt_gateway, a new piece of memory is allocated and rn_key is set to point to the new memory.

Use memory already allocated for key and gateway

398-401

An adequately sized piece of memory is already allocated for the key and gateway, so new is set to point to this existing memory.

Copy new gateway

402

The new gateway structure is copied and rt_gateway is set to point to the socket address structure.

Copy key from old memory to new memory

403-406

If a new piece of memory was allocated, the routing table key (dst) is copied right before the gateway field that was just copied. The old piece of memory is released.

Release gateway routing pointer

407-412

If the routing table entry contains a nonnull rt_gwroute pointer, that structure is released by RTFREE and the rt_gwroute pointer is set to null.

Locate and store new gateway routing pointer

413-415

If the routing table entry is an indirect route, rtalloc1 locates the entry for the new gateway, which is stored in rt_gwroute. If an invalid gateway is specified for an indirect route, an error is not returned by rt_setgate, but the rt_gwroute pointer will be null.

rtinit Function

There are four calls to rtinit from the Internet protocols to add or delete routes associated with interfaces.

  • in_control calls rtinit twice when the destination address of a point-to-point interface is set (Figure 6.21). The first call specifies RTM_DELETE to delete any existing route to the destination; the second call specifies RTM_ADD to add the new route.

  • in_ifinit calls rtinit to add a network route for a broadcast network or a host route for a point-to-point link (Figure 6.19). If the route is for an Ethernet interface, the RTF_CLONING flag is automatically set by in_ifinit.

  • in_ifscrub calls rtinit to delete an existing route for an interface.

Figure 19.12 shows the first part of the rtinit function. The cmd argument is always RTM_ADD or RTM_DELETE.

Table 19.12. rtinit function: call rtrequest to handle command.

------------------------------------------------------------------------- route.c
441 int
442 rtinit(ifa, cmd, flags)
443 struct ifaddr *ifa;
444 int     cmd, flags;
445 {
446     struct rtentry *rt;
447     struct sockaddr *dst;
448     struct sockaddr *deldst;
449     struct mbuf *m = 0;
450     struct rtentry *nrt = 0;
451     int     error;

452     dst = flags & RTF_HOST ? ifa->ifa_dstaddr : ifa->ifa_addr;
453     if (cmd == RTM_DELETE) {
454         if ((flags & RTF_HOST) == 0 && ifa->ifa_netmask) {
455             m = m_get(M_WAIT, MT_SONAME);
456             deldst = mtod(m, struct sockaddr *);
457             rt_maskedcopy(dst, deldst, ifa->ifa_netmask);
458             dst = deldst;
459         }
460         if (rt = rtalloc1(dst, 0)) {
461             rt->rt_refcnt--;
462             if (rt->rt_ifa != ifa) {
463                 if (m)
464                     (void) m_free(m);
465                 return (flags & RTF_HOST ? EHOSTUNREACH
466                         : ENETUNREACH);
467             }
468         }
469     }
470     error = rtrequest(cmd, dst, ifa->ifa_addr, ifa->ifa_netmask,
471                       flags | ifa->ifa_flags, &nrt);
472     if (m)
473         (void) m_free(m);
------------------------------------------------------------------------- route.c

Get destination address for route

452

If the route is to a host, the destination address is the other end of the point-to-point link. Otherwise we’re dealing with a network route and the destination address is the unicast address of the interface (masked with ifa_netmask).

Mask network address with network mask

453-459

If a route is being deleted, the destination must be looked up in the routing table to locate its routing table entry. If the route being deleted is a network route and the interface has an associated network mask, an mbuf is allocated and the destination address is copied into the mbuf by rt_maskedcopy, logically ANDing the caller’s address with the mask. dst is set to point to the masked copy in the mbuf, and that is the destination looked up in the next step.

Search for routing table entry

460-469

rtalloc1 searches the routing table for the destination address. If the entry is found, its reference count is decremented (since rtalloc1 incremented the reference count). If the pointer to the interface’s ifaddr in the routing table does not equal the caller’s argument, an error is returned.

Process request

470-473

rtrequest executes the command, either RTM_ADD or RTM_DELETE. When it returns, if an mbuf was allocated earlier, it is released.

Figure 19.13 shows the second half of rtinit.

Table 19.13. rtinit function: second half.

------------------------------------------------------------------------- route.c
474     if (cmd == RTM_DELETE && error == 0 && (rt = nrt)) {
475         rt_newaddrmsg(cmd, ifa, error, nrt);
476         if (rt->rt_refcnt <= 0) {
477             rt->rt_refcnt++;
478             rtfree(rt);
479         }
480     }
481     if (cmd == RTM_ADD && error == 0 && (rt = nrt)) {
482         rt->rt_refcnt--;
483         if (rt->rt_ifa != ifa) {
484             printf("rtinit: wrong ifa (%x) was (%x)
", ifa,
485                    rt->rt_ifa);
486             if (rt->rt_ifa->ifa_rtrequest)
487                 rt->rt_ifa->ifa_rtrequest(RTM_DELETE, rt, SA(0));
488             IFAFREE(rt->rt_ifa);
489             rt->rt_ifa = ifa;
490             rt->rt_ifp = ifa->ifa_ifp;
491             ifa->ifa_refcnt++;
492             if (ifa->ifa_rtrequest)
493                 ifa->ifa_rtrequest(RTM_ADD, rt, SA(0));
494         }
495         rt_newaddrmsg(cmd, ifa, error, nrt);
496     }
497     return (error);
498 }
------------------------------------------------------------------------- route.c

Generate routing message on successful delete

474-480

If a route was deleted, and rtrequest returned 0 along with a pointer to the rtentry structure that was deleted (in nrt), a routing socket message is generated by rt_newaddrmsg. If the reference count is less than or equal to 0, it is incremented and the route is released by rtfree.

Successful add

481-482

If a route was added, and rtrequest returned 0 along with a pointer to the rtentry structure that was added (in nrt), the reference count is decremented (since rtrequest incremented it).

Incorrect interface

483-494

If the pointer to the interface’s ifaddr in the new routing table entry does not equal the caller’s argument, an error occurred. Recall that rtrequest determines the ifa pointer that is stored in the new entry by calling ifa_ifwithroute (Figure 19.9). When this error occurs the following steps take place: an error message is output to the console, the ifa_rtrequest function is called (if defined) with a command of RTM_DELETE, the ifaddr structure is released, the rt_ifa pointer is set to the value specified by the caller, the interface reference count is incremented, and the new interface’s ifa_rtrequest function (if defined) is called with a command of RTM_ADD.

Generate routing message

495

A routing socket message is generated by rt_newaddrmsg for the RTM_ADD command.

rtredirect Function

When an ICMP redirect is received, icmp_input calls rtredirect and then calls pfctlinput (Figure 11.27). This latter function calls udp_ctlinput and tcp_ctlinput, which go through all the UDP and TCP protocol control blocks. If the PCB is connected to the foreign address that has been redirected, and if the PCB holds a route to that foreign address, the route is released by rtfree. The next time any of these control blocks is used to send an IP datagram to that foreign address, rtalloc will be called and the destination will be looked up in the routing table, possibly finding a new (redirected) route.

The purpose of rtredirect, the first half of which is shown in Figure 19.14, is to validate the information in the redirect, update the routing table immediately, and then generate a routing socket message.

Table 19.14. rtredirect function: validate received redirect.

------------------------------------------------------------------------- route.c
147 int
148 rtredirect(dst, gateway, netmask, flags, src, rtp)
149 struct sockaddr *dst, *gateway, *netmask, *src;
150 int     flags;
151 struct rtentry **rtp;
152 {
153     struct rtentry *rt;
154     int     error = 0;
155     short  *stat = 0;
156     struct rt_addrinfo info;
157     struct ifaddr *ifa;

158     /* verify the gateway is directly reachable */
159     if ((ifa = ifa_ifwithnet(gateway)) == 0) {
160         error = ENETUNREACH;
161         goto out;
162     }
163     rt = rtalloc1(dst, 0);
164     /*
165      * If the redirect isn't from our current router for this dst,
166      * it's either old or wrong.  If it redirects us to ourselves,
167      * we have a routing loop, perhaps as a result of an interface
168      * going down recently.
169      */
170 #define equal(a1, a2) (bcmp((caddr_t)(a1), (caddr_t)(a2), (a1)->sa_len) == 0)
171     if (!(flags & RTF_DONE) && rt &&
172         (!equal(src, rt->rt_gateway) || rt->rt_ifa != ifa))
173         error = EINVAL;
174     else if (ifa_ifwithaddr(gateway))
175         error = EHOSTUNREACH;
176     if (error)
177         goto done;
178     /*
179      * Create a new entry if we just got back a wildcard entry
180      * or if the lookup failed.  This is necessary for hosts
181      * which use routing redirects generated by smart gateways
182      * to dynamically build the routing tables.
183      */
184     if ((rt == 0) || (rt_mask(rt) && rt_mask(rt)->sa_len < 2))
185         goto create;
------------------------------------------------------------------------- route.c

147-157

The arguments are dst, the destination IP address of the datagram that caused the redirect (HD in Figure 8.18); gateway, the IP address of the router to use as the new gateway field for the destination (R2 in Figure 8.18); netmask, which is a null pointer; flags, which is RTF_GATEWAY and RTF_HOST; src, the IP address of the router that sent the redirect (R1 in Figure 8.18); and rtp, which is a null pointer. We indicate that netmask and rtp are both null pointers when called by icmp_input, but these arguments might be nonnull when called from other protocols.

New gateway must be directly connected

158-162

The new gateway must be directly connected or the redirect is invalid.

Locate routing table entry for destination and validate redirect

163-177

rtalloc1 searches the routing table for a route to the destination. The following conditions must all be true, or the redirect is invalid and an error is returned. Notice that icmp_input ignores any error return from rtredirect. ICMP does not generate an error in response to an invalid redirect it just ignores it.

  • the RTF_DONE flag must not be set;

  • rtalloc must have located a routing table entry for dst;

  • the address of the router that sent the redirect (src) must equal the current rt_gateway for the destination;

  • the interface for the new gateway (the ifa returned by ifa_ifwithnet) must equal the current interface for the destination (rt_ifa), that is, the new gateway must be on the same network as the current gateway; and

  • the new gateway cannot redirect this host to itself, that is, there cannot exist an attached interface with a unicast address or a broadcast address equal to gateway.

Must create a new route

178-185

If a route to the destination was not found, or if the routing table entry that was located is the default route, a new entry is created for the destination. As the comment indicates, a host with access to multiple routers can use this feature to learn of the correct router when the default is not correct. The test for finding the default route is whether the routing table entry has an associated mask and if the length field of the mask is less than 2, since the mask for the default route is rn_zeros (Figure 18.35).

Figure 19.15 shows the second half of this function.

Table 19.15. rtredirect function: second half.

------------------------------------------------------------------------- route.c
186     /*
187      * Don't listen to the redirect if it's
188      * for a route to an interface.
189      */
190     if (rt->rt_flags & RTF_GATEWAY) {
191         if (((rt->rt_flags & RTF_HOST) == 0) && (flags & RTF_HOST)) {
192             /*
193              * Changing from route to net => route to host.
194              * Create new route, rather than smashing route to net.
195              */
196           create:
197             flags |= RTF_GATEWAY | RTF_DYNAMIC;
198             error = rtrequest((int) RTM_ADD, dst, gateway,
199                               netmask, flags,
200                               (struct rtentry **) 0);
201             stat = &rtstat.rts_dynamic;
202         } else {
203             /*
204              * Smash the current notion of the gateway to
205              * this destination.  Should check about netmask!!!
206              */
207             rt->rt_flags |= RTF_MODIFIED;
208             flags |= RTF_MODIFIED;
209             stat = &rtstat.rts_newgateway;
210             rt_setgate(rt, rt_key(rt), gateway);
211         }
212     } else
213         error = EHOSTUNREACH;
214   done:
215     if (rt) {
216         if (rtp && !error)
217             *rtp = rt;
218         else
219             rtfree(rt);
220     }
221   out:
222     if (error)
223         rtstat.rts_badredirect++;
224     else if (stat != NULL)
225         (*stat)++;

226     bzero((caddr_t) & info, sizeof(info));
227     info.rti_info[RTAX_DST] = dst;
228     info.rti_info[RTAX_GATEWAY] = gateway;
229     info.rti_info[RTAX_NETMASK] = netmask;
230     info.rti_info[RTAX_AUTHOR] = src;
231     rt_missmsg(RTM_REDIRECT, &info, flags, error);
232 }
------------------------------------------------------------------------- route.c

Create new host route

186-195

If the current route to the destination is a network route and the redirect is a host redirect and not a network redirect, a new host route is created for the destination and the existing network route is left alone. We mentioned that the flags argument always specifies RTF_HOST since the Net/3 ICMP considers all received redirects as host redirects.

Create route

196-201

rtrequest creates the new route, setting the RTF_GATEWAY and RTF_DYNAMIC flags. The netmask argument is a null pointer, since the new route is a host route with an implied mask of all one bits. stat points to a counter that is incremented later.

Modify existing host route

202-211

This code is executed when the current route to the destination is already a host route. A new entry is not created, but the existing entry is modified. The RTF_MODIFIED flag is set and rt_setgate changes the rt_gateway field of the routing table entry to the new gateway address.

Ignore if destination is directly connected

212-213

If the current route to the destination is a direct route (the RTF_GATEWAY flag is not set), it is a redirect for a destination that is already directly connected. EHOSTUNREACH is returned.

Return pointer and increment statistic

214-225

If a routing table entry was located, it is either returned (if rtp is nonnull and there were no errors) or released by rtfree. The appropriate statistic is incremented.

Generate routing message

226-232

An rt_addrinfo structure is cleared and a routing socket message is generated by rt_missmsg. This message is sent by raw_input to any processes interested in the redirect.

Routing Message Structures

Routing messages consist of a fixed-length header followed by up to eight socket address structures. The fixed-length header is one of the following three structures:

  • rt_msghdr

  • if_msghdr

  • ifa_msghdr

Figure 18.11 provided an overview of which functions generated the different messages and Figure 18.9 showed which structure is used by each message type. The first three members of the three structures have the same data type and meaning: the message length, version, and type. This allows the receiver of the message to decode the message. Also, each structure has a member that encodes which of the eight potential socket address structures follow the structure (a bitmask): the rtm_addrs, ifm_addrs, and ifam_addrs members.

Figure 19.16 shows the most common of the structures, rt_msghdr. The RTM_IFINFO message uses an if_msghdr structure, shown in Figure 19.17. The RTM_NEWADDR and RTM_DELADDR messages use an ifa_msghdr structure, shown in Figure 19.18.

Table 19.16. rt_msghdr structure.

------------------------------------------------------------------------ route.h
139 struct rt_msghdr {
140     u_short rtm_msglen;         /* to skip over non-understood messages */
141     u_char  rtm_version;        /* future binary compatibility */
142     u_char  rtm_type;           /* message type */

143     u_short rtm_index;          /* index for associated ifp */
144     int     rtm_flags;          /* flags, incl. kern & message, e.g. DONE */
145     int     rtm_addrs;          /* bitmask identifying sockaddrs in msg */
146     pid_t   rtm_pid;            /* identify sender */
147     int     rtm_seq;            /* for sender to identify action */
148     int     rtm_errno;          /* why failed */
149     int     rtm_use;            /* from rtentry */
150     u_long  rtm_inits;          /* which metrics we are initializing */
151     struct rt_metrics rtm_rmx;  /* metrics themselves */
152 };
------------------------------------------------------------------------ route.h

Table 19.17. if_msghdr structure.

--------------------------------------------------------------------------- if.h
235 struct if_msghdr {
236     u_short ifm_msglen;         /* to skip over non-understood messages */
237     u_char  ifm_version;        /* future binary compatability */
238     u_char  ifm_type;           /* message type */

239     int     ifm_addrs;          /* like rtm_addrs */
240     int     ifm_flags;          /* value of if_flags */
241     u_short ifm_index;          /* index for associated ifp */
242     struct if_data ifm_data;    /* statistics and other data about if */
243 };
--------------------------------------------------------------------------- if.h

Table 19.18. ifa_msghdr structure.

--------------------------------------------------------------------------- if.h
248 struct ifa_msghdr {
249     u_short ifam_msglen;        /* to skip over non-understood messages */
250     u_char  ifam_version;       /* future binary compatability */
251     u_char  ifam_type;          /* message type */

252     int     ifam_addrs;         /* like rtm_addrs */
253     int     ifam_flags;         /* value of ifa_flags */
254     u_short ifam_index;         /* index for associated ifp */
255     int     ifam_metric;        /* value of ifa_metric */
256 };
--------------------------------------------------------------------------- if.h

Note that the first three members across the three different structures have the same data types and meanings.

The three variables rtm_addrs, ifm_addrs, and ifam_addrs are bitmasks defining which socket address structures follow the header. Figure 19.19 shows the constants used with these bitmasks.

Table 19.19. Constants used to refer to members of rti_info array.

Bitmask

Array index

Name in rtsock.c

Description

Constant

Value

Constant

Value

RTA_DST

0x01

RTAX_DST

0

dst

destination socket address structure

RTA_GATEWAY

0x02

RTAX_GATEWAY

1

gate

gateway socket address structure

RTA_NETMASK

0x04

RTAX_NETMASK

2

netmask

netmask socket address structure

RTA_GENMASK

0x08

RTAX_GENMASK

3

genmask

cloning mask socket address structure

RTA_IFP

0x10

RTAX_IFP

4

ifpaddr

interface name socket address structure

RTA_IFA

0x20

RTAX_IFA

5

ifaaddr

interface address socket address structure

RTA_AUTHOR

0x40

RTAX_AUTHOR

6

 

socket address structure for author of redirect

RTA_BRD

0x80

RTAX_BRD

7

brdaddr

broadcast or point-to-point destination address

  

RTAX_MAX

8

 

#elements in an rti_info[] array

The bitmask value is always the constant 1 left shifted by the number of bits specified by the array index. For example, 0x20 (RTA_IFA) is 1 left shifted by five bits (RTAX_IFA). We’ll see this fact used in the code.

The socket address structures that are present always occur in order of increasing array index, one right after the other. For example, if the bitmask is 0x87, the first socket address structure contains the destination, followed by the gateway, followed by the network mask, followed by the broadcast address.

The array indexes in Figure 19.19 are used within the kernel to refer to its rt_addrinfo structure, shown in Figure 19.20. This structure holds the same bitmask that we described, indicating which addresses are present, and pointers to those socket address structures.

Table 19.20. rt_addrinfo structure: encode which addresses are present and pointers to them.

------------------------------------------------------------------------- route.h
199 struct rt_addrinfo {
200     int     rti_addrs;          /* bitmask, same as rtm_addrs */
201     struct sockaddr *rti_info[RTAX_MAX];
202 };
------------------------------------------------------------------------- route.h

For example, if the RTA_GATEWAY bit is set in the rti_addrs member, then the member rti_info[RTAX_GATEWAY] is a pointer to a socket address structure containing the gateway’s address. In the case of the Internet protocols, the socket address structure is a sockaddr_in containing the gateway’s IP address.

The fifth column in Figure 19.19 shows the names used for the corresponding members of an rti_info array throughout the file rtsock.c. These definitions look like

#define dst    info.rti_info[RTAX_DST]

We’ll encounter these names in many of the source files later in this chapter. The RTAX_AUTHOR element is not assigned a name because it is never passed from a process to the kernel.

We’ve already encountered this rt_addrinfo structure twice: in rtalloc1 (Figure 19.2) and rtredirect (Figure 19.14). Figure 19.21 shows the format of this structure when built by rtalloc1, after a routing table lookup fails, when rt_missmsg is called.

rt_addrinfo structure passed by rtalloc1 to rt_missmsg.

Figure 19.21. rt_addrinfo structure passed by rtalloc1 to rt_missmsg.

All the unused pointers are null because the structure is set to 0 before it is used. Also note that the rti_addrs member is not initialized with the appropriate bitmask because when this structure is used within the kernel, a null pointer in the rti_info array indicates a nonexistent socket address structure. The bitmask is needed only for messages between a process and the kernel.

Figure 19.22 shows the format of the structure built by rtredirect when it calls rt_missmsg.

rt_addrinfo structure passed by rtredirect to rt_missmsg.

Figure 19.22. rt_addrinfo structure passed by rtredirect to rt_missmsg.

The following sections show how these structures are placed into the messages sent to a process.

Figure 19.23 shows the route_cb structure, which we’ll encounter in the following sections. It contains four counters; one each for the IP, XNS, and OSI protocols, and an “any” counter. Each counter is the number of routing sockets currently in existence for that domain.

Table 19.23. route_cb structure: counters of routing socket listeners.

-------------------------------------------------------------------------- route.h
203 struct route_cb {
204     int     ip_count;           /* IP */
205     int     ns_count;           /* XNS */
206     int     iso_count;          /* ISO */
207     int     any_count;          /* sum of above three counters */
208 };
------------------------------------------------------------------------- route.h

203-208

By keeping track of the number of routing socket listeners, the kernel avoids building a routing message and calling raw_input to send the message when there aren’t any processes waiting for a message.

rt_missmsg Function

The function rt_missmsg, shown in Figure 19.24, takes the structures shown in Figures 19.21 and 19.22, calls rt_msg1 to build a corresponding variable-length message for a process in an mbuf chain, and then calls raw_input to pass the mbuf chain to all appropriate routing sockets.

Table 19.24. rt_missmsg function.

------------------------------------------------------------------------- rtsock.c
516 void
517 rt_missmsg(type, rtinfo, flags, error)
518 int     type, flags, error;
519 struct rt_addrinfo *rtinfo;
520 {
521     struct rt_msghdr *rtm;
522     struct mbuf *m;
523     struct sockaddr *sa = rtinfo->rti_info[RTAX_DST];

524     if (route_cb.any_count == 0)
525         return;

526     m = rt_msg1(type, rtinfo);
527     if (m == 0)
528         return;

529     rtm = mtod(m, struct rt_msghdr *);
530     rtm->rtm_flags = RTF_DONE | flags;
531     rtm->rtm_errno = error;
532     rtm->rtm_addrs = rtinfo->rti_addrs;

533     route_proto.sp_protocol = sa ? sa->sa_family : 0;
534     raw_input(m, &route_proto, &route_src, &route_dst);
535 }
------------------------------------------------------------------------- rtsock.c

516-525

If there aren’t any routing socket listeners, the function returns immediately.

Build message in mbuf chain

526-528

rt_msg1 (Section 19.12) builds the appropriate message in an mbuf chain, and returns the pointer to the chain. Figure 19.25 shows an example of the resulting mbuf chain, using the rt_addrinfo structure from Figure 19.22. The information needs to be in an mbuf chain because raw_input calls sbappendaddr to append the mbuf chain to a socket’s receive buffer.

Mbuf chain built by rt_msg1 corresponding to Figure .

Figure 19.25. Mbuf chain built by rt_msg1 corresponding to Figure 19.22.

Finish building message

529-532

The two members rtm_flags and rtm_errno are set to the values passed by the caller. The rtm_addrs member is copied from the rti_addrs value. We showed this value as 0 in Figures 19.21 and 19.22, but rt_msg1 calculates and stores the appropriate bitmask, based on which pointers in the rti_info array are nonnull.

Set protocol of message, call raw_input

533-534

The final three arguments to raw_input specify the protocol, source, and destination of the routing message. These three structures are initialized as

struct  sockaddr  route_dst = { 2, PF_ROUTE, };
struct  sockaddr  route_src = { 2, PF_ROUTE, };
struct  sockproto route_proto = { PF_ROUTE, };

The first two structures are never modified by the kernel. The sockproto structure, shown in Figure 19.26, is one we haven’t seen before.

Table 19.26. sockproto structure.

------------------------------------------------------------------------- socket.h
128 struct sockproto {
129     u_short sp_family;          /* address family */
130     u_short sp_protocol;        /* protocol */
131 };
------------------------------------------------------------------------- socket.h

The family is never changed from its initial value of PF_ROUTE, but the protocol is set each time raw_input is called. When a process creates a routing socket by calling socket, the third argument (the protocol) specifies the protocol in which the process is interested. The caller of raw_input sets the sp_protocol member of the route_proto structure to the protocol of the routing message. In the case of rt_missmsg, it is set to the sa_family of the destination socket address structure (if specified by the caller), which in Figures 19.21 and 19.22 would be AF_INET.

rt_ifmsg Function

In Figure 4.30 we saw that if_up and if_down both call rt_ifmsg, shown in Figure 19.27, to generate a routing socket message when an interface goes up or down.

Table 19.27. rt_ifmsg function.

------------------------------------------------------------------------- rtsock.c
540 void
541 rt_ifmsg(ifp)
542 struct ifnet *ifp;
543 {
544     struct if_msghdr *ifm;
545     struct mbuf *m;
546     struct rt_addrinfo info;

547     if (route_cb.any_count == 0)
548         return;

549     bzero((caddr_t) & info, sizeof(info));
550     m = rt_msg1(RTM_IFINFO, &info);
551     if (m == 0)
552         return;

553     ifm = mtod(m, struct if_msghdr *);
554     ifm->ifm_index = ifp->if_index;
555     ifm->ifm_flags = ifp->if_flags;
556     ifm->ifm_data = ifp->if_data;   /* structure assignment */
557     ifm->ifm_addrs = 0;

558     route_proto.sp_protocol = 0;
559     raw_input(m, &route_proto, &route_src, &route_dst);
560 }
------------------------------------------------------------------------- rtsock.c

547-548

If there aren’t any routing socket listeners, the function returns immediately.

Build message in mbuf chain

549-552

An rt_addrinfo structure is set to 0 and rt_msg1 builds an appropriate message in an mbuf chain. Notice that all socket address pointers in the rt_addrinfo structure are null, so only the fixed-length if_msghdr structure becomes the routing message; there are no addresses.

Finish building message

553-557

The interface’s index, flags, and if_data structure are copied into the message in the mbuf and the ifm_addrs bitmask is set to 0.

Set protocol of message, call raw_input

558-559

The protocol of the routing message is set to 0 because this message can apply to all protocol suites. It is a message about an interface, not about some specific destination. raw_input delivers the message to the appropriate listeners.

rt_newaddrmsg Function

In Figure 19.13 we saw that rtinit calls rt_newaddrmsg with a command of RTM_ADD or RTM_DELETE when an interface has an address added or deleted. Figure 19.28 shows the first half of the function.

Table 19.28. rt_newaddrmsg function: first half: create ifa_msghdr message.

------------------------------------------------------------------------- rtsock.c
569 void
570 rt_newaddrmsg(cmd, ifa, error, rt)
571 int     cmd, error;
572 struct ifaddr *ifa;
573 struct rtentry *rt;
574 {
575     struct rt_addrinfo info;
576     struct sockaddr *sa;
577     int     pass;
578     struct mbuf *m;
579     struct ifnet *ifp = ifa->ifa_ifp;

580     if (route_cb.any_count == 0)
581         return;

582     for (pass = 1; pass < 3; pass++) {
583         bzero((caddr_t) & info, sizeof(info));
584         if ((cmd == RTM_ADD && pass == 1) ||
585             (cmd == RTM_DELETE && pass == 2)) {
586             struct ifa_msghdr *ifam;
587             int     ncmd = cmd == RTM_ADD ? RTM_NEWADDR : RTM_DELADDR;

588             ifaaddr = sa = ifa->ifa_addr;
589             ifpaddr = ifp->if_addrlist->ifa_addr;
590             netmask = ifa->ifa_netmask;
591             brdaddr = ifa->ifa_dstaddr;
592             if ((m = rt_msg1(ncmd, &info)) == NULL)
593                 continue;
594             ifam = mtod(m, struct ifa_msghdr *);
595             ifam->ifam_index = ifp->if_index;
596             ifam->ifam_metric = ifa->ifa_metric;
597             ifam->ifam_flags = ifa->ifa_flags;
598             ifam->ifam_addrs = info.rti_addrs;
599         }
------------------------------------------------------------------------- rtsock.c

580-581

If there aren’t any routing socket listeners, the function returns immediately.

Generate two routing messages

582

The for loop iterates twice because two messages are generated. If the command is RTM_ADD, the first message is of type RTM_NEWADDR and the second message is of type RTM_ADD. If the command is RTM_DELETE, the first message is of type RTM_DELETE and the second message is of type RTM_DELADDR. The RTM_NEWADDR and RTM_DELADDR messages are built from an ifa_msghdr structure, while the RTM_ADD and RTM_DELETE messages are built from an rt_msghdr structure. The function generates two messages because one message provides information about the interface and the other about the addresses.

583

An rt_addrinfo structure is set to 0.

Generate message with up to four addresses

588-591

Pointers to four socket address structures containing information about the interface address that has been added or deleted are stored in the rti_info array. Recall from Figure 19.19 that ifaaddr, ifpaddr, netmask, and brdaddr reference elements in the rti_info array named in info. rt_msg1 builds the appropriate message in an mbuf chain. Notice that sa is set to point to the ifa_addr structure, and we’ll see at the end of the function that the family of this socket address structure becomes the protocol of the routing message.

Finish building message

Remaining members of the ifa_msghdr structure are filled in with the interface’s index, metric, and flags, along with the bitmask set by rt_msg1.

Figure 19.29 shows the second half of rt_newaddrmsg, which creates an rt_msghdr message with information about the routing table entry that was added or deleted.

Table 19.29. rt_newaddrmsg function: second half, create rt_msghdr message.

------------------------------------------------------------------------- rtsock.c
600         if ((cmd == RTM_ADD && pass == 2) ||
601             (cmd == RTM_DELETE && pass == 1)) {
602             struct rt_msghdr *rtm;

603             if (rt == 0)
604                 continue;
605             netmask = rt_mask(rt);
606             dst = sa = rt_key(rt);
607             gate = rt->rt_gateway;
608             if ((m = rt_msg1(cmd, &info)) == NULL)
609                 continue;
610             rtm = mtod(m, struct rt_msghdr *);
611             rtm->rtm_index = ifp->if_index;
612             rtm->rtm_flags |= rt->rt_flags;
613             rtm->rtm_errno = error;
614             rtm->rtm_addrs = info.rti_addrs;
615         }
616         route_proto.sp_protocol = sa ? sa->sa_family : 0;
617         raw_input(m, &route_proto, &route_src, &route_dst);
618     }
619 }
------------------------------------------------------------------------- rtsock.c

Build message

600-609

Pointers to three socket address structures are stored in the rti_info array: the rt_mask, rt_key, and rt_gateway structures. sa is set to point to the destination address, and its family becomes the protocol of the routing message. rt_msg1 builds the appropriate message in an mbuf chain.

Additional fields in the rt_msghdr structure are filled in, including the bitmask set by rt_msg1.

Set protocol of message, call raw_input

616-619

The protocol of the routing message is set and raw_input passes the message to the appropriate listeners. The function returns after two iterations through the loop.

rt_msg1 Function

The functions described in the previous three sections each called rt_msg1 to build the appropriate routing message. In Figure 19.25 we showed the mbuf chain that was built by rt_msg1 from the rt_msghdr and rt_addrinfo structures in Figure 19.22. Figure 19.30 shows the function.

Table 19.30. rt_msg1 function: obtain and initialize mbuf.

------------------------------------------------------------------------- rtsock.c
399 static struct mbuf *
400 rt_msg1(type, rtinfo)
401 int     type;
402 struct rt_addrinfo *rtinfo;
403 {
404     struct rt_msghdr *rtm;
405     struct mbuf *m;
406     int     i;
407     struct sockaddr *sa;
408     int     len, dlen;

409     m = m_gethdr(M_DONTWAIT, MT_DATA);
410     if (m == 0)
411         return (m);
412     switch (type) {

413     case RTM_DELADDR:
414     case RTM_NEWADDR:
415         len = sizeof(struct ifa_msghdr);
416         break;

417     case RTM_IFINFO:
418         len = sizeof(struct if_msghdr);
419         break;

420     default:
421         len = sizeof(struct rt_msghdr);
422     }
423     if (len > MHLEN)
424         panic("rt_msg1");
425     m->m_pkthdr.len = m->m_len = len;
426     m->m_pkthdr.rcvif = 0;
427     rtm = mtod(m, struct rt_msghdr *);
428     bzero((caddr_t) rtm, len);

429     for (i = 0; i < RTAX_MAX; i++) {
430         if ((sa = rtinfo->rti_info[i]) == NULL)
431             continue;
432         rtinfo->rti_addrs |= (1 << i);
433         dlen = ROUNDUP(sa->sa_len);
434         m_copyback(m, len, dlen, (caddr_t) sa);
435         len += dlen;
436     }
437     if (m->m_pkthdr.len != len) {
438         m_freem(m);
439         return (NULL);
440     }
441     rtm->rtm_msglen = len;
442     rtm->rtm_version = RTM_VERSION;
443     rtm->rtm_type = type;
444     return (m);
445 }
------------------------------------------------------------------------- rtsock.c

Get mbuf and determine fixed size of message

399-422

An mbuf with a packet header is obtained and the length of the fixed-size message is stored in len. Two of the message types in Figure 18.9 use an ifa_msghdr structure, one uses an if_msghdr structure, and the remaining nine use an rt_msghdr structure.

Verify structure fits in mbuf

423-424

The size of the fixed-length structure must fit entirely within the data portion of the packet header mbuf, because the mbuf pointer is cast to a structure pointer using mtod and the structure is then referenced through the pointer. The largest of the three structures is if_msghdr, which at 84 bytes is less than MHLEN (100).

Initialize mbuf packet header and zero structure

425-428

The two fields in the packet header are initialized and the structure in the mbuf is set to 0.

Copy socket address structures into mbuf chain

429-436

The caller passes a pointer to an rt_addrinfo structure. The socket address structures corresponding to all the nonnull pointers in the rti_info are copied into the mbuf by m_copyback. The value 1 is left shifted by the RTAX_xxx index to generate the corresponding RTA_xxx bitmask (Figure 19.19), and each individual bitmask is logically ORed into the rti_addrs member, which the caller can store on return into the corresponding member of the message structure. The ROUNDUP macro rounds the size of each socket address structure up to the next multiple of 4 bytes.

437-440

If, when the loop terminates, the length in the mbuf packet header does not equal len, the function m_copyback wasn’t able to obtain a required mbuf.

Store length, version, and type

441-445

The length, version, and message type are stored in the first three members of the message structure. Again, all three xxx_msghdr structures start with the same three members, so this code works with all three structures even though the pointer rtm is a pointer to an rt_msghdr structure.

rt_msg2 Function

rt_msg1 constructs a routing message in an mbuf chain, and the three functions that called it then called raw_input to append the mbuf chain to one or more socket’s receive buffer. rt_msg2 is different it builds a routing message in a memory buffer, not an mbuf chain, and has as an argument a pointer to a walkarg structure that is used when rt_msg2 is called by the two functions that handle the sysctl system call for the routing domain. rt_msg2 is called in two different scenarios:

  1. from route_output to process the RTM_GET command, and

  2. from sysctl_dumpentry and sysctl_iflist to process a sysctl system call.

Before looking at rt_msg2, Figure 19.31 shows the walkarg structure that is used in scenario 2. We go through all these members as we encounter them.

Table 19.31. walkarg structure: used with the sysctl system call in the routing domain.

----------------------------------------------------------------------- rtsock.c
 41 struct walkarg {
 42     int     w_op;               /* NET_RT_xxx */
 43     int     w_arg;              /* RTF_xxx for FLAGS, if_index for IFLIST */
 44     int     w_given;            /* size of process' buffer */
 45     int     w_needed;           /* #bytes actually needed (at end) */
 46     int     w_tmemsize;         /* size of buffer pointed to by w_tmem */
 47     caddr_t w_where;            /* ptr to process' buffer (maybe null) */
 48     caddr_t w_tmem;             /* ptr to our malloc'ed buffer */
 49 };
----------------------------------------------------------------------- rtsock.c

Figure 19.32 shows the first half of the rt_msg2 function. This portion is similar to the first half of rt_msg1.

Table 19.32. rt_msg2 function: copy socket address structures.

------------------------------------------------------------------------- rtsock.c
446 static int
447 rt_msg2(type, rtinfo, cp, w)
448 int     type;
449 struct rt_addrinfo *rtinfo;
450 caddr_t cp;
451 struct walkarg *w;
452 {
453     int     i;
454     int     len, dlen, second_time = 0;
455     caddr_t cp0;

456     rtinfo->rti_addrs = 0;
457   again:
458     switch (type) {

459     case RTM_DELADDR:
460     case RTM_NEWADDR:
461         len = sizeof(struct ifa_msghdr);
462         break;

463     case RTM_IFINFO:
464         len = sizeof(struct if_msghdr);
465         break;

466     default:
467         len = sizeof(struct rt_msghdr);
468     }
469     if (cp0 = cp)
470         cp += len;
471     for (i = 0; i < RTAX_MAX; i++) {
472         struct sockaddr *sa;

473         if ((sa = rtinfo->rti_info[i]) == 0)
474             continue;
475         rtinfo->rti_addrs |= (1 << i);
476         dlen = ROUNDUP(sa->sa_len);
477         if (cp) {
478             bcopy((caddr_t) sa, cp, (unsigned) dlen);
479             cp += dlen;
480         }
481         len += dlen;
482     }
------------------------------------------------------------------------- rtsock.c

446-455

Since this function stores the resulting message in a memory buffer, the caller specifies the start of that buffer in the cp argument. It is the caller’s responsibility to ensure that the buffer is large enough for the message that is generated. To help the caller determine this size, if the cp argument is null, rt_msg2 doesn’t store anything but processes the input and returns the total number of bytes required to hold the result. We’ll see that route_output uses this feature and calls this function twice: first to determine the size and then to store the result, after allocating a buffer of the correct size. When rt_msg2 is called by route_output, the final argument is null. This final argument is nonnull when called as part of the sysctl system call processing.

Determine size of structure

458-470

The size of the fixed-length message structure is set based on the message type. If the cp pointer is nonnull, it is incremented by this size.

Copy socket address structures

471-482

The for loop goes through the rti_info array, and for each element that is a nonnull pointer it sets the appropriate bit in the rti_addrs bitmask, copies the socket address structure (if cp is nonnull), and updates the length.

Figure 19.33 shows the second half of rt_msg2, most of which handles the optional walkarg structure.

Table 19.33. rt_msg2 function: handle optional walkarg argument.

------------------------------------------------------------------------- rtsock.c
483     if (cp == 0 && w != NULL && !second_time) {
484         struct walkarg *rw = w;

485         rw->w_needed += len;
486         if (rw->w_needed <= 0 && rw->w_where) {
487             if (rw->w_tmemsize < len) {
488                 if (rw->w_tmem)
489                     free(rw->w_tmem, M_RTABLE);
490                 if (rw->w_tmem = (caddr_t)
491                     malloc(len, M_RTABLE, M_NOWAIT))
492                     rw->w_tmemsize = len;
493             }
494             if (rw->w_tmem) {
495                 cp = rw->w_tmem;
496                 second_time = 1;
497                 goto again;
498             } else
499                 rw->w_where = 0;
500         }
501     }
502     if (cp) {
503         struct rt_msghdr *rtm = (struct rt_msghdr *) cp0;

504         rtm->rtm_version = RTM_VERSION;
505         rtm->rtm_type = type;
506         rtm->rtm_msglen = len;
507     }
508     return (len);
509 }
------------------------------------------------------------------------- rtsock.c

483-484

This if statement is true only when a pointer to a walkarg structure was passed and this is the first loop through the function. The variable second_time was initialized to 0 but can be set to 1 within this if statement, and a jump made back to the label again in Figure 19.32. The test for cp being a null pointer is superfluous since whenever the w pointer is nonnull, the cp pointer is null, and vice versa.

Check if data to be stored

485-486

w_needed is incremented by the size of the message. This variable is initialized to 0 minus the size of the user’s buffer to the sysctl function. For example, if the buffer size is 500 bytes, w_needed is initialized to—500. As long as it remains negative, there is room in the buffer. w_where is a pointer to the buffer in the calling process. It is null if the process doesn’t want the result the process just wants sysctl to return the size of the result, so the process can allocate a buffer and call sysctl again. rt_msg2 doesn’t copy the data back to the process that is up to the caller b ut if the w_where pointer is null, there’s no need for rt_msg2 to malloc a buffer to hold the result and loop back through the function again, storing the result in this buffer. There are really five different scenarios that this function handles, summarized in Figure 19.34.

Table 19.34. Summary of different scenarios for rt_msg2.

called from

cp

w

w.w_where

second_time

Description

route_output

null

null

  

wants return length

nonnull

null

  

wants result

sysctl_rtable

null

nonnull

null

0

process wants return length

null

nonnull

nonnull

0

first time around to calculate length

nonnull

nonnull

nonnull

1

second time around to store result

Allocate buffer first time or if message length increases

487-493

w_tmemsize is the size of the buffer pointed to by w_tmem. It is initialized to 0 by sysctl_rtable, so the first time rt_msg2 is called for a given sysctl request, the buffer must be allocated. Also, if the size of the result increases, the existing buffer must be released and a new (larger) buffer allocated.

Go around again and store result

494-499

If w_tmem is nonnull, a buffer already exists or one was just allocated. cp is set to point to this buffer, second_time is set to 1, and a jump is made to again. The if statement at the beginning of this figure won’t be true during this second pass, since second_time is now 1. If w_tmem is null, the call to malloc failed, so the pointer to the buffer in the process is set to null, preventing anything from being returned.

Store length, version, and type

502-509

If cp is nonnull, the first three elements of the message header are stored. The function returns the length of the message.

sysctl_rtable Function

This function handles the sysctl system call on a routing socket. It is called by net_sysctl as shown in Figure 18.11.

Before going through the source code, Figure 19.35 shows the typical use of this system call with respect to the routing table. This example is from the arp program.

Table 19.35. Example of sysctl with routing table.

-------------------------------------------------------------------------
    int      mib[6];
    size_t   needed;
    char     *buf, *lim, *next;
    struct rt_msghdr  *rtm;

    mib[0] = CTL_NET;
    mib[1] = PF_ROUTE;
    mib[2] = 0;
    mib[3] = AF_INET;        /* address family; can be 0 */
    mib[4] = NET_RT_FLAGS;   /* operation */
    mib[5] = RTF_LLINFO;     /* flags; can be 0 */

    if (sysctl(mib, 6, NULL, &needed, NULL, 0) < 0)
        quit("sysctl error, estimate");

    if ( (buf = malloc(needed)) == NULL)
        quit("malloc");

    if (sysctl(mib, 6, buf, &needed, NULL, 0) < 0)
        quit("sysctl error, retrieval");

    lim = buf + needed;
    for (next = buf; next < lim; next += rtm->rtm_msglen) {
        rtm = (struct rt_msghdr *)next;
        ...  /* do whatever */
    }
-------------------------------------------------------------------------

The first three elements in the mib array cause the kernel to call sysctl_rtable to process the remaining elements.

mib[4] specifies the operation. Three operations are supported.

  1. NET_RT_DUMP: return the routing table corresponding to the address family specified by mib[3]. If the address family is 0, all routing tables are returned.

    An RTM_GET routing message is returned for each routing table entry containing two, three, or four socket address structures per message: those addresses pointed to by rt_key, rt_gateway, rt_netmask, and rt_genmask. The final two pointers might be null.

  2. NET_RT_FLAGS: the same as the previous command except mib[5] specifies an RTF_xxx flag (Figure 18.25), and only entries with this flag set are returned.

  3. NET_RT_IFLIST: return information on all the configured interfaces. If the mib[5] value is nonzero it specifies an interface index and only the interface with the corresponding if_index is returned. Otherwise all interfaces on the ifnet linked list are returned.

    For each interface one RTM_IFINFO message is returned, with information about the interface itself, followed by one RTM_NEWADDR message for each ifaddr structure on the interface’s if_addrlist linked list. If the mib[3] value is nonzero, RTM_NEWADDR messages are returned for only the addresses with an address family that matches the mib[3] value. Otherwise mib[3] is 0 and information on all addresses is returned.

    This operation is intended to replace the SIOCGIFCONF ioctl (Figure 4.26).

One problem with this system call is that the amount of information returned can vary, depending on the number of routing table entries or the number of interfaces. Therefore the first call to sysctl typically specifies a null pointer as the third argument, which means: don’t return any data, just return the number of bytes of return information. As we see in Figure 19.35, the process then calls malloc, followed by sysctl to fetch the information. This second call to sysctl again returns the number of bytes through the fourth argument (which might have changed since the previous call), and this value provides the pointer lim that points just beyond the final byte of data that was returned. The process then steps through the routing messages in the buffer, using the rtm_msglen member to step to the next message.

Figure 19.36 shows the values for these six mib variables that various Net/3 programs specify to access the routing table and interface list.

Table 19.36. Examples of programs that call sysctl to obtain routing table and interface list.

mib[]

arp

route

netstat

routed

gated

rwhod

0

CTL_NET

CTL_NET

CTL_NET

CTL_NET

CTL_NET

CTL_NET

1

PF_ROUTE

PF_ROUTE

PF_ROUTE

PF_ROUTE

PF_ROUTE

PF_ROUTE

2

0

0

0

0

0

0

3

AF_INET

0

0

AF_INET

0

AF_INET

4

NET_RT_FLAGS

NET_RT_DUMP

NET_RT_DUMP

NET_RT_IFLIST

NET_RT_IFLIST

NET_RT_IFLIST

5

RTF_LLINFO

0

0

0

0

0

The first three programs fetch entries from the routing table and the last three fetch the interface list. The routed program supports only the Internet routing protocols, so it specifies a mib[3] value of AF_INET, while gated supports other protocols, so its value for mib[3] is 0.

Figure 19.37 shows the organization of the three sysctl_xxx functions that we cover in the following sections.

Functions that support the sysctl system call for routing sockets.

Figure 19.37. Functions that support the sysctl system call for routing sockets.

Figure 19.38 shows the sysctl_rtable function.

Table 19.38. sysctl_rtable function: process sysctl system call requests.

------------------------------------------------------------------------- rtsock.c
705 int
706 sysctl_rtable(name, namelen, where, given, new, newlen)
707 int    *name;
708 int     namelen;
709 caddr_t where;
710 size_t *given;
711 caddr_t *new;
712 size_t  newlen;
713 {
714     struct radix_node_head *rnh;
715     int     i, s, error = EINVAL;
716     u_char  af;
717     struct walkarg w;

718     if (new)
719         return (EPERM);


720     if (namelen != 3)
721         return (EINVAL);
722     af = name[0];
723     Bzero(&w, sizeof(w));
724     w.w_where = where;
725     w.w_given = *given;
726     w.w_needed = 0 - w.w_given;
727     w.w_op = name[1];
728     w.w_arg = name[2];

729     s = splnet();
730     switch (w.w_op) {

731     case NET_RT_DUMP:
732     case NET_RT_FLAGS:
733         for (i = 1; i <= AF_MAX; i++)
734             if ((rnh = rt_tables[i]) && (af == 0 || af == i) &&
735                 (error = rnh->rnh_walktree(rnh,
736                                            sysctl_dumpentry, &w)))
737                 break;
738         break;

739     case NET_RT_IFLIST:
740         error = sysctl_iflist(af, &w);
741     }
742     splx(s);
743     if (w.w_tmem)
744         free(w.w_tmem, M_RTABLE);
745     w.w_needed += w.w_given;
746     if (where) {
747         *given = w.w_where - where;
748         if (*given < w.w_needed)
749             return (ENOMEM);
750     } else {
751         *given = (11 * w.w_needed) / 10;
752     }
753     return (error);
754 }
------------------------------------------------------------------------- rtsock.c

Validate arguments

705-719

The new argument is used when the process is calling sysctl to set the value of a variable, which isn’t supported with the routing tables. Therefore this argument must be a null pointer.

720-721

namelen must be 3 because at this point in the processing of the system call, three elements in the name array remain: name[0], the address family (what the process specifies as mib[3]); name[1], the operation (mib[4]); and name[2], the flags (mib[5]).

Initialize walkarg structure

723-728

A walkarg structure (Figure 19.31) is set to 0 and the following members are initialized: w_where is the address in the calling process of the buffer for the results (this can be a null pointer, as we mentioned); w_given is the size of the buffer in bytes (this is meaningless on input if w_where is a null pointer, but it must be set on return to the amount of data that would have been returned); w_needed is set to the negative of the buffer size; w_op is the operation (the NET_RT_xxx value); and w_arg is the flags value.

Dump routing table

731-738

The NET_RT_DUMP and NET_RT_FLAGS operations are handled the same way: a loop is made through all the routing tables (the rt_tables array), and if the routing table is in use and either the address family argument was 0 or the address family argument matches the family of this routing table, the rnh_walktree function is called to process the entire routing table. In Figure 18.17 we show that this function is normally rn_walktree. The second argument to this function is the address of another function that is called for each leaf of the routing tree (sysctl_dumpentry). The third pointer is just a pointer to anything that rn_walktree passes to the sysctl_dumpentry function. This argument is a pointer to the walkarg structure that contains all the information about this sysctl call.

Return interface list

739-740

The NET_RT_IFLIST operation calls the function sysctl_iflist, which goes through all the ifnet structures.

Release buffer

743-744

If a buffer was allocated by rt_msg2 to contain a routing message, it is now released.

Update w_needed

745

The size of each message was added to w_needed by rt_msg2. Since this variable was initialized to the negative of w_given, its value can now be expressed as

w_needed = 0 - w_given + totalbytes

where totalbytes is the sum of all the message lengths added by rt_msg2. By adding the value of w_given back into w_needed, we get

w_needed = 0 - w_given + totalbytes + w_given
         = totalbytes

the total number of bytes. Since the two values of w_given in this equation end up canceling each other, when the process specifies w_where as a null pointer it need not initialize the value of w_given. Indeed, we see in Figure 19.35 that the variable needed was not initialized.

Return actual size of message

746-749

If where is nonnull, the number of bytes stored in the buffer is returned through the given pointer. If this value is less than the size of the buffer specified by the process, an error is returned because the return information has been truncated.

Return estimated size of message

750-752

When the where pointer is null, the process just wants the total number of bytes returned. A 10% fudge factor is added to the size, in case the size of the desired tables increases between this call to sysctl and the next.

sysctl_dumpentry Function

In the previous section we described how this function is called by rn_walktree, which in turn is called by sysctl_rtable. Figure 19.39 shows the function.

Table 19.39. sysctl_dumpentry function: process one routing table entry.

------------------------------------------------------------------------- rtsock.c
623 int
624 sysctl_dumpentry(rn, w)
625 struct radix_node *rn;
626 struct walkarg *w;
627 {
628     struct rtentry *rt = (struct rtentry *) rn;
629     int     error = 0, size;
630     struct rt_addrinfo info;

631     if (w->w_op == NET_RT_FLAGS && !(rt->rt_flags & w->w_arg))
632         return 0;
633     bzero((caddr_t) & info, sizeof(info));
634     dst = rt_key(rt);
635     gate = rt->rt_gateway;
636     netmask = rt_mask(rt);
637     genmask = rt->rt_genmask;
638     size = rt_msg2(RTM_GET, &info, 0, w);
639     if (w->w_where && w->w_tmem) {
640         struct rt_msghdr *rtm = (struct rt_msghdr *) w->w_tmem;

641         rtm->rtm_flags = rt->rt_flags;
642         rtm->rtm_use = rt->rt_use;
643         rtm->rtm_rmx = rt->rt_rmx;
644         rtm->rtm_index = rt->rt_ifp->if_index;
645         rtm->rtm_errno = rtm->rtm_pid = rtm->rtm_seq = 0;
646         rtm->rtm_addrs = info.rti_addrs;
647         if (error = copyout((caddr_t) rtm, w->w_where, size))
648             w->w_where = NULL;
649         else
650             w->w_where += size;
651     }
652     return (error);
653 }
------------------------------------------------------------------------- rtsock.c

623-630

Each time this function is called, its first argument points to a radix_node structure, which is also a pointer to a rtentry structure. The second argument points to the walkarg structure that was initialized by sysctl_rtable.

Check flags of routing table entry

631-632

If the process specified a flag value (mib[5]), this entry is skipped if the rt_flags member doesn’t have the desired flag set. We see in Figure 19.36 that the arp program uses this to select only those entries with the RTF_LLINFO flag set, since these are the entries of interest to ARP.

Form routing message

633-638

The following four pointers in the rti_info array are copied from the routing table entry: dst, gate, netmask, and genmask. The first two are always nonnull, but the other two can be null. rt_msg2 forms an RTM_GET message.

Copy message back to process

639-651

If the process wants the message returned and a buffer was allocated by rt_msg2, the remainder of the routing message is formed in the buffer pointed to by w_tmem and copyout copies the message back to the process. If the copy was successful, w_where is incremented by the number of bytes copied.

sysctl_iflist Function

This function, shown in Figure 19.40, is called directly by sysctl_rtable to return the interface list to the process.

Table 19.40. sysctl_iflist function: return list of interfaces and their addresses.

------------------------------------------------------------------------- rtsock.c
654 int
655 sysctl_iflist(af, w)
656 int     af;
657 struct walkarg *w;
658 {
659     struct ifnet *ifp;
660     struct ifaddr *ifa;
661     struct rt_addrinfo info;
662     int     len, error = 0;

663     bzero((caddr_t) & info, sizeof(info));
664     for (ifp = ifnet; ifp; ifp = ifp->if_next) {
665         if (w->w_arg && w->w_arg != ifp->if_index)
666             continue;
667         ifa = ifp->if_addrlist;
668         ifpaddr = ifa->ifa_addr;
669         len = rt_msg2(RTM_IFINFO, &info, (caddr_t) 0, w);
670         ifpaddr = 0;
671         if (w->w_where && w->w_tmem) {
672             struct if_msghdr *ifm;

673             ifm = (struct if_msghdr *) w->w_tmem;
674             ifm->ifm_index = ifp->if_index;
675             ifm->ifm_flags = ifp->if_flags;
676             ifm->ifm_data = ifp->if_data;
677             ifm->ifm_addrs = info.rti_addrs;
678             if (error = copyout((caddr_t) ifm, w->w_where, len))
679                 return (error);
680             w->w_where += len;
681         }
682         while (ifa = ifa->ifa_next) {
683             if (af && af != ifa->ifa_addr->sa_family)
684                 continue;
685             ifaaddr = ifa->ifa_addr;
686             netmask = ifa->ifa_netmask;
687             brdaddr = ifa->ifa_dstaddr;
688             len = rt_msg2(RTM_NEWADDR, &info, 0, w);
689             if (w->w_where && w->w_tmem) {
690                 struct ifa_msghdr *ifam;
691                 ifam = (struct ifa_msghdr *) w->w_tmem;
692                 ifam->ifam_index = ifa->ifa_ifp->if_index;
693                 ifam->ifam_flags = ifa->ifa_flags;
694                 ifam->ifam_metric = ifa->ifa_metric;
695                 ifam->ifam_addrs = info.rti_addrs;
696                 if (error = copyout(w->w_tmem, w->w_where, len))
697                     return (error);
698                 w->w_where += len;
699             }
700         }
701         ifaaddr = netmask = brdaddr = 0;
702     }
703     return (0);
704 }
------------------------------------------------------------------------- rtsock.c

This function is a for loop that iterates through each interface starting with the one pointed to by ifnet. Then a while loop proceeds through the linked list of ifaddr structures for each interface. An RTM_IFINFO routing message is generated for each interface and an RTM_NEWADDR message for each address.

Check interface index

654-666

The process can specify a nonzero flags argument (mib[5] in Figure 19.36) to select only the interface with a matching if_index value.

Build routing message

667-670

The only socket address structure returned with the RTM_IFINFO message is ifpaddr. The message is built by rt_msg2. The pointer ifpaddr in the info structure is then set to 0, since the same info structure is used for generating the subsequent RTM_NEWADDR messages.

Copy message back to process

671-681

If the process wants the message returned, the remainder of the if_msghdr structure is filled in, copyout copies the buffer to the process, and w_where is incremented.

Iterate through address structures, check address family

682-684

Each ifaddr structure for the interface is processed and the process can specify a nonzero address family (mib[3] in Figure 19.36) to select only the interface addresses of the given family.

Build routing message

685-688

Up to three socket address structures are returned in each RTM_NEWADDR message: ifaaddr, netmask, and brdaddr. The message is built by rt_msg2.

Copy message back to process

689-699

If the process wants the message returned, the remainder of the ifa_msghdr structure is filled in, copyout copies the buffer to the process, and w_where is incremented.

701

These three pointers in the info array are set to 0, since the same array is used for the next interface message.

Summary

Routing messages all have the same format a fixed-length structure followed by a variable number of socket address structures. There are three different types of messages, each corresponding to a different fixed-length structure, and the first three elements of each structure identify the length, version, and type of message. A bitmask in each structure identifies which socket address structures follow the fixed-length structure.

These messages are passed between a process and the kernel in two different ways. Messages can be passed in either direction, one message per read or write, across a routing socket. This allows a superuser process complete read and write access to the kernel’s routing tables. This is how routing daemons such as routed and gated implement their desired routing policy.

Alternatively any process can read the contents of the kernel’s routing tables using the sysctl system call. This does not involve a routing socket and does not require special privileges. The entire result, normally consisting of many routing messages, is returned as part of the system call. Since the process does not know the size of the result, a method is provided for the system call to return this size without returning the actual result.

Exercises

19.1

What is the difference in the RTF_DYNAMIC and RTF_MODIFIED flags? Can both be set for a given routing table entry?

19.1

The RTF_DYNAMIC flag is set in Figure 19.15 when the route is created by a redirect, and the RTF_MODIFIED flag is set when the gateway field of an existing route is modified by a redirect. If a route is created by a redirect and then later modified by another redirect, both flags will be set.

19.2

What happens when the default route is entered with the command of the form

bsdi $ route add default -cloning -genmask 255.255.255.255 sun

19.2

A host route is created for each host accessed through the default route. TCP can then maintain and update routing metrics for each individual host (Figure 27.3).

19.3

Estimate the space required by sysctl to dump a routing table that contains 15 ARP entries and 20 routes.

19.3

Each rt_msghdr structure requires 76 bytes. Two sockaddr_in structures are present for a host route (destination and gateway) giving a message size of 108 bytes. The message size for each ARP entry is 112 bytes: one sockaddr_in and one sockaddr_dl. The total size is then (15 × 112 + 20 × 108) or 3840 bytes. A network route (instead of a host route) requires an additional 8 bytes for the network mask (116 bytes for the message instead of 108), so if the 20 routes are all network routes, the total size is 4000 bytes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.111.33