13.4. File System I/O

Two distinct methods perform file system I/O:

  • read(), write(), and related system calls

  • Memory-mapping of a file into the process's address space

Both methods are implemented the same way: a file is mapped into an address space and then paged I/O is performed on the pages within the mapped address space. Although it may be obvious that memory mapping is done when we memory-map a file into a process's address space, it is less obvious that the read() and write() system calls also map a file before reading or writing it. The major differences between these two methods are where the file is mapped and who does the mapping; a process calls mmap() to map the file into its address space for memory mapped I/O, and the kernel maps the file into the kernel's address space for read and write. The two methods are contrasted in Figure 13.5.

Figure 13.5. The read()/write() vs. mmap() Methods for File I/O


13.4.1. Memory Mapped I/O

A request to memory-map a file into an address space is handled by the file system vnode method vop_map() and the seg_vn memory segment driver (see “The seg_map Segment”). A process requests that a file be mapped into its address space. Once the mapping is established, the address space represented by the file appears as regular memory and the file system can perform I/O by simply accessing that memory.

Memory mapping of files hides the real work of reading and writing the file because the seg_vn memory segment driver quietly works with the file system to perform the I/Os without the need for process-initiated system calls. I/O is performed, in units of pages, upon reference to the pages mapped into the address space; reads are initiated by a memory access; writes are initiated as the VM system finds dirty pages in the mapped address space.

The system call mmap() calls the file system for the requested file with the vop_map() vnode method. In turn, the file system calls the address space map function for the current address space, and the mapping is created. The protection flags passed into the mmap() system call are reduced to the subset allowed by the file permissions. If mandatory locking is set for the file, then mmap() returns an error.

Once the file mapping is created in the process's address space, file pages are read when a fault occurs in the address space. A fault occurs the first time a memory address within the mapped segment is accessed because at this point, no physical page of memory is at that location. The memory management unit causes a hardware trap for that memory segment; the memory segment calls its fault function to handle the I/O for that address. The segvn_fault() routine handles a fault for a file mapping in a process address space and then calls the file system to read in the page for the faulted address, as shown below.

segvn_fault (hat, seg, addr, len, type, rw) {

        for ( page = all pages in region ) {

                advise = lookup_advise (page);  /* Look up madvise settings for page */
                if (advise == MADV_SEQUENTIAL)
                        free_all_pages_up_to (page);

                /* Segvn will read at most 64k ahead */
                if ( len > PVN_GETPAGE_SZ)
                        len = PVN_GETPAGE_SZ;

                vp = segvp (seg);
                vpoff = segoff (seg);

                /* Read 64k at a time if the next page is not in memory,
                 * else just a page
                 */
                if (hat_probe (addr+PAGESIZE)==TRUE)
                        len=PAGESIZE;

                /* Ask the file system for the next 64k of pages if the next*/
                VOP_GETPAGE(vp, vp_off, len,
                        &vpprot, plp, plsz, seg, addr + (vp_off - off), arw, cred)
        }
}

For each page fault, seg_vn reads in an 8-Kbyte page at the fault location. In addition, seg_vn initiates a read-ahead of the next eight pages at each 64-Kbyte boundary. Memory mapped read-ahead uses the file system cluster size (used by the read() and write() system calls) unless the segment is mapped MA_SHARED or memory advice MADV_RANDOM is set.

Recall that you can provide paging advice to the pages within a memory mapped segment by using the madvise system call. The madvise system call and (as in the example) the advise information are used to decide when to free behind as the file is read.

Modified pages remain unwritten to disk until the fsflush daemon passes over the page, at which point they will be written out to disk. You can also use the memcntl() system call to initiate a synchronous or asynchronous write of pages.

13.4.2. read() and write() System Calls

The vop_read() and vop_write() vnode methods implement reading and writing with the read() and write() system calls. As shown in Figure 13.5, the seg_map segment driver maps a file into the kernel's address space during the read() and write() system calls. The seg_vn segment could be used to map the file into the kernel's address space; however, the seg_vn driver is a complex segment driver that deals with all of the process address space requirements (such as mapping protections, copy-on-write fault handling, shared memory, etc.), so a lighter-weight driver (seg_map) performs the mapping. The read and write file system calls require only a few basic mapping functions since they do not map files into a process's address space. Instead, they copy data to or from the process during a system call to a portion of the file that is mapped into the kernel's address space by seg_map. The lighter-weight seg_map driver enhances performance by virtue of a shorter code path and reduced locking complexities.

13.4.3. The seg_map Segment

The seg_map segment maintains mappings of pieces of files into the kernel address space and is used only by the file systems. Every time a read or write system call occurs, the seg_map segment driver locates or creates a virtual address space where the page of the file can be mapped. Then, the system call can copy the data to or from the user address space.

The seg_map segment provides a full set of segment driver interfaces (see “Memory Segments”); however, the file system directly uses a small subset of these interfaces without going through the generic segment interface. The subset handles the bulk of the work that is done by the seg_map segment for file read and write operations. The functions used by the file systems are shown in Table 13-8.

Table 13-8. seg_map Functions Used by the File Systems
Function Name Description
segmap_getmap() segmap_getmapfault() Retrieves or creates a mapping for a range of the file at the given offset and length.
segmap_release() Releases the mapping for a given file at a given address.
segmap_pagecreate() Creates new page(s) of memory and slots in the seg_map segment for a given file. Used for extending files or writing to holes during a write.
segmap_pageunlock() Unlocks pages in the segment that was locked during segmap_pagecreate().

At any time, the seg_map segment has some portion of the total file system cache mapped into the kernel address space. The maximum size of the seg_map segment differs among hardware architectures, is often only a fraction of the total physical memory size, and contains only a small proportion of the total file system cache. Note that even though the size of the seg_map segment is fixed, the pages which it references can be stolen by the page scanner, and as a result, only a portion of the seg_map segment may be resident (especially on machines where the seg_map segment is close to the size of physical memory).

A single seg_map segment is created at boot time. The segment is sized according to a table of machine types (see Table 13-9) and is capped at the amount of free memory just after the kernel has booted. For example, if the machine is an Ultra-1 or sun4u architecture, the maximum size of the seg_map is 256 Mbytes. A 64-Mbyte machine booting the Solaris kernel will most likely end up with a seg_map segment about 50 Mbytes in size, as 50 Mbytes was the free memory when the system was being booted. A 512-Mbyte sun4u system will have a seg_map size of 256 Mbytes, since free memory will be much larger than 256 Mbytes while it is booting.

Table 13-9. Architecture-Specific Sizes of Solaris 7 seg_map Segment
Architecture Systems Maximum Size of seg_map
sun4c SPARC 1, 2 4 Mbytes
sun4m SPARC 5, 10, 20 16 Mbytes
sun4d SPARC 1000,2000 32 Mbytes
sun4u UltraSPARC 256 Mbytes

We can take a look at the seg_map segment on a running system by using adb with the $seg macro, as shown below.

# adb -k /dev/ksyms /dev/mem
physmem 3b73

segkmap/J
segkmap:
segkmap:        3000022df50

3000022df50$<seg

3000022df50:    base            size            as
                2a750000000     7432000         104236e0
3000022df68:    next            prev            ops
                104234a0        3000022df88     segmap_ops
3000022df80:    data
                300001b1d68

We can see that on this system, the segkmap segment has been created at boot as 0x7432000 bytes, or 121,839,616 bytes. This system was a 128-Mbyte Ultra-1, and we can see that free memory was smaller than the 256-Mbyte maximum segment size for the sun4u architecture. Hence, the segment was created at whatever the size of free memory was at that point. Once segkmap is created, the segment interfaces are called directly from the file system code during the read and write operations.

The seg_map segment driver divides the segment into block-sized slots that represent blocks in the files it maps. The seg_map block size for the Solaris kernel is 8,192 bytes. A 128-Mbyte segkmap segment would, for example, be divided into 128-MB/8-KB slots, or 16,384 slots. The seg_map segment driver maintains a hash list of its page mappings so that it can easily locate existing blocks. The list is based on file and offsets. One list entry exists for each slot in the segkmap segment. The structure for each slot in a seg_map segment is defined in the <vm/segmap.h> header file, shown below.

Example. Header File <vm/seg_map.h>
/*
 * Each smap struct represents a MAXBSIZE-sized mapping to the
 * <sm_vp, sm_off> given in the structure. The location of
 * the structure in the array gives the virtual address of the
 * mapping. Structure rearranged for 64-bit sm_off.
 */
struct  smap {
        struct  vnode   *sm_vp;         /* vnode pointer (if mapped) */

        /*
         * These next 3 entries can be coded as
         * ushort_t's if we are tight on memory.
         */
        struct  smap    *sm_hash;       /* hash pointer */
        struct  smap    *sm_next;       /* next pointer */
        struct  smap    *sm_prev;       /* previous pointer */
        u_offset_t      sm_off;         /* file offset for mapping */

        ushort_t        sm_bitmap;      /* bitmap for locked translations */
        ushort_t        sm_refcnt;      /* reference count for uses */
};

struct  smfree {
        struct  smap    *sm_free;       /* free list array pointer */
                kmutex_t        sm_mtx;      /* protects smap data of this color */
        kcondvar_t      sm_free_cv;
        ushort_t        sm_want;        /* someone wants a slot of this color */
};

The smap structures are:

  • sm_vp — The file (vnode) this slot represents (if slot not empty)

  • sm_hash, sm_next, sm_prev — Hash list reference pointers

  • sm_off — The file (vnode) offset for a block-sized chunk in this slot in the file

  • sm_bitmap — Bitmap to maintain translation locking

  • sm_refcnt — The number of references to this mapping caused by concurrent reads

The important fields in the smap structure are the file and offset fields, sm_vp and sm_off. These fields identify which page of a file is represented by each slot in the segment.

We can observe the seg_map slot activity with the kstat statistics that are collected for the seg_map segment driver. These statistics are visible with the netstat command, as shown below.

# netstat -k segmap
segmap:
fault 8366623 faulta 0 getmap 16109564 get_use 11723 get_reclaim 15257790 get_reuse
825178
get_unused 0 get_nofree 0 rel_async 710244 rel_write 749677 rel_free 16370
rel_abort 0 rel_dontneed 709733 release 15343517 pagecreate 1009281

Table 13-10 describes the segmap statistics.

Table 13-10. Statistics from the seg_map Segment Driver
Field Name Description
fault The number of times segmap_fault was called, usually as a result of a read or write system call.
faulta The number of times the segmap_faulta function was called. It is called to initiate asynchronous paged I/O on a file.
getmap The number of times the segmap_getmap function was called. It is called by the read and write system calls each time a read or write call is started. It sets up a slot in the seg_map segment for the requested range on the file.
get_use The number of times getmap found an empty slot in the segment and used it.
get_reclaim The number of times getmap found a valid mapping for the file and offset already in the seg_map segment.
get_reuse The number of times getmap deleted the mapping in a nonempty slot and created a new mapping for the file and offset requested.
get_unused Not used—always zero.
get_nofree The number of times a request for a slot was made and none was available on the internal free list of slots. This number is usually zero because each slot is put on the free list when release is called at the end of each I/O. Hence, ample free slots are usually available.
rel_async The slot was released with a delayed I/O on it.
rel_write The slot was released as a result of a write system call.
rel_free The slot was released, and the VM system was told that the page may be needed again but to free it and retain its file/offset information. These pages are placed on the cache list tail so that they are not the first to be reused.
rel_abort The slot was released and asked to be removed from the seg_map segment as a result of a failed aborted write.
rel_dontneed The slot was released, and the VM system was told to free the page because it won't be needed again. These pages are placed on the cache list head so they will be reused first.
released The slot was released and the release was not affected by rel_abort, rel_async, or rel_write.
pagecreate Pages created in the segmap_pagecreate function.

Our example segmap statistics show us that 15,257,790 times a slot was reclaimed out of a total 16,109,564 getmap calls, a 95% slot reuse with the correct file and offset, or a 95% cache hit ratio for the file system pages in segmap. Note that the actual page-to-cache hit ratio may be higher because even if we miss in segmap, we could still have the pages in the page cache and only need to reload the address translations for the page. A lower segmap hit ratio and high page-to-cache hit ratio is typical of large memory machines, in which segmap is limited to only 256 megabytes of potential gigabytes of physical memory.

Writing is a similar process. Again, segmap_getmap is called to retrieve or create a mapping for the file and offset, the I/O is done, and the segmap slot is released. An additional step is involved if the file is being extended or a new page is being created within a hole of a file. This additional step calls the segmap_pagecreate function to create and lock the new pages, then calls segmap_pageunlock() to unlock the pages that were locked during the page_create.

The segmap cache can grow and shrink as pages are paged in and out and as pages are stolen by the page scanner, but the maximum size of the segmap cache is capped at an architecture-specific limit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.144.18