Appendix A

Computer Virus Basics

This appendix is intended to give a quick overview of malicious software for those who are not already familiar with the subject. The origins of malicious software are described, followed by a discussion of competing definitions for a virus and a worm. The structure of a simple IBM PC virus is then provided. The appendix concludes by dispelling a common misconception about how viruses and Trojans gain control from their hosts. It is shown that viruses and Trojans can utilize string matching to randomize the location of the jump instruction in the host that sends control to the malware.

A.1 Origins of Malicious Software

The origin of the modern computer virus can be traced back to 1949, when John von Neumann presented lectures that encompassed the theory and organization of complicated automata [310]. Neumann postulated that a computer program could reproduce itself. Bell Laboratories employees eventually gave life to Neumann's theory in the 1950s in a game dubbed Core Wars. In this game, two programmers would unleash software organisms and watch as the programs attempted to lay claim to the address space in which they fought. The Core Wars were described in a May 1984 issue of Scientific American [91]. Ken Thompson, winner of the prestigious ACM Turing Award, mentioned in his Turing Award lecture that he had experimented with self-replicating code as an undergraduate [300]. At that time he had challenged himself to write the smallest self-replicating program possible in the usual vehicle, the FORTRAN programming language.

The notion of a backdoor in software, as we know it today, can be traced back to the late 1960s. The concept of multiuser operating systems grew out of the need to make efficient use of expensive computing machinery. Prior to this, physical controls could be used to maintain security of batch processing machines, yet the effectiveness of such controls began to wane as soon as programs from different users began sharing memory on a single computer.

It was when the military began utilizing multiuser and networked operating systems that security issues surrounding these systems came to a head. The discussion of computer subversion was addressed in an article by Petersen and Turn in the 1967 AFIPS Conference [223]. The question of security control in resource sharing systems was the focus of a series of events in 1967. The Advanced Research Projects Agency was asked to form a task force to study and recommend hardware and software safeguards to protect classified information in multiaccess resource sharing computer systems. The task force contained a technical panel that included, among others, James P. Anderson and Daniel J. Edwards. The report of the task force was printed in 1970 and published by the Rand Corporation under ARPA sponsorship [321]. The RAND report identifies a class of active infiltration that utilize “trap-door” entry points into the system to bypass the security facilities and permit direct access to data.

The notion of a Trojan horse was identified by Daniel J. Edwards1 and was described as a “Trojan horse” in the Anderson report in 1972 [6]. One of the earliest Trojans was a segment of binary code that was inserted into Multics binary code and was distributed to all sites. Paul Karger and Roger Schell described this Trojan in 1974 [151].

In 1982, J. F. Shoch and J. A. Hupp published their work on computer worms [271]. While at the Xerox Palo Alto Research Center (PARC) they investigated the use of worms for the purposes of performing distributed computation over a network. They called their program a worm in honor of the computer tapeworm described in the science fiction novel The Shockware Rider [43], a cyberpunk novel that predates the term cyberpunk. These worms adhered to a prespecified protocol that enabled them to move from machine to machine without user intervention. They were programmed to look for idle machines in order to exploit free CPU power and were designed to control their numbers.

Scattered reports indicate that a computer virus spread on Apple computers as early as 1981. Also, numerous sources indicate that a virus called the Elk Cloner was written in 1983 [129]. This virus became well known in the Apple community and spread on Apple Computer's DOS 3.3 operating system. Fred Cohen did a considerable amount of the initial research on computer viruses in the early and mid 1980s [66, 67, 69]. His doctoral dissertation covered virus theory and his faculty advisor, Professor Leonard Adleman (co-inventor of the RSA algorithm), also worked on the problem.

A.2 Trojans, Viruses, and Worms: What Is the Difference?

Researchers have proposed competing definitions for what a worm is and what a virus is. For example, McAfee and Haynes refer to the infamous self-replicating program that Robert T. Morris, Jr. wrote as a virus.2 Morris was a graduate student at Cornell University at the time, and his creation spread like wildfire across the Internet/Arpanet in November of 1988. McAfee and Haynes define a virus to be a computer program that is created to infect other programs with copies of itself and that has the ability to clone itself. They define a worm to be a program that can burrow its way into systems to manipulate, destroy, or alter data, and that does not contain instructions to replicate. In fact, they even state that worms cannot replicate.3

In contrast, Peter J. Denning referred to the Morris program as the Internet Worm [88]. Denning stated that some media reports mistakenly called the invading program a virus rather than a worm. He defined a virus to be a code segment that embeds itself inside a legitimate program, that is activated when the program is activated, and that then embeds another copy of itself in another legitimate but uninfected program.

To shed light on this apparent dichotomy, the exact nature of the Morris program must be considered. The Morris program had three different infection vectors [88]. Whenever an infection vector succeeded it established a connection with a remote shell and fed the shell a 99-line bootstrap program together with the commands that were needed to compile and execute it. If everything worked, the rest of the worm was sent to the new machine. The parent worm then issued commands to construct a new worm from the delivered pieces. The McAfee-Haynes definition of a worm is more in line with the beneficial worms that Shoch and Hupp researched. Under this definition, a worm does not contain explicit instructions to replicate and hence adheres to the McAfee-Haynes notion of a worm.

It is perhaps safest to say that there is no single correct definition of a worm. The research at XEROX showed that worms could be quite useful for performing distribute computations when they are carefully controlled. News agencies like to go hog wild whenever a rapidly spreading program hits the Internet, and so this has cast worms with malicious propagation vectors in a negative light.

Rather than making the distinction between a virus and a worm contingent upon the insertion of a program's own code into a host program, we prefer to make the distinction contingent upon whether or not user intervention is required for replication. We prefer the following definition of worm.

Definition 5 A computer worm is a program that causes possibly altered copies of itself to be created without any user intervention.

We use the possibly altered language to account for the fact that programs can change their own code (such as polymorphic viruses). We prefer the following definition of a virus.

Definition 6 A computer virus is a program that when triggered by an action of the user, causes possibly altered copies of itself to be created.

The beneficial XEROX programs and the Morris program of '88 are worms under these definitions. Also, many e-mail “worms” are viruses under these definitions. This holds whenever the user has to open his or her inbox or open a specific e-mail for the virus to propagate.

The biological analogs of computer viruses and worms form a possible justification for our definitions. A worm will crawl beneath the earth of its own volition. In order for the common cold virus to spread, a host has to place him or herself within the proximity of another potential host and then cough, sneeze, and so forth. This is analogous to a virus that will only spread if the user types in the name of the infected program and then hits the carriage return or if the user double-clicks on the icon of the infected program, and so on. We are not arguing that our definitions are any better or worse than any other definitions. They just seem to make a good deal of sense.

An important aspect to our definition is that it says nothing about computer networks. If a program can spread on its own from program to program on a single machine while the user is asleep, then the program may be justifiably referred to as a worm.

Fortunately, no one would argue about the definition of a Trojan horse program. A Trojan horse program does not replicate. If it replicated, then it would be a virus or a worm.

Definition 7 A Trojan horse is a program that does not replicate and that performs something that the user does not intend the infected program to do.

As shown in Chapter 9, a Trojan horse can be as simple as an exploitable bug that is deliberately placed in a program. Some exploitable bugs such as buffer overruns or improperly handled race conditions can appear to be honest programming errors.

A.3 A Simple DOS COM Infector

An example will go a long way to show how a virus spreads on a typical disk operating system. Although admittedly outdated, a good example is a virus that infects MSDOS COM files. A COM file is an executable MSDOS program that has the .com extension at the end of the executable's file name. The reason why these programs are so simple to infect is that COM files are direct memory images of programs. The format of a COM file is as follows. It must be less than 64 kilobytes and the first instruction begins at offset 100h (256 bytes). The first 256 bytes can be all zeros, for instance.

When a COM file is executed, DOS loads what is called the program segment prefix (PSP) at the first available segment in memory. This is a data structure that is 256 bytes long and contains bookkeeping information for DOS. This bookkeeping information may overwrite the bytes that were there when the COM file was initially copied from the disk into random access memory. Immediately following the program segment prefix the code for the COM file begins, so MSDOS sends control to offset of 100h to execute the COM program. COM files must be small enough to fit within one segment, thereby limiting their maximum length to be just short of 64 kilobytes. When the COM file is finished running, it can send control back to MSDOS by executing the int 20h interrupt instruction, or it can execute the ret instruction that employs a return address on the stack.

The simplest form of virus that does not destroy data in its host is an appending virus. An appending COM virus stores virtually all of its code immediately following the last byte in the COM file. This code is called the virus body. Simply putting this code there will not allow the virus to gain control when the host COM file is executed by the operating system. So, the virus overwrites the first few executable bytes of the COM file with an unconditional jump that sends control to the beginning of the appended virus. The overwritten bytes are not destroyed, however. Instead, they are carefully saved within the virus. When the virus is finished executing, the executable bytes in memory that were overwritten are restored in memory. Control is then sent to the first executable instruction in the COM file. The COM file is then in its original form, save for the virus hanging out just beyond the last byte of the COM file. This is illustrated in Figure A.1. A COM infector must make certain that its operation does not modify the program segment prefix before sending control to the host. Otherwise, the host program might not operate correctly.

The simplest way to prevent a virus form infecting a host multiple times is to design it to detect itself. This can be accomplished by choosing a 20-byte value IDENTIFIER uniformly at random and then including it in the virus as a data constant. The virus can then detect itself by performing string matching using these 20 bytes. Since the identifier is 160 bits long, mismatches should occur with negligible probability. The downside is that this gives antiviral programs an excellent way to identify the virus as well.

Many viruses use tiny identification strings to let the virus detect itself. Sometimes it is a small string at a fixed offset that is used for identification. At the very least this could lead to false virus identifications. At the worst it could lead to programming errors. As the virus is developed its offsets will inevitably change. This could cause the virus to go out of control and escape the author's machine unexpectedly. This problem does not occur with a large randomly chosen identifier.

It is easy to embed a constant within assembly code. This can be accomplished by placing an unconditional jump instruction just before the constant that sends control immediately following the constant. To utilize the constant, a label can be placed just before it. This approach may be used in this example virus. The data that it jumps over is the value IDENTIFIER and the value HOSTINSTR. The value HOSTINSTR is the bytes that were overwritten at the beginning of the host COM file. They are stored as a constant so that the virus can repair the host at runtime prior to sending control to it. An example Intel assembly language implementation of this method is given below.

images

Figure A.1 Simple appending COM virus

images

One thing that a virus typically needs to do is get a pointer to itself at run-time so that it can copy itself. A common approach is to call a subroutine and then obtain the return address of the subroutine from the stack.4 This is not always necessary.

Suppose that it is known that the size of the virus can be stored within a two-byte quantity. In this case the size of the virus can be guessed and used within the source with the knowledge that it is probably incorrect. The true size of a virus can be found after it is assembled. This value can then be fed back into the virus by copying it into the assembly source. This way the virus source knows how large the virus is. An MSDOS call exists that lets a virus learn the size of a COM file. An appending virus can then get its offset by subtracting the size of the virus from the size of the COM file.

The following is an overview of the execution of a COM file P1 that is infected with this simple appending COM virus.

  1. P1 begins executing. The jump instruction at offset 100h sends control to the virus at the end of P1.
  2. The virus chooses a COM file P2 uniformly at random from all COM files on the computer.
  3. The virus opens P2 for reading and writing. If this fails, control is sent to step 10.
  4. If P2 contains the string IDENTIFIER, control is sent to step 10.
  5. If P2 does not have enough room for the virus, control is sent to step 10.
  6. The virus copies itself to the end of the last byte of P2 in RAM.
  7. The virus stores the first three or four bytes starting at offset 100h in P2 within the new copy of the virus.
  8. The virus overwrites the first three or four bytes in P2 that start at offset 100h with a jump instruction that sends control to the new virus.
  9. The virus writes the changes to P2 to the disk and closes P2.
  10. The virus repairs the first 3 or 4 bytes of code in P1 that start at offset 100h in RAM using the stored values.
  11. The virus sends control to offset 100h in P1.

The operations of opening files, closing files, reading files, writing files, and so on, are standard in any disk operating system. In older versions of MS-DOS these functions are called by invoking interrupt 21h. This interrupt is reserved for DOS function calls. The standard way to invoke a DOS call is to load the Intel CPU register AH with a function value and then invoke the int 21h instruction. Table A.1 shows the function values needed to implement this simple COM infector.

For example, the following assembly code closes a file.

images

It loads the 8-bit Intel CPU register AH with the 8-bit quantity 3Eh and then invokes interrupt 21h. The DOS interrupt reads the value in AH and determines that the program wishes to perform the close file operation.

A.4 Viruses Don't Have to Gain Control Before the Host

It is surprising how many people assume that viruses have to get control before their hosts. This is clearly the easiest way to implement a virus, but it is by no means the only way. A virus can search the host for a small set of code with known functionality and then replace it with a jump to the virus. An experimental Macintosh virus dubbed the SysBeep virus did exactly this.5 SysBeep is a Macintosh OS call that takes a single argument, an integer indicating the duration of the system beep as measured in ticks. The SysBeep call can be implemented using the “move” instruction followed by the trap for SysBeep. The move instruction pushes an integer onto the stack. Address register A7 is the stack pointer. The trap instruction is two bytes and has the hexadecimal value A9C8. Only these two instructions are needed to invoke a system beep and as it turns out these two instructions were extremely common for making the call.

images

Table A.1 Old MSDOS calls

The following is an example of a call to SysBeep. It pushes the 16 bits of data stored in data register D0 onto the stack in the MOVE.W instruction. The mnemonic _SysBeep is a call to the Macintosh OS SysBeep call. This is a debugging listing that demonstrates the size of the instructions. The semicolon is used to indicate a comment that extends to the end of the line that it is on.

images

Below is a different debugging listing that shows the actual bytes of the instructions, that being 3F2E0008A9C8.

images

The machine he was using was a Macintosh SE/30 that used the Motorola 68030 microprocessor. It is a complex instruction set computer (CISC), and so the instructions varied in size. The move instruction in the first listing above is 2 bytes long. The move instruction in the second listing is 4 bytes long.

A search algorithm was implemented that looked for a move word instruction followed by A9C8. The search algorithm took into account the fact that the move word instruction may vary in size. When found, the virus safely stored these two instructions within the body of the virus. It then replaced these two instructions with an unconditional jump to the body of the virus, adding padding bytes if necessary. Just before sending control back to the host, the virus repaired the two instructions using the two saved instructions. It then sent control to the first of the two instructions. The virus was written entirely in ANSI C.

The more general lesson is that a virus can search for any small set of instructions and overwrite them with an unconditional jump to the virus. This can be accomplished by choosing an offset into the code uniformly at random, searching the code for the small set of instructions from there on, and the wrapping around to the beginning of the code if one is not found when the end is reached. The search terminates when either the small set of instructions is reached, or the search leads back to the randomly chosen offset. This way, the small set of instructions that are replaced are chosen randomly from all possible sets of instructions that can be replaced.

Great care must be taken, for suppose that an “if” statement that occurs before the overwritten code sends control into the middle of the overwritten code. In this case, the virus might not have a chance to repair the code before the host sends control to it. This will more than likely cause a crash or bus error. The problem is that of atomicity. A virus can heuristically check that no jump instructions send control to the middle of the overwritten code.

This approach is a countermeasure to generic decryption. Recall that in generic decryption, the first several thousand instructions of a program are emulated to try to get a polymorph to decrypt itself if present. If a polymorphic virus gains control before the host and decrypts itself, then generic decryption should succeed as well as subsequent string-matching algorithms that are performed. By randomizing the location of the instruction that jumps to the virus, there is a good chance that the virus will slip by the generic decryption engine.

1From the NSA.

2See page 80 of [185].

3See page 29 of [185].

4This is so common that it is used as a virus detection heuristic.

5Experimentally demonstrated at Yale in the early 1990s by the first author and Matt Hastings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.12.172