Chapter 4
Although software assurance is more than just writing secure code, writing secure code is an important and critical component to ensuring the resiliency of software security controls. Reports in full disclosure and security mailing lists are evidence that software written today is rife with vulnerabilities that can be exploited. A majority of these weaknesses can be attributed to insecure software design and/or implementation, and it is vitally important that software first and foremost be reliable, and second less prone to attack and more resilient when it is. Successful hackers today are identified as individuals who have a thorough understanding of programming. It is therefore imperative that software developers who write code must also have a thorough understanding of how their code can be exploited, so that they can effectively protect their software and data. Today’s security landscape calls for software developers who additionally have a security mindset. This chapter will cover the basics of programming concepts, delve into topics that discuss common software coding vulnerabilities and defensive coding techniques and processes, cover code analysis and code protection techniques, and finally discuss building environment security considerations that are to be factored into the software.
As a CSSLP, you are expected to
This chapter will cover each of these objectives in detail. It is imperative that you fully understand the objectives and be familiar with how to apply them in the software that your organization builds.
Although it may seem that the responsibility for insecure software lies primarily on the software developers who write the code, opinions vary, and the debate on who is ultimately responsible for a software breach is ongoing. Holding the coder solely responsible would be unreasonable since software is not developed in a silo. Software has many stakeholders, as depicted in Figure 4.1, and eventually all play a crucial role in the development of secure software. Ultimately, it is the organization (or company) that will be blamed for software security issues, and this state cannot be ignored.
Who is a programmer? What is their most important skill? A programmer is essentially someone who uses his/her technical know-how and skills to solve problems that the business has. The most important skills a programmer (used synonymously with a coder) has is problem solving. Programmers use their skills to construct business problem-solving programs (software) to automate manual processes, improving the efficiency of the business. They use programming languages to write programs. In the following section, we will learn about computer architecture, types of programming languages and code, and program utilities, such as assembler, compilers, and interpreters.
Most modern-day computers are primarily composed of the computer processor, system memory, and input/output (I/O) devices. Figure 4.2 depicts a simplified illustration of modern-day computer architecture.
The computer processor is more commonly known as the central processing unit (CPU). The CPU is made up of the
Because CPU registers have only limited memory space, memory is augmented by system memory and secondary storage devices, such as the hard disks, digital video disks (DVDs), compact disks (CDs), and USB keys/fobs. The system memory is also commonly known as random access memory (RAM). The RAM is the main component with which the CPU communicates. I/O devices are used by the computer system to interact with external interfaces. Some common examples of input devices include a keyboard, mouse, etc., and some common examples of output devices include the monitor, printers, etc. The communication between each of these components occurs via a gateway channel that is called the bus.
The CPU, at its most basic level of operation, processes data based on binary codes that are internally defined by the processor chip manufacturer. These instruction codes are made up of several operational codes called opcodes. These opcodes tell the CPU what functions it can perform. For a software program to run, it reads instruction codes and data stored in the computer system memory and performs the intended operation on the data. The first thing that needs to happen is for the instruction and data to be loaded on to the system memory from an input device or a secondary storage device. Once this happens, the CPU does the following four functions for each instruction:
The fetch–decode–execute–store cycle is also known as the machine cycle. A basic understanding of this process is necessary for a CSSLP because they need to be aware of what happens to the code written by a programmer at the machine level.
When the software program executes, the program allocates storage space in memory so that the program code and data can be loaded and processed as the programmer has intended it. The CPU registers are used to store the most immediate data; the compilers use the registers to cache frequently used function values and local variables that are defined in the source code of the program. However, since there are only a limited number of registers, most programs, especially the large ones, place their data values on the system memory (RAM) and use these values by referencing their unique addresses. Internal memory layout has the following segments: program text, data, stack, and heap, as depicted in Figure 4.3. Physically the stack and the heap are allocated areas on the RAM. The allocation of storage space in memory (also known as a memory object) is called instantiation. Program code uses the variables defined in the source code to access memory objects.
The series of execution instructions (program code) is contained in the program text segment. The next segment is the read–write data segment, which is the area in memory that contains both initialized and uninitialized global data. Function variables, local data, and some special register values, such as the execution stack pointer (ESP), are placed on the stack part of the RAM. The ESP points to the memory address location of the currently executing program function. Variable sized objects and objects that are too large to be placed on the stack are dynamically allocated on the heap part of the RAM. The heap provides the ability to run more than one process at a time, but for the most part with software, memory attacks on the stack are most prevalent.
The stack is an area of memory used to store function arguments and local variables, and it is allocated when a function in the source code is called to execute. When the function execution begins, space is allocated (pushed) on the stack, and when the function terminates, the allocated space is removed (popped off) the stack. This is known as the PUSH and POP operation. The stack is managed as a LIFO (last in, first out) data structure. This means that when a function is called, memory is first allocated in the higher addresses and used first. The PUSH direction is from higher memory addresses to lower memory addresses, and the POP direction is from lower memory addresses to higher memory addresses. This is important to understand because the ESP moves from higher memory to lower memory addresses, and, without proper management, serious security breaches can be evident.
Software hackers often have a thorough understanding of this machine cycle and how memory management happens, and without appropriate protection mechanisms in place, they can circumvent higher-level security controls by manipulating instruction and data pointers at the lowest level, as is the case with memory buffer overflow attacks and reverse engineering. These will be covered later in this chapter under the section about common software vulnerabilities and countermeasures.
Knowledge of all the processor instruction codes can be extremely onerous on a programmer, if even humanly possible. Even an extremely simple program would require the programmer to write lines of code that manipulate data using opcodes, and in a fast-paced day and age where speed of delivery is critically important for the success of business, software programs, like any other product, cannot take an inordinate amount of time to create. To ease programmer’s effort and shorten the time to delivery of software development, simpler programming languages that abstract the raw processor instruction codes have been developed. There are many programming languages that exist today.
Software developers use a programming language to create programs, and they can choose a low-level programming language. A low-level programming language is closely related to the hardware (CPU) instruction codes. It offers little to no abstraction from the language that the machine understands, which is binary codes (0s and 1s). When there is no abstraction and the programmer writes code in 0s and 1s to manipulate data and processor instructions, which is a rarity, they are coding in machine language. However, the most common low-level programming language today is the assembly language, which offers little abstraction from the machine language using opcodes. Appendix C has a listing of the common opcodes used in assembly language for abstracting processor instruction codes in an Intel 80186 or higher microprocessor (CPU) chip. Machine language and assembly language are both examples of low-level programming languages. An assembler converts assembly code into machine code.
In contrast, high-level programming languages (HLL) isolate program execution instruction details and computer architecture semantics from the program’s functional specification itself. High-level programming languages abstract raw processor instruction codes into a notation that the programmer can easily understand. The specialized notation with which a programmer abstracts low-level instruction codes is called the syntax, and each programming language has its own syntax. This way, the programmer is focused on writing a code that addresses business requirements instead of being concerned with manipulating instruction and data pointers at the microprocessor level. This makes software development certainly simpler and the software program more easily understandable. It is, however, important to recognize that with the evolution of programming languages and integrated development environments (IDEs) and tools that facilitate the creation of software programs, even professionals lacking the internal knowledge of how their software program will execute at the machine level are now capable of developing software. This can be seriously damaging from a security standpoint because software creators may not necessarily understand or be aware of the protection mechanisms and controls that need to be developed and therefore inadvertently leave them out.
Today, the evolution of programming languages has given us goal-oriented programming languages that are also known as very high-level programming languages (VHLL). The level of abstraction in some of the VHLLs has been so increased that the syntax for programming in these VHLLs is like writing in English. Additionally, languages such as the natural language offer even greater abstraction and are based on solving problems using logic based on constraints given to the program instead of using the algorithms written in code by the software programmer. Natural languages are infrequently used in business settings and are also known as logic programming languages or constraint-based programming languages.
Figure 4.4 illustrates the evolution of programming languages from the low-level machine language to the VHLL natural language.
The syntax in which a programmer writes their program code is the source code. Source code needs to be converted into a set of instruction codes that the computer can understand and process. The code that the machine understands is the machine code, which is also known as native code. In some cases, instead of converting the source code into machine code, the source code is simply interpreted and run by a separate program. Depending on how the program is executed on the computer, HLL can be categorized into compiled languages and interpreted languages.
The predominant form of programming languages are compiled languages. Examples include COBOL, FORTRAN, BASIC, Pascal, C, C++, and Visual Basic. The source code that the programmer writes is converted into machine code. The conversion itself is a two-step process, as depicted in Figure 4.5, that includes two subprocesses: compilation and linking.
There are two types of linking: static linking and dynamic linking. When the linker copies all functions, variables, and libraries needed for the program to run into the executable itself, it is referred to as static linking. Static linking offers the benefit of faster processing speed and ease of portability and distribution because the required dependencies are present within the executable itself. However, based on the size and number of other dependencies files, the final executable can be bloated, and appropriate space considerations needs to be taken. Unlike static linking, in dynamic linking only the names and respective locations of the needed object code files are placed in the final executable, and actual linking does not happen until runtime, when both the executable and the library files are placed in memory. Although this requires less space, dynamically linked executables can face issues that relate to dependencies if they cannot be found at run time. Dynamic linking should be chosen only after careful consideration to security is given, especially if the linked object files are supplied from a remote location and are open source in nature. A hacker can maliciously corrupt a dependent library, and when they are linked at runtime, they can compromise all programs dependent on that library.
While programs written in compiled languages can be directly run on the processor, interpreted languages require an intermediary host program to read and execute each statement of instruction line by line. The source code is not compiled or converted into processor-specific instruction codes. Common examples of interpreted languages include REXX, PostScript, Perl, Ruby, and Python. Programs written in interpreted languages are slower in execution speed, but they provide the benefit of quicker changes because there is no need for recompilation and relinking, as is the case with those written in compiled languages.
To leverage the benefits provided by compiled languages and interpreted languages, there is also a combination (hybrid) of both compiled and interpreted languages. Here, the source code is compiled into an intermediate stage that resembles object code. The intermediate stage code is then interpreted as required. Java is a common example of a hybrid language. In Java, the intermediate stage code that results upon compilation of source code is known as the byte code. The byte code resembles processor instruction codes, but it cannot be executed as such. It requires an independent host program that runs on the computer to interpret the byte code, and the Java Virtual Machine (JVM) provides this for Java. In .Net programming languages, the source code is compiled into what is known as the common intermediate language (CIL), formerly known as Microsoft Intermediate Language (MSIL). At run time, the common language runtime’s (CLR) just in time compiler converts the CIL code into native code, which is then executed by the machine.
Software development is a structured and methodical process that requires the interplay of people expertise, processes, and technologies. The software development life cycle (SDLC) is often broken down into multiple phases that are either sequential or parallel. In this section, we will learn about the prevalent SDLC models that are used to develop software. These include:
The waterfall model is one of the most traditional software development models still in use today. It is a highly structured, linear, and sequentially phased process characterized by predefined phases, each of which must be completed before one can move on to the next phase. Just as water can flow in only one direction down a waterfall, once a phase in the waterfall model is completed, one cannot go back to that phase. Winston W. Royce’s original waterfall model from 1970 has the following order of phases:
The waterfall model is useful for large-scale software projects because it brings structure by phases to the software development process. The National Institute of Standards and Technology (NIST) Special Publication 800-64 REV 1d, covering Security Considerations in the Information Systems Development Life Cycle, breaks the linear waterfall SDLC model into five generic phases: initiation, acquisition/development, implementation/assessment, operations/maintenance, and sunset (Figure 4.6). Today, there are several other modified versions of the original waterfall model that include different phases with slight or major variations, but the definitive characteristic of each is the unidirectional sequential phased approach to software development.
From a security standpoint, it is important to ensure that the security requirements are part of the requirements phase. Incorporating any missed security requirements at a later point in time will result in additional costs and delays to the project.
In the iterative model of software development, the project is broken into smaller versions and developed incrementally, as illustrated in Figure 4.7. This allows the development effort to be aligned with the business requirements, uncovering any important issues early in the project and therefore avoiding disastrous faulty assumptions. It is also commonly referred to as the prototyping model in which each version is a prototype of the final release to manufacturing (RTM) version. Prototypes can be built to clarify requirements and then discarded, or they may evolve into the final RTM version. The primary advantage of this model is that it offers increased user input opportunity to the customer or business, which can prove useful to solidify the requirements as expected before investing a lot of time, effort, and resources. However, it must be recognized that if the planning cycles are too short, nonfunctional requirements, especially security requirements, can be missed. If it is too long, then the project can suffer from analysis paralysis and excessive implementation of the prototype.
The spiral model, as shown in Figure 4.8, is a software development model with elements of both the waterfall model and the prototyping model, generally used for larger projects. The key characteristic of this model is that each phase has a risk assessment review activity. The risk of not completing the software development project within the constraints of cost and time is estimated, and the results of the risk assessment activity are used to find out if the project needs to be continued or not. This way, should the success of completing the project be determined as questionable, then the project team has the opportunity to cut the losses before investing more into the project.
Agile development methodologies are gaining a lot of acceptance today, and most organizations are embracing them for their software development projects. Agile development methodologies are built on the foundation of iterative development with the goal of minimizing software development project failure rates by developing the software in multiple repetitions (iterations) and small timeframes (called timeboxes). Each iteration includes the full SDLC. The primary benefit of agile development methodologies is that changes can be made quickly. This approach uses feedback driven by regular tests and releases of the evolving software as its primary control mechanism, instead of planning in the case of the spiral model.
The two main agile development methodologies include:
In reality, the most conducive model for enterprise software development is usually a combination of two or more of these models. It is important, however, to realize that no model or combination of models can create inherently secure software. For software to be securely designed, developed, and deployed, a minimum set of security tasks needs to be effectively incorporated into the system development process, and the points of building security into the SDLC model should be identified.
Although secure software is the result of a confluence between people, process, and technology, in this chapter, we will primarily focus on the technology and process aspects of writing secure code. We will learn about the most common vulnerabilities that result from insecure coding, how an attacker can exploit those vulnerabilities, and the anatomy of the attack itself. We will also discuss security controls that must be put in place (in the code) to resist and thwart actions of threat agents.
Nowadays, most of the reported incidents of security breaches seem to have one thing in common: they are attacks that exploited some weakness in the software layer. Analysis of the breaches invariably indicates one of the following to be the root cause of the breach: design flaws, coding (implementation) issues, and improper configuration and operations, with the prevalence of attacks exploiting software coding weaknesses. The Open Web Application Security Project (OWASP) Top 10 List and the Common Weakness Enumeration (CWE/SANS) Top 25 List of the most dangerous programming errors are testaments to the fact that software programming has a lot to do with its security. The 2010 OWASP Top 10 List, in addition to considering the most common application security issues from a weaknesses or vulnerabilities perspective (as did the 2004 and 2007 versions), views application security issues from an organizational risks (technical risk and business impact) perspective, as tabulated in Table 4.1. The 2009 CWE/SANS Top 25 List of the most dangerous programming errors is shown in Table 4.2.
The 2009 CWE/SANS Top 25 List of the most dangerous programming errors falls into the following three categories:
The categorization of the 2009 CWE/SANS Top 25 List of most dangerous programming errors is shown in Table 4.3.
It is recommended that you visit the respective Web sites for the OWASP Top 10 List and the CWE/SANS Top 25 List, as a CSSLP is expected to be familiar with programming issues that can lead to security breaches and know how to address them. The most common software security vulnerabilities and risks are covered in the following section. Each vulnerability or risk is first described as to what it is and how it occurs and is followed by a discussion of security controls that can be implemented to mitigate it.
OWASP Top 10 Rank |
1 |
CWE Top 25 Rank |
2, 9 |
Considered one of the most prevalent software (or application) security weaknesses, injection flaws occur when the user-supplied data are not validated before being processed by an interpreter. The attacker supplies data that are accepted as they are and interpreted as a command or part of a command, thus allowing the attacker to execute commands using any injection vector. Almost any data accepting source are a potential injection vector if the data are not validated before they are processed. Common examples of injection vectors include QueryStrings, form input, and applets in Web applications. Injection flaws are easily discoverable using code review, and scanners, including fuzzing scans, can be used to detect them. There are several different types of injection attacks.
The most common injection flaws include SQL injection, OS command injection, LDAP injection, and XML injection.
This is probably the most well-known form of injection attack, as the databases that store business data are becoming the prime target for attackers. In SQL (Structured Query Language) injection, attackers exploit the way in which database queries are constructed. They supply input that, if not sanitized or validated, becomes part of the query that the database processes as a command. Let us consider an example of a vulnerable code implementation in which the query command text (sSQLQuery) is dynamically built using data supplied from text input fields (txtUserID and txtPassword) from the Web form.
string sSQLQuery = “ SELECT * FROM USERS WHERE user_id = ‘ ” + txtUserID.Text + ” ‘ AND user_password = ‘ ” + txtPassword.Text + ” ‘
If the attacker supplies ‘ OR 1=1 -- as the txtUserID value, then the SQL Query command text that is generated is as follows:
string sSQLQuery = “ SELECT * FROM USERS WHERE user_id = ‘ ” + ‘ OR 1=1 - - + ” ‘ AND user_password = ‘ ” + txtPassword.Text + ” ‘
This results in SQL syntax, as shown below, that the interpreter will evaluate and execute as a valid SQL command. Everything after the -- in T-SQL is ignored.
SELECT * FROM USERS WHERE user_id = ‘ ’ OR 1=1 - -
The attack flow in SQL injection comprises the following steps:
Upon determining that the application is susceptible to SQL injection, an attacker will attempt to force the database to respond with messages that potentially disclose internal database structure and values by passing in SQL commands that cause the database to error. Suppressing database error messages considerably thwarts SQL injection attacks, but it has been proven that this control measure is not sufficient to prevent SQL injection completely. Attackers have found a way to go around the use of error messages for constructing their SQL commands, as is evident in the variant of SQL injection known as blind SQL injection. In blind SQL injection, instead of using information from error messages to facilitate SQL injection, the attacker constructs simple Boolean SQL expressions (true/false questions) to probe the target database iteratively. Depending on whether the query was successfully executed, the attacker can determine the syntax and structure of the injection. The attacker can also note the response time to a query with a logically true condition and one with a false condition and use that information to determine if a query executes successfully or not.
This works in the same principle as the other injection attacks where the command string is generated dynamically using input supplied by the user. When the software allows the execution of operation system (OS) level commands using the supplied user input without sanitization or validation, it is said to be susceptible to OS Command injection. This could be seriously devastating to the business if the principle of least privilege is not designed into the environment that is being compromised. The two main types of OS Command injection are as follows:
An example of an OS Command injection that an attacker supplies as the value of a QueryString parameter to execute the bin/ls command to list all files in the “bin” directory is given below:
http://www.mycompany.com/sensitive/cgi-bin/userData.pl?doc=%20%3B%20/bin/ls%20-l %20 decodes to a space and %3B decodes to a ; and the command that is executed will be /bin/ls -l listing the contents of the program’s working directory.
LDAP is used to store information about users, hosts, and other objects. LDAP injection works on the same principle as SQL injection and OS command injection. Unsanitized and unvalidated input is used to construct or modify syntax, contents, and commands that are executed as an LDAP query. Compromise can lead to the disclosure of sensitive and private information as well as manipulation of content within the LDAP tree (hierarchical) structure. Say you have the ldap query (_sldapQuery) built dynamically using the user-supplied input (userName) without any validation, as shown in the example below.
String _sldapQuery = ’’ (cn=’’ + $userName + ’’) ’ ’;
If the attacker supplies the wildcard ‘*”, information about all users listed in the directory will be disclosed. If the user supplies the value such as ‘’‘’sjohnson) (|password=*))‘’, the execution of the LDAP query will yield the password for the user sjohnson.
XML injection occurs when the software does not properly filter or quote special characters or reserved words that are used in XML, allowing an attacker to modify the syntax, contents, or commands before execution. The two main types of XML injection are as follows:
In XPATH injection, the XPath expression used to retrieve data from the XML data store is not validated or sanitized before processing and built dynamically using user-supplied input. The structure of the query can thus be controlled by the user, and an attacker can take advantage of this weakness by injecting malformed XML expressions to perform malicious operations, such as modifying and controlling logic flow, retrieving unauthorized data, and circumventing authentication checks. XQuery injection works the same way as an XPath injection, except that the XQuery (not XPath) expression used to retrieve data from the XML data store is not validated or sanitized before processing and built dynamically using user-supplied input.
Consider the following XML document (accounts.xml) that stores the account information and pin numbers of customers and a snippet of Java code that uses XPath query to retrieve authentication information:
<customers>
<customer>
<user_name>andrew</user_name>
<accountnum>1234987655551379</accountnum>
<pin>2358</pin>
<homepage>/home/astrout</homepage>
</customer>
<customer>
<user_name>dave</user_name>
<accountnum>9865124576149436</accountnum>
<pin>7523</pin>
<homepage>/home/dclarke</homepage>
</customer>
</customers>
The Java code used to retrieve the home directory based on the provided credentials is:
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression xPathExp = xpath.compile(“//customers/customer[user_name/text()=’” + login.getUserName() + “’ and pin/text() = ‘” + login.getPIN() + “’]/homepage/text()”);
Document doc = DocumentBuilderFactory.newInstance() .newDocumentBuilder().parse(new File(“accounts.xml”));
String homepage = xPathExp.evaluate(doc);
By passing in the value “andrew” into the getUserName() method and the value “’ or ‘’=’” into the getPIN() method call, the XPath expression becomes
//customers/customer[user_name/text()=’andrew’ or ‘’=’’ and pin/text() = ‘’ or ‘’=’’]/hompage/text()
This will allow the user logging in as andrew to bypass authentication without supplying a valid PIN.
Regardless of whether an injection flaw exploits a database, OS command, a directory protocol and structure, or a document, they are all characterized by one or more of the following traits:
The consequences of injection flaws are varied and serious. The most common ones include:
Commonly used mitigation and prevention strategies and controls for injection flaws are as follows:
In the event that the code cannot be fixed, using an application layer firewall to detect injection attacks can be a compensating control.