Independent Cybersecurity Consultant, Bhubaneswar, Odisha, India
(2)
Independent Cybersecurity Consultant, Mangalore, Karnataka, India
A malware analyst deals with hundreds of files every day. All the files on a system need to be categorized so that an analyst understands the potential damage that one file can do to the system. A malware analyst needs to be aware of the various file formats and how to identify them. In this chapter, you go through various kinds of files and learn how to identify their extensions and formats.
Visualizing a File in Its Native Hex Form
Everything that a computer finally understands boils down to binary. Binary translates to bits, represented finally by either a 0 or 1. Every file in our OS is binary. The misconception most often heard is that every binary file is an executable file. All kinds of data—executable files, text files, HTML page files, software programs, PDFs, Word documents, PowerPoint slides, videos, audios, games, or whatever else is stored in a computer as file—is in the form of a binary file. But when opened, each file runs or is presented to the user differently based on the file’s extension or data format. A file’s every byte can be visualized in its hex form, as shown in Figure 3-1.
As an example, create a text file using Notepad and type some text in it, as shown in Figure 3-1. Open the newly created file using a hex editor. If you are on Windows, you can use Notepad++’s hex view, as shown in Figure 3-2, or any other hex editor.
The middle column in Figure 3-2 displays the file’s bytes in hex, and the corresponding right side displays the same hex value as ASCII printable characters, if it is printable. Hex character code ranges from 0–9 and A–F. If you check any character in the middle column, you see only characters in the hex range listed. Where are the binary 0s and 1s we were talking about earlier? A hex is an alternative representation of bits, like the decimal notation. In Figure 3-2, the hex value for the letter H is 0x48, which in decimal is 72 and in binary translates to 0100 1000. Hex editor shows the binary form in the form of a hex number so that it is more human-readable.
Today, most programmers do not need to deal with files at the hex or binary level. But a malware analyst needs to look deep into a malware sample and hence cannot stay away from understanding files in its native binary form, which is pretty much visualized in hex. As a malware analyst, reverse engineer, or a detection engineer, getting comfortable with hex is a must.
Hash: Unique File Fingerprint
There are millions of files in this world, and we need a way to uniquely identify it first. The name of the file can’t be used as its unique identifier. Two files on two different computers or even on the same computer can have the same name. This is where hashing comes in handy and is used in the malware analysis world to uniquely identify a malware sample.
Hashing is a method by which any data generates a unique identifier string for that data. The data for which the hash is created can range from a few raw bytes to the entire contents of a file. Hashing of a file works by taking the contents of the file and feeding it through a hashing program and algorithm, which generates a unique string for the content, as illustrated by Figure 3-3.
One common misconception around hashing a file is that changing the name of the file generates a new hash. Hashing only depends on the contents of the file. The name of the file is not part of the file contents and won’t be included in the hashing process, and the hash generated. Another important point to keep in mind is changing even a single byte of data in the file’s content generates a new hash for the file, as illustrated by Figure 3-4.
The hash value of a malware file is what is used in the malware analysis world to identify and refer to it. As you will learn in later chapters, whenever you have a malware file, you generate its hash and then look it up on the Internet for analysis. Alternatively, if you only have the hash of a malware file, you can use it to get more information for further analysis.
There are mainly three kinds of hashes that are predominantly used in the malware world for files (md5, sha1, and sha256), each of which is generated by tools that use the hashing algorithms specific to the hash they are generating. Listing 3-1 shows the md5, sha1, and sha256 hashes for the same file.
The md5, sha1, and Sha256 Hashes Generated for the Same File
To generate the hash for a file on Windows, you can use the HashMyFiles GUI tool, as shown in Figure 3-5. We generated the hash for C:Windows
otepad.exe, which is the famous Notepad program to open text files on Windows systems. You can also use the QuickHash GUI tool.
Alternatively, you can also use the md5deep, sha1deep, sha256deep command-line tools on Windows to generate the md5, sha1 and sha256 hashes for a file, as shown in Listing 3-2.
. md5deep, sha1deep and sha256 Deep Command-Line Tools on Windows in Action
Identifying Files
There are two primary ways to identify files: file extensions and file format. In this section we go through each of these file identification techniques and list ways where some of these identification techniques can be used by malicious actors to fool users into running malware.
File Extension
The primary way the OS identifies a file is by using the file’s extension. On Windows, a file extension is a suffix to the name of the file, which is usually a period character (.) followed by three letters identifying the type of file; some examples are .txt, .exe, and.pdf. A file extension can be as short as one character or longer than ten characters. By default, file extensions are not displayed on Windows by the File Explorer, as seen in Figure 3-6, but you can configure your system to display the file extension for all files on the system, as explained in Chapter 2.
After disabling file extension hiding, you can view file extensions, as shown in Figure 3-7.
Table 3-1 lists some of the popular file extensions and the file type each extension indicates.
Table 3-1
Some of the Known File Extensions and the File Type They Indicate
Extension
File Type
.pdf
Adobe Portable Document Format
.exe
Microsoft Executable
.xslx
Excel Microsoft Office Open XML Format document
.pptx
PowerPoint Microsoft Office Open XML Format document
File association is a method by which you can associate a file type or extension to be opened by a certain application. Usually, a file extension is the file property that creates an association with an application on the system.
As an experiment, take a freshly installed OS that doesn’t have Microsoft Office installed. Obtain any Microsoft PowerPoint file (.ppt or .pptx file extension) and copy it over to the Documents folder on your system. If you try opening the file, the OS throws an error message saying it can’t open the file, as shown in Figure 3-8. The reason for this is a lack of a software association with Microsoft PowerPoint type files, or rather with the .pptx file extension. Without a file association for the .pptx file extension, Windows does not know how to deal with these files when you try to open it, and it ends up throwing an error message.
Now on the same Windows machine, try to open a .jpeg or .png image file, and the OS succeeds in opening it, as shown in Figure 3-9. It succeeds in opening and displaying the image file without any issues because Windows has a default image viewer program installed on the system that is associated with the .jpeg and .png file extensions.
Why Disable Extension Hiding?
When you analyze a piece of malware, viewing the extension gives you a quick overview of the type of file you are dealing with. Also, when the malware sample is run, it can create multiple files on the system, and the ability to view their extension helps you immediately figure out the type of files created by the malware on the system. Malware authors also use extension faking and thumbnail fakingtechniques to deceive users into clicking the malware (as explained in the next sections). Knowing the correct extension of a file can help thwart some of these malicious techniques.
Extension Faking
Some malware is known to exploit extension hiding to fool users into clicking it, thereby infecting the system. Check out Sample-3-1 from the samples repository, which is illustrated by a similar file in Figure 3-10. As shown on the left, the sample appears to be .pdf file at first glance, but in reality, the file is an executable. Its true extension—.exe—is hidden in Windows. The attacker exploits this hidden extension to craftily rename the file by suffixing it with .pdf, which fools the end user into assuming that the file is a PDF document that is safe to click open. As shown on the right in Figure 3-10, once you disable extension hiding in Windows, the .exe extension is visible.
Thumbnail Faking
Another method employed by attackers is to use fake thumbnails to deceive users into clicking malware. Check out the sample illustrated in Figure 3-11. In the left window, it appears that the file is a PDF. But the thumbnail of any file can be modified, which is what the attackers did in this sample. The file is an executable, as seen on the right. Its true extension is .exe, which becomes visible after disabling extension hiding. But by adding a fake PDF thumbnail to the document and with extension hiding enabled, the attacker manages to deceive the user into thinking that the file is a PDF. The file is clicked, and it infects the system.
Well-Known File Extensions
Table 3-2 features some well-known file extensions and the program associated with it. The program associated with a file extension can be changed. For example, the .pdf extension type can be associated with either an Adobe Acrobat PDF Reader, Foxit PDF Viewer, or any other program.
Table 3-2
Popular file Extensions and the Corresponding Default Program Associated with It
Extension
Program
.png, .jpeg, .jpg
Windows Photo Viewer
.pdf
Adobe Acrobat Reader
.exe
Windows loader
.docx, .doc, .pptx, .xlsx
Microsoft Office tools
.mp3, .avi, .mpeg
VLC Media Player
File Extensions: Can We Rely on One Alone?
Can we rely on the extension of a file to decide the type of a file? The answer is no. For example, changing the file extension of a file with a .pptx extension to a .jpeg doesn’t change the type of the file from a Microsoft PowerPoint file to a JPEG image file. The file is still a PowerPoint file but with a wrong extension, with the contents of the file unchanged. You can still force Microsoft PowerPoint to manually force load this file despite the wrong extension.
As malware analysts, this issue is more amplified. Often, malware files are dropped on the system without readable names and an extension. Also, malware is known to fool users by dropping files in the system with a fake file extension to masquerade the real type of the file. In the next section, we introduce file formats, the foolproof way to identify the file type.
File Format: The Real File Extension
Let’s start by opening C:WindowsNotepad.exe with the Notepad++ hex editor. Now do the same with other kinds of files on the system: zips, PNG images, and so forth. Note that files with the same extension have some specific characters common to them at the very start of the file. For example, ZIP files start with PK. A PNG file’s second, third, and fourth characters are PNG. Windows DOS executables start with MZ, as shown in Figure 3-12. These common starting bytes are called magic bytes. In Figure 3-12, the MZ characters are the ASCII equivalent of hex bytes 4d 5a.
These magic bytes are not located randomly in the file. They are part of what is known as the fileheader. Every file has a structure or format that defines how data should be stored in the file. The structure of the file is usually defined by headers, which holds meta information on the data stored in the file. Parsing the header and the magic bytes is lets you identify the format or type of the file.
A file—audio, video, executable, PowerPoint, Excel, PDF document—each has a file structure of its own to store its data. This file structure is called a file format. Further parsing of the headers can help determine a file’s characteristics. For example, for Windows executable files, apart from the MZ magic bytes, parsing the header contents further past these magic bytes reveals other characteristics of the file. For example, the headers hold information on the file (e.g., whether it is a DLL, or an executable, or a sys file, whether it is 32- or 64-bit, etc.). You can determine the actual file extension of a file by determining its file format.
Figure 3-13 gives a general high-level overview of the structure of a file and its headers. As shown, the file’s format can be defined by multiple headers, which holds the offset, size, and other properties of the chunks of data held in the file.
Table 3-3 and Table 3-4 feature well-known executable or nonexecutable file formats and their corresponding magic bytes.
Table 3-3
Popular Executable File Formats and Their Magic Bytes
OS
File Type/Format
Magic Bytes HEX
Magic Bytes ASCII
Windows
Windows Executable
4D 5A
MZ
Linux
Linux Executable
7F 45 4C 46
.ELF
Mach-O
Mach-O Executable
FE ED FA CE
….
Table 3-4
Popular Nonexecutable File Formats and Their Magic Bytes
File Format/Type
File Extension
Magic Bytes HEX
Magic Bytes ASCII
PDF Document
.pdf
25 50 44 46
%PDF
Adobe Flash
.swf
46 57 53
FWS
Flash Video
.flv
46 4C 56
FLV
Video AVI files
.avi
52 49 46 46
RIFF
Zip compressed files
.zip
50 4B
PK
Rar compressed files
.rar
52 61 72 21
rar!
Microsoft document
.doc
D0 CF
Identifying File Formats
While there are many tools to identify file formats, there are two prominent ones available. One is the file command-line tool in Linux, and the other is the TriD present as the trid command-line tool available on Windows, Linux, and macOS, or TriDNet if you prefer GUI. Both command-line tools take a path to the file as the argument from the command line and give out the verdict on the format of the file.
TriD and TriDNet
Open your command prompt in Windows and type the command shown in Listing 3-3.
c:>trid.exe c:Windows
otepad.exe
TrID/32 - File Identifier v2.24 - (C) 2003-16 By M.Pontello
Definitions found: 12117
Analyzing...
Collecting data from file: c:Windows
otepad.exe
49.1% (.EXE) Microsoft Visual C++ compiled executable (generic) (16529/12/5)
19.5% (.DLL) Win32 Dynamic Link Library (generic) (6578/25/2)
. trid.exe Command Line Tool Identifying the Format of a File
In Listing 3-3, trid.exe lists the potential file formats. For notepad.exe located on our analysis box, trid.exe reports with a 49.1% accuracy that it is an executable file compiled using Microsoft Visual C++. The greater the probability, the more likely it is that file format.
Alternatively, you can use TriDNET, which is the GUI version of the same trid command-line tool. The output of TriDNET for the same notepad.exe file opened in Listing 3-3, is shown in Figure 3-14.
File Command-Line Tool
The other very well-known file identification tool is the file command-line utility primarily available on Linux. It is based on libmagic, a popular library dear to most detection tools that use it for identifying file formats. Very similar to the TriD command-line tool, the file command-line tool takes the path to a file as an argument and gives out the format of the file, as shown in Listing 3-4.
@ubuntu:~$ file notepad.exe
notepad.exe: PE32+ executable (GUI) x86-64, for MS Windows
Listing 3-4
File Command Line Tool on Linux, Identifying the Format of an Executable File
Manual Identification of File Formats
In the previous section, we introduced magic bytes, file headers, and their structures, and using them to identify files manually. But with the presence of tools like TriD, it seems unnecessary to remember these file format details, and manually open a file using a hex editor to identify its format.
But there are times when knowing the various magic bytes for popular file formats does help. As malware analysts, we deal with a lot of data. The data that we deal with may come from network packets, and in some cases, it might include the contents of a file that we are analyzing. Often, the data carries files from malware attackers or contains other files embedded with an outer parent file. Knowing the magic bytes and the general header structure for well-known file formats helps you quickly identify the presence of these files embedded in a huge data haystack, which improves analysis efficiency. For example, Figure 3-15 shows Wireshark displaying a packet capture file carrying a ZIP file in an HTTP response packet. Quickly identifying the PK magic bytes among the packet payload helps an analyst quickly conclude that the packet holds a response from the server returning a zipped file. You saw in Table 3-4 that the magic bytes for the ZIP file format is PK.
Summary
In this chapter, you learned about file extensions and file formats, as well as the structure, magic bytes, and headers that form the identity of a file format. Using freely available command-line tools, you can quickly identify the type of a malware file and set up the right analysis environment for the file based on its type. Knowledge of magic bytes helps you manually identify the presence of files in various data sources, such as packet payloads and packed files.