© Abhijit Mohanta, Anoop Saldanha 2020
A. Mohanta, A. SaldanhaMalware Analysis and Detection Engineeringhttps://doi.org/10.1007/978-1-4842-6193-4_3

3. Files and File Formats

Abhijit Mohanta1  and Anoop Saldanha2
(1)
Independent Cybersecurity Consultant, Bhubaneswar, Odisha, India
(2)
Independent Cybersecurity Consultant, Mangalore, Karnataka, India
 

A malware analyst deals with hundreds of files every day. All the files on a system need to be categorized so that an analyst understands the potential damage that one file can do to the system. A malware analyst needs to be aware of the various file formats and how to identify them. In this chapter, you go through various kinds of files and learn how to identify their extensions and formats.

Visualizing a File in Its Native Hex Form

Everything that a computer finally understands boils down to binary. Binary translates to bits, represented finally by either a 0 or 1. Every file in our OS is binary. The misconception most often heard is that every binary file is an executable file. All kinds of data—executable files, text files, HTML page files, software programs, PDFs, Word documents, PowerPoint slides, videos, audios, games, or whatever else is stored in a computer as file—is in the form of a binary file . But when opened, each file runs or is presented to the user differently based on the file’s extension or data format. A file’s every byte can be visualized in its hex form , as shown in Figure 3-1.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig1_HTML.jpg
Figure 3-1

A text file created on Windows using Notepad

As an example, create a text file using Notepad and type some text in it, as shown in Figure 3-1. Open the newly created file using a hex editor. If you are on Windows, you can use Notepad++’s hex view, as shown in Figure 3-2, or any other hex editor.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig2_HTML.jpg
Figure 3-2

Opening the text file using Notepad++ Hex Editor plugin’s Hex View

The middle column in Figure 3-2 displays the file’s bytes in hex, and the corresponding right side displays the same hex value as ASCII printable characters, if it is printable. Hex character code ranges from 0–9 and A–F. If you check any character in the middle column, you see only characters in the hex range listed. Where are the binary 0s and 1s we were talking about earlier? A hex is an alternative representation of bits, like the decimal notation. In Figure 3-2, the hex value for the letter H is 0x48, which in decimal is 72 and in binary translates to 0100 1000. Hex editor shows the binary form in the form of a hex number so that it is more human-readable.

Today, most programmers do not need to deal with files at the hex or binary level. But a malware analyst needs to look deep into a malware sample and hence cannot stay away from understanding files in its native binary form, which is pretty much visualized in hex. As a malware analyst, reverse engineer, or a detection engineer, getting comfortable with hex is a must.

Hash: Unique File Fingerprint

There are millions of files in this world, and we need a way to uniquely identify it first. The name of the file can’t be used as its unique identifier. Two files on two different computers or even on the same computer can have the same name. This is where hashing comes in handy and is used in the malware analysis world to uniquely identify a malware sample.

Hashing is a method by which any data generates a unique identifier string for that data. The data for which the hash is created can range from a few raw bytes to the entire contents of a file. Hashing of a file works by taking the contents of the file and feeding it through a hashing program and algorithm, which generates a unique string for the content, as illustrated by Figure 3-3.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig3_HTML.jpg
Figure 3-3

Illustration of how the hash of a file is generated

One common misconception around hashing a file is that changing the name of the file generates a new hash. Hashing only depends on the contents of the file. The name of the file is not part of the file contents and won’t be included in the hashing process, and the hash generated. Another important point to keep in mind is changing even a single byte of data in the file’s content generates a new hash for the file, as illustrated by Figure 3-4.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig4_HTML.jpg
Figure 3-4

Modifying a single byte in a file generates a different unique hash

The hash value of a malware file is what is used in the malware analysis world to identify and refer to it. As you will learn in later chapters, whenever you have a malware file, you generate its hash and then look it up on the Internet for analysis. Alternatively, if you only have the hash of a malware file, you can use it to get more information for further analysis.

There are mainly three kinds of hashes that are predominantly used in the malware world for files (md5, sha1, and sha256), each of which is generated by tools that use the hashing algorithms specific to the hash they are generating. Listing 3-1 shows the md5, sha1, and sha256 hashes for the same file.
MD5 - 28193d0f7543adf4197ad7c56a2d430c
SHA1 - f34cda04b162d02253e7d84efd399f325f400603
SHA256 - 50e4975c41234e8f71d118afbe07a94e8f85566fce63a4e383f1d5ba16178259
Listing 3-1

The md5, sha1, and Sha256 Hashes Generated for the Same File

To generate the hash for a file on Windows, you can use the HashMyFiles GUI tool, as shown in Figure 3-5. We generated the hash for C:Windows otepad.exe, which is the famous Notepad program to open text files on Windows systems. You can also use the QuickHash GUI tool.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig5_HTML.jpg
Figure 3-5

HashMyFiles tool that can generate md5, sha1, sha256 and other hashes for a file

Alternatively, you can also use the md5deep, sha1deep, sha256deep command-line tools on Windows to generate the md5, sha1 and sha256 hashes for a file, as shown in Listing 3-2.
C:>md5deep C:Windows otepad.exe
a4f6df0e33e644e802c8798ed94d80ea  C:Windows otepad.exe
C:>sha1deep C:Windows otepad.exe
fc64b1ef19e7f35642b2a2ea5f5d9f4246866243 C:Windows otepad.exe
C:>sha256deep C:Windows otepad.exe
b56afe7165ad341a749d2d3bd925d879728a1fe4a4df206145c1a69aa233f68b  C:Windows otepad.exe
Listing 3-2

. md5deep, sha1deep and sha256 Deep Command-Line Tools on Windows in Action

Identifying Files

There are two primary ways to identify files: file extensions and file format. In this section we go through each of these file identification techniques and list ways where some of these identification techniques can be used by malicious actors to fool users into running malware.

File Extension

The primary way the OS identifies a file is by using the file’s extension. On Windows, a file extension is a suffix to the name of the file, which is usually a period character (.) followed by three letters identifying the type of file; some examples are .txt, .exe, and.pdf. A file extension can be as short as one character or longer than ten characters. By default, file extensions are not displayed on Windows by the File Explorer, as seen in Figure 3-6, but you can configure your system to display the file extension for all files on the system, as explained in Chapter 2.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig6_HTML.jpg
Figure 3-6

Default file view on Windows with file extensions hidden

After disabling file extension hiding, you can view file extensions, as shown in Figure 3-7.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig7_HTML.jpg
Figure 3-7

File extension visible for files, after disabling Extension Hiding

Table 3-1 lists some of the popular file extensions and the file type each extension indicates.
Table 3-1

Some of the Known File Extensions and the File Type They Indicate

Extension

File Type

.pdf

Adobe Portable Document Format

.exe

Microsoft Executable

.xslx

Excel Microsoft Office Open XML Format document

.pptx

PowerPoint Microsoft Office Open XML Format document

.docx

Word Microsoft Office Open XML Format document

.zip

ZIP compressed archive

.dll

Dynamic Link Library

.7z

7-Zip compressed file

.dat

Data file

.xml

XML file

.jar

Java archive file

.bat

Windows batch file

.msi

Windows installer package

File Association: How an OS Uses File Extensions

File association is a method by which you can associate a file type or extension to be opened by a certain application. Usually, a file extension is the file property that creates an association with an application on the system.

As an experiment, take a freshly installed OS that doesn’t have Microsoft Office installed. Obtain any Microsoft PowerPoint file (.ppt or .pptx file extension) and copy it over to the Documents folder on your system. If you try opening the file, the OS throws an error message saying it can’t open the file, as shown in Figure 3-8. The reason for this is a lack of a software association with Microsoft PowerPoint type files, or rather with the .pptx file extension. Without a file association for the .pptx file extension, Windows does not know how to deal with these files when you try to open it, and it ends up throwing an error message.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig8_HTML.jpg
Figure 3-8

Windows unable to open a .pptx PowerPoint file without a file association for this particular extension

Now on the same Windows machine, try to open a .jpeg or .png image file, and the OS succeeds in opening it, as shown in Figure 3-9. It succeeds in opening and displaying the image file without any issues because Windows has a default image viewer program installed on the system that is associated with the .jpeg and .png file extensions.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig9_HTML.jpg
Figure 3-9

Windows displays an image file, whose extension .jpeg has a file association with an image viewer application on the system

Why Disable Extension Hiding?

When you analyze a piece of malware, viewing the extension gives you a quick overview of the type of file you are dealing with. Also, when the malware sample is run, it can create multiple files on the system, and the ability to view their extension helps you immediately figure out the type of files created by the malware on the system. Malware authors also use extension faking and thumbnail faking techniques to deceive users into clicking the malware (as explained in the next sections). Knowing the correct extension of a file can help thwart some of these malicious techniques.

Extension Faking
Some malware is known to exploit extension hiding to fool users into clicking it, thereby infecting the system. Check out Sample-3-1 from the samples repository, which is illustrated by a similar file in Figure 3-10. As shown on the left, the sample appears to be .pdf file at first glance, but in reality, the file is an executable. Its true extension—.exeis hidden in Windows. The attacker exploits this hidden extension to craftily rename the file by suffixing it with .pdf, which fools the end user into assuming that the file is a PDF document that is safe to click open. As shown on the right in Figure 3-10, once you disable extension hiding in Windows, the .exe extension is visible.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig10_HTML.jpg
Figure 3-10

Malware executable file using extension faking by being craftily named by attackers with a fake .pdf extension suffix

Thumbnail Faking
Another method employed by attackers is to use fake thumbnails to deceive users into clicking malware. Check out the sample illustrated in Figure 3-11. In the left window, it appears that the file is a PDF. But the thumbnail of any file can be modified, which is what the attackers did in this sample. The file is an executable, as seen on the right. Its true extension is .exe, which becomes visible after disabling extension hiding. But by adding a fake PDF thumbnail to the document and with extension hiding enabled, the attacker manages to deceive the user into thinking that the file is a PDF. The file is clicked, and it infects the system.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig11_HTML.jpg
Figure 3-11

Malware executable file with a fake .pdf thumbnail

Well-Known File Extensions

Table 3-2 features some well-known file extensions and the program associated with it. The program associated with a file extension can be changed. For example, the .pdf extension type can be associated with either an Adobe Acrobat PDF Reader, Foxit PDF Viewer, or any other program.
Table 3-2

Popular file Extensions and the Corresponding Default Program Associated with It

Extension

Program

.png, .jpeg, .jpg

Windows Photo Viewer

.pdf

Adobe Acrobat Reader

.exe

Windows loader

.docx, .doc, .pptx, .xlsx

Microsoft Office tools

.mp3, .avi, .mpeg

VLC Media Player

File Extensions: Can We Rely on One Alone?

Can we rely on the extension of a file to decide the type of a file? The answer is no. For example, changing the file extension of a file with a .pptx extension to a .jpeg doesn’t change the type of the file from a Microsoft PowerPoint file to a JPEG image file. The file is still a PowerPoint file but with a wrong extension, with the contents of the file unchanged. You can still force Microsoft PowerPoint to manually force load this file despite the wrong extension.

As malware analysts, this issue is more amplified. Often, malware files are dropped on the system without readable names and an extension. Also, malware is known to fool users by dropping files in the system with a fake file extension to masquerade the real type of the file. In the next section, we introduce file formats, the foolproof way to identify the file type.

File Format: The Real File Extension

Let’s start by opening C:WindowsNotepad.exe with the Notepad++ hex editor. Now do the same with other kinds of files on the system: zips, PNG images, and so forth. Note that files with the same extension have some specific characters common to them at the very start of the file. For example, ZIP files start with PK. A PNG file’s second, third, and fourth characters are PNG. Windows DOS executables start with MZ, as shown in Figure 3-12. These common starting bytes are called magic bytes . In Figure 3-12, the MZ characters are the ASCII equivalent of hex bytes 4d 5a.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig12_HTML.jpg
Figure 3-12

Magic bytes for executable file types. MZ(4d 5a in hex)

These magic bytes are not located randomly in the file. They are part of what is known as the file header . Every file has a structure or format that defines how data should be stored in the file. The structure of the file is usually defined by headers, which holds meta information on the data stored in the file. Parsing the header and the magic bytes is lets you identify the format or type of the file.

A file—audio, video, executable, PowerPoint, Excel, PDF document—each has a file structure of its own to store its data. This file structure is called a file format. Further parsing of the headers can help determine a file’s characteristics. For example, for Windows executable files, apart from the MZ magic bytes, parsing the header contents further past these magic bytes reveals other characteristics of the file. For example, the headers hold information on the file (e.g., whether it is a DLL, or an executable, or a sys file, whether it is 32- or 64-bit, etc.). You can determine the actual file extension of a file by determining its file format.

Figure 3-13 gives a general high-level overview of the structure of a file and its headers. As shown, the file’s format can be defined by multiple headers, which holds the offset, size, and other properties of the chunks of data held in the file.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig13_HTML.jpg
Figure 3-13

High-level overview of the structure and headers of a file

Table 3-3 and Table 3-4 feature well-known executable or nonexecutable file formats and their corresponding magic bytes.
Table 3-3

Popular Executable File Formats and Their Magic Bytes

OS

File Type/Format

Magic Bytes HEX

Magic Bytes ASCII

Windows

Windows Executable

4D 5A

MZ

Linux

Linux Executable

7F 45 4C 46

.ELF

Mach-O

Mach-O Executable

FE ED FA CE

….

Table 3-4

Popular Nonexecutable File Formats and Their Magic Bytes

File Format/Type

File Extension

Magic Bytes HEX

Magic Bytes ASCII

PDF Document

.pdf

25 50 44 46

%PDF

Adobe Flash

.swf

46 57 53

FWS

Flash Video

.flv

46 4C 56

FLV

Video AVI files

.avi

52 49 46 46

RIFF

Zip compressed files

.zip

50 4B

PK

Rar compressed files

.rar

52 61 72 21

rar!

Microsoft document

.doc

D0 CF

 

Identifying File Formats

While there are many tools to identify file formats, there are two prominent ones available. One is the file command-line tool in Linux, and the other is the TriD present as the trid command-line tool available on Windows, Linux, and macOS, or TriDNet if you prefer GUI. Both command-line tools take a path to the file as the argument from the command line and give out the verdict on the format of the file.

TriD and TriDNet
Open your command prompt in Windows and type the command shown in Listing 3-3.
c:>trid.exe c:Windows otepad.exe
TrID/32 - File Identifier v2.24 - (C) 2003-16 By M.Pontello
Definitions found:  12117
Analyzing...
Collecting data from file: c:Windows otepad.exe
 49.1% (.EXE) Microsoft Visual C++ compiled executable (generic) (16529/12/5)
 19.5% (.DLL) Win32 Dynamic Link Library (generic) (6578/25/2)
 13.3% (.EXE) Win32 Executable (generic) (4508/7/1)
 6.0% (.EXE) OS/2 Executable (generic) (2029/13)
 5.9% (.EXE) Generic Win/DOS Executable (2002/3)
Listing 3-3

. trid.exe Command Line Tool Identifying the Format of a File

In Listing 3-3, trid.exe lists the potential file formats. For notepad.exe located on our analysis box, trid.exe reports with a 49.1% accuracy that it is an executable file compiled using Microsoft Visual C++. The greater the probability, the more likely it is that file format.

Alternatively, you can use TriDNET, which is the GUI version of the same trid command-line tool. The output of TriDNET for the same notepad.exe file opened in Listing 3-3, is shown in Figure 3-14.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig14_HTML.jpg
Figure 3-14

TriDNet, the GUI alternative to the command line trid file identification tool

File Command-Line Tool
The other very well-known file identification tool is the file command-line utility primarily available on Linux. It is based on libmagic, a popular library dear to most detection tools that use it for identifying file formats. Very similar to the TriD command-line tool, the file command-line tool takes the path to a file as an argument and gives out the format of the file, as shown in Listing 3-4.
@ubuntu:~$ file notepad.exe
notepad.exe: PE32+ executable (GUI) x86-64, for MS Windows
Listing 3-4

File Command Line Tool on Linux, Identifying the Format of an Executable File

Manual Identification of File Formats

In the previous section, we introduced magic bytes, file headers, and their structures, and using them to identify files manually. But with the presence of tools like TriD, it seems unnecessary to remember these file format details, and manually open a file using a hex editor to identify its format.

But there are times when knowing the various magic bytes for popular file formats does help. As malware analysts, we deal with a lot of data. The data that we deal with may come from network packets, and in some cases, it might include the contents of a file that we are analyzing. Often, the data carries files from malware attackers or contains other files embedded with an outer parent file. Knowing the magic bytes and the general header structure for well-known file formats helps you quickly identify the presence of these files embedded in a huge data haystack, which improves analysis efficiency. For example, Figure 3-15 shows Wireshark displaying a packet capture file carrying a ZIP file in an HTTP response packet. Quickly identifying the PK magic bytes among the packet payload helps an analyst quickly conclude that the packet holds a response from the server returning a zipped file . You saw in Table 3-4 that the magic bytes for the ZIP file format is PK.
../images/491809_1_En_3_Chapter/491809_1_En_3_Fig15_HTML.jpg
Figure 3-15

Using magic bytes to quickly and manually identifying the presence of files in other data like packet payloads

Summary

In this chapter, you learned about file extensions and file formats, as well as the structure, magic bytes, and headers that form the identity of a file format. Using freely available command-line tools, you can quickly identify the type of a malware file and set up the right analysis environment for the file based on its type. Knowledge of magic bytes helps you manually identify the presence of files in various data sources, such as packet payloads and packed files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.6.114