‘Reversing’ of malware network protocols with ‘angr’

One of the most difficult objectives to obtain in the analysis of a malicious binary is usually discovering all of the functionalities that it has. If in addition, these functions are only executed at the discretion of the attackers through its control center, things get complicated. For various reasons, many times we cannot carry out a full dynamic analysis, such as the fall of the malware infrastructure or the isolation of the sample to avoid contact with the C&C. In these cases the analysis of the interaction between the server of the attacker and the sample is usually slower, since you have to create a fictitious server or be continually patching/deceiving the sample, to take it through all the different paths that we want to investigate. Depending on the size and complexity of the analyzed code or the objective of the analysis, this task can vary its difficulty and extension over time.

I am going to propose a study example of the functionalities of a fictitious RAT that can be executed according to the orders received from your C&C panel. Our goal would be to create a server that simulates the attacker’s. For this we have to understand the communication protocol between the server and the sample installed on the victim’s device.

Instead of analyzing all the internal workings of the sample using the typical disassembly and debugging tools, I am going to delegate the responsibility of part of the analysis to a new tool that I had wanted to try for some time: ‘angr‘. ‘Angr’ is a working environment for the analysis of binaries, that makes use of both static and dynamic symbolic analysis of the code. This tool can be used for different purposes, listed on its website, in which there are also many examples. For this entry we are going to focus on the symbolic execution, which can be defined as the analysis of a program to determine what entry conditions must be given to execute different parts of its code.

Due to the complexity of the tool, in order to facilitate its understanding I will propose a simple analysis of a code created to illustrate its use:

https://gist.github.com/halos/15d48a46556645ae7ff2ecb3dfc95d73

The code simulates a RAT that receives instructions from a C&C from the function ‘call_home’, initializing the value of a global variable ‘c2_resp’ that will contain an order of the supposed C&C. Then the ‘exec_order’ function is executed, which, according to the information contained in this structure, will carry out some operations or others.

The values that the different fields of the structure can take are defined as constants at the beginning of the code. They contain random values so as not to give clues of the meaning of each one to possible analysts.

However, when dealing with malware in the laboratory, we lack the source code that generated it, and we only have the compiled binary. This would be all we see from the ‘exec_order’ function if we compile the previous code in C:

With a little more analysis you can get an assembly code that is easier to digest:

This code now shows a structure somewhat easier to understand, but even if it is a moderately organized code, its length can make its analysis difficult, as can be seen in the following image when we go deeper into the function:

However, in day-to-day malware, the logic of these types of programs is usually not so clear and not because their creators do not know how to program, but because they seek to hinder the study they know will be done of their creation.

To study this sample with ‘angr’, I will need to know the value of 4 memory addresses to guide the tool about where and how to work:

  1. Where does the code I want to study begin?
  2. Where does it end?
  3. What parts of the code do I want to reach?
  4. With what variables can I play?

The first two will serve ‘angr’ to delimit the work area on which to evaluate all possible paths and thus avoid the loss of time analyzing paths that do not serve us in this study. In the third point we will have several addresses that will mark the areas of interest that I want to reach and on which the study will focus. The 4th will point to the ‘buffer’ coming from the server which, depending on which values or others, will cause the program to reach the different zones of interest marked in point 3.

It can be seen in the code of the first graph shown, that the function to analyze starts at address 0x0804844d, so it has been assigned to the constant START_ADDR. The addresses to be found are stored in the FIND_ADDRS list. For example, the address 0x8048483 of the instruction “push str.Creando_file: __ s” has been added, from the execution branch to create a file (block with address 0x804847b).

Without going into deep details about the rest of the code, I just wish to point out that I initialize the analyzer with the compiled executable ‘a.out’ of the code in C passing it the start address. It is also necessary to indicate the server response buffer (BUFF_ADDR), which will be the variable to work on and I will indicate the addresses where to stop the analysis. Finally, I will tell it to start exploring the code with all these indications and I will print the results in a way that more or less can be understood by a human:

The first entry that appears comes to say that if the C&C wants to order the bot, it must delete a file (action that runs at the address 0x80485050), it will send a ‘buffer’ with the bytes “00 E9 BB 64 D4 D4 01 1F  AF 78 09 6E 00 00 00 00 00 00 00 00”.

Then, by way of explanation/curiosity, it appears how that address was reached by playing with the contents of the BUFF_ADDR memory address. It should be noted that ‘angr’ works at the bit level, hence the high indexes that appear for a buffer of only 20 bytes (‘c2_resp_0_160[159:128]’).

The first restriction indicates that the 4th DWORD must contain the value 0x64bbe900 (T_FILESYSTEM as it can be seen in the source code in C). However, on the content of the third byte three restrictions appear: that it be different from 0xaff80c17 and 0xc6e6ef6b, and it must also contain the value 0x1f01d4d4. Although with the third restriction it would be enough, this happens for a reason. When simulating the execution, ‘angr’ has to avoid entering certain branches and the condition is that the third byte does not have a value that makes it enter different ‘if’s. If we look at the created C code, the 3rd byte, which corresponds to the ‘fs_order’ field, it is first compared with the constants FSO_CREATE (line 55) and FSO_WRITE (line 67), and finally enters the deletion branch when it is equal to the constant FSO_DELETE (line 71).

This information will help us understand the communication protocol to obtain new IOCs, set up a server that pretends to be the attacker, modify the sample buffer live and this way go directly by focusing on certain functionalities, etc.