IL2CPP Reverse Engineering Part 2: Structural Overview & Finding the Metadata
Earlier we learned what IL2CPP is, how to setup a build environment, and compared the C#, IL, C++ and disassembly of a simple function.
In this article, you will learn:
- an overview of the key files in an IL2CPP application from a reverse-engineering perspective
- how an IL2CPP application loads the metadata we are interested in
- how to find the application binary’s metadata by hand in a disassembler (x64 and ARM)
- beginner-level disassembly navigation and tidying in IDA
- how to interpret C++ function calls in assembly language
Pre-requisites:
- Basic knowledge of high-level programming
- Basic knowledge of disassembly (the article uses IDA but Ghidra works equally well)
- Basic knowledge of what IL2CPP is – I recommend that you read part 1 first if you’re new to IL2CPP
Note: I chose Unity 2019.3.1 more or less at random for this walkthrough.
Different versions vary slightly although the overall principles are the same.
01 The Executable
IL2CPP applications are forged from two key components. First, there is the application code itself.
On Windows, the main executable of an IL2CPP application is essentially just a stub that loads UnityPlayer.dll and calls UnityMain.
For an IL2CPP game, this will select Unity’s IL2CPP initialization path and load the main application binary;
this is usually called GameAssembly.dll in the application’s root path but it can be placed elsewhere and renamed.
On Android, the application binary is libil2cpp.so, and on iOS everything is generally wrapped up into a single executable.
Other platforms use different layouts, but all of the binaries can be analyzed in the same way,
so the target platform doesn’t matter too much.
The application binary (which I’ll just call “the binary” from hereon)
is the output created by taking the regular Mono DLLs for the application (eg. Assembly-CSharp.dll and its dependencies,
as if it was shipped without IL2CPP) and running them through the IL2CPP transpiler, and is therefore
the main target for reverse engineering since it contains the actual application code.
Besides the application code itself, the binary also contains a vast sea of binary-specific metadata such as a pointer list
every C#-equivalent function, data about every type referenced by method code and so on. Many (most) binaries also expose
the IL2CPP API – a large group of exported functions allowing you to query and modify data in the application at
runtime – useful for dynamic analysis with a debugger. These APIs can be found in the export table and begin with the prefix il2cpp_.
02 The meta-data
The other main file of interest for analysts is global-metadata.dat (“the metadata”).
This file is a platform-independent data file created by IL2CPP containing all of the .NET metadata for the application.
This includes definitions (including symbols) for all of the types, methods, properties, fields and so on for the application.
Many of the structures within are similar to those used by the actual .NET runtime, but tweaked for IL2CPP.
Serge Lidin provides a thorough treatise of the metadata in the excellent book Expert .NET 2.0 IL Assembler.
The metadata file is always a little-endian 32-bit width set of data,
with tables linked via indices rather than pointers. Therefore in principle,
if you are compiling the same application for multiple platforms,
you only need one copy of global-metadata.dat and different executable binaries for each platform.
In practice, builds are often customized with platform-specific functions for Windows, Android and so on.
The metadata file format is very simple.
It always starts with the signature 0xFAB11BAF (little-endian) followed by 4 bytes containing the metadata version number.
This is followed by a long list of offset/length pairs for the various tables of information, directly followed by the tables themselves.
Which tables are actually present depends on the version number, and there will also be corresponding changes in the binary for different versions.
Info: The first non-beta IL2CPP version in 2015 was 15, and at the time of writing we are on version 27 (Unity 2020.2).
There was a long period of several years where the version number remained at 24,
however multiple changes were made to the data format over time and the RE community has named them 24.1 – 24.4.
Version 24.2 (Unity 2019.1) brought substantial changes to the way the data is organized
– moving much of the data from global lists to per-module (per-assembly) lists instead,
with an extra table pointing to these lists for each assembly. Version 27 moves more global list data to per-assembly lists,
and also moves a large block of data describing which methods use which types, other methods and string literals –
that were previously in the metadata file – to the binary file.
Additionally, while the metadata and binary have typically moved in lockstep with version advances,
a divergence occurred in Unity 2019.3.7-2019.4.14 where the binary’s metadata was changed but the metadata file remained the same.
This version is numbered 24.3, but the metadata file format of 24.2 and 24.3 is the same – only the binary changed.
One may wonder why an overzealous publisher would want to ship their product with everything required to reconstruct
all of the types and method prototypes in plain sight. Ultimately, this data is required due to .NET’s heavy reliance on reflection
(known in other languages as runtime type information or RTTI) and attributed programming, and cannot be easily elided.
As is the case with Unity apps built with the Mono scripting backend, some developers choose to use canned obfuscation software
such as the popular BeeByte to arbitrarily redefine un-exported symbol names. These tools are useful as a roadblock to thwart the casual attacker,
but for anyone used to determining the meaning of code from the code itself rather than its symbols, such obfuscators have limited effect.
On its own, the metadata file can be used to re-construct the entire structure of the application as it was when it was written in C# –
with more or less everything except for the actual source code to the methods themselves –
however this gives us zero insight into the structure of the actual binary we’re analyzing.
To do this, we need to combine the metadata file with the specific binary file we’re looking at,
and to do that, we need to first find the location of the binary’s own metadata structures.
This is crucial for successful reverse engineering, and is our goal for today.