IL2CPP Reverse Engineering Part 1: Hello World and the IL2CPP Toolchain

In this article, you will learn:

  • what IL2CPP is and why it exists
  • what the generated C++ source code and binary disassembly of a simple function looks like compared to native C#, IL and C++ code
  • how to setup your environment to generate C++ source code and IL2CPP binaries from your own C# code so that you can examine and compare them with your original code
  • how to use IL2CPP at the command-line on arbitrary code without Unity

Introduction to IL2CPP

IL2CPP is an alternative application deployment model introduced into Unity in 2015 which is designed to bring significant performance improvements to Unity games. It’s a beautiful mess, and today we’re going to start picking it apart.

A standard Unity game is distributed as a series of .NET assemblies which are executed by the managed runtime (CLR) on the target platform of choice as per the norm for any .NET application. The premise of IL2CPP is to take these assemblies, parse the IL, generate C++ equivalent source code from it, then compile this C++ into machine code for faster, unmanaged execution. This is described quite well on this page of the Unity manual with this diagram:

alt

There are several excellent guides about how IL2CPP generates code such as Unity’s own IL2CPP Internals blog series and Jackson Dunstan’s exquisitely detailed musings , so I’m not going to repeat that work here. Instead, I want to focus on the opposite perspective: how do we reverse engineer compiled IL2CPP binaries?

Unity games have traditionally been exceptionally easy to reverse engineer, generally requiring nothing more than a copy of ILSpy (or my preferred tool Telerik JustDecompile) and a dream. IL2CPP changes all that: we go from neat assemblies – often with all of the function and variable names intact – to straight up machine code that we have to wade through in a disassembler. Suddenly, even finding the areas of interest becomes magnitudes tougher. How can we make this task easier?

To answer that question, we’re going to need to develop a deep understanding of how IL2CPP manages types and data under the hood, and that’s what this series is all about. Buckle up!

Tracing a Path: Six Representations of Hello World

Consider the following trivial program:

using System;
 
namespace HelloWorld
{
    class Program
    {
        static void Main(string[] args) {
            var a = 1;
            var b = 2;
            Console.WriteLine("Hello World: {0}", a + b);
        }
    }
}

How does IL2CPP convert this to C++? In your mind’s eye, you might imagine the method gets translated something like this (foregoing the fact we’d likely use iostream and cout in real code):

#include <stdio.h>
 
int main(int argc, char **argv) {
    int a = 1;
    int b = 2;
    printf("Hello world: %d\r\n", a + b);
}

In fact, Main gets translated like this:

// System.Void HelloWorld.Program::Main(System.String[])
IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR void Program_Main_m7A2CC8035362C204637A882EDBDD0999B3D31776 (StringU5BU5D_t933FB07893230EA91C40FF900D5400665E87B14E* ___args0, const RuntimeMethod* method)
{
    static bool s_Il2CppMethodInitialized;
    if (!s_Il2CppMethodInitialized)
    {
        il2cpp_codegen_initialize_method (Program_Main_m7A2CC8035362C204637A882EDBDD0999B3D31776_MetadataUsageId);
        s_Il2CppMethodInitialized = true;
    }
    int32_t V_0 = 0;
    int32_t V_1 = 0;
    {
        V_0 = 2;
        int32_t L_0 = V_0;
        V_1 = ((int32_t)il2cpp_codegen_add((int32_t)1, (int32_t)L_0));
        int32_t L_1 = V_1;
        int32_t L_2 = L_1;
        RuntimeObject * L_3 = Box(Int32_t585191389E07734F19F3156FF88FB3EF4800D102_il2cpp_TypeInfo_var, &L_2);
        IL2CPP_RUNTIME_CLASS_INIT(Console_t5C8E87BA271B0DECA837A3BF9093AC3560DB3D5D_il2cpp_TypeInfo_var);
        Console_WriteLine_m22F0C6199F705AB340B551EA46D3DB63EE4C6C56(_stringLiteral331919585E3D6FC59F6389F88AE91D15E4D22DD4, L_3, /*hidden argument*/NULL);
        return;
    }
}

Woah! That’s a lot of gunk. What is going on here?

First, remember that IL2CPP starts with the IL code of an assembly as its input – not the C# source code. If we look at the IL of our trivial C# code, we get:

ldc.i4.1
stloc.0
ldc.i4.2
stloc.1
ldstr "Hello World: {0}"
ldloc.0
ldloc.1
add
box System.Int32
call System.Void System.Console::WriteLine(System.String,System.Object)
ret

This code corresponds to the second braced block in our generated C++. IL is a stack-based pseudo-assembly language; IL2CPP will perform a linear scan over the IL bytecode and translate it into non-stack-based equivalents in C++. This is why we see some redundant variables and assignments in the C++ code, which will hopefully be optimized away by the compiler to some extent. This also explains why we see a boxing operation.