iL2CppDumper

Examining the binary

With this knowledge in hand, we should be able to fire up our favourite disassembler (or maybe just the one that isn’t grotesquely overpriced for hobbyist users), load up the binary and PDB and take a look. The object of the game here is to start by looking at our known metadata location and work our way back through the chain of references and function calls to the starting point, so that we understand what we’re looking for in a real application.

Let’s navigate to g_MetadataRegistration, which is the Il2CppMetadataRegistration table (one of the two tables we are looking for) just to see what it looks like (IDA: press G and type the symbol name then Enter).

.rdata:0000000181D9DE30 g_MetadataRegistration db 0AFh
.rdata:0000000181D9DE31                 db  21h ; !
.rdata:0000000181D9DE32                 db    0
.rdata:0000000181D9DE33                 db    0
.rdata:0000000181D9DE34                 db    0
.rdata:0000000181D9DE35                 db    0
.rdata:0000000181D9DE36                 db    0
.rdata:0000000181D9DE37                 db    0
.rdata:0000000181D9DE38                 dq offset s_Il2CppGenericTypes
.rdata:0000000181D9DE40                 db  97h ; —
.rdata:0000000181D9DE41                 db    7
.rdata:0000000181D9DE42                 db    0
.rdata:0000000181D9DE43                 db    0
.rdata:0000000181D9DE44                 db    0
.rdata:0000000181D9DE45                 db    0
.rdata:0000000181D9DE46                 db    0
.rdata:0000000181D9DE47                 db    0
.rdata:0000000181D9DE48                 dq offset g_Il2CppGenericInstTable
.rdata:0000000181D9DE50                 db 0BFh ; ¿
.rdata:0000000181D9DE51                 db  2Dh ; -
.rdata:0000000181D9DE52                 db    0
.rdata:0000000181D9DE53                 db    0

We can see some pointers and counts. It’s a bit messy, but if we want to see it more clearly we can just click on a line in IDA and tap D repeatedly to toggle each item between 1, 2, 4 and 8 bytes. If we do that for each item, we get something more readable:

.rdata:0000000181D9DE30 g_MetadataRegistration dq 21AFh
.rdata:0000000181D9DE38                 dq offset s_Il2CppGenericTypes
.rdata:0000000181D9DE40                 dq 797h
.rdata:0000000181D9DE48                 dq offset g_Il2CppGenericInstTable
.rdata:0000000181D9DE50                 dq 2DBFh
.rdata:0000000181D9DE58                 dq offset s_Il2CppGenericMethodFunctions
.rdata:0000000181D9DE60                 dq 4DABh
.rdata:0000000181D9DE68                 dq offset g_Il2CppTypeTable
.rdata:0000000181D9DE70                 dq 306Dh
.rdata:0000000181D9DE78                 dq offset g_Il2CppMethodSpecTable
.rdata:0000000181D9DE80                 dq 0EE8h
.rdata:0000000181D9DE88                 dq offset g_FieldOffsetTable
.rdata:0000000181D9DE90                 dq 0EE8h
.rdata:0000000181D9DE98                 dq offset g_Il2CppTypeDefinitionSizesTable
.rdata:0000000181D9DEA0                 dq 4195h
.rdata:0000000181D9DEA8                 dq offset g_MetadataUsages

If we compare this to Il2CppMetadataRegistration.c in our IL2CPP output, we see that it matches up nicely:

const Il2CppMetadataRegistration g_MetadataRegistration = 
{
    8623,
    s_Il2CppGenericTypes,
    1943,
    g_Il2CppGenericInstTable,
    11711,
    s_Il2CppGenericMethodFunctions,
    19883,
    g_Il2CppTypeTable,
    12397,
    g_Il2CppMethodSpecTable,
    3816,
    g_FieldOffsetTable,
    3816,
    g_Il2CppTypeDefinitionSizesTable,
    16789,
    g_MetadataUsages,
};

(the numbers are the same, it’s just that they are shown in hexadecimal in the disassembly and regular decimal in the C code)

So this is what we’ll be looking for in a real application, although there will be no symbols of course. If we click on the g_MetadataRegistration label and press X to open cross-references (xrefs), we can see everywhere in the binary that references this address. There is only one xref, and it takes us to:

.text:00000001803EC9D0 ?s_Il2CppCodegenRegistration@@YAXXZ proc near
.text:00000001803EC9D0                 push    rdi
.text:00000001803EC9D2                 sub     rsp, 20h
.text:00000001803EC9D6                 mov     rdi, rsp
.text:00000001803EC9D9                 mov     ecx, 8
.text:00000001803EC9DE                 mov     eax, 0CCCCCCCCh
.text:00000001803EC9E3                 rep stosd
.text:00000001803EC9E5                 lea     r8, unk_181D53A70 ; struct Il2CppCodeGenOptions *
.text:00000001803EC9EC                 lea     rdx, g_MetadataRegistration ; struct Il2CppMetadataRegistration *
.text:00000001803EC9F3                 lea     rcx, ?g_CodeRegistration@@3UIl2CppCodeRegistration@@B ; struct Il2CppCodeRegistration *
.text:00000001803EC9FA                 call    ?il2cpp_codegen_register@@YAXQEBUIl2CppCodeRegistration@@QEBUIl2CppMetadataRegistration@@QEBUIl2CppCodeGenOptions@@@Z
.text:00000001803EC9FF                 add     rsp, 20h
.text:00000001803ECA03                 pop     rdi
.text:00000001803ECA04                 retn
.text:00000001803ECA04 ?s_Il2CppCodegenRegistration@@YAXXZ endp

Here we see a function prologue from 1803EC9D0-1803EC9E4 which we ignore, three LEAs which load the addresses of our wanted structs into registers followed by a call to il2cpp_codegen_register, and finally the function epilogue – which we also ignore.

This is the compiled version of the function IL2CPP generated for us in Il2CppCodeRegistration.cpp:

void s_Il2CppCodegenRegistration()
{
    il2cpp_codegen_register (&g_CodeRegistration, &g_MetadataRegistration, &s_Il2CppCodeGenOptions);
}

64-bit binaries on Windows use the x64 calling convention, which states that the first four arguments to a function will be passed in RCX, RDX, R8 and R9. While it is obvious with our symbols which struct is which, there is no guarantee that the compiler will generate code which always loads the registers in this order, and indeed it frequently doesn’t. However, since we know the correct order of the arguments to il2cpp_codegen_register, we know that – in other applications – RCX will always be a pointer to Il2CppCodeRegistration and RDX will always be a pointer to Il2CppMetadataRegistration.

Tip: If you are disassembling ARM binaries, ARMv7’s calling convention uses R0-R3 as the arguments (from left to right), and ARMv8 for 64-bit platforms uses X0-X7.

Let’s step back again by looking at the xrefs to s_Il2CppCodegenRegistration (click on the label and press X). We might expect a pointer to this function to be referenced by one of the startup hooks we discussed re: first figure, and sure enough this is what we find:

.text:0000000180040980 code_reg_hook   proc near
.text:0000000180040980                 push    rdi
.text:0000000180040982                 sub     rsp, 20h
.text:0000000180040986                 mov     rdi, rsp
.text:0000000180040989                 mov     ecx, 8
.text:000000018004098E                 mov     eax, 0CCCCCCCCh
.text:0000000180040993                 rep stosd
.text:0000000180040995                 xor     r9d, r9d        ; int
.text:0000000180040998                 xor     r8d, r8d        ; void (*)(void)
.text:000000018004099B                 lea     rdx, ?s_Il2CppCodegenRegistration@@YAXXZ ; void (*)(void)
.text:00000001800409A2                 lea     rcx, unk_181FC626B ; this
.text:00000001800409A9                 call    ??0RegisterRuntimeInitializeAndCleanup@utils@il2cpp@@QEAA@P6AXXZ0H@Z
.text:00000001800409AE                 add     rsp, 20h
.text:00000001800409B2                 pop     rdi
.text:00000001800409B3                 retn
.text:00000001800409B3 code_reg_hook   endp

Indeed we find a function which passes the address of s_Il2CppCodegenRegistration as an argument to RegisterRuntimeInitializeAndCleanup, just as we expected!

This code snippet merits further explanation for newcomers to disassembly. First, you might notice some weird xor instructions where a register is XOR’ed with itself. This is a standard compiler optimization to set a register to zero – if you XOR a number with itself, you always get zero. You can do mov r8d, 0 instead but this uses 5 bytes of memory and takes more cycles (time), whereas the xor is faster and only uses 3 bytes.

Secondly, notice here how a this pointer is passed as the first argument in RCX. Let’s look at the function prototype from the IL2CPP source code in libil2cpp/utils/RegisterRuntimeInitializeAndCleanup.cpp:

                RegisterRuntimeInitializeAndCleanup::RegisterRuntimeInitializeAndCleanup(CallbackFunction Initialize, CallbackFunction Cleanup, 
                    int order)

There are only three arguments, but the assembly code passes four. This is because in machine code, there are no classes, and all functions are global. Therefore, to know which object (class instance) is being used, every class method must receive a pointer to the instance. By convention, this is always passed as the first argument, and in C++ source code it is completely hidden from view. Therefore, we pass this in RCX and the first declared argument – Initialize – in RDX.

To make it easier to find again, I gave this function the name code_reg_hook. To rename a function, click on its label and press N.

Finally, let’s step back one more time. This time, there are two xrefs:

The second one is a RUNTIME_FUNCTION struct in the .pdata section and you can safely ignore it. This is a list of structs Windows uses for exception handling and is not of interest to us. Clicking on the first item, we see it is part of a long list of function pointers:

; ...
.rdata:0000000181870BB0                 dq offset sub_1800407C0
.rdata:0000000181870BB8                 dq offset sub_180040800
.rdata:0000000181870BC0                 dq offset sub_1800405D0
.rdata:0000000181870BC8                 dq offset sub_180040660
.rdata:0000000181870BD0                 dq offset sub_180040840
.rdata:0000000181870BD8                 dq offset sub_1800408C0
.rdata:0000000181870BE0                 dq offset sub_180040900
.rdata:0000000181870BE8                 dq offset ??__E?wndTop@CWnd@@2V1@B@@YAXXZ
.rdata:0000000181870BF0                 dq offset sub_180040880
.rdata:0000000181870BF8                 dq offset code_reg_hook
.rdata:0000000181870C00                 dq offset sub_180040B40
.rdata:0000000181870C08                 dq offset sub_1800412C0
.rdata:0000000181870C10                 dq offset sub_180040E00
.rdata:0000000181870C18                 dq offset sub_180040C00
.rdata:0000000181870C20                 dq offset sub_180041000
.rdata:0000000181870C28                 dq offset sub_180040BC0
.rdata:0000000181870C30                 dq offset sub_180040B80
; ...

This is in fact what we hoped for. Remember how we discussed earlier that a library can execute initialization functions when it starts up? This is precisely that list! In a C++ application, this can – depending on how it has been compiled – include every static constructor and dynamic initializer in the application – including those in the standard library – which creates a very long list indeed. It’s an important list though, because almost every binary file with executable code has one, and it serves as our starting point: the first breadcrumb in the trail to the metadata.

Info: The init function table has a different location depending on what kind of files you are working with.

For PE files (Windows EXEs and DLLs), the init table is in the .rdata section right after the IAT (Import Address Table), which comes at the start of the section. An easy way to find it in some files it to search for __guard_check_icall_fptr and scroll down until you find a null (zero) pointer. The init table starts at the next address.*

or ELF files (Linux, Android etc.), the table is stored in the .init_array section (and finalization functions are in the .fini_array section).

For MachO files (iOS), the table is stored in the __mod_init_func section.

Why are the names all messed up? The long squiggly names are the result of name mangling – a process which guarantees every symbol relating to the binary is unique, and provides additional information to a debugger. By appending the full namespace and an encoded sequence of argument types to each symbol, multiple overloads of the same method still get unique symbols, for example. Not all symbol files use name mangling, but many do. Luckily, it doesn’t have any effect on this kind of reverse engineering – you will just learn to ignore all of the extra bits after a while.

Next >> Finding binary metadata in the wild

Dec 27, 2020 djkaty.com

Tags:

IL2CPP Internals:

Il2CPP Reverse:

Tutorial:

Adventures:

Honkai Impact:

Examining the binary