Examining the binary
With this knowledge in hand, we should be able to fire up our favourite disassembler (or maybe just the one that isn’t grotesquely overpriced for hobbyist users),
load up the binary and PDB and take a look. The object of the game here is to start by looking at our known metadata location and
work our way back through the chain of references and function calls to the starting point, so that we understand what we’re looking for in a real application.
Let’s navigate to g_MetadataRegistration,
which is the Il2CppMetadataRegistration table (one of the two tables we are looking for) just to see what it looks like
(IDA: press G and type the symbol name then Enter).
.rdata:0000000181D9DE30 g_MetadataRegistration db 0AFh
.rdata:0000000181D9DE31 db 21h ; !
.rdata:0000000181D9DE32 db 0
.rdata:0000000181D9DE33 db 0
.rdata:0000000181D9DE34 db 0
.rdata:0000000181D9DE35 db 0
.rdata:0000000181D9DE36 db 0
.rdata:0000000181D9DE37 db 0
.rdata:0000000181D9DE38 dq offset s_Il2CppGenericTypes
.rdata:0000000181D9DE40 db 97h ; —
.rdata:0000000181D9DE41 db 7
.rdata:0000000181D9DE42 db 0
.rdata:0000000181D9DE43 db 0
.rdata:0000000181D9DE44 db 0
.rdata:0000000181D9DE45 db 0
.rdata:0000000181D9DE46 db 0
.rdata:0000000181D9DE47 db 0
.rdata:0000000181D9DE48 dq offset g_Il2CppGenericInstTable
.rdata:0000000181D9DE50 db 0BFh ; ¿
.rdata:0000000181D9DE51 db 2Dh ; -
.rdata:0000000181D9DE52 db 0
.rdata:0000000181D9DE53 db 0
We can see some pointers and counts. It’s a bit messy, but if we want to see it more clearly we can just click on a line in
IDA and tap D repeatedly to toggle each item between 1, 2, 4 and 8 bytes. If we do that for each item, we get something more readable:
.rdata:0000000181D9DE30 g_MetadataRegistration dq 21AFh
.rdata:0000000181D9DE38 dq offset s_Il2CppGenericTypes
.rdata:0000000181D9DE40 dq 797h
.rdata:0000000181D9DE48 dq offset g_Il2CppGenericInstTable
.rdata:0000000181D9DE50 dq 2DBFh
.rdata:0000000181D9DE58 dq offset s_Il2CppGenericMethodFunctions
.rdata:0000000181D9DE60 dq 4DABh
.rdata:0000000181D9DE68 dq offset g_Il2CppTypeTable
.rdata:0000000181D9DE70 dq 306Dh
.rdata:0000000181D9DE78 dq offset g_Il2CppMethodSpecTable
.rdata:0000000181D9DE80 dq 0EE8h
.rdata:0000000181D9DE88 dq offset g_FieldOffsetTable
.rdata:0000000181D9DE90 dq 0EE8h
.rdata:0000000181D9DE98 dq offset g_Il2CppTypeDefinitionSizesTable
.rdata:0000000181D9DEA0 dq 4195h
.rdata:0000000181D9DEA8 dq offset g_MetadataUsages
If we compare this to Il2CppMetadataRegistration.c in our IL2CPP output, we see that it matches up nicely:
const Il2CppMetadataRegistration g_MetadataRegistration =
{
8623,
s_Il2CppGenericTypes,
1943,
g_Il2CppGenericInstTable,
11711,
s_Il2CppGenericMethodFunctions,
19883,
g_Il2CppTypeTable,
12397,
g_Il2CppMethodSpecTable,
3816,
g_FieldOffsetTable,
3816,
g_Il2CppTypeDefinitionSizesTable,
16789,
g_MetadataUsages,
};
(the numbers are the same, it’s just that they are shown in hexadecimal in the disassembly and regular decimal in the C code)
So this is what we’ll be looking for in a real application, although there will be no symbols of course.
If we click on the g_MetadataRegistration label and press X to open cross-references (xrefs),
we can see everywhere in the binary that references this address. There is only one xref, and it takes us to:
.text:00000001803EC9D0 ?s_Il2CppCodegenRegistration@@YAXXZ proc near
.text:00000001803EC9D0 push rdi
.text:00000001803EC9D2 sub rsp, 20h
.text:00000001803EC9D6 mov rdi, rsp
.text:00000001803EC9D9 mov ecx, 8
.text:00000001803EC9DE mov eax, 0CCCCCCCCh
.text:00000001803EC9E3 rep stosd
.text:00000001803EC9E5 lea r8, unk_181D53A70 ; struct Il2CppCodeGenOptions *
.text:00000001803EC9EC lea rdx, g_MetadataRegistration ; struct Il2CppMetadataRegistration *
.text:00000001803EC9F3 lea rcx, ?g_CodeRegistration@@3UIl2CppCodeRegistration@@B ; struct Il2CppCodeRegistration *
.text:00000001803EC9FA call ?il2cpp_codegen_register@@YAXQEBUIl2CppCodeRegistration@@QEBUIl2CppMetadataRegistration@@QEBUIl2CppCodeGenOptions@@@Z
.text:00000001803EC9FF add rsp, 20h
.text:00000001803ECA03 pop rdi
.text:00000001803ECA04 retn
.text:00000001803ECA04 ?s_Il2CppCodegenRegistration@@YAXXZ endp
Here we see a function prologue from 1803EC9D0-1803EC9E4 which we ignore,
three LEAs which load the addresses of our wanted structs into registers followed by a call to il2cpp_codegen_register,
and finally the function epilogue – which we also ignore.
This is the compiled version of the function IL2CPP generated for us in Il2CppCodeRegistration.cpp:
void s_Il2CppCodegenRegistration()
{
il2cpp_codegen_register (&g_CodeRegistration, &g_MetadataRegistration, &s_Il2CppCodeGenOptions);
}
64-bit binaries on Windows use the x64 calling convention, which states that the first four arguments to a function will be passed in
RCX, RDX, R8 and R9. While it is obvious with our symbols which struct is which, there is no guarantee that the compiler will
generate code which always loads the registers in this order, and indeed it frequently doesn’t. However,
since we know the correct order of the arguments to il2cpp_codegen_register, we know that – in other applications –
RCX will always be a pointer to Il2CppCodeRegistration and RDX will always be a pointer to Il2CppMetadataRegistration.
Tip: If you are disassembling ARM binaries,
ARMv7’s calling convention uses R0-R3 as the arguments (from left to right), and ARMv8 for 64-bit platforms uses X0-X7.
Let’s step back again by looking at the xrefs to s_Il2CppCodegenRegistration (click on the label and press X).
We might expect a pointer to this function to be referenced by one of the startup hooks we discussed re: first figure, and sure enough this is what we find:
.text:0000000180040980 code_reg_hook proc near
.text:0000000180040980 push rdi
.text:0000000180040982 sub rsp, 20h
.text:0000000180040986 mov rdi, rsp
.text:0000000180040989 mov ecx, 8
.text:000000018004098E mov eax, 0CCCCCCCCh
.text:0000000180040993 rep stosd
.text:0000000180040995 xor r9d, r9d ; int
.text:0000000180040998 xor r8d, r8d ; void (*)(void)
.text:000000018004099B lea rdx, ?s_Il2CppCodegenRegistration@@YAXXZ ; void (*)(void)
.text:00000001800409A2 lea rcx, unk_181FC626B ; this
.text:00000001800409A9 call ??0RegisterRuntimeInitializeAndCleanup@utils@il2cpp@@QEAA@P6AXXZ0H@Z
.text:00000001800409AE add rsp, 20h
.text:00000001800409B2 pop rdi
.text:00000001800409B3 retn
.text:00000001800409B3 code_reg_hook endp
Indeed we find a function which passes the address of s_Il2CppCodegenRegistration as an argument to RegisterRuntimeInitializeAndCleanup,
just as we expected!
This code snippet merits further explanation for newcomers to disassembly.
First, you might notice some weird xor instructions where a register is XOR’ed with itself.
This is a standard compiler optimization to set a register to zero – if you XOR a number with itself,
you always get zero. You can do mov r8d, 0 instead but this uses 5 bytes of memory and takes more cycles (time),
whereas the xor is faster and only uses 3 bytes.
Secondly, notice here how a this pointer is passed as the first argument in RCX.
Let’s look at the function prototype from the IL2CPP source code in libil2cpp/utils/RegisterRuntimeInitializeAndCleanup.cpp:
RegisterRuntimeInitializeAndCleanup::RegisterRuntimeInitializeAndCleanup(CallbackFunction Initialize, CallbackFunction Cleanup,
int order)
There are only three arguments, but the assembly code passes four.
This is because in machine code, there are no classes, and all functions are global.
Therefore, to know which object (class instance) is being used, every class method must receive a pointer to the instance.
By convention, this is always passed as the first argument, and in C++ source code it is completely hidden from view.
Therefore, we pass this in RCX and the first declared argument – Initialize – in RDX.
To make it easier to find again, I gave this function the name code_reg_hook.
To rename a function, click on its label and press N.
Finally, let’s step back one more time. This time, there are two xrefs:
The second one is a RUNTIME_FUNCTION struct in the .pdata section and you can safely ignore it.
This is a list of structs Windows uses for exception handling and is not of interest to us.
Clicking on the first item, we see it is part of a long list of function pointers:
; ...
.rdata:0000000181870BB0 dq offset sub_1800407C0
.rdata:0000000181870BB8 dq offset sub_180040800
.rdata:0000000181870BC0 dq offset sub_1800405D0
.rdata:0000000181870BC8 dq offset sub_180040660
.rdata:0000000181870BD0 dq offset sub_180040840
.rdata:0000000181870BD8 dq offset sub_1800408C0
.rdata:0000000181870BE0 dq offset sub_180040900
.rdata:0000000181870BE8 dq offset ??__E?wndTop@CWnd@@2V1@B@@YAXXZ
.rdata:0000000181870BF0 dq offset sub_180040880
.rdata:0000000181870BF8 dq offset code_reg_hook
.rdata:0000000181870C00 dq offset sub_180040B40
.rdata:0000000181870C08 dq offset sub_1800412C0
.rdata:0000000181870C10 dq offset sub_180040E00
.rdata:0000000181870C18 dq offset sub_180040C00
.rdata:0000000181870C20 dq offset sub_180041000
.rdata:0000000181870C28 dq offset sub_180040BC0
.rdata:0000000181870C30 dq offset sub_180040B80
; ...
This is in fact what we hoped for.
Remember how we discussed earlier that a library can execute initialization functions when it starts up?
This is precisely that list! In a C++ application, this can – depending on how it has been compiled –
include every static constructor and dynamic initializer in the application – including those in the standard library –
which creates a very long list indeed. It’s an important list though, because almost every binary file with executable
code has one, and it serves as our starting point: the first breadcrumb in the trail to the metadata.
Info: The init function table has a different location depending on what kind of files you are working with.
For PE files (Windows EXEs and DLLs), the init table is in the .rdata section right after the IAT (Import Address Table),
which comes at the start of the section. An easy way to find it in some files it to search for __guard_check_icall_fptr and
scroll down until you find a null (zero) pointer. The init table starts at the next address.*
or ELF files (Linux, Android etc.), the table is stored in the .init_array section (and finalization functions are in the .fini_array section).
For MachO files (iOS), the table is stored in the __mod_init_func section.
Why are the names all messed up? The long squiggly names are the result of name mangling –
a process which guarantees every symbol relating to the binary is unique, and provides additional information to a debugger.
By appending the full namespace and an encoded sequence of argument types to each symbol, multiple overloads of the same method still get unique symbols,
for example. Not all symbol files use name mangling, but many do. Luckily, it doesn’t have any effect on this kind of reverse engineering –
you will just learn to ignore all of the extra bits after a while.