Finding binary metadata in the wild
Now we’ve examined our disassembly in easy mode, let’s turn our attention to a real application where we have no source code and no symbols,
and perform the analysis the opposite way around: start from the initialization function list and drill down to the metadata structs.
We can leverage what we’ve learned to have a better idea of what we’re looking for, and if we get stuck, we can turn to the IL2CPP library source code for help.
Note that this technique is not the only way to find the desired data, nor is it often the fastest. However,
it is the method that gives the best understanding of the code. We briefly summarize other possible strategies below.
I’ll use a randomly chosen Android game using ARMv8-A (64-bit) for this example, Subway Surfers (this example uses v2.10.2).
Having extracted the APK with 7-Zip, we can find the binary at /lib/arm64-v8a/libil2cpp.so and load it into our disassembler.
Press Ctrl+S to open the segment list and double-click on .init_array to navigate there. We find the following list:
.init_array:0000000002ADA620 ; ELF Initialization Function Table
.init_array:0000000002ADA620 ; ===========================================================================
.init_array:0000000002ADA620
.init_array:0000000002ADA620 ; Segment type: Pure data
.init_array:0000000002ADA620 AREA .init_array, DATA, ALIGN=3
.init_array:0000000002ADA620 ; ORG 0x2ADA620
.init_array:0000000002ADA620 off_2ADA620 DCQ sub_B4F89C ; DATA XREF: LOAD:off_88↑o
.init_array:0000000002ADA620 ; sub_1078EC4:loc_10790A0↑o ...
.init_array:0000000002ADA628 DCQ sub_B4FC64
.init_array:0000000002ADA630 DCQ sub_B4FD18
.init_array:0000000002ADA638 DCQ sub_B4FD34
.init_array:0000000002ADA640 DCQ sub_B50394
.init_array:0000000002ADA648 DCQ sub_B504C0
.init_array:0000000002ADA650 DCQ sub_B505B0
.init_array:0000000002ADA658 DCQ sub_B50624
.init_array:0000000002ADA660 DCQ sub_B50780
.init_array:0000000002ADA660 ; .init_array ends
.init_array:0000000002ADA660
Due to the way code is compiled, it’s often best to start the search from the end of the list, though that’s not always the case.
We double-click on each function in turn,
starting from the end, looking for something that resembles either the init hook or Il2CppCodeGenRegistration() itself.
Most of the functions contain calls to __cxa_atexit which registers a function to be called when the library is unloaded from memory;
we can immediately discard all of these, along with anything else that calls internal compiler-related functions,
typically starting with __cxa, __gxx and so on. You will soon learn to recognize these from experience.
The function at sub_B4FD18 looks interesting:
.text:0000000000B4FD18 ; __unwind {
.text:0000000000B4FD18 ADRP X0, #unk_2FDFD8B@PAGE
.text:0000000000B4FD1C ADRP X1, #sub_D1DB7C@PAGE
.text:0000000000B4FD20 ADD X0, X0, #unk_2FDFD8B@PAGEOFF
.text:0000000000B4FD24 ADD X1, X1, #sub_D1DB7C@PAGEOFF
.text:0000000000B4FD28 MOV X2, XZR
.text:0000000000B4FD2C MOV W3, WZR
.text:0000000000B4FD30 B loc_D67FC8
.text:0000000000B4FD30 ; } // starts at B4FD18
If you actually click through every function in the init table you’ll see that this one both looks vastly different to all the others,
and is much shorter. Furthermore, its entire behaviour is to load four arguments and jump to another function: an unknown struct pointer in X0,
a function pointer in X1, and zeroes in X2 and X3. IDA helps us here by defining function names with the sub_ prefix,
so we can easily see that X1 is a function pointer.
Tip: In ARMv8, loading a 64-bit address requires two instructions. ADRP loads the top 32 bits of the address into a register,
and then ADD adds the bottom 32 bits to make a complete address. Note that these instructions don’t have to be paired right next to each other, as you can see above.
XZR and WZR represent 64-bit and 32-bit “zero registers”. They are a shortcut and always contain the value zero.
Recall earlier that class methods require a this parameter as the first argument,
so it’s a reasonable guess that X0 is an instance pointer. We tap N on the various labels and name them with their suspected meanings:
.text:0000000000B4FD18 code_reg_hook
.text:0000000000B4FD18 ; __unwind {
.text:0000000000B4FD18 ADRP X0, #this@PAGE
.text:0000000000B4FD1C ADRP X1, #Il2CppCodeGenRegistration@PAGE
.text:0000000000B4FD20 ADD X0, X0, #this@PAGEOFF
.text:0000000000B4FD24 ADD X1, X1, #Il2CppCodeGenRegistration@PAGEOFF
.text:0000000000B4FD28 MOV X2, XZR
.text:0000000000B4FD2C MOV W3, WZR
.text:0000000000B4FD30 B loc_D67FC8
.text:0000000000B4FD30 ; } // starts at B4FD18
Now we double-click the unconditional branch to loc_D67FC8 and see what awaits. What we find looks scary –
it could be RegisterRuntimeInitializeAndCleanup –
so let’s back out and double-click on Il2CppCodeGenRegistration instead to see if it does what we expect:
.text:0000000000D1DB7C Il2CppCodeGenRegistration
.text:0000000000D1DB7C ; __unwind {
.text:0000000000D1DB7C ADRP X1, #off_2DB5048@PAGE
.text:0000000000D1DB80 LDR X1, [X1,#off_2DB5048@PAGEOFF]
.text:0000000000D1DB84 ADRP X0, #unk_2D40460@PAGE
.text:0000000000D1DB88 ADRP X2, #unk_24DF3DC@PAGE
.text:0000000000D1DB8C ADD X0, X0, #unk_2D40460@PAGEOFF
.text:0000000000D1DB90 ADD X2, X2, #unk_24DF3DC@PAGEOFF
.text:0000000000D1DB94 B loc_D71E34
Three pointers are loaded into X0-X2 and the code branches to another function. Referring back to the C definition of Il2CppCodeGenRegistration,
we see that this is exactly what it does, jumping to il2cpp_codegen_register.
So we have probably found our metadata! We name the addresses once again,
being careful to use the order matching the signature of il2cpp_codegen_register:
.text:0000000000D1DB7C Il2CppCodeGenRegistration
.text:0000000000D1DB7C ; __unwind {
.text:0000000000D1DB7C ADRP X1, #g_MetadataRegistration@PAGE
.text:0000000000D1DB80 LDR X1, [X1,#g_MetadataRegistration@PAGEOFF]
.text:0000000000D1DB84 ADRP X0, #g_CodeRegistration@PAGE
.text:0000000000D1DB88 ADRP X2, #s_Il2CppCodeGenOptions@PAGE
.text:0000000000D1DB8C ADD X0, X0, #g_CodeRegistration@PAGEOFF
.text:0000000000D1DB90 ADD X2, X2, #s_Il2CppCodeGenOptions@PAGEOFF
.text:0000000000D1DB94 B il2cpp_codegen_register
We double-click on g_MetadataRegistration to find a slight hiccup:
.got:0000000002DB5038 off_2DB5038 DCQ qword_301CD18 ; DATA XREF: sub_15B58B0+E4↑o
.got:0000000002DB5038 ; sub_15B58B0+E8↑r
.got:0000000002DB5040 off_2DB5040 DCQ qword_301CD20 ; DATA XREF: sub_12B5E64+BC↑o
.got:0000000002DB5040 ; sub_12B5E64+C0↑r
.got:0000000002DB5048 g_MetadataRegistration DCQ dword_2D41320
.got:0000000002DB5048 ; DATA XREF: Il2CppCodeGenRegistration↑o
.got:0000000002DB5048 ; Il2CppCodeGenRegistration+4↑r
.got:0000000002DB5050 off_2DB5050 DCQ qword_301CD28 ; DATA XREF: sub_21374A4+88↑o
.got:0000000002DB5050 ; sub_21374A4+8C↑r
.got:0000000002DB5058 off_2DB5058 DCQ qword_301CD30 ; DATA XREF: sub_13CA74C+7C↑o
.got:0000000002DB5058 ; sub_13CA74C+80↑r
.got:0000000002DB5060 off_2DB5060 DCQ qword_301CD38 ; DATA XREF: sub_148F624+128↑o
Well it turns out that this was not the Il2CppMetadataRegistration struct after all,
but rather a pointer to it, so we rename the label pMetadataRegistration to make this clear
(note the p at the start – this is traditional naming convention but you can use whatever naming style makes it easiest for you),
and give dword_2D41320 the name g_MetadataRegistration, then double-click on it:
.data.rel.ro:0000000002D41320 g_MetadataRegistration DCD 0x89E3
.data.rel.ro:0000000002D41324 ALIGN 8
.data.rel.ro:0000000002D41328 off_2D41328 DCQ off_2CCDDB0
.data.rel.ro:0000000002D41330 dword_2D41330 DCD 0x1B11
.data.rel.ro:0000000002D41334 ALIGN 8
.data.rel.ro:0000000002D41338 off_2D41338 DCQ off_2D2DDD8
.data.rel.ro:0000000002D41340 DCB 0xBD
.data.rel.ro:0000000002D41341 DCB 0xAC
.data.rel.ro:0000000002D41342 DCB 0
.data.rel.ro:0000000002D41343 DCB 0
.data.rel.ro:0000000002D41344 DCB 0
.data.rel.ro:0000000002D41345 DCB 0
.data.rel.ro:0000000002D41346 DCB 0
.data.rel.ro:0000000002D41347 DCB 0
.data.rel.ro:0000000002D41348 DCQ unk_2327E58
.data.rel.ro:0000000002D41350 DCQ stru_10C88.st_info
.data.rel.ro:0000000002D41358 DCQ off_2B71F20
.data.rel.ro:0000000002D41360 DCB 0xAB
.data.rel.ro:0000000002D41361 DCB 0xB9
.data.rel.ro:0000000002D41362 DCB 0
.data.rel.ro:0000000002D41363 DCB 0
.data.rel.ro:0000000002D41364 DCB 0
Bingo, we have found a data structure.
In a 64-bit binary we know that every
field should be 8 bytes long (DCQ quad-word) so let’s tidy it up using the technique of tapping D from earlier and see what we get:
.data.rel.ro:0000000002D41320 g_MetadataRegistration DCQ 0x89E3
.data.rel.ro:0000000002D41328 off_2D41328 DCQ off_2CCDDB0
.data.rel.ro:0000000002D41330 qword_2D41330 DCQ 0x1B11
.data.rel.ro:0000000002D41338 off_2D41338 DCQ off_2D2DDD8
.data.rel.ro:0000000002D41340 DCQ 0xACBD
.data.rel.ro:0000000002D41348 DCQ unk_2327E58
.data.rel.ro:0000000002D41350 DCQ stru_10C88.st_info
.data.rel.ro:0000000002D41358 DCQ off_2B71F20
.data.rel.ro:0000000002D41360 DCQ 0xB9AB
.data.rel.ro:0000000002D41368 DCQ unk_229CA54
.data.rel.ro:0000000002D41370 DCQ 0x2942
.data.rel.ro:0000000002D41378 DCQ unk_2F61B48
.data.rel.ro:0000000002D41380 DCQ 0x2942
.data.rel.ro:0000000002D41388 DCQ off_2F76558
.data.rel.ro:0000000002D41390 DCQ 0xA9D9
.data.rel.ro:0000000002D41398 DCQ off_2BF8380
This looks a lot like a list of counts and pointers, exactly as expected. There is a slight quirk where IDA
has incorrectly mapped the count at 2D41350 to an address. You can fix this by clicking on the label,
pressing U to undefine it, and then tapping D four times to turn it from bytes to a qword.
This exact same process can be repeated to find g_CodeRegistration, giving us the two key metadata structures we were looking for.
Note: Some compilers may merge together small functions,
or functions that are only called once – which is especially applicable to initialization code that is only called one time at application startup.
This process is called “inlining” and it is not uncommon to see the chain of Il2CppCodegenRegistration to il2cpp_codegen_register to il2cpp::vm::MetadataCache::Register
as a single inlined function. Keep this in mind if the chain of calls in a real application doesn’t match up with what you expect.