Finding binary metadata in the wild

Now we’ve examined our disassembly in easy mode, let’s turn our attention to a real application where we have no source code and no symbols, and perform the analysis the opposite way around: start from the initialization function list and drill down to the metadata structs. We can leverage what we’ve learned to have a better idea of what we’re looking for, and if we get stuck, we can turn to the IL2CPP library source code for help.

Note that this technique is not the only way to find the desired data, nor is it often the fastest. However, it is the method that gives the best understanding of the code. We briefly summarize other possible strategies below.

I’ll use a randomly chosen Android game using ARMv8-A (64-bit) for this example, Subway Surfers (this example uses v2.10.2).

Having extracted the APK with 7-Zip, we can find the binary at /lib/arm64-v8a/libil2cpp.so and load it into our disassembler.

Press Ctrl+S to open the segment list and double-click on .init_array to navigate there. We find the following list:

.init_array:0000000002ADA620 ; ELF Initialization Function Table
.init_array:0000000002ADA620 ; ===========================================================================
.init_array:0000000002ADA620
.init_array:0000000002ADA620 ; Segment type: Pure data
.init_array:0000000002ADA620                 AREA .init_array, DATA, ALIGN=3
.init_array:0000000002ADA620                 ; ORG 0x2ADA620
.init_array:0000000002ADA620 off_2ADA620     DCQ sub_B4F89C          ; DATA XREF: LOAD:off_88↑o
.init_array:0000000002ADA620                                         ; sub_1078EC4:loc_10790A0↑o ...
.init_array:0000000002ADA628                 DCQ sub_B4FC64
.init_array:0000000002ADA630                 DCQ sub_B4FD18
.init_array:0000000002ADA638                 DCQ sub_B4FD34
.init_array:0000000002ADA640                 DCQ sub_B50394
.init_array:0000000002ADA648                 DCQ sub_B504C0
.init_array:0000000002ADA650                 DCQ sub_B505B0
.init_array:0000000002ADA658                 DCQ sub_B50624
.init_array:0000000002ADA660                 DCQ sub_B50780
.init_array:0000000002ADA660 ; .init_array   ends
.init_array:0000000002ADA660

Due to the way code is compiled, it’s often best to start the search from the end of the list, though that’s not always the case. We double-click on each function in turn, starting from the end, looking for something that resembles either the init hook or Il2CppCodeGenRegistration() itself.

Most of the functions contain calls to __cxa_atexit which registers a function to be called when the library is unloaded from memory; we can immediately discard all of these, along with anything else that calls internal compiler-related functions, typically starting with __cxa, __gxx and so on. You will soon learn to recognize these from experience.

The function at sub_B4FD18 looks interesting:

.text:0000000000B4FD18 ; __unwind {
.text:0000000000B4FD18                 ADRP            X0, #[email protected]
.text:0000000000B4FD1C                 ADRP            X1, #[email protected]
.text:0000000000B4FD20                 ADD             X0, X0, #[email protected]
.text:0000000000B4FD24                 ADD             X1, X1, #[email protected]
.text:0000000000B4FD28                 MOV             X2, XZR
.text:0000000000B4FD2C                 MOV             W3, WZR
.text:0000000000B4FD30                 B               loc_D67FC8
.text:0000000000B4FD30 ; } // starts at B4FD18

If you actually click through every function in the init table you’ll see that this one both looks vastly different to all the others, and is much shorter. Furthermore, its entire behaviour is to load four arguments and jump to another function: an unknown struct pointer in X0, a function pointer in X1, and zeroes in X2 and X3. IDA helps us here by defining function names with the sub_ prefix, so we can easily see that X1 is a function pointer.

Recall earlier that class methods require a this parameter as the first argument, so it’s a reasonable guess that X0 is an instance pointer. We tap N on the various labels and name them with their suspected meanings:

.text:0000000000B4FD18 code_reg_hook
.text:0000000000B4FD18 ; __unwind {
.text:0000000000B4FD18                 ADRP            X0, #this@PAGE
.text:0000000000B4FD1C                 ADRP            X1, #[email protected]
.text:0000000000B4FD20                 ADD             X0, X0, #this@PAGEOFF
.text:0000000000B4FD24                 ADD             X1, X1, #[email protected]
.text:0000000000B4FD28                 MOV             X2, XZR
.text:0000000000B4FD2C                 MOV             W3, WZR
.text:0000000000B4FD30                 B               loc_D67FC8
.text:0000000000B4FD30 ; } // starts at B4FD18

Now we double-click the unconditional branch to loc_D67FC8 and see what awaits. What we find looks scary – it could be RegisterRuntimeInitializeAndCleanup – so let’s back out and double-click on Il2CppCodeGenRegistration instead to see if it does what we expect:

.text:0000000000D1DB7C Il2CppCodeGenRegistration
.text:0000000000D1DB7C ; __unwind {
.text:0000000000D1DB7C                 ADRP            X1, #[email protected]
.text:0000000000D1DB80                 LDR             X1, [X1,#[email protected]]
.text:0000000000D1DB84                 ADRP            X0, #[email protected]
.text:0000000000D1DB88                 ADRP            X2, #[email protected]
.text:0000000000D1DB8C                 ADD             X0, X0, #[email protected]
.text:0000000000D1DB90                 ADD             X2, X2, #[email protected]
.text:0000000000D1DB94                 B               loc_D71E34

Three pointers are loaded into X0-X2 and the code branches to another function. Referring back to the C definition of Il2CppCodeGenRegistration, we see that this is exactly what it does, jumping to il2cpp_codegen_register. So we have probably found our metadata! We name the addresses once again, being careful to use the order matching the signature of il2cpp_codegen_register:

.text:0000000000D1DB7C Il2CppCodeGenRegistration
.text:0000000000D1DB7C ; __unwind {
.text:0000000000D1DB7C                 ADRP            X1, #[email protected]
.text:0000000000D1DB80                 LDR             X1, [X1,#[email protected]]
.text:0000000000D1DB84                 ADRP            X0, #[email protected]
.text:0000000000D1DB88                 ADRP            X2, #[email protected]
.text:0000000000D1DB8C                 ADD             X0, X0, #[email protected]
.text:0000000000D1DB90                 ADD             X2, X2, #[email protected]
.text:0000000000D1DB94                 B               il2cpp_codegen_register

We double-click on g_MetadataRegistration to find a slight hiccup:

.got:0000000002DB5038 off_2DB5038     DCQ qword_301CD18       ; DATA XREF: sub_15B58B0+E4↑o
.got:0000000002DB5038                                         ; sub_15B58B0+E8↑r
.got:0000000002DB5040 off_2DB5040     DCQ qword_301CD20       ; DATA XREF: sub_12B5E64+BC↑o
.got:0000000002DB5040                                         ; sub_12B5E64+C0↑r
.got:0000000002DB5048 g_MetadataRegistration DCQ dword_2D41320
.got:0000000002DB5048                                         ; DATA XREF: Il2CppCodeGenRegistration↑o
.got:0000000002DB5048                                         ; Il2CppCodeGenRegistration+4↑r
.got:0000000002DB5050 off_2DB5050     DCQ qword_301CD28       ; DATA XREF: sub_21374A4+88↑o
.got:0000000002DB5050                                         ; sub_21374A4+8C↑r
.got:0000000002DB5058 off_2DB5058     DCQ qword_301CD30       ; DATA XREF: sub_13CA74C+7C↑o
.got:0000000002DB5058                                         ; sub_13CA74C+80↑r
.got:0000000002DB5060 off_2DB5060     DCQ qword_301CD38       ; DATA XREF: sub_148F624+128↑o

Well it turns out that this was not the Il2CppMetadataRegistration struct after all, but rather a pointer to it, so we rename the label pMetadataRegistration to make this clear (note the p at the start – this is traditional naming convention but you can use whatever naming style makes it easiest for you), and give dword_2D41320 the name g_MetadataRegistration, then double-click on it:

.data.rel.ro:0000000002D41320 g_MetadataRegistration DCD 0x89E3
.data.rel.ro:0000000002D41324                 ALIGN 8
.data.rel.ro:0000000002D41328 off_2D41328     DCQ off_2CCDDB0
.data.rel.ro:0000000002D41330 dword_2D41330   DCD 0x1B11
.data.rel.ro:0000000002D41334                 ALIGN 8
.data.rel.ro:0000000002D41338 off_2D41338     DCQ off_2D2DDD8
.data.rel.ro:0000000002D41340                 DCB 0xBD
.data.rel.ro:0000000002D41341                 DCB 0xAC
.data.rel.ro:0000000002D41342                 DCB    0
.data.rel.ro:0000000002D41343                 DCB    0
.data.rel.ro:0000000002D41344                 DCB    0
.data.rel.ro:0000000002D41345                 DCB    0
.data.rel.ro:0000000002D41346                 DCB    0
.data.rel.ro:0000000002D41347                 DCB    0
.data.rel.ro:0000000002D41348                 DCQ unk_2327E58
.data.rel.ro:0000000002D41350                 DCQ stru_10C88.st_info
.data.rel.ro:0000000002D41358                 DCQ off_2B71F20
.data.rel.ro:0000000002D41360                 DCB 0xAB
.data.rel.ro:0000000002D41361                 DCB 0xB9
.data.rel.ro:0000000002D41362                 DCB    0
.data.rel.ro:0000000002D41363                 DCB    0
.data.rel.ro:0000000002D41364                 DCB    0

Bingo, we have found a data structure. In a 64-bit binary we know that every field should be 8 bytes long (DCQ quad-word) so let’s tidy it up using the technique of tapping D from earlier and see what we get:

.data.rel.ro:0000000002D41320 g_MetadataRegistration DCQ 0x89E3
.data.rel.ro:0000000002D41328 off_2D41328     DCQ off_2CCDDB0
.data.rel.ro:0000000002D41330 qword_2D41330   DCQ 0x1B11
.data.rel.ro:0000000002D41338 off_2D41338     DCQ off_2D2DDD8
.data.rel.ro:0000000002D41340                 DCQ 0xACBD
.data.rel.ro:0000000002D41348                 DCQ unk_2327E58
.data.rel.ro:0000000002D41350                 DCQ stru_10C88.st_info
.data.rel.ro:0000000002D41358                 DCQ off_2B71F20
.data.rel.ro:0000000002D41360                 DCQ 0xB9AB
.data.rel.ro:0000000002D41368                 DCQ unk_229CA54
.data.rel.ro:0000000002D41370                 DCQ 0x2942
.data.rel.ro:0000000002D41378                 DCQ unk_2F61B48
.data.rel.ro:0000000002D41380                 DCQ 0x2942
.data.rel.ro:0000000002D41388                 DCQ off_2F76558
.data.rel.ro:0000000002D41390                 DCQ 0xA9D9
.data.rel.ro:0000000002D41398                 DCQ off_2BF8380

This looks a lot like a list of counts and pointers, exactly as expected. There is a slight quirk where IDA has incorrectly mapped the count at 2D41350 to an address. You can fix this by clicking on the label, pressing U to undefine it, and then tapping D four times to turn it from bytes to a qword.

This exact same process can be repeated to find g_CodeRegistration, giving us the two key metadata structures we were looking for.