Author Topic: Teardown: Windows 10 on ARM - x86 Emulation 2/2  (Read 432 times)

Online javajolt

  • Administrator
  • Hero Member
  • *****
  • Posts: 35127
  • Gender: Male
  • I Do Windows
    • windows10newsinfo.com
Teardown: Windows 10 on ARM - x86 Emulation 2/2
« on: February 06, 2020, 05:45:37 PM »
Binary JIT Translation

Before I answer this question right away, let’s take a look at a translation of LdrInitializeThunk (the very first x86 function executed in an x86 process):

// Original x86 function
ntdll_775d0000!EXP+#LdrInitializeThunk:
775d1760 8bff            mov     edi,edi
775d1762 55              push    ebp
775d1763 8bec            mov     ebp,esp
775d1765 5d              pop     ebp
775d1766 90              nop
775d1767 e9f4310800      jmp     ntdll_775d0000!LdrInitializeThunk (77654960)

// JIT translation
00000000`0310a168 b81fcf9d str         wfp,[x28,#-4]!       // push ebp
00000000`0310a16c 2a1c03fd mov         wfp,w28              // mov ebp,esp
00000000`0310a170 b840479d ldr         wfp,[x28],#4         // pop ebp
00000000`0310a174 11420d29 add         w9,w9,#0x83,lsl #0xC
00000000`0310a178 1107f929 add         w9,w9,#0x1FE
00000000`0310a17c 97fffe4b bl          00000000`03109aa8    // (get jump function address)
00000000`0310a180 d61f0220 br          xip1                 // jmp ntdll_775d0000!LdrInitializeThunk

To answer the question, observe the following call flow which illustrates the process behind this translation:

BTCpuSimulate
└─ EmulateX86Function
   ├─ sub_180022c60
   │  └─ sub_180022d20
   │     │
   │     ├─ sub_1800255b8
   │     │  └─ BTXtaDeconstructOpcode
   │     │
   │     └─ sub_180023e10
   │        ├─ BTXtaCreatePushPop     // writes "str wfp,[x28,#-4]!"       (@180042e04) push ebp
   │        ├─ sub_180041fc0
   │        │  └─ sub_180043e30       // writes "mov wfp,w28"              (@180043eb4) mov ebp,esp
   │        ├─ BTXtaCreatePushPop     // writes "ldr wfp,[x28],#4"         (@180042d1c) pop ebp
   │        ├─ sub_180040070
   │        │  └─ sub_18003fd30
   │        │     └─ sub_1800473f8    // writes "add w9,w9,#0x83,lsl #0xC" (@18004756c) jmp ...
   │        │                         // writes "add w9,w9,#0x200"         (@18004757c) jmp contd...
   │        └─ sub_18003fd30
   │           └─ sub_180040070
   │              └─ sub_180046550    // writes "bl 00000000`03109aa8"     (@180046580) jmp end...
   └─ sub_180020928
      └─ sub_180021008
         └─ memcpy

Translation begins at BTCpuSimulate and ends at a call to memcpy in sub_180021008. Many, many different functions are provided for translation. Sets of functions are designated for groups of similar x86 instructions. Together, they all perform the same general task:

1.       Deconstruct an x86 instruction's bytes into a struct (BTXtaDeconstructOpcode)

2.       Create an ARM instruction from the deconstructed data (BTXtaCreatePushPop)

3.       Save the converted ARM instruction to a buffer (sub_180021008)

4.       Repeat the above four steps for each instruction in a function

5.       Save the converted function to the JIT-cache buffer

As I mentioned earlier, x86 to ARM translation is quite complex. The reason for xtajit.dll's relatively large size is partly due to the overall size of the x86 instruction set, and the need to emulate it in an ARM-based environment. Instructions are organized into groups: push and pop, for example, use some of the same functions during translation. In the end, the resulting instructions are saved to the JIT-cache buffer.

An x86 instruction's bytes are deconstructed in the following function:

// sub_180025b80
INT BTXtaDeconstructOpcode(PXTA_INFO xtaInfo, PXTA_INSTRUCTION xtaInstruction)
{
    if (xtaInstruction->dword_12 >= 15)
        goto loc_18005523c;
    offset = xtaInfo->InstructionOffset; // From start of function (0xA4)
    address = xtaInfo->InstructionAddr; // Absolute address of instruction
    BYTE opcode = *((PBYTE)(address + offset)); // Instruction opcode
    xtaInfo->InstructionOffset += 1; // Move to next byte
    if (opcode > 0xFF)
        goto loc_180025e18;
    switch (opcode)
    {
        // push opcodes
        case 0x50:
        case 0x51:
        case 0x52:
        case 0x53:
        case 0x54:
        case 0x55: // push ebp
        case 0x56:
        case 0x57:
            x8 = xtaInstruction->dword_1D;
            x12 = opcode & 7;
            if (xtaInstruction->dword_1D & 1 == 0)
                goto loc_18002db6c;
            xtaInstruction->dword_19 = 4;
            xtaInstruction->dword_1C = 0x14;
            xtaInstruction->TransferRegister = opcode & 7;
            xtaInstruction->dword_24 |= (1 << (opcode & 7)) | 0x10;
            xtaInstruction->GenerateArm64Func = BTXtaCreatePushPop;
            xtaInstruction->dword_28 |= 0x10;
            xtaInstruction->OtherFunc = sub_1800399b0;
            xtaInstruction->dword_8 |= 1;
            break;
        // mov r32,r/m32
        case 0x8b:
// ...
            break;
    }
    return 0;
}

And translated below:

#define XTAR_X28       28
#define XTAI_PUSH_POP  0xb81fcc00
// sub_180042bb0
VOID BTXtaCreatePushPop(PXTA_DATA xtaData, PXTA_INSTRUCTION xtaInstruction,
    PJIT_CACHE_FUNCTION jitCacheFunc)
{
// ...
    if (/* do some checks... */)
    {
// ...
        // loc_180042de8
        jitCacheFunc->Buffer[0] = XTAI_PUSH_POP | (XTAR_X28 << 5) |
            gXtaContextRegs[xtaInstruction->TransferRegister + 0x6e];
    }
// ...
}

The final bytes are then written to a buffer. This buffer is later copied to the JIT-cache buffer with memcpy (see call graph above) and finally executed.

So, we finally witness the actual translation, and very little of what’s involved is surprising. Given that the constituent parts of instruction are organized into bytes and bit fields, translation simply becomes a matter of deconstructing and reconstructing the parts for a different architecture.  But if we again recall this flowchart:

BTCpuSimulate
└─ EmulateX86Function
   ├─ sub_180022c60
   │  └─ sub_180022d20
   │     │
   │     ├─ sub_1800255b8
   │     │  └─ BTXtaDeconstructOpcode
   │     │
   │     └─ sub_180023e10
   │        ├─ BTXtaCreatePushPop     // writes "str wfp,[x28,#-4]!"       (@180042e04) push ebp
   │        ├─ sub_180041fc0
   │        │  └─ sub_180043e30       // writes "mov wfp,w28"              (@180043eb4) mov ebp,esp
   │        ├─ BTXtaCreatePushPop     // writes "ldr wfp,[x28],#4"         (@180042d1c) pop ebp
   │        ├─ sub_180040070
   │        │  └─ sub_18003fd30
   │        │     └─ sub_1800473f8    // writes "add w9,w9,#0x83,lsl #0xC" (@18004756c) jmp ...
   │        │                         // writes "add w9,w9,#0x200"         (@18004757c) jmp contd...
   │        └─ sub_18003fd30
   │           └─ sub_180040070
   │              └─ sub_180046550    // writes "bl 00000000`03109aa8"     (@180046580) jmp end...
   └─ sub_180020928
      └─ sub_180021008
         └─ memcpy

This entire procedure is all for merely four instructions (push ebp; mov ebp, esp; pop ebp; jmp). Imagine the overhead if this were to be performed for a full x86 binary.

Which brings us back to the use of XTA cache files: how are they created in the first place? This is done through independent background caching and compilation with the help of the caching service.

XTA Caching Service (XtaCache)

XtaCache is fairly small. Its purpose is to listen for module load notifications and to start the compiler if needed. Notifications are sent by the emulator (xtajit.dll) when a module is loaded. If necessary, the service will create a new XTA cache file in C:\Windows\XtaCache and start the compiler (xtac.exe). The compiler then performs a partial or full translation of the module's executable code. Once finished, it notifies the service, whereupon XtaCache saves the changes to the XTA cache file and closes all open file handles.

xtajit → XtaCache

Communication between xtajit and XtaCache is achieved using NtAlpcSendWaitReceivePort. Below is the call flow showing inter-process communication between the emulator and caching service:

BTCpuNotifyMapViewOfSection → sub_180014A80 → sub_180015AB0 → sub_180015C28 → sub_180015D80 → sub_180019010 → NtAlpcSendWaitReceivePort}

BTCpuNotifyMapViewOfSection is called every time a module is loaded (since NtMapViewOfSection is called every time a module is loaded). Eventually, it passes a module filehandle to NtAlpcSendWaitReceivePort, which sends the message to the compiler, xtac.exe.

XtaCache → xtac

Once XtaCache receives a module filehandle from the emulation module, it must determine whether or not it the x86 module file needs to be compiled and written to a cache file. The following is the call flow leading to the execution of the compiler (xtac.exe):

1.      NtAlpcSendWaitReceivePort receives module filehandle from xtajit.dll

2.      NtCreateFile opens module, creates a SHA256 hash from filename and PE header data

3.      TpAllocWork creates callback function which will start compiler (xtac.exe)

4.      Callback function formats XTA cache file name and creates an XTA cache file (NtCreateFile)

5.      NtCreateSection creates section handle to be passed to the compiler in a command line

6.      CreateProcessAsUserW starts compiler with command line: xtac.exe -p [section handle]

The following function is responsible for receiving and responding to an ALPC message:

NTSTATUS ProcessIncomingAlpcMessage(PALPC_CONTEXT pAlpcContext,
    PORT_MESSAGE msg, PALPC_MESSAGE_ATTRIBUTES msgAttr)
{
/*...*/    
    if (msg->TotalLength >= 80 && msg->MsgType == 2)
    {
        if (msgAttr->ValidAttributes & 0x10000000 != 0)
        {
            xtaCacheMsgAttr =
                (PXTACACHE_MESSAGE_ATTRIBUTES)AlpcGetMessageAttribute(
                    msgAttr, 0x10000000);
            if (xtaCacheMsgAttr->MsgType == 0xA1)
            {
                /* This will create the XTA cache file and start the compiler.
                 * hModuleNew is assigned a duplicate handle of hModule. */
                result = PrepareJitCacheFileCompilation(
                    pAlpcContext->qword_0, unknown_0->qword_28,
                    xtaCacheMsgAttr->hModule, unknown_2, &hModuleNew);
                if (NT_SUCCESS(result))
                {
/*...*/
                    if (hModuleNew != 0)
                    {
                        xtaCacheMsgAttr =
                            (PXTACACHE_MESSAGE_ATTRIBUTES)AlpcGetMessageAttribute(
                                &newMsgAttr, 0x10000000);
                        xtaCacheMsgAttr->hModule = hModuleNew;
                        xtaCacheMsgAttr->MsgType = 12;
                    }
                    xtaCacheMsg->Result = STATUS_SUCCESS;
                }
            }
        }
    }
/*...*/    
    // Send response to xtajit emulator
    NTSTATUS result = NtAlpcSendWaitReceivePort(
        pAlpcContext->hAlpcPort, 0x10000, msg, newMsgAttr, NULL, NULL, NULL, 0);
        
    // Close the module file handle
    xtaCacheMsgAttr = (PXTACACHE_MESSAGE_ATTRIBUTES)AlpcGetMessageAttribute(
        msgAttr, 0x10000000);
    NtClose(xtaCacheMsgAttr->hModule);
    
    return result;
}

Eventually, a response is sent to the emulator (xtajit.dll) notifying it of the result, including any errors or anomalies in the message and message attributes.

Conclusion

It is worth noting that x86 emulation on ARM is a relatively new feature and is still in its infancy. Microsoft is one of the first to incorporate such a feature in its latest operating system. Supporting x86 emulation on ARM is a topic that is of increasing concern, as ARM processors become more and more ubiquitous and the concomitant demand for x86 emulation grows.

As traditionally x86- and x64-based operating systems attempt to migrate over to an ARM-based architecture, however, the drawbacks to emulation become more noticeable. Whereas x86 on x64 or ARM32 on ARM64 is a non-trivial matter of switching contexts and whatnot, x86 on ARM requires full-on CPU simulation in user mode – all in a single, infinite loop. The overhead encountered during such a process could be immense, depending on the size of the executing binary and the availability of cache files.

A gradually diminishing amount of free disk space could also become a concern, due to a growing cache file directory as the number of x86 binaries executed over time increases. Much remains to be seen as to what direction Microsoft and other software publishers will take in regards to x86 emulation. No doubt there are plenty of opportunities for innovation and improvement upon current techniques, and plenty to watch out for in future updates.

Resources

Below is a list of the materials used during research:

• Raspberry Pi 3 Model B+ (http://www.raspberrypi.org/products/raspberry-pi-3-model-b/)

• WOA Deployer for Raspberry Pi 3 (http://github.com/WOA-Project/WOA-Deployer-Rpi)

• Debugging Tools for Windows (http://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools)

• IDA Pro 7.2 (http://www.hex-rays.com/products/ida/support/download.shtml)

• PuTTY 0.71 and PSFTP (http://www.chiark.greenend.org.uk/~sgtatham/putty/)

• Sysinternals Suite for ARM64 (http://live.sysinternals.com/ARM64/)

• WoW64 internals ...re-discovering Heaven's Gate on ARM (http://wbenny.github.io/2018/11/04/wow64-internals.html)

• ARM Instruction Set Reference Guide (http://static.docs.arm.com/100076/0100/arm_instruction_set_reference_guide_100076_0100_00_en.pdf)

• A64 Instruction Set Architecture (http://static.docs.arm.com/ddi0596/a/DDI_0596_ARM_a64_instruction_set_architecture.pdf)

source
« Last Edit: February 06, 2020, 07:25:13 PM by javajolt »