Author Topic: Teardown: Windows 10 on ARM - x86 Emulation 1/2  (Read 476 times)

Offline javajolt

  • Administrator
  • Hero Member
  • *****
  • Posts: 35168
  • Gender: Male
  • I Do Windows
    • windows10newsinfo.com
Teardown: Windows 10 on ARM - x86 Emulation 1/2
« on: February 06, 2020, 07:24:25 PM »
Upon reading the title of this article, one might pose the initial question: what would an ARM-based operating system do with an x86 instruction? Or a chunk of x86 instructions? Or an entire x86 binary? Windows 10, for example, does this by taking a set of x86 instructions below:

push    ebp
mov     ebp,esp
pop     ebp
nop
jmp     ntdll_775d0000!LdrInitializeThunk

And translating it to the following:

str         wfp,[x28,#-4]!       // push ebp
mov         wfp,w28              // mov ebp,esp
ldr         wfp,[x28],#4         // pop ebp
add         w9,w9,#0x83,lsl #0xC
add         w9,w9,#0x1FE
bl          00000000`03109aa8    // (get jump function address)
br          xip1                 // jmp ntdll_775d0000!LdrInitializeThunk

First off, ARM and x86 are completely different architectures. An ARM processor is incapable of executing x86 code and the hardware provides no means to do so. This leaves the task up to software developers to facilitate it themselves.

Microsoft's recent version of Windows 10 for ARM-based processors assumes such a task, by simulating an x86 processor entirely in userland. An emulator module (xtajit.dll) employs a form of just-in-time (JIT) translation to convert x86 code to ARM (shown above) within a loop, as the x86 process is executing. On each pass, a chunk of x86 code is translated to ARM, and the translation is executed.

All of this, as you might have guessed, can make the experience of running x86 programs a comparatively slow experience. However, a cache of already-translated code (located in C:\Windows\XtaCache) eliminates much of the overhead. A compiler (xtac.exe) and background caching service (XtaCache) handle full binary translation and caching. Hybrid binaries (located in C:\Windows\SyChpe32) containing x86-to-ARM stubs also help to reduce overhead.

In this article, I present what I believe are five key features of x86 emulation, concluding with an example of the raw opcode translation procedure.

CHPE DLLs

A peculiar system directory exists on Windows 10 for ARM. Located at C:\Windows\SyChpe32, this folder holds a small set of DLL files with the same names as the most frequently used libraries on Windows: ntdll.dll, kernel32.dll, advapi32.dll, etc. Several interesting characteristics are observable in these types of files, including a mixture of x86 and ARM functions and the presence of a strange section name in the header.

.hexpthk

When we first begin investigating a CHPE file (ntdll.dll), we discover a new section type in the header:

000001e0: 00 00 00 00 00 00 00 00 2e 68 65 78 70 74 68 6b .........hexpthk
000001f0: e0 53 00 00 00 10 00 00 00 54 00 00 00 04 00 00 .S.......T......
00000200: 00 00 00 00 00 00 00 00 00 00 00 00 20 00 00 60 ............ ..`
00000210: 2e 74 65 78 74 00 00 00 c2 5d 17 00 00 70 00 00 .text....]...p..

.hexpthk possibly stands for Hybrid Executable Push Thunk. The primary purpose of the 1300+ functions in this section is to provide a set of x86 jump stubs for the library's native ARM APIs. This eliminates the need for JIT translation or XTA cache file access, thus reducing a bit of overhead.

One of the stubs in this section:

ntdll_773e0000!EXP+#RtlCreateUserProcess:
773e2520 8bff            mov     edi,edi
773e2522 55              push    ebp
773e2523 8bec            mov     ebp,esp
773e2525 5d              pop     ebp
773e2526 90              nop
773e2527 e9a4620d00      jmp     ntdll_773e0000!#RtlCreateUserProcess (774b87d0)

Which jumps to:

ntdll_773e0000!#RtlCreateUserProcess:
00000000`774b87d0 29ba7bfd stp         wfp,wlr,[sp,#-0x30]!
00000000`774b87d4 910003fd mov         fp,sp
00000000`774b87d8 52800028 mov         w8,#1
...
00000000`774b8810 94000004 bl          ntdll_773e0000!#RtlCreateUserProcessEx (00000000`774b8820)
00000000`774b8814 28c67bfd ldp         wfp,wlr,[sp],#0x30
00000000`774b8818 d65f03c0 ret

push_thunk and pop_thunk

This leads us to another collection of special CHPE functions (in the .text section): the push_thunk and the pop_thunk. These attempt to fetch the translated function from the JIT cache buffer or XTA cache file. Ntdll.dll contains a total of 282 push_thunk and 1829 pop_thunk functions. Below is a disassembly snippet of #NtAllocateVirtualMemory$push_thunk:

ntdll_773e0000!#NtAllocateVirtualMemory$push_thunk:
00000000`77525630 a9bb53f3 stp         x19,x20,[sp,#-0x50]!
...
00000000`77525674 900002aa adrp        x10,ntdll_773e0000!_os_wowa64_dispatch_call (00000000`77579000)
00000000`77525678 b9400150 ldr         wip0,[x10]
00000000`7752567c b0000188 adrp        x8,ntdll_773e0000!NtAccessCheck (00000000`77556000)
00000000`77525680 11060109 add         w9,w8,#0x180
00000000`77525684 d63f0200 blr         xip0
00000000`77525688 d503201f nop
00000000`7752568c 37f80085 tbnz        x5,#0x1F,ntdll_773e0000!#NtAllocateVirtualMemory$push_thunk+0x6c (00000000`7752569c)
00000000`77525690 900002b0 adrp        xip0,ntdll_773e0000!_os_wowa64_dispatch_call (00000000`77579000)
00000000`77525694 b9402210 ldr         wip0,[xip0,#0x20]
00000000`77525698 d63f0200 blr         xip0
00000000`7752569c b9400ba8 ldr         w8,[fp,#8]
...
00000000`775256c4 d65f03c0 ret

XTA cache (.jc) files

We notice a number of XTA cache (.jc) files in the C:\Windows\XtaCache directory whose filenames correspond to several common DLLs:

ACWINRT.DLL.8CB3A2AB47A53C8A2B0154CD9DCFBAB3.323EB21400CAFAA0582C5009A75869C1.mp.1.jc
APPHELP.DLL.95D0BAD69AB6222384AF242317B56149.DF78C05157184445FAFBB6CE538964A5.mp.1.jc
BCRYPTPRIMITIVES.DLL.013B0F82E47EF5A5FCF5A8526296185B.EF8F27FC52B74ACB9AC3BC89FA928CE1.mp.1.jc
CFGMGR32.DLL.5D4A7278FAF3DBBDBF2DDB915FCAFAB5.F945F9A53B0D60BD9E78255821B9CF3C.mp.1.jc
CMD.EXE.82E4EB063821519C52FE2943A99BF400.1D875ED0DCA07279C17651C0E6C07D68.mp.1.jc
COMBASE.DLL.4F0AC5980CD35BD10454EEE975995AC5.27FF208B6378EEC51D8E744BAF092F94.mp.1.jc
CRYPT32.DLL.C5DAC220D689F508F0E17E2A66CA3F1A.320EA93B91430A11C07BCF035279C389.mp.1.jc
CRYPTBASE.DLL.16CA22D6D04B8B1996C66338057F9FBF.385D1574AF589DD7DBE99097ADFA13A7.mp.1.jc
DEVOBJ.DLL.C27E6011E5F5F63CCBE4C1EDDE39AB82.8F5C03EA8E04FE31C38209807883CEFF.mp.1.jc
KERNEL32.DLL.834794B203AF374A9362DB4ED6A773E1.DDEEC7202859318755E42D071DBE493B.mp.1.jc
KERNELBASE.DLL.FD274E0B2908F4427DD5E105D6793B46.DB966B70C90268F5B3A22AF2FFD62FB9.mp.1.jc
MMDEVAPI.DLL.4095F18EA4B2E4980A31F9CFF10ACB18.F3FFB7D241CFAF41DBF7A8EB1FA2F14C.mp.1.jc
MSASN1.DLL.592DF648596867992875D9F422985BE2.C91F12FC8B8289618C00A6A0B4CB63F6.mp.1.jc
MSVCP_WIN.DLL.7A9A3D8939BA215908F2D277F6036868.059202EDB6FBBF9C5D3CEA601F657783.mp.1.jc
MSVCRT.DLL.D751099CF1900AB3B0B21169E3ACAEA1.D32B8BAE88BDFA4E405C93DE74A58C58.mp.1.jc
NTDLL.DLL.25E64FDB4C4531A2DA3649AF45082708.A8B2D871AD511568138A61A746F3477E.mp.1.jc
SECHOST.DLL.B4519BA08D878093477A68086F2C73E9.6B8F8A4E1C74BF41ECB65F5BC09C99AE.mp.1.jc
SSPICLI.DLL.8A3767BF366F5B126EEE0A97F09F3821.177E02DDE1B9F225A2E156445C8C26C6.mp.1.jc
UCRTBASE.DLL.FDEE79B96ED912C4BECA94A2153A8DA6.2B41C9D2592D756F6DA93B24EBBAB8BB.mp.1.jc
WINBRAND.DLL.0B7D029DE53AA3517BAAE21FFF2958A2.080F7128AEA24497BE12FC799A471BCA.mp.1.jc
WINTRUST.DLL.36F3317E5E83FB1D21A49E8598B82079.1A4137576B1F56F2356651AA77E6C162.mp.1.jc
WLDP.DLL.52F5CBD679E57BDD90519D96701B1362.42321841F03DF5F906568D1916ABD9FC.mp.1.jc

.jc files have a header:

0:000:x86> db 07640000
07640000  58 54 41 43 13 00 00 00-00 00 00 00 48 0f 01 00  XTAC........H...
07640010  1f 01 00 00 38 00 00 00-12 00 00 00 40 18 01 00  ....8.......@...
07640020  64 00 00 00 50 00 00 00-50 00 00 00 50 a1 00 00  d...P...P...P...
07640030  d4 a1 00 00 ec 0e 01 00-4e 00 54 00 44 00 4c 00  ........N.T.D.L.
07640040  4c 00 2e 00 44 00 4c 00-4c 00 00 00 00 00 00 00  L...D.L.L.......
07640050  42 4c 43 4b e4 0e 01 00-00 00 00 00 00 00 00 00  BLCK............
07640060  bf 39 03 d5 c0 03 5f d6-00 00 40 79 fd ff ff 17  .9...._...@y....
07640070  20 00 40 79 fb ff ff 17-c0 00 40 79 f9 ff ff 17   .@y......@y....

.jc header format

Below is the format for the cache file header:

XTAC header
0x00    DWORD    'XTAC'
0x04    DWORD    Always 0x13
0x08    DWORD    BOOL? Most files set to 0, some are 1
0x0c    DWORD    Offset of JC address pair table
0x10    DWORD    Number of JC address pairs (two DWORDs = 8 bytes per pair)
0x14    DWORD    Offset of the module name
0x18    DWORD    Length of module name (in bytes)
0x1c    DWORD    Offset of module NT pathname (usually at the very end of file)
0x20    DWORD    Length of module NT pathname (in bytes)
0x24    DWORD    Offset of BLCK stubs
0x28    DWORD    ?
0x2c    DWORD    Size of BLCK stubs (always 0xa150)
0x30    DWORD    ?
0x34    DWORD    ?
0x38    WCHAR[]  usually start of the module name

XTA cache (.jc) files are composed of a JC address pair table and translated x86-to-ARM code. Entries in this pointer table each contain two relative virtual addresses: the RVA of the original function in the x86 binary and the RVA of the translated function in the cache file, in that order.

Pointer table entries do not always point to the start of a function. They often point to return addresses within a function, ie. the address of the instruction immediately following a call instruction. Hence multiple address pairs often exist for a single function.

Below is a translation of __mainCRTStartup from CMD.EXE.82E4EB063821519C52FE2943A99BF400.1D875ED0DCA07279C17651C0E6C07D68.mp.1.jc:

// __mainCRTStartup - translated ARM code from XTA cache file for C:\Windows\SysWOW64\cmd.exe
    1bebc:   528b49bb    mov w27, #0x5a4d                    // 'MZ'
   1bec0:   51405926    sub w6, w9, #0x16, lsl #12
   1bec4:   512374c6    sub w6, w6, #0x8dd
   1bec8:   48dffcc7    ldarh   w7, [x6]
   1becc:   6b1b00e2    subs    w2, w7, w27
   1bed0:   54000761    b.ne    0x1bfbc  // b.any
(...snip...)
Original (x86) __mainCRTStartup at 0x4168dd (base address 0x400000 + 0x168dd):
   0x004168dd      b84d5a0000     mov eax, 0x5a4d             // 'MZ'
   0x004168e2      663905000040.  cmp word [0x400000], ax     // [0x400000:2]=0xffff
   0x004168e9      7555           jne 0x416940
(...snip...)

Notice the equivalent "mov eax, 0x5a4d" and "mov w27, #0x5a4d" instructions. The indirect memory access in the cmp instruction ends up being split into four ARM instructions, while the relative branch jne instruction is easily translated into a b.ne. With the cache file accessible, the emulator can then fetch a translated function if needed.

So, the final question remains: how does the emulator (xtajit.dll) actually perform the translation?

XTA JIT (xtajit.dll)

It all begins in BTCpuSimulate, the beating heart of the x86 emulator. However, emulation doesn't start right away when an x86 process begins. The emulation module, xtajit.dll, has not yet been loaded. Windows automatically loads the native ARM ntdll.dll (C:\Windows\System32\ntdll.dll), the CHPE ntdll.dll (C:\Windows\SyChpe32\ntdll.dll), and the wow64 DLLs required for emulation.  Before emulation begins, the xtajit function BTCpuProcessInit is called:

NTSTATUS BTCpuProcessInit(PWCSTR wImageName, PVOID pCpuThreadSize)
{
    if (!NT_SUCCESS(GetProcessorPowerInformation()))
        return;
   
    InitializeCacheDatabase(); // sub_18000E770
   
/*...*/

    // Gets Wow64InfoPtr from TEB
    TEB teb = GetCurrentTeb();
    TEB32 teb32 = teb + teb->Peb32Offset;
    if (teb == teb32->SubSystemTib)
        wow64InfoPtr = teb32->TlsSlots[10]; // Wow64InfoPtr
    else
        wow64InfoPtr = teb->TlsSlots[10]; // Wow64InfoPtr     
    wow64InfoPtr->CpuFlags |= 2; // WOW64_CPUFLAGS_SOFTWARE   

/*...*/

        // Done
    return STATUS_SUCCESS;
}

Then the emulation starts in the BtCpuSimulate loop:

void BTCpuSimulate()
{
    // Calls RtlWow64GetCurrentCpuArea (sub_180015530)
    PWOW64_CPU_AREA cpuArea = GetWow64ArmCpuArea();

    while (1)
    {
        /* This function is in the wow64 module. Cross-process items might
           include calls to BTCpuNotifyMemory functions. Among those are:
           BTCpuNotifyMemoryAlloc, BTCpuNotifyMemoryFree,
           BTCpuNotifyMemoryProtect, BTCpuNotifyMemoryDirty */
        Wow64ProcessPendingCrossProcessItems();

        // Inside the CPU emulator
        GetCurrentTeb()->TlsSlots[2] = TRUE;

        if (gWow64CpuInfo->ProcessInitComplete)
            dmb_ish(); // Wait for all memory accesses to finish

        // Updates x86 registers in WOW64 ARM context structure (sub_180015038)
        CpupSwitchToX86(
            cpuArea->Wow64ArmContext, // Destination
            cpuArea->Wow64Context, // Source
            );

        // Emulates/executes x86 instruction (sub_1800215d0)
        EmulateX86Function(
            gWow64CpuInfo->BTProperties, // Initialized in BTCpuProcessInit
            cpuArea->Wow64ArmContext // Same as 1st argument to CpupSwitchToX86
        );
    }
}

Registers for x86 are updated in CpupSwitchToX86, where a WOW64_CONTEXT structure inside a WOW64_CPU_AREA structure is updated with the newest values of each register from a source WOW64_CONTEXT. But the function perhaps deserving the most attention in this article is a function I have named EmulateX86Function.

EmulateX86Function

EmulateX86Function, not surprisingly, is a very large and very complex function, from which flow several more very large and very complex functions. Not surprisingly, because one would generally assume that converting between two completely different CPU architectures involves quite a bit of work. X86 instructions must be taken apart and analyzed, deconstructed, and interpreted as ARM instructions. This is all done through the cooperation of several routines, all stemming from a root function: EmulateX86Function. Below is a (truncated) reverse engineering of this function:

// sub_1800215d0
void EmulateX86Function(PVOID btProperties, PWOW64_ARM_CONTEXT wow64ArmContext)
{
    PVOID jitCacheInfo = wow64ArmContext->JitCacheInfo;
    do
    {
        DWORD eip = wow64ArmContext->Eip;
        DWORD bitShift = jitCacheInfo->BitShift;
        DWORD mask = jitCacheInfo->Mask;
        PVOID funcTbl = jitCacheInfo->FunctionTable;
        DWORD offset = (eip >> bitShift) & mask;
        PVOID funcEntry = funcTbl + offset;
        PVOID funcEntryRes = NULL;
        int counter = 0;
       
        do
        {
            DWORD btFuncOff = funcEntry->BinaryTranslatedFunc;
            if (btFunc != NULL)
            {
                PVOID origFunc = funcEntry->OriginalFunc;
                funcEntry++;
                if (origFunc == eip)
                {
                    funcEntryRes = funcEntry - 1;
                    break;
                }
            }
            else
            {
                funcEntryRes = NULL;
                break;
            }
        } while (++counter < 32);
       
        if (funcEntryRes != NULL)
        {
            PVOID dispatchInfo = btProperties->DispatchInfo;
            DWORD wordSize = dispatchInfo->WordSize;
            PVOID pJitCacheInfo = dispatchInfo->JitCacheInfo;
            // Offset of the dispatch routine in the JIT-cache buffer
            DWORD offset = dispatchInfo->Offset;
            // Pointer to the JIT-cache buffer in memory
            PVOID pJitCache = pJitCacheInfo->pJitCache; // Offset 0x20 (32-bit)
            PVOID dispatcher = pJitCache + offset;
           
            /* Example: EXP+LdrInitializeThunk (x86) is the first function.
             * Once BTDispatchRoutine returns, wow64ArmContext->Eip becomes
             * LdrInitializeThunk (ARM64). */
            BTDispatchRoutine(wow64ArmContext->Context, wow64ArmContext,
                dispatcher);
        }
    } while (wow64ArmContext->ExecutionFlag != 1);
   
    BTCpuSimulateComplete(btProperties, wow64ArmContext, bUnknown1, bUnknown2,
        bUnknown3, bUnknown4, bUnknown5);
}

So the emulation begins, but which function or functions does it hand off the translation to and where?


source