PolyHook 2: C++17 x86/x64 Hooking Library

stevemk14ebr

5.00/5 (13 votes)

Jul 22, 2018

CPOL

12 min read

39890

PolyHook v2 - the C++17 x86/x64 library supporting multiple methods of hooking

The Library

https://github.com/stevemk14ebr/PolyHook_2_0

Introduction

Previously: PolyHook V1 Article

I've spent the last 2 years re-writing PolyHook to fix a lot of the known edge cases in V1. I'll briefly cover how the implemented hooking methods work, but this is an advanced topic and you should read my other article first which goes in depth on that. This article will focus on the edge cases, and why it took me 2 years to get it working in release mode with modern compilers on multiple architectures. It's still not perfect, but it's significantly better in all ways. There's a lot to be said about just how deep the rabbit hole goes, I've only just recently crawled back out of it.

Background

Hooking is the process of redirecting the control flow of a program from its original path. Typically, when used access to the source code is not available, so it is an inherently low level process that operates at the assembly level or at least after the compilation stage. Depending on the method used, different effects can be achieved, all methods allow executing a callback that fires just before a hooked method would be called. Some methods allow changing function arguments, or return values. And furthermore, some methods modify the compiled programs code while others abuse techniques transparent to the running program.

The Bugs

In V1, there were a few unhandled edge cases of inline hooks:

Jmps back into prologue not supported
Indirect prologue (jmp at beginning)
x64 stack touched
Failure to hook left original function malformed in a partially overwritten state
Hooking would race trampoline creation

And also a lot of bugs in other hooking methods:

Mutex acquired in Vectored Exception Handler
Breakpoint type and width not set in Dr7
IAT failed to find import thunk to hook

Let's see what all that means. We'll start with my favorite.

Jmps into prologue (1)

0:  55                      push   ebp
1:  89 e5                   mov    ebp,esp <-
3:  89 e5                   mov    ebp,esp  |
5:  89 e5                   mov    ebp,esp  |
7:  89 e5                   mov    ebp,esp  |
9:  90                      nop             |
a:  90                      nop             |
b:  7f f4                   jg     0x1   ----

Notice the jg assembly instruction jumps back to address 0x1. When performing a hook on x86, the above prologue is overwritten with a 5 byte e9 style jump so that it becomes the following:

0:  e9 ef be ad de          jmp    hook_callback <--
5:  89 e5                   mov    ebp,esp          |   <--- callback executes, runs the
7:  89 e5                   mov    ebp,esp          |        overwritten instructions and
9:  90                      nop                     |        returns here once done
a:  90                      nop                     |
b:  7f f4                   jg     0x1 -------------

That jg now points to byte ef, belonging to the jmp. This is a problem as when it's executed, it will be in the middle of the instruction and won't be interpreted as a jmp, but rather some garbage. There are many ways to fix this, some more complex than others. We could re-encode the jg to point to 0x0 so that it follows the jmp and no longer executes garbage, but when the jmp landed, it would break the control flow as the user callback would fire a second time, and the execution would not continue execution at mov ebp, esp like it did originally; so this is wrong.

We could also try to build a jmp table, and overwrite a little bit more of the prologue to make room for the jmp table entries to write a wider jmp type all the way to wherever the trampoline is. The whole prologue section would be copied to the trampoline, and we could just place a jmp to there when we want to execute them by redirecting the condition jmp to the bigger jmp.

0:  e9 ef be ad de          jmp    hook_callback
3:  e9 ef be ad de          jmp    trampoline_mov_ebp_esp <- points copy in trampoline 
8:  90                      nop
9:  90                      nop 
a:  90                      nop 
b:  7f f4                   jg     0x3

But this has a really big problem. IT'S SUPER HARD. The jmp table must be in the prologue because we only have +- 127 bytes of displacement to work with ( single signed byte of jg 7f f4). This makes it so that the more fixups we have to do, the more of the prologue we overwrite, which could potentially mean even more fixups, which means...yea it's an unbounded recursive solution trying to be solved in a fixed amount of space. And what happens when you need to do so many fixups that your jmp table grows to a size that it hits the first jump you fixed (address b in this example). I tried to implement this many times but this introduces more edge cases than it fixes and can be solved better and simpler with the method mentioned next.

The general solution that I chose was to K.I.S.S and just expand the prologue section that is copied to the trampoline, and fix the jump there if it was in range. Here is what the current example turns into:

0: e9 ef be ad de           jmp hook_callback
.... nops all the way down ... 
b: 90                       nop 

trampoline: 
100:  55                      push   ebp
101:  89 e5                   mov    ebp,esp <-
103:  89 e5                   mov    ebp,esp  |
105:  89 e5                   mov    ebp,esp  |
107:  89 e5                   mov    ebp,esp  |
109:  90                      nop             |
10a:  90                      nop             |
10b:  7f f4                   jg     0x101  ---

Let's look at a more complicated example that also requires a jmp table entry in the trampoline:

Original function:                                           
145804c [1]: 55                            push ebp     <--  
145804d [2]: 8b ec                         mov ebp, esp   | <-
145804f [2]: 74 fb                         je 0x145804c --   | <-
1458051 [2]: 74 ea                         je 0x145803d -----   |
1458053 [2]: 74 fa                         je 0x145804f ---------
1458055 [2]: 8b ec                         mov ebp, esp
1458057 [2]: 8b ec                         mov ebp, esp
1458059 [2]: 8b ec                         mov ebp, esp

Trampoline:
c11a20 [1]: 55                            push ebp     <-
c11a21 [2]: 8b ec                         mov ebp, esp  |
c11a23 [2]: 74 fb                         je 0xc11a20 ---   <-
c11a25 [2]: 74 07                         je 0xc11a2e ----   |
c11a27 [2]: 74 fa                         je 0xc11a23 -- |  --
c11a29 [5]: e9 27 66 84 00                jmp 0x1458055  |
c11a2e [5]: e9 0a 66 84 00                jmp 0x145803d <-

These jmps make it complicated to just move the prologue section. We have to move the whole thing as a chunk and then redirect the conditional je to point to a bigger jmp once it's relocated to the trampoline. This is because the je only has +-127 bytes of displacement to work with and it's extremely unlikely the trampoline's buffer happened to be allocated that close. Therefore, this solution of expanding the prologue works but it gets really complicated to redirect all the jmps to preserve code flow and stay within the displacement size of each instruction. This is implemented in polyhook V2.

Indirect Prologue (2)

Turns out compilers like to optimize stuff! In release mode, many calls are not directly to the function. But rather to a jmp table first. The following demonstrates this:

foo(); 

typical asm: 
call foo 

optimized asm: 
jmp 0x0 

jmp table:
0:  jmp foo_implementation     <- jmp to actual guts of foo
5:  jmp bar_implementation 
10: jmp foobar_implementation
...

So hooking would fail because this:

void (*pFnFoo)() = &foo;

would not point to the guts of foo but actually to the jmp in the jmp table, where things would go horribly wrong and the jmp table would be malformed and other seemingly random functions would do who knows what since they now pointed to who knows where. The fix was to follow these jmps until we landed at code. This also fixes hooking a function multiple times, as the second hook will just follow the first callback and hook the callback, chaining callback hooks at runtime in assembly...isn't that neat.

Stack touched (3)

56360477b000 [1]: 55                            push rbp
56360477b001 [3]: 48 89 e5                      mov rbp, rsp
56360477b004 [3]: 89 7d fc                      mov dword ptr [rbp - 4], edi
56360477b007 [4]: 83 7d fc 00                   cmp dword ptr [rbp - 4], 0       
56360477b00b [2]: 7e 15                         jle 0x56360477b022
56360477b00d [5]: b8 0f 00 00 00                mov eax, 0xf               
56360477b012 [1]: 50                            push rax                   <- oopsies just overwrote edi
56360477b013 [a]: 48 b8 4d 5a 53 04 36 56 00 00 movabs rax, 0x563604535a4d
56360477b01d [4]: 48 87 04 24                   xchg qword ptr [rsp], rax
56360477b021 [1]: c3                            ret

On x64 in polyhook V1 the gadget push, mov, xchg, ret was used to jmp back to the original function, and the push from that gadget clobers stack values. This caused hard to diagnose behavior differences in hooked functions. In V2, this is fixed by using the FF 25 style jmp<font color="#007000" face=""Segoe UI",Arial,Sans-Serif">.</font>

ff 25 ef be ad de        jmp [0xdeadbeef] 
deadbeef:                &original_function

As you can see, there is no stack or register usage involved, so it's fine. It does mix code and data however as the destination to jmp to is actually written into memory somewhere in the .text section...it's fine with careful book-keeping and in V2, I write this data at the very end of the trampoline where the data can never be accidentally executed as code.

Malformed Prologue on Errors

There's various errors that could occur that cause a hook to fail mid-way through modification of the assembly. An allocate could fail, disassembler could hit a bad instruction, we might fail to resolve a jmp, etc. If one of these cases were to be hit in V1, the assembly would be left in a partially overwritten state and it would be up to the user to fix. This is bad design. In V2, all of the hooking logic operates on a cached byte buffer of the instructions. When writes occur, they write to the buffer (one small buffer per instruction). Only once the end of the hooking operation is done and we are reasonably sure all is well are these byte buffers actually written and the original assembly modified. As an added bonus, the features to do this were upstreamed to Capstone 'next'. Now unlike V1 PolyHook does not require a fork of capstone to work properly.

Trampoline Creation Race Condition

The API in V1 was meant to be simple. You call setup, then hook, then a method to get the allocated trampoline to call the original:

Detour detour; 
detour.setup(&hookMe, &myCallback);
detour.hook(); 
pTrampoline = detour.getOriginal();

The problem however was that you could only get a pointer to the allocated trampoline AFTER you had hook the function. So it was possible that just in between when you called hook, and when you filled the pTrampoline variable that your callback would be dispatched. If this happened, then the callback would fire and attempt to call pTrampoline which would hold an invalid value. And then you'd crash. The allocation of the trampoline occurs inside the hook() routine so there was no simple fix for this in V1. In V2 however, the interface was changed. The constructor takes pTrampoline as a constructor argument now and fills it for you just before the hook is committed to memory. Because the trampoline variable you pass is filled before the hook overwrites the original function, you get the guarantee that your callback only fires once your trampoline variable is valid.

Detour detour(&hookMe, &myCallback, &trampoline)
detour.hook();

Vectored Exception and Vectored Continue Handlers

To implement the hooking types that throw exceptions, PolyHook needs to register an exception handler. This exception handler needs to catch the exception so that it can call the callback and resume as if the hook never threw an exception in the first place. This is done with the API:

PVOID WINAPI AddVectoredExceptionHandler(
  _In_ ULONG                       FirstHandler,
  _In_ PVECTORED_EXCEPTION_HANDLER VectoredHandler
);

It takes a pointer to a function to be called when the exception occurs, and potentially multiple hook types will generate different exceptions, but they all will be routed to the same handler. If we take a look at the MSDN remarks, the first thing it says is:

Quote:

Remarks

The handler should not call functions that acquire synchronization objects or allocate memory, because this can cause problems. Typically, the handler will simply access the exception record and return.

Now let's go look at the first line for the handler code for V1:

std::lock_guard<std::mutex> <span class="pl-c1">m_Lock</span>(m_TargetMutex);

Whoops, that's undefined. V2 fixes this. There's also another interesting type of exception handler though, a VectoredContinueHandler. A VectoredExceptionHandler is raised when the exception is thrown, but a VectoredContinueHandler is raised once another handler has decided to return EXCEPTION_CONTINUE_EXECUTION. Turns out debuggers return this if you click play (not single stepping). This is a nice method to detect if BP hooks are being used, or debuggers are attached. Here's a good post about these things.

There is also a secret magic number C++ exceptions throw which I found during development:

0xE06D7363: // this is ExceptionInfo->ExceptionRecord->ExceptionCode;

BP Type and Size

When you place a hardware breakpoint, the debugger actually writes into a special register on your CPU the type of breakpoint, the address to hit on, and the size to hit on. This location (should actually say locations, it's multiple registers) are Dr0-Dr7. You are allowed to place up to 4 BPs per thread, and Dr0-Dr3 hold the addresses you want to break on, and a few bits in Dr7 control if they are enabled, their type, and their size. In V1, I had a bug where I didn't set the bits in Dr7 correctly. I wrote the address to hit on, and then enabled the breakpoint by writing:

switch (m_regIdx) {
case 0:
    ctx.Dr0 = (decltype(ctx.Dr0))m_fnAddress;
        break;
case 1:
    ctx.Dr1 = (decltype(ctx.Dr1))m_fnAddress;
        break;
case 2:
    ctx.Dr2 = (decltype(ctx.Dr2))m_fnAddress;
        break;
case 3:
    ctx.Dr3 = (decltype(ctx.Dr3))m_fnAddress;
        break;
}

ctx.Dr7 |= 1ULL << (2 * m_regIdx);

This tells the CPU to turn on one of the HW bp's and to hit on address m_fnAddress, but not whether to hit on read, write, or execute, and also not the size of memory it should monitor. To do that, I needed:

ctx.Dr7 &= ~(3ULL << (16 + 4 * m_regIdx)); //00b at 16-17, 20-21, 24-25, 28-29 is execute bp
ctx.Dr7 &= ~(3ULL << (18 + 4 * m_regIdx)); // size of 1 (val 0), at 18-19, 22-23, 26-27, 30-31

which sets a 1 byte breakpoint to hit on execution. For reference, here is the bit layout of Dr7 from:

https://wiki.osdev.org/CPU_Registers_x86#Debug_Registers

bit	Description
0	local DR0 enable
1	global DR0 enable
2	local DR1 enable
3	global DR1 enable
4	local DR2 enable
5	global DR2 enable
6	local DR3 enable
7	global DR3 enable
16-17	type DR0
18-19	size DR0
20-21	type DR1
22-23	size DR1
24-25	type DR2
26-27	size DR2
28-29	type DR3
30-31	size DR3

Quote:

00b condition means execution break, 01b means a write watchpoint, and 11b means an R/W watchpoint. 10b is reserved for I/O R/W (unsupported).

Currently, I still set the Debug registers with a call to setthreadcontext from the same thread, which is undefined according to Microsoft. I'm wagering this is ok because I only set the debug registers and I've never had it fail in any of my testing, but I have not done any in-depth analysis to check if this is truly ok.

Finding IAT Thunks Failed

In V1, the IAT hook would sometimes fails because it couldn't find the import. This was because I made the mistake of only walking my own processes' IAT, and not also the other modules it had loaded. If you want to resolve the thunk of an entry you have to kind of do it recursively. A process loads a few modules (what I call DLLs) and those DLLs export some entries. Those DLLs however ALSO have IATs and can load other things which also have...which also...you get it. And this is where my mistake was, I naively only went the first level deep in V1 so it failed to find APIs sometimes, I also used the dbghelp.lib to find the IMPORT_DIRECTORY_ENTRY_IMPORT which was nice but added a dependency. So the fix was to walk the PEB to find all loaded modules, and then for each loaded module to walk its IAT.

The peb stores a linked list of modules at Peb->Ldr->InLoadOrderModuleList and you can grab an image base from there. Then to get the IAT, you cast the image base to a DosHeader then go to DosHeader->e_lfanew which is NTHeader->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT]. You also need to carefully check for null pointers as some of the fields in the IAT are zero'd depending on the compiler. Full code is on github.

The result is V2 can search the IAT correctly and recursively now (capped the list to show only a few APIs):

Module: PolyHook_2.exe
--DLL: KERNEL32.dll
----API: GetStdHandle
----API: IsDebuggerPresent
----API: OutputDebugStringA
----API: AddVectoredExceptionHandler
----API: RemoveVectoredExceptionHandler
----API: SetThreadStackGuarantee
----API: GetConsoleScreenBufferInfo
--DLL: MSVCP140.dll
----API: ?_Getgloballocale@locale@std@@CAPEAV_Locimp@12@XZ
----API: ?always_noconv@codecvt_base@std@@QEBA_NXZ
----API: ?tolower@?$ctype@D@std@@QEBADD@Z
----API: ?tolower@?$ctype@D@std@@QEBAPEBDPEADPEBD@Z
----API: ?_Getcat@?$ctype@D@std@@SA_KPEAPEBVfacet@locale@2@PEBV42@@Z
----API: ?in@?$codecvt@DDU_Mbstatet@@@std@@QEBAHAEAU_Mbstatet@@PEBD1AEAPEBDPEAD3AEAPEAD@Z
----API: ?out@?$codecvt@DDU_Mbstatet@@@std@@QEBAHAEAU_Mbstatet@@PEBD1AEAPEBDPEAD3AEAPEAD@Z
--DLL: VCRUNTIME140.dll
----API: strrchr
----API: _purecall
----API: __std_terminate
----API: __std_type_info_destroy_list
----API: memchr
----API: memmove
----API: strchr
--DLL: api-ms-win-crt-runtime-l1-1-0.dll
----API: _seh_filter_dll
----API: _configure_narrow_argv
----API: _initialize_narrow_environment
----API: _initialize_onexit_table
----API: _register_onexit_function
----API: _execute_onexit_table
----API: _crt_atexit
--DLL: api-ms-win-crt-heap-l1-1-0.dll
----API: _callnewh
----API: free
----API: realloc
----API: calloc
----API: _set_new_mode
----API: malloc
--DLL: api-ms-win-crt-utility-l1-1-0.dll
----API: rand
----API: srand
----API: qsort
--DLL: api-ms-win-crt-math-l1-1-0.dll
----API: _dtest
----API: __setusermatherr
----API: pow
----API: _fdtest
--DLL: api-ms-win-crt-stdio-l1-1-0.dll
----API: _set_fmode
----API: _get_stream_buffer_pointers
----API: fclose
----API: fflush
----API: fgetc
----API: fgetpos
----API: __stdio_common_vsprintf
--DLL: api-ms-win-crt-filesystem-l1-1-0.dll
----API: _lock_file
----API: _unlock_file
--DLL: api-ms-win-crt-string-l1-1-0.dll
----API: isalnum
----API: tolower
----API: strncpy
----API: strncmp
--DLL: api-ms-win-crt-time-l1-1-0.dll
----API: strftime
----API: _gmtime64_s
----API: _time64
--DLL: api-ms-win-crt-convert-l1-1-0.dll
----API: atoi
--DLL: api-ms-win-crt-locale-l1-1-0.dll
----API: _configthreadlocale
Module: ntdll.dll
[!]ERROR:PEs without import tables are unsupported
Module: KERNEL32.DLL
--DLL: api-ms-win-core-rtlsupport-l1-1-0.dll
----API: RtlVirtualUnwind
----API: RtlUnwindEx
----API: RtlRestoreContext
----API: RtlLookupFunctionEntry
----API: RtlInstallFunctionTableCallback
----API: RtlRaiseException
----API: RtlDeleteFunctionTable
--DLL: ntdll.dll
----API: RtlSizeHeap
----API: RtlLCIDToCultureName
----API: RtlUnicodeStringToInteger
----API: _wcslwr
----API: RtlGetUILanguageInfo
----API: EtwEventEnabled
----API: RtlpConvertLCIDsToCultureNames
--DLL: KERNELBASE.dll
----API: lstrlenA
----API: BaseFormatObjectAttributes
----API: GetVolumeNameForVolumeMountPointW
----API: AppContainerFreeMemory
----API: AppContainerLookupMoniker
----API: BasepNotifyTrackingService
----API: MoveFileWithProgressTransactedW
--DLL: api-ms-win-core-processthreads-l1-1-0.dll
----API: GetProcessTimes
----API: GetProcessId
----API: GetThreadId
----API: GetCurrentProcess
----API: GetCurrentProcessId
----API: GetThreadPriority
----API: GetThreadPriorityBoost
--DLL: api-ms-win-core-processthreads-l1-1-3.dll
----API: GetProcessInformation
----API: SetProcessInformation
----API: SetThreadIdealProcessor
----API: GetProcessShutdownParameters
--DLL: api-ms-win-core-processthreads-l1-1-2.dll
----API: GetThreadIOPendingFlag
----API: SetThreadInformation
----API: GetSystemTimes
----API: GetThreadInformation
----API: SetProcessPriorityBoost
----API: GetProcessPriorityBoost
--DLL: api-ms-win-core-processthreads-l1-1-1.dll
----API: GetProcessHandleCount
----API: SetProcessMitigationPolicy
----API: GetProcessMitigationPolicy
----API: SetThreadIdealProcessorEx
----API: GetThreadIdealProcessorEx
----API: GetThreadContext
----API: GetThreadTimes
--DLL: api-ms-win-core-registry-l1-1-0.dll
----API: RegLoadMUIStringW
----API: RegLoadMUIStringA
----API: RegNotifyChangeKeyValue
----API: RegLoadKeyA
----API: RegGetValueA
----API: RegFlushKey
----API: RegEnumValueW
--DLL: api-ms-win-core-heap-l1-1-0.dll
----API: HeapCreate
----API: HeapWalk
----API: HeapAlloc
----API: GetProcessHeap
----API: HeapFree
----API: HeapUnlock
----API: HeapSetInformation
--DLL: api-ms-win-core-heap-l2-1-0.dll
----API: LocalFree
--DLL: api-ms-win-core-memory-l1-1-1.dll
----API: QueryMemoryResourceNotification
----API: CreateMemoryResourceNotification
----API: GetLargePageMinimum
----API: GetProcessWorkingSetSizeEx
----API: GetSystemFileCacheSize
----API: SetProcessWorkingSetSizeEx
----API: SetSystemFileCacheSize
--DLL: api-ms-win-core-memory-l1-1-0.dll
----API: MapViewOfFileEx
----API: OpenFileMappingW
----API: MapViewOfFile
----API: CreateFileMappingW
----API: VirtualQueryEx
----API: VirtualQuery
----API: VirtualProtectEx
... AND SO ON ...

Compiler Optimization WTF moments

An optimizing compiler used to be my best friend... we've since parted ways:

The compiler may inline a function you took a function pointer too, leaving your pointer pointing to the middle of another block of code. Likely this was because the function pointer was never called, but used to get an address to the assembly to modify. Mark the function __declspec(noinline).
The compiler may completely remove a function you took a function pointer to if it's not called. Leaving you with a dangling pointer to invalid memory. WTF Compiler!?! Mark __declspec(noinline) and use lots of volatiles inside seems to fix. Also adding printf or other calls to functions with side effects keeps this behavior at bay.
The compiler may re-order statements to occur in a different order. Well known but this bit me a few times. Marking volatile fixes this... sometimes.
The compiler may remove reads and writes to unused variables or parameters. Mark everything volatile.
Release mode calls are sometimes indirected through a jmp table. Why?

Conclusion

Hooking is really hard, but fun.