Press "Enter" to skip to content

Hypervisor From Scratch – Part 8: How To Do Magic With Hypervisor!

Sina Karvandi 22

Introduction

Hi guys,

Welcome to the 8th part of the Hypervisor From Scratch. If you reach here, then you probably finished reading the 7th part, and personally, I believe the 7th part was the most challenging part to understand so hats off, you did a great job.

The 8th part would be an exciting part as we’ll see lots of real-world and practical examples of solving reverse-engineering related problems with hypervisors. For example, we’ll see how hidden hooks work in the presence of hypervisor or how to create a syscall hook, and we’re eventually able to transfer messages from vmx root to OS (vmx non-root) and then into user-mode thus it gives us a valuable bunch of information about how the system works.

Besides some OS-related concepts, we’ll also see some CPU related topics like VPIDs and some general information about how patches for Meltdown and Spectre works.

Event injection, Exception Bitmap, and also adding support to virtualize a hyper-v machine are other titles that will be discussed.

Before starting, I should give special thanks to my friend Petr Benes for his contributions to Hypervisor From Scratch, of course, Hypervisor From Scratch could never have existed without his help and to Liran Alon for great helps on fixing VPID problem, and to Gerhart for his in-depth knowledge about Hyper-V internals that makes Hypervisor From Scratch available for Hyper-V.

Overview

This part is divided into eight main sections :

  1. How to inject interrupts (Event) into the guest and Exception Bitmap
  2. Implementing hidden hooks using EPT
  3. Syscall hook
  4. Invalidating EPT caches using VPID
  5. Demonstrating a custom VMX Root-mode compatible message tracing mechanism and adding WPP Tracing to our Hypervisor
  6. We’ll add support to Hyper-V
  7. Fixing some previous design caveats
  8. Discussion (In this section we discuss the different question and approaches about various topics in this part)

The full source code of this tutorial is available on GitHub :

[https://github.com/SinaKarvandi/Hypervisor-From-Scratch]

Table of Contents

  • Introduction
  • Overview
  • Table of Contents
  • Event Injection
    • Vectored Events
      1. Interrupts
      2. Exceptions
    • Exception Classifications
    • Event Injection Fields
    • Vectored Event Injection
    • Exception Error Codes
  • Exception Bitmap
  • Monitor Trap Flag (MTF)
  • Hidden Hooks (Simulating Hardware Debug Registers Without Any Limitation)
    • Hidden Hooks Scenarios for Read/Write and Execute
    • Implementing Hidden Hooks
    • Removing Hooks From Pages
    • An Important Note When Modifying EPT Entries
  • System-Call Hook
    • Finding Kernel Base
    • Finding SSDT and Shadow SSDT Tables
    • Get Routine Address by Syscall Number
  • Virtual Processor ID (VPID) & TLB
    • INVVPID – Invalidate Translations Based on VPID
      1. Individual-address invalidation
      2. Single-context invalidation
      3. All-contexts invalidation
      4. Single-context invalidation, retaining global translations
    • Important Notes For Using VPIDs
    • INVVPID vs. INVPCID
  • Designing A VMX Root-mode Compatible Message Tracing
    • Concepts
      1. What’s a spinlock?
      2. Test-and-Set
      3. What do we mean by “Safe”?
      4. What is DPC?
    • Challenges
    • Designing A Spinlock
    • Message Tracer Design
      1. Initialization Phase
      2. Sending Phase (Saving Buffer and adding them to pools)
      3. Reading Phase (Read buffers and send them to user-mode)
      4. Checking for new messages
      5. Sending messages to pools
      6. Receiving buffers and messages in user-mode
      7. IOCTL and managing user-mode requests
      8. User-mode notify callback
      9. Uninitialization Phase
  • WPP Tracing
  • Supporting to Hyper-V
    • Enable Nested Virtualization
    • Hyper-V’s visible behavior in nested virtualization
    • Hyper-V Hypervisor Top-Level Functional Specification (TLFS)
    • Out of Range MSRs
    • Hyper-V Hypercalls (VMCALLs)
    • Hyper-V Interface CPUID Leaves
  • Fixing Previous Design Issues
    • Fixing the problem with pre-allocated buffers
    • Avoid Intercepting Accesses to CR3
    • Restoring IDTR, GDTR, GS Base and FS Base
  • Let’s Test it!
    • View WPP Tracing Messages
    • How to test?
      1. Event Injection & Exception Bitmap Demo
      2. Hidden Hooks Demo
        • Read/Write Hooks or Hardware Debug Registers Simulation
        • Hidden Execution Hook
      3. Syscall Hook Demo
  • Discussion
  • Conclusion
  • References
Animmmmeee :0

Event Injection

One of the essential parts of the hypervisors is the ability to inject events (events are Interrupts, Exceptions, NMIs, and SMIs) as if they’ve arrived normally, and the capability to monitor received interrupts and exceptions.

This gives us a great ability to manage the guest operating system and unique ability to build applications, For example, if you are developing anti-cheat application, you can easily disable breakpoint and trap interrupts, and it completely disables all the features of Windbg or any other debugger as you’re the first one that is being notified about the breakpoint thus you can decide to abort the breakpoint or give it to the debugger.

This is just a simple example that the attacker needs to find a way around it. You can also use event injections for reverse-engineering purposes, e.g., directly inject a breakpoint into an application that uses different anti-debugging techniques to make its code hidden.

We can also implement some important features of our hypervisor like hidden hooks based on relying on event injection.

Before going deep into the Event Injection, we need to know some basic processor concepts and terms used by Intel. Most of them derived from this post and this answer.

Intel x86 defines two overlapping categories, vectored events (interrupts vs exceptions), and exception classes (faults vs traps vs aborts).

Vectored Events

Vectored Events (interrupts and exceptions) cause the processor to jump into an interrupt handler after saving much of the processor’s state (enough such that execution can continue from that point later).

Exceptions and interrupts have an ID, called a vector, that determines which interrupt handler the processor jumps to. Interrupt handlers are described within the Interrupt Descriptor Table (IDT).

Interrupts

Interrupts occur at random times during the execution of a program, in response to signals from the hardware. System hardware uses interrupts to handle events external to the processor, such as requests to service peripheral devices. The software can also generate interrupts by executing the INT n instruction.

Exceptions

Exceptions occur when the processor detects an error condition while executing an instruction, such as division by zero. The processor identifies a variety of error conditions, including protection violations, page faults, and internal machine faults.

Exception Classifications

Exceptions classified as faultstraps, or aborts depending on the way they reported and whether the instruction that caused the exception could be restarted without loss of program or task continuity.

In summary: traps increment the instruction pointer (RIP), faults do not, and aborts ‘explode’.

We’ll start with the fault classification. You’ve probably heard of things called page faults (or segmentation faults if you’re from the past).

A fault is just an exception type that can be corrected and allows the processor the ability to execute some fault handler to rectify an offending operation without terminating the entire operation. When a fault occurs, the system state is reverted to an earlier state before the faulting operation occurred, and the fault handler is called. After executing the fault handler, the processor returns to the faulting instruction to execute it again. That last sentence is important because that means it redoes an instruction execution to make sure the proper results are used in the following operations. This is different from how a trap is handled.

A trap is an exception that is delivered immediately following the execution of a trapping instruction. In our hypervisor, we trap on various instructions, meaning that after the execution of an instruction – say rdtsc or rdtscp – a trap exception is reported to the processor. Once a trap exception is reported, control is passed to a trap handler, which will perform some operation(s). Following the execution of the trap handler, the processor returns to the instruction following the trapping instruction.

An abort, however, is an exception that occurs and doesn’t always yield the location of the error. Aborts are commonly used for reporting hardware errors, or otherwise. You won’t see these very often, and if you do… Well, you’re doing something wrong. It’s important to know that all exceptions are reported on an instruction boundary – excluding aborts. An instruction boundary is quite simple: if you have the bytes 0F 31 48 C1 E2 20 which translates to the instructions,

Then the instruction boundary would be between the bytes 31 and 48. That’s because 0F 31 is the instruction opcodes for rdtsc. This way, two instructions separated by a boundary.

Event Injection Fields

Event injection is done with using interruption-information field of VMCS.

The interruption-information is written into the VM-entry fields of the VMCS during VM-entry; after all the guest context has been loaded, including MSRs and Registers, it delivers the exception through the Interrupt Descriptor Table (IDT) using the vector specified in this field. 

The first field to configure event injection is VM-entry interruption-information field (32 bits) or VM_ENTRY_INTR_INFO in the VMCS, this field provides details about the event to be injected.

The following picture shows the detail of each bit.

VM-Entry Interruption-Information
  • The vector (bits 7:0) determines which entry in the IDT is used or which other event is injected or, in other words, it defines the index of Interrupt to be injected in IDT, for example, the following command (!idt) in windbg shows the IDT indexes. (note that the index is the numbers at the left).

The interruption type (bits 10:8) determines details of how the injection is performed.

In general, a VMM should use the type hardware exception for all exceptions other than the following:

  • Breakpoint exceptions (#BP): a VMM should use the type software exception.
  • Overflow exceptions (#OF): a VMM should use the use type software exception.
  • Those debug exceptions (#DB) that are generated by INT1 (a VMM should use the use type privileged software exception).

For exceptions, the deliver-error-code bit (bit 11) determines whether delivery pushes an error code on
the guest stack. (we’ll talk about error-code later)

The last bit is that VM entry injects an event if and only if the valid bit (bit 31) is 1. The valid bit in this field is cleared on every VM exit means that when you want to inject an event, you set this bit to inject your interrupt and the processor will automatically clear it at the next VM-Exit.

The second field that controls the event injection is VM-entry exception error code.

VM-entry exception error code (32 bits) or VM_ENTRY_EXCEPTION_ERROR_CODE in the VMCS: This field is used if and only if the valid bit (bit 31) and the deliver error-code bit (bit 11) are both set in the VM-entry interruption-information field.

The third field that controls the event injection is VM-entry instruction length.

VM-entry instruction length (32 bits) or VM_ENTRY_INSTRUCTION_LEN in the VMCS: For injection of events whose type is a software interrupt, software exception, or privileged software exception, this field is used to determine the value of RIP that is pushed on the stack.

All in all, these things in VMCS control the Event Injection process: VM_ENTRY_INTR_INFO, VM_ENTRY_EXCEPTION_ERROR_CODE, VM_ENTRY_INSTRUCTION_LEN.

Vectored Event Injection

If the valid bit in the VM-entry interruption-information field is 1, VM entry causes an event to be delivered (or made pending) after all components of the guest state have been loaded (including MSRs) and after the VM-execution control fields have been established.

The interruption type (which is described above) can be one of the following values.

Now it’s time to set the vector bit. The following enum is the representation of the indexes in IDT. (Look at the indexes of !idt command above).

In general, the event is delivered as if it had been generated normally, and the event is delivered using the vector in that field to select a descriptor in the IDT. Since event injection occurs after loading IDTR (IDT Register) from the guest-state area, this is the guest IDT, or in other words, the event is delivered to GUEST_IDTR_BASE and GUEST_IDTR_LIMIT.

Putting the above descriptions into the implementation, we have the following function :

As an example we want to inject a #BP (breakpoint) into the guest, we can use the following code :

Or if we want to inject a #GP(0) or general protection fault with error code 0 then we use the following code:

You can write functions for other types of interrupts and exceptions. The only thing that you should consider is the InterruptionType, which is always hardware exception except for #DP, #BP, #OF, which is discussed above.

Exception Error Codes

You might notice that we used VM_ENTRY_EXCEPTION_ERROR_CODE in the VMCS and 11th bit of the interruption-information field, and for some exceptions, we disabled them while for some others we set them to a specific value, so what’s that error codes?

Some exceptions will push a 32-bit “error code” on to the top of the stack, which provides additional information about the error. This value must be pulled from the stack before returning control back to the currently running program. (i.e., before calling IRET for returning from interrupt).

The fact that the error code must be pulled from the stack makes event injection more complicated as we have to make sure whether the Windows tries to pull error code from the stack or not, as it turns to error if we put something onto the stack that Windows doesn’t expect to pull it later or we didn’t push anything but Windows thoughts there is something in the stack that needs to be pulled.

The following table shows some of these exceptions with the presence or absence of Error code, this table is derived from Intel SDM, Volume 1, CHAPTER 6 (Table 6-1. Exceptions and Interrupts).

NameVector nr.TypeMnemonicError code?
Divide-by-zero Error0 (0x0)Fault#DENo
Debug1 (0x1)Fault/Trap#DBNo
Non-maskable Interrupt2 (0x2)InterruptNo
Breakpoint3 (0x3)Trap#BPNo
Overflow4 (0x4)Trap#OFNo
Bound Range Exceeded5 (0x5)Fault#BRNo
Invalid Opcode6 (0x6)Fault#UDNo
Device Not Available7 (0x7)Fault#NMNo
Double Fault8 (0x8)Abort#DFYes (Zero)
Coprocessor Segment Overrun9 (0x9)FaultNo
Invalid TSS10 (0xA)Fault#TSYes
Segment Not Present11 (0xB)Fault#NPYes
Stack-Segment Fault12 (0xC)Fault#SSYes
General Protection Fault13 (0xD)Fault#GPYes
Page Fault14 (0xE)Fault#PFYes
Reserved15 (0xF)No
x87 Floating-Point Exception16 (0x10)Fault#MFNo
Alignment Check17 (0x11)Fault#ACYes
Machine Check18 (0x12)Abort#MCNo
SIMD Floating-Point Exception19 (0x13)Fault#XM/#XFNo
Virtualization Exception20 (0x14)Fault#VENo
Reserved21-29 (0x15-0x1D)No
Security Exception30 (0x1E)#SXYes
Reserved31 (0x1F)No
Triple FaultNo
FPU Error InterruptIRQ 13Interrupt#FERRNo

Now that we learn how to create a new events, it’s time to see how to monitor system interrupts.

Exception Bitmap

If you remember from MSR Bitmaps, we have a mask for each MSR that shows whether the read or write on that MSR should cause a vm-exit or not.

The monitoring of exceptions uses the same method, which means that a simple mask governs it. This mask is EXCEPTION_BITMAP in VMCS.

The exception bitmap is a 32-bit field that contains one bit for each exception. When an exception occurs, its vector is used to select a bit in this field. If the bit is 1, the exception causes a VM exit. If the bit is 0, the exception is delivered normally through the IDT.

Now it’s up to you to decide whether you want to inject that exception back to the guest or change the state or whatever you want to do.

For example, if you set the 3rd bit of the EXCEPTION_BITMAP, then whenever a breakpoint occurs somewhere (both user-mode and kernel-mode), a vm-exit with EXIT_REASON_EXCEPTION_NMI (exit reason == 0) occurs.

Now we can change the state of the program, then resume the guest, remember resuming the guest doesn’t cause the exception to be delivered to the guest, we have to inject an event manually if we want that the guest process the event normally. For example, we can use the function “EventInjectBreakpoint,” as mentioned earlier, to inject the exception back to the guest.

The last question is how we can find the index of exception that occurred, you know we might set exception bitmap for multiple exceptions, so we have to know the exact reason why this vm-exit happened or more clearly, what exception causes this vm-exit.

The following VMCS fields report us about the event,

  • VM_EXIT_INTR_INFO
  • VM_EXIT_INTR_ERROR_CODE

The following table shows how we can use VM_EXIT_INTR_INFO.

Which is the following structure:

And we can read the details using vmread instruction, for example, the following command shows how we can detect if breakpoint (0xcc) occurred.

If we want to re-inject an exception that comes with an error code (see the above table), then the error code can be read using VM_EXIT_INTR_ERROR_CODE in VMCS. After that, write the error code to VM_ENTRY_EXCEPTION_ERROR_CODE and enable the deliver-error-code of VM_ENTRY_INTR_INFO to make sure that re-injection is without any flaw.

Also, keep in mind that page-fault is treated differently you can read Intel SDM for more information.

But wait! Have you notice that exception bitmap are just a 32-bit field in VMCS while we have up to 256 interrupts in IDT ?!

If you’re curious about this question you can read its answer in Discussion section.

Monitor Trap Flag (MTF)

Monitor Trap Flag or MTF is a feature that works exactly like Trap Flag in r/eflags except it’s invisible to the guest.

Whenever you set this flag on CPU_BASED_VM_EXEC_CONTROL, after VMRESUME, the processor executes one instruction then a vm-exit occurs.

We have to clear this flag otherwise each instruction cause a vm-exit.

The following function is responsible for setting and unsetting MTF.

Setting MTF leads to a vm-exit with exit reason (EXIT_REASON_MONITOR_TRAP_FLAG), we unset the MTF in the vm-exit handler.

MTF is essential in implementing hidden hooks, more details about MtfEptHookRestorePoint later in the hidden hooks section.

Here’s the MTF vm-exit handler.

Hidden Hooks

(Simulating Hardware Debug Registers Without Any Limitation)

Have you ever used hardware debugger registers ?!

The debug registers allow researchers and programmers to selectively enable various debug conditions (read, write, execute) associated with a set of four debug addresses without any change in program instructions.

As you know, we can set up to 4 locations to these hardware registers, and it’s the worst limitation for these registers.

so what if we have a structure (let say _EPROCESS) and we want to see what function in Windows Read or Write in this structure?

It’s not possible with current debug registers but we use EPT to rescue !

Hidden Hooks Scenarios for Read/Write and Execute

We have two strategies for hidden hooks, one for Read/Write and one for Execute.

For Read/Write,

we unset read or write or both (based on how user wants) in the entry corresponding to the address.

This means before read or write a vm-exit occurs, and an EPT Violation will notify us. In the EPT Violation handler, we log the address that tries to read or write, then we find the entry in EPT table and set both read and write (means that any read or write to the page is allowed) and also set an MTF flag.

VMM resumes, and one instruction executes, or in other words, read or write is performed, then an MTF vm-exit occurs. In MTF vm-exit handler, we unset the read and write access again so any future access to that page will cause an EPT Violation.

Note that all of the above scenarios happen to one core. Each core has a separate TLB and separate Monitor Trap Flag.

For Execute,

For execution, we use a capability in Intel processors called execute-only.

Execute-only means that we can have a page with execute access enabled while read and write access is disabled.

If the user wants an execution hook, then we find the entry in EPT Table and unset read and write access and set the execute access. Then we create a copy from the original page (Page A) to somewhere else (Page B) and modify the copied page (Page B) with an absolute jump to the hook function.

Now, each time that any instruction attempted to execute our function, the absolute jump is performed, and our hook function is called. Each time any instruction tries to read or write to that location, an EPT Violation occurs as we unset read and write access to that page, so we can swap the original page (Page A) and also set the monitor trap flag to restore the hook after executing one instruction.

Wasn’t it easy ? Review it one more time if you didn’t understand.

You can also think about the different methods; for example, DdiMon creates a copy from that page and modifies the hook location by replacing one bytes (0xcc) breakpoint there. Now it intercepts each breakpoint (using Exception Bitmap) and swaps the original page. This method is much simpler to implement and more reliable, but it causes vm-exit for each hook, so it’s slower, but the first method for EPT Hooks never causes a vm-exit for execution.

Vm-exits for Read and Write hooks are unavoidable.

The execution hook for this part is derived from Gbps hv.

Let’s dig into implementation.

Implementing Hidden Hooks

For hooking functions, first, we split the page into 4KB entries, as described in the previous part. Then find the entry and read that entry. We want to save the details of a hooked page so we can use it later. For read/write hooks, we unset read or write or both, while for executing hooks, we unset read/write access and set execute access and also copy the page contents into a new page and swap the entry’s physical address with the second page’s physical address (fake page’s physical address).

Then we build a trampoline (explained later) and finally decide how to invalidate the TLB based on vmx-state (vmx-root or vmx non-root) and finally add the hook details to the HookedPagesList.

Now we need a function that creates another page and patches the original page (Page A) with an absolute jump (trampoline) that jumps another page (Page B).

In (Page B) we will jump to the hooked function also this function copies the bytes that are patched to the (Page B) and save the original function for the caller to return back to the original page on (Page B).

This is a simple inline hook that we use LDE (LDE64x64) as the detour function.

For creating a simple absolute jump we use the following function.

In the case of EPT Violations, first, we find the details of the physical address that caused this vm-exit. Then we call EptHandleHookedPage to create a log about the details then we set an MTF to restore to the hooked state after executing one instruction.

Each time an EPT Violation occurs, we check whether it was because Read Access or Write Access or Execute Access violation and log GUEST_RIP, then we restore the initial flags (All read, write, and exec is allowed).

That’s it! We have a working hidden hooks.

Removing Hooks From Pages

Removing hooks from pages are essential to us because of two reasons; first, sometimes we need to disable the hooks, and second, when we want to turn off hypervisor, we have to remove all the hooks. Otherwise, we might encounter strange behavior.

Removing hooks is simple as we saved details, including original entries in PageHookList; we have to find entries in this list and broadcast to all processors to update their TLBs and also remove that entry.

The following function is for this purpose.

In vmx-root, we also search for the specific hook and use EptSetPML1AndInvalidateTLB to return that entry to the initial state, which is previously saved in OriginalEntry.

If we want to unhook all the pages, then we use another VMCALL, there is no need to iterate through the list here as all of the hooks must be removed. Just broadcast it through all the cores.

In vmx-root we just iterate through the list and restore them to the initial state.

An Important Note When Modifying EPT Entries

One interesting thing that I encountered during the test of my driver on the multi-core system was the fact that EPT entries should be modified in one instruction.

For example, if you change the access bits of an EPT entry, bit by bit, then you probably get the error (EPT Misconfiguration) that one access bits changed and before the next access bit applies another core tries to access page table and it sometimes leads to an EPT Misconfiguration and sometimes you might not get the desired behavior.

For example the following method for modifying EPT entries is wrong!

But the following code is correct. (Applying changes in one instruction instantly).

This is why we have the following function that acquires a spinlock that makes sure that only one entry is modified once and then invalidate that core’s TLB.

The above function solves the problems of simultaneously modifying the EPT Table as we have one EPT Table for all cores.

System-Call Hook

When it comes to hypervisors, we have different options for hooking system-calls. Each of these methods has its own advantages and disadvantages.

Let’s review some of the methods, that we can use to hook system-calls.

The first method is hooking MSR 0xc0000082 (LSTAR). This MSR is the kernel-entry for dispatching system-calls. Each time an instruction like Syscall is executed in user-mode, the processor automatically switches to kernel-mode and runs the address stored in this MSR. In Windows address of KiSystemCall64 is stored in this MSR.

This means that each time an application needs to call a system-call, it executes a syscall, and now this function is responsible for finding the entries in SSDT and call. In short, SSDT is a table in Windows that stores pointer to Windows function based on a system-call number. All SSDT entries and LSTAR MSR is under the control of PatchGuard.

This brings us three possibilities!

First, we can change the MSR LSTAR to point to our custom function, and to make it PatchGuard compatible, we can set MSR Bitmap that if any kernel routine wants to read this MSR, then a vm-exit occurs so we can change the result. Instead of showing our custom handler, we can show the KiSystemCall64, and PatchGuard will never know that this is a fake MSR.

Hooking MSR LSTAR is complicated, and updates to Meltdown make it even more complicated. In a post-meltdown system, LSTAR points to KiSystemCall64Shadow, which involves changing CR3 and execute KPTI-related instruction and Meltdown mitigation. It’s not a good idea to hook LSTAR as we have difficulties with pre-Meltdown and post-Meltdown mitigations and also as the system-state changes in this MSR so we can’t hook anything in the kernel as the kernel is not mapped on CR3.

Hyperbone uses this method (even it not updated for post-meltdown systems in the time of writing this article).

The second option is finding SSDT tables and change their entry to point to our custom functions, each time the PatchGuard tries to audit these entries, we can show it the not-patched listings. The only thing that we should keep in mind is to find where KiSystemCall64 tries to read that location and save that location somewhere so we can know that if the function that tries to read is syscall dispatcher our other functions (and probably PatchGuard).

Implementing this method is not super-fast as we need to unset EPT Read for SSDT entry, and each time a read happens, a vm-exit occurs, so we have one vm-exit for each syscall thus it makes our computer slow!

The third option is finding functions in SSDT entries and put a hidden hook on the functions that we need to hook. This way, we can catch a custom list of functions because I think hooking all system-calls is stupid!

We implement the third option in this part.

Another possible way is Syscall Hooking Via Extended Feature Enable Register (EFER), as described here. This method is based on disabling Syscall Enable (or SCE bit) of the EFER MSR; hence each time a Syscall is executed, a #UD exception is generated by the processor, and we can intercept #UD by using Exception Bitmap (described above) to handle these syscalls.

Again it’s not a good idea because it leads to a vm-exit for each syscall; thus, it’s substantially slow but usable for experimental purposes.

Also, they might be other options. Don’t hesitate to send a comment to this post and describe if you know one!

Finding Kernel Base

To find SSDT, we need to find nt!KeServiceDescriptorTable and nt!KeServiceDescriptorTableShadow, these tables are exported in x86 systems but not in x64. This makes the things much complicated as the routines to find these tables might change in future versions of Windows; thus, our Syscall hooker might have problems in future versions.

First of all, we need to find the base address of ntoskrnl, and it’s the image size, this is done by using ZwQuerySystemInformation, first, we find this function by using MmGetSystemRoutineAddress.

Then we allocate a memory to get the details from Windows and find the base address and module size.

Update 2: You can also use RtlPcToFileHeader instead of above method:

Finding SSDT and Shadow SSDT Tables

Now that we have the base address ntoskrnl we can search for this pattern to find nt!KeServiceDescriptorTableShadow.

nt!KeServiceDescriptorTableShadow contains the nt!KiServiceTable and win32k!W32pServiceTable, which is the SSDT of Syscall function for both NT Syscalls and Win32K Syscalls.

Note that nt!KeServiceDescriptorTable only contains the nt!KiServiceTable, and it doesn’t provide win32k!W32pServiceTable.

Get Routine Address by Syscall Number

After finding the NT Syscall Table and Win32k Syscall Table, now it’s time to translate Syscall Numbers to its corresponding address.

The following formula converts API Number to function address.

Keep in mind that NT Syscalls start from 0x0, but Win32k Syscalls start from 0x1000, so as we computer indexes based on the start of the table, we should minus the Win32k Syscalls with 0x1000.

All in all, we have the following function.

Now that we have the address of the routine that we want, now it’s time to put a hidden hook on that function, we also need their functions prototypes so we can read their arguments appropriately.

The syscall hook example is demonstrated later in the (How to test?) section.

Kizuna ai :)

Virtual Processor ID (VPID) & TLB

In Intel, its explanation about VPIDs is vague, so I found a great link that explains is so much more straightforward; hence it’s better to read the details below instead of starting with SDM.

The translation lookaside buffer (TLB) is a high-speed memory page cache for virtual to physical address translation. It follows the local principle to avoid time-consuming lookups for recently used pages.

Host mappings are not coherent to the guest and vice versa. Each guest has it’s own address space, the mapping table cannot be re-used in another guest (or host). Therefore first-generation VMs like Intel Core 2 (VMX) flush the TLB on each VM-enter (resume) and VM-exit. But flushing the TLB is a show-stopper, it is one of the most critical components in a modern CPU.

Intel engineers started to think about that. Intel Nehalem TLB entries have changed by introducing a Virtual Processor ID. So each TLB entry is tagged with this ID. The CPU does not specify VPIDs, the hypervisor allocates them, whereas the host VPID is 0. Starting with Intel Nehalem, the TLB must not be flushed. When a process tries to access a mapping where the actual VPID does not match with the TLB entry VPID a standard TLB miss occurs. Some Intel numbers show that the latency performance gain is 40% for a VM round trip transition compared to Meron, an Intel Core 2.

Imagine you have two or more VMs:

  • If you enable VPIDs, you don’t have to worry that VM1 accidentally, fetches cached memory of VM2 (or even hypervisor itself)
  • If you don’t enable VPIDs, CPU assigns VPID=0 to all operations (VMX root & VMX non-root) and flushes TLB on each transition for you

A logical processor may tag some cached information with a 16-bit VPID.

The VPID is 0000H in the following situations:

  • Outside VMX operation. (e.g System Management Mode (SMM)).
  • VMX root operation
  • VMX non-root operation when the “enable VPID” VM-execution control is 0

INVVPID – Invalidate Translations Based on VPID

In order to support VPIDs, we have to add CPU_BASED_CTL2_ENABLE_VPID to Secondary Processor-Based VM-Execution Controls.

The next step is to set a 16-bit value to VMCS’s VIRTUAL_PROCESSOR_ID field using VMWRITE instruction. This value is used as an index for the current VMCS on this core so our current VMCS’s VPID is 1.

Also, as described above, 0 has special meaning and should not be used.

INVVPID (instruction) invalidates mappings in the translation lookaside buffers (TLBs) and paging-structure caches based on the virtual processor identifier (VPID).

For the INVVPID there are 4 types that currently supported by the processors which are reported in the IA32_VMX_EPT_VPID_CAP MSR.

The enumeration for these types are :

I’ll describe these types in detail later.

For the implementation of INVVPID we use an assembly function like this (which executes invvpid from the RCX and RDX for x64 fast calling convention) :

and then, a general purpose function for calling this assembly function :

For INVVPID, there is a descriptor defined below.

INVVPID Descriptor

This structure defined like this :

The types of INVVPID is defined as below :

  • Individual-address invalidation: If the INVVPID type is 0, the logical processor invalidates mappings for the linear address, and VPID specified in the INVVPID descriptor. In some cases, it may invalidate mappings for other linear addresses (or other VPIDs) as well.
  • Single-context invalidation: If the INVVPID type is 1, the logical processor invalidates all mappings tagged with the VPID specified in the INVVPID descriptor. In some cases, it may invalidate mappings for other VPIDs as well.
  • All-contexts invalidation: If the INVVPID type is 2, the logical processor invalidates all mappings tagged with all VPIDs except VPID 0000H. In some cases, it may invalidate translations with VPID 0000H as well.
  • Single-context invalidation, retaining global translations: If the INVVPID type is 3, the logical processor invalidates all mappings tagged with the VPID specified in the INVVPID descriptor except global translations. In some cases, it may invalidate global translations (and mappings with other VPIDs) as well. See the “Caching Translation Information” section in Chapter 4 of the IA-32 Intel Architecture Software Developer’s Manual, Volumes 3A for information about global translations.

You probably think about how VPIDs can be used in the hypervisor. We can use it instead of INVEPT, but generally, it doesn’t have any particular usage for us. I described it more in the Discussion Section. By the way, VPIDs will be used in implementing special features as it’s more flexible than INVEPT and also when we have multiple VMCS (EPTP). (Can you think about some of them?).

Important Notes For Using VPIDs

There are some important things that you should know when using VPIDs.

Enabling VPIDs have a side-effect of not flushing TLB on VMEntry/VMExit. You should manually flush guest TLB entries if required (By using INVEPT/INVVPID). These issues might be hidden when VPID is disabled.

When VPID is disabled, VMEntry flushes the entire TLB. Thus, the hypervisor doesn’t need to explicitly invalidate TLB entries populated by the guest when performing an operation that should invalidate them (e.g., Modifying an EPT entry). When VPID is enabled, INVEPT/INVVPID should be used.

An easy way for you to find these kinds of issues is indeed the issue you have, is to execute INVEPT global-context before every VMEntry to flush entire TLB while still keeping VPID enabled. If it now works, you should check where you are missing an INVEPT execution.

In my experience, if you just enable VPIDs without any extra assumption, all processes start to crash one by one, and eventually, kernel crashes, and this is because we didn’t invalidate the TLB.

In order to solve the problem of crashing every process, we have to invalidate TLB in the case of Mov to Cr3 thus whenever a vm-exit occurs with reason == EXIT_REASON_CR_ACCESS (28) then if it’s a Mov to Cr3 we have to invalidate TLB (INVEPT or INVVPID [Look at the Update 1 for more details]).

So we edit the code like this:

Also, note that as we have a single EPTP for all cores then it’s enough to invalidate single-context otherwise we have to invalidate all-contexts.

Update 1 : As Satoshi Tanda mentioned,

The CR3 handler should use INVVPID instead of INVEPT because INVEPT invalidates more than needed. We want to invalid caches of GVA -> HPA (combined mappings), and both instructions do this. This is why INVEPT works too, but INVEPT also invalidates caches of GPA -> HPA (guest-physical mappings), which are not impacted by the guest CR3 change and can be kept without invalidation.

The general guideline is, INVVPID when TLB flush emulation is needed, and INVEPT when EPT entries are changed. You can find more info on those instructions and cache types in :

  • 28.3.1 Information That May Be Cached
  • 28.3.3.3 Guidelines for Use of the INVVPID Instruction.

so instead of InveptSingleContext we used InvvpidSingleContext.

Honestly, we have some misunderstanding about handling Cr3 vm-exits, even though the above code works fine, but generally, it has some performance penalties. I’ll explain these performance problems in the “Fixing Previous Design Issues” section.

You might also ask why we avoid writing the 63rd bit of the CR3.

Bit 63 of CR3 is a new bit that is part of the PCID feature. It allows OS to change CR3 value without invalidating all TLB entries (tagged with the same EP4TA and VPID) besides those marked with global-bit.

EP4TA is the value of bits 51:12 of EPTP.

E.g. Windows KVA Shadowing and Linux KPTI signal this bit on CR3 mov that changes PCID between userspace PCID and kernel space PCID on user and kernel transitions.

We should not write on bit 63 of CR3 on mov reg, cr3 emulation because the processor does not write and attempt to write this will cause a crash on modern Win10. 

INVVPID vs. INVPCID

INVPCID is not really relevant to hypervisor but in the case, if you want to know, INVPCID invalidates mappings in the translation lookaside buffers (TLBs) and paging-structure caches based on the process-context identifier (PCID).

So it’s like INVVPID with the difference that it’s not specific to the hypervisor. It also has its particular contexts (currently 3), you can read more here but generally keep in mind that to reduce that overhead, a feature called Process Context ID (PCID) was introduced by Intel’s Westmere architecture and related instruction, INVPCID (invalidate PCID) with Haswell. With PCID enabled, the way the TLB is used and flushed changes. First, the TLB tags each entry with the PCID of the process that owns the entry. This allows two different mappings from the same virtual address to be stored in the TLB as long as they have a different PCID. Second, with PCID enabled, switching from one set of page tables to another doesn’t flush the TLB any more. Since each process can only use TLB entries that have the right PCID, there’s no need to flush the TLB each time.

This behavior is used in Meltdown mitigation to avoid wiping out the entire TLB for the processors that support PCID.

Designing A VMX Root-mode Compatible Message Tracing

Without any doubt, one of the hardest parts of designing a hypervisor is sending a message from Vmx root-mode to Vmx non-root mode. This is because you have lots of limitations like you can’t access non-paged buffer, and of course, most of the NT functions are not (ANY IRQL) compatible as they might access the buffers that reside in paged pool.

The things are ending here, there are plenty of other limitation to deal with.

This section is inspired by Chapter 6: Kernel Mechanisms (High IRQL Synchronization) from the Windows Kernel Programming book by Pavel Yosifovich which is a really amazing book if you want to start with kernel programming.

Concepts

This section describes some of the Operating System concepts, you should know before starting.

What’s a spinlock?

The Spin Lock is a bit in memory that provides atomic test and modify operations. When a CPU tries to acquire a spinlock, and it’s not currently free, the CPU keeps spinning on the spinlock, busy waiting for it to be released by another CPU means that it keeps checking until another thread which acquired it first release it.

Test-and-Set

You probably read about Test and Set in university. Still, in case you didn’t, in computer science, the test-and-set instruction is an instruction used to write 1 (set) to a memory location and return its old value as a single atomic (i.e., non-interruptible) operation. If multiple processes may access the same memory location, and if a process is currently performing a test-and-set, no other process may begin another test-and-set until the first process’s test-and-set is finished.

What do we mean by “Safe”?

The “safe” is used a lot in hypervisors. By “safe,” we mean something that works all the time and won’t cause system crash or system halt. It’s because it’s so tricky to manage codes in vmx root-mode. After all, interrupts are masked (disabled), or transfer buffer from vmx root-mode to vmx non-root mode needs extra effort, and we should be cautious and avoid executing some APIs to be safe.

What is DPC?

Deferred Procedure Call (DPC) is a Windows mechanism that allows high-priority tasks (e.g., an interrupt handler) to defer required but lower-priority tasks for later execution. This permits device drivers and other low-level event consumers to perform the high-priority part of their processing quickly and schedule non-critical additional processing for execution at a lower priority.

DPCs are implemented by DPC objects which are created and initialized by the kernel when a device driver or some other kernel-mode program issues requests for DPC. The DPC request is then added to the end of a DPC queue. Each processor has a separate DPC queue. DPCs have three priority levels: low, medium, and high. By default, all DPCs are set to medium priority. When Windows drops to an IRQL of Dispatch/DPC level, it checks the DPC queue for any pending DPCs and executes them until the queue is empty or some other interrupt with a higher IRQL occurs.

This is the description of DPCs from MSDN:

Because ISRs must execute as quickly as possible, drivers must usually postpone the completion of servicing an interrupt until after the ISR returns. Therefore, the system provides support for deferred procedure calls (DPCs), which can be queued from ISRs and which are executed at a later time and at a lower IRQL than the ISR.

There are two posts about DPCs here and here, you can read them for more information.

Challenges

For example, Vmx-root mode is not a HIGH_IRQL interrupt (with discussing it in Discussion Section), but as it disables all of the interrupts, we can think like it’s a HIGH_IRQL state. The problem is that must of synchronization functions are designed to be worked on IRQL less than DISPATCH_LEVEL.

Why is it problematic? Imagine you have a one-core processor, and your function requires a spinlock (let say it’s merely a buffer that needs to be accessed). The function raises the IRQL to DISPATCH_LEVEL. Now the Windows Scheduler can’t interrupt the function until it releases the spinlock and lowers the IRQL to PASSIVE_LEVEL or APC_LEVEL. During the execution of the function, a vm-exit occurs; thus, we’re in vmx root-mode now. It’s because, as I told you, vm-exit happens as if it’s a HIGH_IRQL interrupt.

Now, what if we want to access that buffer in vmx root mode? Two scenarios might occur.

  • We wait on a spinlock that was previously acquired by a thread in vmx non-root mode and this we have to wait forever. A deadlock occurs.
  • We enter the function without looking at the lock (while there is another thread that enters the function at the same time.) so it results in a corrupted buffer and invalid data.

The other limitation is in Windows design putting the thread into a waiting state cannot be done at IRQL DISPATCH_LEVEL or higher. It’s because in Windows when you acquire a spinlock it raises the IRQL to 2 – DISPATCH_LEVEL (if not already there), acquire the spinlock, perform the work and finally release the spinlock and lower IRQL back.

If you look at a function like KeAcquireSpinLock and KeReleaseSpinLock, they get an IRQL in their arguments. First, KeAcquireSpinLock saves current IRQL to the parameter supplied by the user then raises the IRQL to DISPATCH_LEVEL and sets a bit. When the function finished its works with shared data, then it calls KeReleaseSpinLock and passes that old IRQL parameter so this function unsets the bit and restore the old IRQL (lowers the IRQL).

Windows has 4 kinds of Spinlocks,

  1. KeAcquireSpinLock – KeReleaseSpinLock : This pair can be called at IRQL <= DISPATCH_LEVEL.
  2. KeAcquireSpinLockAtDpcLevel – KeReleaseSpinLockFromDpcLevel : This pair can be call at IRQL = DISPATCH_LEVEL only, it’s more optimized if you are already in IRQL 2 as it doesn’t saves the old IRQL and it’s specially designed to work on DPC routine.
  3. KeAcquireInterruptSpinLock – KeReleaseInterruptSpinLock: Hardware based use this pair e.g in Interrupt Service Routine (ISR) or it used by drivers with an interrupt source.
  4. ExInterlockedXxx : This function raises the IRQL to HIGH_LEVEL and perform it’s task, it doesn’t need a release function as no one interrupt us on HIGH_IRQL.

But unfortunately, things are more complicated when it comes to vmx root-mode. We don’t have IRQL in the vmx root-mode. It’s an operating system thing, so we can’t use any of the above functions, and things are getting worst if we want to use our message tracing mechanism between multiple cores!

For these reasons, we have to design our custom spinlock.

Designing A Spinlock

Designing spinlock in a multi-core system by its nature needs the hardware support for atomic operation means that hardware (most of the time processor) should guarantee that an operation is performed just by logical (hyper-threaded) core and it’s non-interruptible.

There is an article here that describes different kinds of spinlock with different optimizations, also it’s implemented here.

The design of this mechanism in the processor is beyond the scope of this article. We simply use an intrinsic function provided by Windows called “_interlockedbittestandset“.

This makes our implementation super simple. We just need to use the following function, and it’s the responsibility of the processor to take care of everything.

Update 2: We should use volatile keyword in parameters too, otherwise it’s like un-volatiling.

Now we need to spin! If the above function was not successful, then we have to keep CPU checking to see when another processor releases the lock.

Update 2: We should use volatile keyword in parameters too, otherwise it’s like un-volatiling.

If you wonder what is the _mm_pause() then it’s equal to PAUSE instruction in x86.

Pause instruction is commonly used in the loop of testing spinlock, when some other thread owns the spinlock, to mitigate the tight loop.

PAUSE notifies the CPU that this is a spinlock wait loop, so memory and cache accesses may be optimized. See also pause instruction in x86 for some more details about avoiding the memory-order mis-speculation when leaving the spin-loop. PAUSE may stop CPU for some time to save power. Older CPUs decode it as REP NOP, so you don’t have to check if it’s supported. Older CPUs will simply do nothing (NOP) as fast as possible.

For releasing the lock, there is nothing special to do, so simply unset it without caring for any other processor as there is no other processor that wants to unset it.

Update 2: We should use volatile keyword in parameters too, otherwise it’s like un-volatiling.

The last step is to use a volatile variable as the lock.

The “volatile” keyword tells the compiler that the value of the variable may change at any time without any action being taken by the code the compiler finds nearby. The implications of this are quite serious. There are lots of examples here if you have a problem with understanding “volatile“.

Message Tracer Design

For solving the above the challenge about deadlock, I create two message pools for saving messages. The first pool is designed to be used as storage for vmx non-root messages (buffers) and the second pool is used for vmx-root messages.

We have the following structure that describes the state of each of these two pools.

Generally, we’ll save the buffer as illustrated below, each chunk of the message came with BUFFER_HEADER that describes that chunk.

Other information for the buffer like Current Index to Write and Current to Send is saved in the above structure.

The BUFFER_HEADER is defined like this,

We save the length of used length of the chunk and a bit which determine whether we sent it before or not.

Operation Number is number, which will be sent to the user-mode to show the type of the buffer that came from the kernel. In other words, it’s a number that indicates the intention (and structure) of the buffer, so the user-mode application will know what to do with this buffer.

The following Operation Numbers are currently defined :

Each of them shows a different type of message, and the last one shows that a bunch buffer is accumulated in this buffer. This message tracing is designed to send any kind of the buffer from both vmx root and OS to the user-mode, so it’s not limited just to sending messages, we can send buffers with custom structures and different Operation Numbers.

The last thing about our message tracing is, it can be configured with the following constants, you can change them in order to have a better performance for your exclusive use.

You can configure things like the maximum number of chunks in a buffer and also the size of each chunk. Setting the above variables is necessary in some cases if there is no thread to consume (read) these chunks and pools are full; it replaces the previous unread buffer. Hence, if you can’t frequently consume the pools, then it’s better to specify a higher number for MaximumPacketsCapacity so that you won’t lose anything.

Initialization Phase

In the initialization phase, we allocate space for the above structure (2 times, one for vmx non-root and one for vmx-root) and then allocate the buffers to be used as the storage for saving our messages.

We have to zero them all and also KeInitializeSpinLock to initialize the spinlock. We use this spinlock only for vmx non-root, and this function makes sure that the value for the lock is unset. We do the same for our custom spinlock (VmxRootLoggingLock), just unset it.

You might ask, what is the “BufferLockForNonImmMessage“, it’s another lock that will use it as optimization (see later).

All in all, we have the following code.

Sending Phase (Saving Buffer and adding them to pools)

In a regular Windows routine generally, we shouldn’t be on IRQL more than Dispatch Level. There is no case that our log manager needs to be used in higher IRQLs, so we don’t care about them; thus, we have two different approaches here. First, we acquire the lock (spinlock) using KeAcquireSpinLock in vmx non-root as it’s a Windows optimized way to acquire a lock and for vmx-root mode, we acquire the lock using our previously designed spinlock.

As I told you above, we want to fix this problem that might a vmx-exit occurs when we acquired a lock, so it’s not possible to use the same spinlock as deadlock might happen.

Now we have to see whether we are operating from vmx non-root or vmx root, based on this condition, we select our lock and the index of the buffer that we want to put our message into it.

I’m not gonna explain each step, as it’s easy, it’s just managing buffer and copying data from a buffer to another buffer and also the code is well commented so you can read the code, instead, I explain tricky parts of our message tracing.

After creating a header for our new message buffer, we will copy the bytes and change the information about buffer’s indexes. The last step here is to see whether any thread is waiting to receive our message or not.

If there is no thread waiting for our message then nothing more to do here but if there is a thread which is IRP Pending state (I explain about it later), then we use KeInsertQueueDpc so that it will be added to our DPC Queue which will be subsequently executed by Windows in IRQL == DISPATCH_LEVEL.

It means that our callback function will execute by Windows later and of course, Windows execute our function in vmx non-root so it’s safe. I’ll describe this callback and how we create a DPC later.

Finally, we have to release the locks so that other threads can enter.

Reading Phase (Read buffers and send them to user-mode)

It’s time to read the previously filled buffer! The fact that we add a DPC in the previous function “LogSendBuffer” shows that the “LogReadBuffer” is executed in vmx non-root mode so we can freely use most of the APIs (not all of them).

Theoretically, we have a problem here, if we want to read a buffer from the vmx root-mode pool, then it might cause a deadlock as we acquired a vmx root-mode lock and might a vm-exit occur. Hence, we spin on this lock in vmx root mode forever, but practically there is no deadlock here. Can you guess why?

It’s because our LogReadBuffer executes in DISPATCH_LEVEL so the Windows scheduler won’t interrupt us, and our function is executed without any interruption and the fact that we’re not doing anything fancy here. I mean, we’re not performing anything (like CPUID) that causes a vm-exit in our code, so practically there is nothing to cause deadlock here, but we should keep in mind that we’re not allowed to run codes that cause vmx-exit.

We compute the header address based on previous information and also set the valid bit to zero so that it shows that this buffer is previously used.

Then we copy the buffer to the buffer that specified in arguments also put the Operation Number on the top of the target buffer so that the future functions will know about the intention of this buffer. We can also use DbgPrint to show the messages to the kernel debugger. Using DbgPrint in DISPATCH_LEVEL (vmx non-root mode) is safe. We might need to use DbgPrint multiple times as this function has a maximum of 512 bytes by default. Even though you can change the limit number but we assume the default size is selected.

Finally, we have to reset some of the information regarding buffer, clear the buffer messages (it’s not necessary to zero the buffer, but for making debug process easier, I prefer to zero the buffer), and release the locks.

Checking for new messages

Checking for the new message is simple; we just need to check the current message index based on previous information and see if its header is valid or not. If it’s valid then it shows that we have a new message, but if it’s not valid, then some function reads the message previously, and there is no new message.

For checking the new message, we even don’t need to acquire a lock because basically we don’t write anything and in our case reading doesn’t need a lock.

Sending messages to pools

Previously, we see how to save (send) buffers and read them. Each message is a buffer of strings, so finally, we have to use “LogSendBuffer” to send our buffer, but we need to consider extra effort to send a well-formed message.

va_start and va_end are used to support multiple arguments to one function, e.g like DbgPrint or printf.

You can use a combination of KeQuerySystemTime, ExSystemTimeToLocalTime, and RtlTimeToTimeFields to get the current system time (see the example) then putting them together with sprintf_s.

There is a particular reason why we use the sprintf-like function instead of RtlString* functions; the reason is described in the Discussion section. The next step is computing length using strnlen_s.

Finally, we have a vital optimization here; logically we create two kinds of messages, one called “Immediate Message” which we will directly send it into the pool and another type is “Non-Immediate Message” which we gather the messages in another buffer and append new messages in that buffer until its capacity is full (we shouldn’t pass the PacketChunkSize limit).

Using this way, we don’t send each message to the user-mode separately but instead, we send multiple messages in one buffer to the user-mode. We will gain visible performance improvement. For example with a configuration with PacketChunkSize == 1000 bytes we send 6 messages on a buffer (it’s average basically it depends on each message size) because you probably know that CPU has to do a lot to change its state from kernel-mode to user-mode and also creating new IRP Packet is a heavy task.

You can also change the configuration, e.g., increase the PacketChunkSize so that more messages will hold on the temporary buffer, but generally, it delays the time you see the message.

Also, we work on a buffer so we need another spinlock here.

Putting it all together we have the following code :

Receiving buffers and messages in user-mode

Receiving buffers from the user-mode is done by using an IOCTL. First, we create another thread in our user-mode application. This thread is responsible for bringing the kernel-mode buffers to the user-mode and then operate based on Operation Number.

This thread executes the following function. We use IRP Pending for transferring data from kernel-mode to user-mode. IRP Pending is primarily used for transferring a packet. For example, you send an IRP packet to the kernel, and kernel marks this packet as Pending. Whenever the user-mode buffer is available to send to the user-mode, the kernel completes the IRP request, and the IOCTL function returns to the user-mode and continues the execution.

It’s somehow like when you use Wait for an object. We can also use events in Windows and whenever the buffer is available the event is triggered but IRP Pending is better as it designed for the purpose of sending messages to user-mode.

What we have to do is allocating a buffer for kernel-mode code and using DeviceIoControl to request the packet. When the packet from the kernel received, we process the packet and switch through the Operation Number.

IOCTL and managing user-mode requests

When the IOCTL arrived on the kernel side, DrvDispatchIoControl from major functions is called. This function returns a pointer to the caller’s I/O stack location in the specified IRP.

From the IRP Stack we can read the IOCTL code and buffers address, this time we perform necessary checks and pass the arguments to LogRegisterIrpBasedNotification.

To register an IRP notification, first, we check whether any other thread is pending by checking GlobalNotifyRecord if there is any thread we complete the IRP and return to the user-mode because in our design we ignore multiple threads that request the buffers means that only one thread can read the kernel-mode buffer.

Second, we initialize a custom structure that describes the state. The following structure is responsible for saving Type, DPC Object, and target buffer.

In order to fill this structure, we initialize a DPC object by calling KeInitializeDpc, this function gets the function callback that should be called later (LogNotifyUsermodeCallback) and the parameter(s) to this function (NotifyRecord).

We first check the vmx non-root pools to see if anything new is available. Otherwise, we check the vmx-root mode buffer. This precedence is because vmx non-root buffers are more important. After all, we spent must of the time in VMX Root-mode, so we might see thousands of messages from vmx-root while we have fewer messages from vmx non-root. If we check the vmx root message buffer first, then we might lose some messages from vmx non-root or never find a time to process them.

If any new message is available then we directly add a DPC to the queue (KeInsertQueueDpc).

If there isn’t any new message available, then we simply save our Notify Record for future use, and also we mark IRP to pending state using IoMarkIrpPending and return STATUS_PENDING.

Usermode notify callback

As you see in the above codes, we add DPCs to queue in two functions (LogRegisterIrpBasedNotification and LogSendBuffer). This way, we won’t miss anything, and everything is processed as a message is generated. For example, if there is any thread waiting for the message then LogSendBuffer notifies it about the new message, if there isn’t any thread waiting for the message then LogSendBuffer can’t do anything, as long as a new thread comes to the kernel then it checks for the new message. Think about it one more time. It’s beautiful.

Now it’s time to read the packets from kernel pools and send them to the user-mode.

When LogNotifyUsermodeCallback is called then we sure that we’re in DISPATCH_LEVEL and vmx non-root mode.

In this function, we check if the parameters sent to the kernel are valid or not. It’s because the user-mode provides them. For example, we check the IRP stack’s Parameters. DeviceIoControl. InputBufferLength and Parameters. DeviceIoControl. OutputBufferLength to make sure they are not null or check whether the SystemBuffer is null or not.

Then we call LogReadBuffer with user-mode buffers, so this function will fill the user-mode buffer and adds the Operation Number in a suitable place. Also, Irp->IoStatus.Information provides the buffer length to the user-mode.

The last step here is to complete the IRP, so I/O Manager sends the results to the user-mode, and the thread can continue to its normal life.

The reason why we access the user-mode buffer in all processes (because DPCs might run on the random user-mode process) and why we use DPCs and don’t use other things like APCs is discussed in the Discussion section.

The following code demonstrates what we talked about it above.

Uninitialization Phase

Nothing special, we just de-allocate the previously allocated buffers. Keep in mind that we should initialize the message tracer at the very first function of our driver so we can use it and, of course, uninitialize it at the end when we don’t have any message anymore.

Aniiiimmmmeee :)

WPP Tracing

WPP Tracing is another mechanism provided by Windows, which can be used to trace messages from both vmx non-root and vmx root-mode and in any IRQL. It is primarily intended for debugging code during development, and it’s capable of publishing events that can be consumed by applications in structured ETW events.

Logging messages with WPP software tracing is similar to using Windows event logging services. The driver logs a message ID and unformatted binary data in a log file. Subsequently, a postprocessor converts the information in the log file to a human-readable form.

In order to use WPP Tracing, first, we should configure our driver to use WPP Tracing as the message tracing by setting UseWPPTracing to TRUE. By default it’s FALSE.

Then we go to our project’s properties and set Run Wpp Tracing to Yes and also add a custom function for sending messages by setting Function To Generate Trace Messages to HypervisorTraceLevelMessage (LEVEL,FLAGS,MSG,…).

WPP Tracing Configuration

Then we need to generate a unique GUID for our driver by using Visual Studio’s Tools -> Create GUID and generate one and put it into the following format.

WPP_DEFINE_BIT creates some specific events for our messages that can be used in the future for masking specific events.

After all the above code, we initialize the WPP Tracing by adding the following code at the very first line of the code, e.g., DriverEntry.

At last we clean up and set WPP Tracing to off by using the following code to Driver Unload function.

For making things easy, I add the following codes to our previous message tracing code, which means that instead of sending the buffers into our custom message tracing buffer, we’ll send it to WPP Tracing buffer.

Also, we have to .tmh files. These files are auto-generated by the WPP framework, which contains the required code for trace messages. TMH file name should be the same as the C file, for example, if we are adding the trace message in “Driver.c” then we are supposed to include “Driver.tmh“. We used WPP Tracing APIs in two files, first Driver.c and Logging.c, so we have to include Driver.tmh and Logging.tmh and no need for these files in other project files as long as we gathered everything in one file.

The WPP Tracing is complete! In order to see the messages in user-mode, we have to use another application, e.g traceview.

Personally, I prefer to use my custom message tracing as WPP Tracing needs to some other application to parse the .pdb file or other files to show the messages, and I didn’t find any good example of parsing messages in an application without using another app.

You can see the results of WPP Tracing later in Let’s Test it! section.

Supporting to Hyper-V

As I told you in the previous parts, testing and building hypervisor for Hyper-V needs extra consideration and adding a few more lines of code to support Hyper-V nested virtualization.

At the time of writing this part, Hyper-V and VMware Workstation are incompatible with each other, which means that if you run Hyper-V you can’t run VMware and a message like this will appear.

VMware Workstation and Hyper-V are not compatible. Remove the Hyper-V role from the system before running VMware Workstation.

The same is true for VMware, if you run VMware you can’t run Hyper-V and you need to execute a command then restart your computer to use another VMM.

In order to use Hyper-V, you should run the following command (as administrator) and then restart your computer.

And if you want to run VMware, you can run the following command (as administrator) and restart your computer.

Enable Nested Virtualization

In part 1, there is a section that describes how to enable VMware’s nested virtualization and test your driver. For Hyper-V we have an exact same scenario, first, turn off the target VM then enable nested virtualization for the target virtual machine by running the following command on Powershell:

Note that instead of PutYourVmNameHere, put the name of your virtual machine that you want to enable nested virtualization for it.

And if you need to disable it, you can run:

Now you need to attach your Hyper-V machine to a windbg debugger. There are many ways to do it. You can read here and here (I prefer using kdnet.exe).

Now we have the testing environment, it’s time to modify our hypervisor so we can support Hyper-V.

Hyper-V’s visible behavior in nested virtualization

Hyper-V has some visible behavior for our hypervisor, which means that you should manage some of them that relate to us and give some of them to the Hyper-V as a top-level hypervisor to manage them, you’re confused? Let me explain it one more time.

In a nested virtualization environment, you’re not directly getting the vm-exits and all other hypervisor events, instead it’s the top-level hypervisor that gets the vm-exit (in our case Hyper-V is the top-level). Top-level hypervisor calls the vm-exit handler of lower-level hypervisors (our hypervisor is a low-level hypervisor in this case.) now the lower level hypervisor manages the vm-exit (for example it injects an event (interrupt) to be delivered to the guest) after vm-exit finishes it executes VMRESUME, but this instruction won’t directly go to the guest vmx non-root. Instead, it goes to the vm-exit handler of the top-level hypervisor, and now it’s the top-level hypervisor that performs the tasks (In our example, insert event to the guest).

So, even our hypervisor is not the first hypervisor that gets the event, but our hypervisor is the first to manage them.

On the other hand, Windows kernel is highly integrated to Hyper-V, which means that it uses lots of Hypercalls (Vmcalls) and MSRs to contact with Hyper-V and if the Windows kernel doesn’t get the valid response from Hyper-V then it crashes or halts.

As the first hypervisor to manage the vm-exits, we have to inspect vm-exit details to see if the vm-exit relates to us our refers to Hyper-V. In other words, it’s a general vm-exit, or it’s because Windows wants to talk with Hyper-V.

OK, let see what we should manage and what we should not.

Hyper-V Hypervisor Top-Level Functional Specification (TLFS)

The Hyper-V Hypervisor Top-Level Functional Specification (TLFS) describes the hypervisor’s externally visible behavior to other operating system components. This specification is meant to be useful for guest operating system developers.

If you want to research Hyper-V, you have to read the documentation about Hyper-V’s TLFS here, but we just want to support Hyper-V. Hence, there is documentation (Requirements for Implementing the Microsoft Hypervisor Interface) that describes the things we should do in order to support Hyper-V. Of course, we’re not going to implement all of them to make our hypervisor work on Hyper-V.

Out of Range MSRs

In part 6, I described MSR Bitmaps, if you remember MSR bitmap support MSR index (RCX) between 0x00000000 to 0x00001FFF and 0xC0000000 to 0xC0001FFF. Windows uses other MSRs from 0x40000000 to 0x400000F0 for requesting something or reporting something to vmx-root.

You might ask why they don’t use VMCALLs. Of course, they can use VMCALL, but most hypervisors do this. It’s cheaper and predates VMCALLs, and also this range is specifically designed to be used by hypervisors.

The reason why it’s cheaper is the same discussion about why use int 2e and not sysenter as the cost of sending data over vmcall and allowing it from ring 0 or ring 3 and deciding things (rdmsr doesn’t need that ring check) and sending data back is greater than a simple MSR interface and can work with legacy compilers and systems too.

You can find the definitions of these MSRs here.

All in all, I modified our previous MSR handler (both MSR Read – RDMSR and MSR Write – WRMSR to support MSRs between 0x40000000 to 0x400000F0). All we have to do is execute RDMSR or WRMSR in vmx-root mode.

You might ask, is it ok to run WRMSR or RDMSR with hardware invalid MSRs?

The answer is no! but the reason why we execute it is because we’re are in a nested virtualization environment and it’s not a real vmx-root, physically we’re in vmx non-root mode if that makes sense.

In other words, VMware or Hyper-V or any nested virtualization environment calls our vm-exit handler in vmx non-root and pretend that it’s in vmx-root mode, so executing WRMSR or RDMSR causes a real vm-exit to Hyper-V, and that’s how they can handle the actual vm-exit.

For example RDMSR handles like this :

Same checks apply to WRMSR too.

Hyper-V Hypercalls (VMCALLs)

VMCALL is exactly like RDMSR and WRMSR, even though running VMCALL on vmx-root mode has a known behavior (invokes an SMM monitor). Still, in our case, in a nested virtualization environment, it causes a vm-exit to Hyper-V so Hyper-V can manage the hypercall.

Hyper-V has the following convention for its VMCALLs (hypercall).

Hyper-V hypercall convention

As we want to use our hypervisor VMCALLs, a quick and dirty fix for this problem is somehow show the vm-exit handler that our hypervisor routines should manage this VMCALL; thus we put some random hex values to r10, r11, r12 (as these registers are not used in fastcall calling convention, you can choose other registers too) thus we can check for these registers on the vm-exit handler to make sure that this VMCALL relates to our hypervisor.

As some of the registers should not be changed due to the Windows x64 fastcall calling convention, we save them to restore them later.

Generally, The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile (caller-saved) and registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered nonvolatile (callee-saved).

For Hyper-V VMCALLs we need to adjust RCX, RDX, R8 as demonstrated in the above picture.

Finally, in the vm-exit handler, we check for the VMCALL to see if our random values are store in the registers or not. If it’s on those registers, then we call our hypervisor VMCALL handler. Otherwise, we let Hyper-V do whatever it wants to its VMCALLs.

Hyper-V Interface CPUID Leaves

The last step on supporting Hyper-V is managing CPUID leaves, here are some of the CPUID leaves that we have to manage them.

Note that based on the document I mentioned, we have to return non “Hv#1” value. This indicates that our hypervisor does NOT conform to the Microsoft hypervisor interface.

By the way, it works without the above modification about CPUID leaves, but it’s better to manage them based on TLFS.

One other thing that I noticed during the development on Hyper-V was the fact that we have vm-exits because the guest executes HLT (Halt) instruction, of course, we don’t want to halt the processor so in the case of EXIT_REASON_HLT we simply ignore it.

Finished! From now you can test your hypervisor on Hyper-V too : )

Fixing Previous Design Issues

In this part, we want to improve our hypervisor and fix some issues from the previous parts regarding problems and misunderstandings.

Fixing the problem with pre-allocated buffers

Our previous buffer pre-allocation has 2 problems,

  • It doesn’t allow us to hook page from VMX Root mode, which means that every pool allocation should start from vmx non-root mode.
  • In the process of allocation, we didn’t acquire spinlock so that the processor might interrupt us. Next time we want to continue our execution, there is no allocation as we allocate pools per core.

To fix them, we need to design a global pool manager. You can see the pool manager code in “PoolManager.c” and “PoolManager.h“. I’m not gonna describe how it works as it’s pretty clear if you see the source code, but I’ll explain the functionality of this pool manager and how you can use its functions.

In this pool manager, instead of allocating core-core specific pre-allocated buffers, we’ll use global pre-allocated buffers with ten pre-allocated buffers ready, each time one of these buffers is used we add a request to pool manager to replace another pool as soon as possible, this way we’ll never run out of pre-allocated pools.

Of course, we might run out of the pre-allocated pool if ten requests arrive at the pool manager, but we don’t need such a request and, of course, between them, pool manager gets a chance to re-allocate new pools.

Here the functions explanation :

Initializes the Pool Manager and pre-allocate some pools.

De-allocate all the allocated pools

The above function tries to see whether a new pool request is available, if available, then allocates it. It should be called in PASSIVE_LEVEL (vmx non-root mode) because we want paging allocation, and also, the best place to check for it is on IOCTL handler as we call it frequently and it’s PASSIVE_LEVEL and safe.

If we have requested to allocate a new pool, we can call this function. It stores the requests somewhere in the memory to be allocated when it’s safe (IRQL == PASSIVE_LEVEL).

POOL_ALLOCATION_INTENTION is an enum that describes why we need this pool. It’s used because we might need pools for other purposes with different sizes, so we use our pool manager without any problem.

In the vmx-root mode, if we need a safe pool address immediately we call it, it also requests a new pool if we set RequestNewPool to TRUE; thus, next time that it’s safe, the pool will be allocated.

Also, you can look at the code for other explanations.

Avoid Intercepting Accesses to CR3

One of misunderstanding that we have from part 5 until this part was that we intercept CR3 accesses because we set CR3 load-exiting and CR3 store-exiting on the Cpu Based VM Exec Controls.

In general, it’s quite unusual to intercept guest accesses to CR3 when you run it under EPT. It’s a behavior mostly done when you implementing shadow MMU (Because lack of EPT support in CPU) so not intercepting CR3 accesses is the standard behavior for any hypervisor running with EPT enabled.

Intercepting CR3 access is always configurable, we have to clear bits CPU_BASED_CR3_STORE_EXITING, CPU_BASED_CR3_LOAD_EXITING, and CPU_BASED_INVLPG_EXITING in VMCS’s CPU_BASED_VM_EXEC_CONTROL.

But wait, why we should clear them, we never set them !

As noted in previous parts, certain VMX controls are reserved and must be set to a specific value (0 or 1), which is determined by the processor. That’s why we used the function “HvAdjustControls” and pass them an MSR (MSR_IA32_VMX_PROCBASED_CTLS, MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS) which represents these settings.

Actually, there are 3 types of settings for VMCS controls.

  • Always-flexible. These have never been reserved.
  • Default0. These are (or have been) reserved with a default setting of 0.
  • Default1. They are (or have been) reserved with a default setting of 1.

On newer processors, if Bit 55 (IA32_VMX_BASIC) is read as 1 if any VMX controls that are default1 may be cleared to 0. This bit also reports support for the VMX capability MSRs A32_VMX_TRUE_PINBASED_CTLS, IA32_VMX_TRUE_PROCBASED_CTLS, IA32_VMX_TRUE_EXIT_CTLS, and IA32_VMX_TRUE_ENTRY_CTLS.

So we have to check if our CPU supports this bit, if it supports then we have to use new A32_VMX_TRUE_PINBASED_CTLS, IA32_VMX_TRUE_PROCBASED_CTLS, IA32_VMX_TRUE_EXIT_CTLS, and IA32_VMX_TRUE_ENTRY_CTLS instead of MSR_IA32_VMX_PROCBASED_CTLS, MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS.

Note that MSR_IA32_VMX_PROCBASED_CTLS2 doesn’t have another version.

For this purpose, first we read the MSR_IA32_VMX_BASIC.

Then we check whether the 55th bit of the MSR_IA32_VMX_BASIC is set or not. If it’s set, then we use different MSR to our HvAdjustControls.

This way, we can gain better performance by disabling unnecessary vm-exits as there are countless CR3 changes for each process in Windows, and also meltdown patch brings twice cr3 changes. We no longer need to intercept them.

Restoring IDTR, GDTR, GS Base and FS Base

One of the things that we didn’t have in the previous parts was that we didn’t restore the IDTR, GDTR, GS Base, and FS Base when we want to turn off the hypervisor. We should reset GDTR/IDTR when you do vmxoff, or PatchGuard will detect them left modified.

In order to restore them, before executing vmxoff in each core, the following function is called and it takes care of everything that should be restored to avoid PatchGuard errors.

It read GUEST_GS_BASE and GUEST_FS_BASE from VMCS and write to restore them with WRMSR and also restore the GUEST_GDTR_BASE, GUEST_GDTR_LIMIT, and GUEST_IDTR_BASE, GUEST_IDTR_LIMIT using lgdt and lidt instructions.

This is the assembly part to restore IDTR and GDTR.