Windows x64 Shellcoding – Part 2

The second part of a three part series on Windows x64 Shellcoding. Going from a basic Hello World program in Assembly to implementing guardrails in MSFVenom shellcode.

Part 1 – Assembly and WinAPIs – Hello World
Part 2 – Assembly to Shellcode – Stack Buffers and WinAPIs
Part 3 – Adding extra functionality to MSFVenom

Previously we covered how to write a basic Hello World program in x64 ASM, and showed the issues that we need to overcome to convert our Assembly code into Shellcode. These issues are:

Handling variables manually on the Stack
Calling Win API functions without relying on linking

Stack Strings

To demonstrate variable handling on the Stack we’ll start with Stack strings, then continue onto Stack buffers. So to demonstrate Stack strings we will use the MessageBoxA function.

int MessageBoxA(
  [in, optional] HWND   hWnd,
  [in, optional] LPCSTR lpText,
  [in, optional] LPCSTR lpCaption,
  [in]           UINT   uType
);

After the first section hopefully you can now modify the code to use the new MessageBoxA function. If we call this function with 0 values for all variables, we get an empty message box with the default title Error.

mov	rcx, 0
mov 	rdx, 0
mov	r8, 0
mov 	r9, 0
call    MessageBoxA

The second argument, namely that in the rdx register requires a pointer to a string to display a message within the message box. One simple way of handling this is to push 64-bit strings onto the stack, and then assigning the address of the stack string to a register. Because of the endianness and structure of the stack, string arguments need to be written in a rather annoying format, for example the following string AAAABBBBCCCDDDD (notice the three C’s, this is intentional), would usually be read as 0x414141414242424243434344444444. However to preserve the order of the string on the stack, this would need to be pushed onto the stack in the following order.

0x44444444434343
0x4242424241414141

Another thing to remember is about stack alignment, every push decrements the stack pointer by 8-bits, so a single push will need to have an additional 8-bits subtracted from the stack pointer to account for stack alignment. If we push two strings to the stack, the 16-bit alignment works itself out naturally. Therefore the MessageBoxA call could look something like this:

mov    rcx, 0
mov    rdx, 0x44444444434343
push   rdx
mov    rdx, 0x4242424241414141
push   rdx
mov    rdx, rsp
mov    r8, 0
mov    r9, 0
call   MessageBoxA

It’s easy to make mistakes with the stack string format, while I would recommend trying to do a few different strings manually to learn how to do it, the Python script below handles the formatting of the bytes for you.

# run with python heck.py | tac
msg = "Oh heck! A wild message box appeared!"

for i in range(0, len(msg), 8):
    rmsg = msg[i:i+8][::-1]
    print("0x" +  ''.join(
        [
            format(ord(ch), '02x') 
            for ch in rmsg
        ]
    ))

Putting this together, we get the following ASM code. Note the single sub rsp, 8 instruction because of the odd number of push instructions.

start:
	sub rsp, 40 ; reserve shadow space

	mov rdx, 0x2164657261
	push rdx
	mov rdx, 0x6570706120786f62
	push rdx
	mov rdx, 0x206567617373656d
	push rdx
	mov rdx, 0x20646c6977204120
	push rdx
	mov rdx, 0x216b63656820684f
	push rdx
	
	mov rdx, rsp
	mov rcx, 0
	mov r8, 0
	mov r9, 0
	sub rsp, 8
	call MessageBoxA
		
	xor rcx, rcx
	call ExitProcess

Stack Buffers

Now onto something a little bit more complicated, Stack Buffers. After learning and reading a few tutorials on ASM, I was a bit disappointed as the functionality I was implementing was just message boxes or similar style functions. So to do something a little more interesting, let’s use GetComputerNameA to get the hostname of our machine. A brief look at the documentation and this function initially looks very simple to implement.

BOOL GetComputerNameA(
  [out]     LPSTR   lpBuffer,
  [in, out] LPDWORD nSize
);

Perhaps we could use a stack string, but that just feels messy, what we need instead is a stack buffer. An area of memory reserved on the stack for us to use to write the hostname to. Starting with the following template, similar to the Hello World program from Part 1, we need to create a stack buffer, and then call GetComputerNameA.

start:
	sub 	rsp, 40 ; reserve shadow space
	
	mov 	rcx, -11
	call 	GetStdHandle
	mov	rbx, rax ; store into rbx for later

	; TODO: create stack buffer
	; TODO: call GetComputerNameA

	mov 	rdx, <stackBuffer>
	mov 	rcx, rbx ; store stdout handle to rcx
	mov 	r8, 6 ; length of my hostname
	mov 	r9, 0
	mov 	qword [rsp+32], 0
	call 	WriteFile

	xor rcx, rcx
	call ExitProcess

So another problems we are now faced with, is now we’re handling much more data in our registers and stack, we need to make sure that our stack buffer and GetComputerNameA function call doesn’t destroy our reference to the output handle from GetStdHandle, so we store the output handle in a non clobbered register, i.e. rbx.

Next we create a stack buffer, what I mean by that is a space reserved on the stack for us to store our ComputerName. We do this by first decrementing the stack pointer by however much space we wish to use to store our variable. In this example I will use 40. So the new stack pointer is now at rsp-40. This address is then assigned to another variable that will be used to point to our stack buffer, e.g. mov rcx, rsp. We then add 40 back to rcx so we can get a reference back to the start of our buffer, and finally store rcx into r12 as a non clobbered register, because after we call GetComputerNameA, our reference in rcx will likely be destroyed. The code for creating a stack buffer is below.

	sub 	rsp, 40 ; create space for stack buffer
	mov 	rcx, rsp
	add 	rcx, 40 ; get stack buffer pointer
	mov 	r12, rcx ; backup stack buffer

There are much nicer and more efficient ways of handling this, but I think it’s easier to explain for the first time when being as excessive as possible.

If you experiment with different values for the stack buffer you may run into some strange issues. Remember, alignment is always a priority here. Decrementing the stack pointer by 40, means that we need to align the stack by 8 manually to preserve 16-bit alignment. We do this with a push instruction later on so it is handled for us and we don’t need to handle it explicitly with an additional sub rsp, 8 instruction.

If we used 80 instead of 40 however, 80 is already 16-bit aligned (mod 80 16 == 0) so we wouldn’t need to worry about alignment, however our future code (as discussed previously and shown below) will include a push, so if using 80 as our stack buffer size, we would need to add an additional sub rsp, 8 instruction, or rewrite our push instruction as something else to ensure the stack is 16-bit aligned.

Anyway, our GetComputerNameA call requires two registers, rcx as the stack buffer, which we have just set up, and rdx a pointer to a numerical value for the computer name length. My computer name for context is jwin11. Also note how we’re handling the rdx argument, this is what I was referring to earlier. This is why we don’t need to align the stack manually if using a stack buffer size of 40, our stack buffer is re-aligned with the push instruction.

	mov	rax, 0xff
	push	rax
	mov 	rdx, rsp
	call 	GetComputerNameA

Usually, as seen with other functions e.g. GetStdHandle, the return value is stored in rax. While the return value from GetComputerNameA is still stored in rax, that value is a boolean on the success value of the function call, which we don’t really care about. The value wecare about is stored in our stack buffer. So rcx right? No, as I mentioned previously, our rcx value is likely clobbered and our memory pointer to the stack buffer is gone. This is the reason why we stored it in r12 before the GetComputerNameA call.

So now we can reinstate our stack buffer address from r12 but now put it into rdx, as the WriteFile call requires that the second argument be the string that is printed. The final code for using Stack buffers with GetComputerNameA should look something like this.

start:
	sub 	rsp, 40 ; reserve shadow space
	
	mov 	rcx, -11
	call 	GetStdHandle
	mov 	rbx, rax ; store handle in rbx

	sub 	rsp, 40 ; create space for stack buffer
	mov 	rcx, rsp
	add 	rcx, 40 ; get stack buffer pointer
	mov 	r12, rcx ; backup stack buffer

	mov 	rax, 0xff
	push 	rax
	mov 	rdx, rsp
	call 	GetComputerNameA

	mov 	rdx, r12
	mov 	rcx, rbx
	mov 	r8, 6
	mov 	r9, 0
	mov 	qword [rsp+32], 0
	call 	WriteFile

	xor 	rcx, rcx
	call 	ExitProcess

Now we can remove the .data section from our ASM, and move onto the final blocker for getting our shellcode working.

Walking the Process Environment Block

I learnt the following technique from the following links, I highly recommend reviewing these before continuing, especially the Nytro Security blog as I will be extending Nytro’s code in the remainder of this tutorial.

So the situation we’re in is that we want to call Windows API functions, but we are not able to link to functions during compilation and we don’t know where in memory the functions exist.

Although I just said we don’t know where Windows APIs are, there are two interesting Windows API function calls we would like to find

The reason we want these functions specifically, is that LoadLibraryA allows us to load a module of our choosing, e.g. User32.dll into the address space of our process, and GetProcAddress allows us to find specific functions within the imported module. For example the logic may look like this

GetProcAddress(LoadLibraryA("User32.dll", "MessageBoxA"))

So we need to find these two functions which both exist in Kernel32.dll. So to do this, we can walk the Process Environment Block (PEB). Simply put, the PEB is unique to each process and stores information about the processes execution environment. As described further in the links above, we can walk the PEB to locate the address of Kernel32.dll and therefore our two required functions.

TEB->PEB->Ldr->InMemoryOrderLoadList->currentProgram->ntdll->kernel32.BaseDll

This is possible because InMemoryOrderModuleList is a doubly linked list data structure that contains linked LDR_DATA_TABLE_ENTRY structures, which each contain the base address of DLL files. Each of these data structures also contains the name of the DLL in the data structure. So with this information we can walk the PEB to find the addresses of the DLLs we’re searching for. Combine that with our Kernel32 functions from earlier, and we have access to Win APIs.

So assuming that you’ve read and implemented Nytro’s code to get access to specific Win API functions, we can now extend it with the use of stack buffers to call GetComputerNameA and then print the value out in MessageBoxA.

Writing Shellcode

We’re going to implement the following seven areas using Nytro’s code as a base.

Get Kernel32.dll address
Get GetComputerNameA address
Create a stack buffer for our hostname
Call GetComputerNameA
Get User32.dll address
Get MessageBoxA address
Call MessageBoxA with our stack buffer

Nytro already created a back up of Kernel32 in rbx, so we already have the address. But then we need to get the address of GetComputerNameA, GetProcAddress is already stored in rdi. So we can make some minor modifications to the ASM used to call GetProcAddress to get the address of GetComputerNameA.

xor rcx, rcx                  
push rcx                      
mov rcx, 0x41656d614e726574
push rcx
mov rcx, 0x75706d6f43746547
push rcx
mov rdx, rsp                  ; GetComputerNameA
mov rcx, rbx                  ; kernel32.dll base address
sub rsp, 0x28                 
call rdi                      ; Call GetProcAddress
add rsp, 0x28                 
add rsp, 0x18                  
mov r15, rax                  ; GetComputerNameA in r15

Next we can create our stack buffer as shown earlier. Once again, using 40 as the GetComputerNameA call afterwards uses a push instruction.

sub rsp, 40
mov r14, rsp
add r14, 40

mov rcx, r14
mov rdx, 0xff
push rdx
mov rdx, rsp
call r15

Nytro’s blog already includes a code block for fetching the address of User32.dll, so we can execute that block without any modifications. After getting the address of User32.dll we can then call GetProcAddress with the argument for MessageBoxA.

xor rcx, rcx                  
push rcx                      
mov rcx, 0x41786f
push rcx
mov rcx, 0x426567617373654d
push rcx
mov rdx, rsp                  ; MessageBoxA
mov rcx, r15                  ; User32.dll base address
sub rsp, 0x28                 
call rdi                      ; Call GetProcAddress
add rsp, 0x28                 
add rsp, 0x18                 
mov r15, rax                  ; Store MessageBoxA into r15

Use our stack buffer, stored in a non-clobbered register and move it into rdx for the MessageBoxA call.

mov rcx, 0   
mov rdx, r14
mov r8, 0
mov r9, 0
call r15      ; MessageBoxA

Then add an ExitProcess call at the end, compile and we should have our executable successfully printing a message box with our hostname as the message.

Now we have removed the .data section and external functions, we can go back to objdump and extract our shellcode. The following one liner can help us extract the shellcode by running against the compiled executable. Source.

objdump show_hostname.exe -d | grep '[a-f0-9]:' | grep -v 'file' | cut -f2 -d: | cut -f1-7 -d' ' | tr -s ' ' | tr '\t' ' ' | sed 's/ $//g' | sed 's/ /\\x/g' | paste -d '' -s

Next we can add this into a basic shellcode runner, for example injecting into a local process by using the first example here, also provided below.

#include "Windows.h"

int main()
{
    unsigned char shellcode[] = "";
  
    void *exec = VirtualAlloc(
        0, 
        sizeof shellcode, 
        MEM_COMMIT, 
        PAGE_EXECUTE_READWRITE
    );

    memcpy(exec, shellcode, sizeof shellcode);
    ((void(*)())exec)();
    return 0;
}

Next we can compile it the final file with our shellcode.

"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat"

"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.37.32822\bin\Hostx64\x64\cl.exe" payload.cpp

After compiling, we can then run our final payload with our newly created custom shellcode.

In the next and final part of this series on shellcode development, we will walk through writing shellcode that is actually useful and serves an interesting purpose by implementing custom guardrails into MSFVenom shellcode.

Windows x64 Shellcoding – Part 3

If you learnt something cool or this helped you out, send me some BTC for beer 🙂

bc1qannme72ya2gechk2ued2f96ec6v2veyctvz7mc

Windows x64 Shellcoding – Part 2

Share this: