Writing ultra-small Windows executables

(April 17, 2017)

How small can a valid and useful Win32 executable be? There already are a few tutorials about this topic, but these are either not working on modern Windows versions any longer or only cover the most basic »do nothing but return zero« program. The goal here should be to do something genuinely useful: a console application that outputs the contents of the Windows clipboard on standard output.

The target application

The target application scenrario is a little more than a simple »Hello World« application, but still only very basic Win32 API 101 with only a few calls to kernel32.dll and user32.dll functions and very little algorithmic stuff inbetween. Still, it’s undeniably useful, e.g. when you want to filter the contents of some text in an editor through (Win32 versions of) sed, grep, tr, cut, awk or similar command-line tools. Or —and this is what I’m using very frequently— to quickly change the directory in the console to something else you happen to have open in Explorer. Besides copying the path from Explorer, you would normally need to type »cd /d « in the command prompt window, followed by a right-click, and Enter. (The awkward /d parameter is important, otherwise the current volume letter wouldn’t be changed as well.) That’s cumbersome; I use a small batch file (called fcd.cmd in my case) which resides in a directory somewhere in the PATH and does this automatically:

@for /f "usebackq tokens=*" %%a in (`getclip`) do @cd /d %%a

It calls the getclip.exe helper program to get the clipboard’s contents and crafts a cd /d command with it (using cmd.exe‘s byzantine syntax to make the backticks work their usual magic). I had this helper program already, but it was part of a GnuWin32 installation that requires a bunch of obscure DLLs to work; for my new Windows installation, I didn’t want to use this cruft again. However, there’s no simple alternative either: Windows ships with clip.exe, but this only works in the opposite direction, putting data from a pipe into the clipboard, not out of it. A quick internet research only came up with solutions that had even more ridiculous dependencies. So I decided to ~~shave a yak~~ take matters into my own hands and write my own simple, small implementation of getclip.exe. Couldn’t be that hard.

The naïve C implementation

The obvious first try is to write a C program, like this:

#include <windows.h>
#include <stdio.h>
#include <string.h>
int main(void) {
    if (!OpenClipboard(NULL)) {
        ExitProcess(1);
    }
    HANDLE hData = GetClipboardData(CF_TEXT);
    if (!hData) {
        CloseClipboard();
        ExitProcess(1);
    }
    const char *str = (const char*)GlobalLock(hData);
    if (!str) {
        CloseClipboard();
        ExitProcess(1);
    }
    fwrite((const void*)str, 1, strlen(str), stdout);
    GlobalUnlock(hData);
    CloseClipboard();
    return 0;
}

Note the use of fwrite instead of puts to avoid adding additional newlines at the end of the output. Other than that, it’s fairly basic stuff: opening the clipboard, requesting the data as plain ASCII text, mapping the data into our address space, writing it to stdout and deallocating all the resources we acquired.

Using Visual Studio 2017 (configured to optimize for size) and the default Windows 8.1 SDK, this gives us an executable of 76800 bytes. Yes, that’s almost 77 kilobytes! We could get this down to 8704 bytes by linking dynamically against the C library DLL, but that’s cheating: This way, the user would require a bunch of DLLs in the multi-megabyte range to run it. We can do better than that.

That pesky C library

Looking closely at the code, it becomes obvious that this »C library tax« is not really necessary for this program: The only required library functions are fwrite and strlen, everything else is just plain Win32 API calls into kernel32.dll and user32.dll. fwrite to stdout can be trivially substituted by GetStdHandle and WriteFile, and strlen doesn’t even need a substitute because it’s inlined by the compiler anyway. So let’s just get rid of the C library altogether and link with /NODEFAULTLIB. In doing so, we lose the luxury of having a main function that has a working heap and gets the command line parsed into argc and argv, but we don’t need that anyway. We can instead make our main function be mainCRTStartup, which is the default entry point of console-mode Windows executables, and return from it by calling ExitProcess. The whole program turns into this (changes highlighted):

#include <windows.h>
#include <string.h>
int mainCRTStartup(void) {
    if (!OpenClipboard(NULL)) {
        ExitProcess(1);
    }
    HANDLE hData = GetClipboardData(CF_TEXT);
    if (!hData) {
        CloseClipboard();
        ExitProcess(1);
    }
    const char *str = (const char*) GlobalLock(hData);
    if (!str) {
        CloseClipboard();
        ExitProcess(1);
    }
    DWORD dummy;
    WriteFile(GetStdHandle(STD_OUTPUT_HANDLE),
              (const void*)str, strlen(str), &dummy, NULL);
    GlobalUnlock(hData);
    CloseClipboard();
    ExitProcess(0);
}

That’s not too many changes, but the result is quite impressive: we’re down to 3072 bytes (3 KiB), with no dependencies other than the two mandatory DLLs! At this point, further optimization isn’t reasonable: We’re already below the page size, filesystem cluster size and (modern) disk sector size, all of which are at 4 KiB. If we shrink anything below these 4 KiB, we won’t save any memory, storage space or load time. So there, we have it – we pushed it as far as it makes sense!

But no, we won’t leave it at that! (At least, I won’t.) It might be a pure sports exam by now, but if we started the quest to make a getclip implementation as small as possible, we might just as well end it! So let’s go the next step …

Assembly to the rescue

The actual code is simple enough to forego the comfort zone of C programming altogether and write it straight in assembly, like this (NASM/YASM syntax):

    global _mainCRTStartup
    extern _ExitProcess@4
    extern _OpenClipboard@4
    extern _CloseClipboard@0
    extern _GetClipboardData@4
    extern _GlobalLock@4
    extern _GlobalUnlock@4
    extern _GetStdHandle@4
    extern _WriteFile@20
    section .text
_mainCRTStartup:

    ; set up stack frame for *lpBytesWritten
    push ebp
    sub esp, 4

    ; if (!OpenClipboard(NULL)) ExitProcess(1);
    push 0
    call _OpenClipboard@4
    or eax, eax
    jz error2

    ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;
    push 1  ; CF_TEXT
    call _GetClipboardData@4
    or eax, eax
    jz error
    push eax  ; save hData for GlobalUnlock at the end

    ; char* str = GlobalLock(hData); if (!str) fail;
    push eax
    call _GlobalLock@4
    or eax, eax
    jz error

    ; strlen(str)
    mov ecx, eax
strlen_loop:
    mov dl, [ecx]
    or dl, dl
    jz strlen_end
    inc ecx
    jmp strlen_loop
strlen_end:
    sub ecx, eax

    ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...)
    push 0            ; lpOverlapped = NULL
    lea edx, [ebp-4]  ; put nBytesWritten on the stack
    push edx
    push ecx          ; nNumberOfBytesToWrite = strlen(str)
    push eax          ; lpBuffer = str
    push -11          ; hFile = ...
    call _GetStdHandle@4  ; ... GetStdHandle(STD_OUTPUT_HANDLE)
    push eax
    call _WriteFile@20

    ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);
    call _GlobalUnlock@4  ; hData is already on the stack
    call _CloseClipboard@0
    push 0
    call _ExitProcess@4

error:
    call _CloseClipboard@0
error2:
    push 1
    call _ExitProcess@4

Assembling this and linking it with Microsoft’s link.exe generates an executable of 2560 bytes. That might sound a bit disappointing (a mere 512 bytes reduction for writing everything in assembly, come on!), but in fact it’s more or less expected: Code generated by a good C compiler is usually already very tight (it might even be better than my attempt; I didn’t check that though) and by telling it to omit all C library dependencies, there’s not much additional cruft in there that would be produced by a compiler but not by an assembler.

However, by having a closer look into the generated executable, it shows lots of zeroes and all kinds of PE sections, including relocation information (which is not needed at all for non-ASLR exectutables) and (empty) debug information. There are no linker options (that I know of) that get rid of this, so we need to dig even deeper …

Constructing PE files by hand

The perhaps not easiest, but certainly most thorough way to stop any interference from the linker is not to use one and write all the PE headers and sections directly in the assembler. Unfortunately, the PE format is not very simple and full of idiosyncracies, so it takes some effort until a working binary emerges:

bits 32
BASE      equ 0x00400000
ALIGNMENT equ 512
SECTALIGN equ 4096

%define ROUND(v, a) (((v + a - 1) / a) * a)
%define ALIGNED(v) (ROUND(v, ALIGNMENT))
%define RVA(obj) (obj - BASE)

section header progbits start=0 vstart=BASE

mz_hdr:
    dw "MZ"                       ; DOS magic
    times 0x3a db 0               ; [UNUSED] DOS header
    dd RVA(pe_hdr)                ; address of PE header

pe_hdr:
    dw "PE",0                     ; PE magic + 2 padding bytes
    dw 0x014c                     ; i386 architecture
    dw 2                          ; two sections
    dd 0                          ; [UNUSED] timestamp
    dd 0                          ; [UNUSED] symbol table pointer
    dd 0                          ; [UNUSED] symbol count
    dw OPT_HDR_SIZE               ; optional header size
    dw 0x0102                     ; characteristics: 32-bit, executable

opt_hdr:
    dw 0x010b                     ; optional header magic
    db 13,37                      ; [UNUSED] linker version
    dd ALIGNED(S_TEXT_SIZE)       ; [UNUSED] code size
    dd ALIGNED(S_IDATA_SIZE)      ; [UNUSED] size of initialized data
    dd 0                          ; [UNUSED] size of uninitialized data
    dd RVA(section..text.vstart)  ; entry point address
    dd RVA(section..text.vstart)  ; [UNUSED] base of code
    dd RVA(section..idata.vstart) ; [UNUSED] base of data
    dd BASE                       ; image base
    dd SECTALIGN                  ; section alignment
    dd ALIGNMENT                  ; file alignment
    dw 4,0                        ; [UNUSED] OS version
    dw 0,0                        ; [UNUSED] image version
    dw 4,0                        ; subsystem version
    dd 0                          ; [UNUSED] Win32 version
    dd RVA(the_end)               ; size of image
    dd ALIGNED(ALL_HDR_SIZE)      ; size of headers
    dd 0                          ; [UNUSED] checksum
    dw 3                          ; subsystem = console
    dw 0                          ; [UNUSED] DLL characteristics
    dd 0x00100000                 ; [UNUSED] maximum stack size
    dd 0x00001000                 ; initial stack size
    dd 0x00100000                 ; maximum heap size
    dd 0x00001000                 ; [UNUSED] initial heap size
    dd 0                          ; [UNUSED] loader flags
    dd 16                         ; number of data directory entries
    dd 0,0                        ; no export table
    dd RVA(import_table)          ; import table address
    dd IMPORT_TABLE_SIZE          ; import table size
    times 14 dd 0,0               ; no other entries in the data directories
OPT_HDR_SIZE equ $ - opt_hdr

sect_hdr_text:
    db ".text",0,0,0              ; section name
    dd ALIGNED(S_TEXT_SIZE)       ; virtual size
    dd RVA(section..text.vstart)  ; virtual address
    dd ALIGNED(S_TEXT_SIZE)       ; file size
    dd section..text.start        ; file position
    dd 0,0                        ; no relocations or debug info
    dw 0,0                        ; no relocations or debug info
    dd 0x60000020                 ; flags: code, readable, executable
sect_hdr_idata:
    db ".idata",0,0               ; section name
    dd ALIGNED(S_IDATA_SIZE)      ; virtual size
    dd RVA(section..idata.vstart) ; virtual address
    dd ALIGNED(S_IDATA_SIZE)      ; file size
    dd section..idata.start       ; file position
    dd 0,0                        ; no relocations or debug info
    dw 0,0                        ; no relocations or debug info
    dd 0xC0000040                 ; flags: data, readable, writeable

ALL_HDR_SIZE equ $ - $$

;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;

section .text progbits follows=header align=ALIGNMENT vstart=BASE+SECTALIGN*1
s_text:

    ; set up stack frame for *lpBytesWritten
    push ebp
    sub esp, 4

    ; if (!OpenClipboard(NULL)) ExitProcess(1);
    push 0
    call [OpenClipboard]
    or eax, eax
    jz error2

    ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;
    push 1  ; CF_TEXT
    call [GetClipboardData]
    or eax, eax
    jz error
    push eax  ; save hData for GlobalUnlock at the end

    ; char* str = GlobalLock(hData); if (!str) fail;
    push eax
    call [GlobalLock]
    or eax, eax
    jz error

    ; strlen(str)
    mov ecx, eax
strlen_loop:
    mov dl, [ecx]
    or dl, dl
    jz strlen_end
    inc ecx
    jmp strlen_loop
strlen_end:
    sub ecx, eax

    ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...)
    push 0            ; lpOverlapped = NULL
    lea edx, [ebp-4]  ; put nBytesWritten on the stack
    push edx
    push ecx          ; nNumberOfBytesToWrite = strlen(str)
    push eax          ; lpBuffer = str
    push -11          ; hFile = ...
    call [GetStdHandle]   ; ... GetStdHandle(STD_OUTPUT_HANDLE)
    push eax
    call [WriteFile]

    ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);
    call [GlobalUnlock]  ; hData is already on the stack
    call [CloseClipboard]
    push 0
    call [ExitProcess]

error:
    call [CloseClipboard]
error2:
    push 1
    call [ExitProcess]

S_TEXT_SIZE equ $ - s_text

;;;;;;;;;;;;;;;;;;;; .idata ;;;;;;;;;;;;;;;;;

section .idata progbits follows=.text align=ALIGNMENT vstart=BASE+SECTALIGN*2
s_idata:

import_table:
    ; import of kernel32.dll
    dd 0                        ; [UNUSED] read-only IAT
    dd 0                        ; [UNUSED] timestamp
    dd 0                        ; [UNUSED] forwarder chain
    dd RVA(N_kernel32)          ; library name
    dd RVA(IAT_kernel32)        ; IAT pointer
    ; import of user32.dll
    dd 0                        ; [UNUSED] read-only IAT
    dd 0                        ; [UNUSED] timestamp
    dd 0                        ; [UNUSED] forwarder chain
    dd RVA(N_user32)            ; library name
    dd RVA(IAT_user32)          ; IAT pointer
    ; terminator (empty item)
    times 5 dd 0
IMPORT_TABLE_SIZE: equ $ - import_table

IAT_kernel32:
    ExitProcess:      dd RVA(H_ExitProcess)
    GlobalLock:       dd RVA(H_GlobalLock)
    GlobalUnlock:     dd RVA(H_GlobalUnlock)
    GetStdHandle:     dd RVA(H_GetStdHandle)
    WriteFile:        dd RVA(H_WriteFile)
    dd 0
IAT_user32:
    OpenClipboard:    dd RVA(H_OpenClipboard)
    CloseClipboard:   dd RVA(H_CloseClipboard)
    GetClipboardData: dd RVA(H_GetClipboardData)
    dd 0

                    align 4, db 0
N_kernel32:         db "kernel32.dll",0
                    align 4, db 0
N_user32:           db "user32.dll",0
                    align 2, db 0
H_OpenClipboard:    db 0,0,"OpenClipboard",0
                    align 2, db 0
H_GetClipboardData: db 0,0,"GetClipboardData",0
                    align 2, db 0
H_GlobalLock:       db 0,0,"GlobalLock",0
                    align 2, db 0
H_GetStdHandle:     db 0,0,"GetStdHandle",0
                    align 2, db 0
H_WriteFile:        db 0,0,"WriteFile",0
                    align 2, db 0
H_GlobalUnlock:     db 0,0,"GlobalUnlock",0
                    align 2, db 0
H_CloseClipboard:   db 0,0,"CloseClipboard",0
                    align 2, db 0
H_ExitProcess:      db 0,0,"ExitProcess",0

S_IDATA_SIZE equ $ - s_idata

align ALIGNMENT, db 0
the_end:

That’s a pretty standard »by the book« implementation of a PE file: Code and import tables are nicely segregated into separate sections, the sections have their default alignment, all headers are spelled out in full, and fields which are not used by the loader nevertheless have sensible values or at least the usual dummy values (i.e. zero). The only thing that’s missing is a proper DOS stub, so if anybody ever tries to run this on real DOS, it will crash and burn.

So what does it give us? The result is 1536 bytes of finest hand-crafted code. Not too bad, but not quite satisfying either. The elephant in the room is the 512-byte alignment of the sections in the file that causes a lot of empty space: Can’t we just turn that down to, like, nothing? Unfortunately, we really can’t: Windows 10’s loader insists on a file alignment of 512 bytes; any attempt to decrease it results in the message »This app can’t be executed on this PC«. It’s not even possible to strip the padding at the end of the last section. (WINE accepts all of that without flinching, but that’s not at all our target platform.)

Merging sections

Even with Windows being so uncooperative, we still got one trick up our sleeves: We can just put both the code and the import tables into a combined section. That’s not common to do (code/data separation exists for a reason), but on our quest to make the file smaller, we take what we can.

The modifications are quite small, so here’s just a diff:

@@ -34,3 +42,3 @@
     dw 0x014c                     ; i386 architecture
-    dw 2                          ; two sections
+    dw 1                          ; one section
     dd 0                          ; [UNUSED] timestamp
@@ -44,8 +52,8 @@
     db 13,37                      ; [UNUSED] linker version
-    dd ALIGNED(S_TEXT_SIZE)       ; [UNUSED] code size
-    dd ALIGNED(S_IDATA_SIZE)      ; [UNUSED] size of initialized data
+    dd ALIGNED(S_SECT_SIZE)       ; [UNUSED] code size
+    dd ALIGNED(S_SECT_SIZE)       ; [UNUSED] size of initialized data
     dd 0                          ; [UNUSED] size of uninitialized data
-    dd RVA(section..text.vstart)  ; entry point address
-    dd RVA(section..text.vstart)  ; [UNUSED] base of code
-    dd RVA(section..idata.vstart) ; [UNUSED] base of data
+    dd RVA(section.getclip.vstart); entry point address
+    dd RVA(section.getclip.vstart); [UNUSED] base of code
+    dd RVA(section.getclip.vstart); [UNUSED] base of data
     dd BASE                       ; image base
@@ -74,20 +82,11 @@
-sect_hdr_text:
-    db ".text",0,0,0              ; section name
-    dd ALIGNED(S_TEXT_SIZE)       ; virtual size
-    dd RVA(section..text.vstart)  ; virtual address
-    dd ALIGNED(S_TEXT_SIZE)       ; file size
-    dd section..text.start        ; file position
+sect_hdr:
+    db "getclip",0                ; section name
+    dd ALIGNED(S_SECT_SIZE)       ; virtual size
+    dd RVA(section.getclip.vstart); virtual address
+    dd ALIGNED(S_SECT_SIZE)       ; file size
+    dd section.getclip.start      ; file position
     dd 0,0                        ; no relocations or debug info
     dw 0,0                        ; no relocations or debug info
-    dd 0x60000020                 ; flags: code, readable, executable
+    dd 0xE0000060                 ; flags: code + data, readable, writeable, executable
-sect_hdr_idata:
-    db ".idata",0,0               ; section name
-    dd ALIGNED(S_IDATA_SIZE)      ; virtual size
-    dd RVA(section..idata.vstart) ; virtual address
-    dd ALIGNED(S_IDATA_SIZE)      ; file size
-    dd section..idata.start       ; file position
-    dd 0,0                        ; no relocations or debug info
-    dw 0,0                        ; no relocations or debug info
-    dd 0xC0000040                 ; flags: data, readable, writeable
@@ -97,4 +96,4 @@
-section .text progbits follows=header align=ALIGNMENT vstart=BASE+SECTALIGN*1
-s_text:
+section getclip progbits follows=header align=ALIGNMENT vstart=BASE+SECTALIGN*1
+the_section:
@@ -157,9 +156,5 @@
-S_TEXT_SIZE equ $ - s_text
-
 ;;;;;;;;;;;;;;;;;;;; .idata ;;;;;;;;;;;;;;;;;

-section .idata progbits follows=.text align=ALIGNMENT vstart=BASE+SECTALIGN*2
-s_idata:
-
+    align 4, ret
@@ -215,3 +210,3 @@
-S_IDATA_SIZE equ $ - s_idata
+S_SECT_SIZE equ $ - the_section

The result is (predictably) 1024 bytes, i.e. exactly 1 KiB. Within the constraints of the Windows loader, it’s not possible to go below that: We need at least one »pseudo-section« for the header and one section for actual code and data, and both of them need to be at least a full 512 bytes.

Going sectionless

As this whole section business works against us, can we possibly live without it? Windows will load at least the header part of the executable into memory anyway, and if we sneak the actual code and import table data into there, we should be fine. In fact, this used to work in the past, but at least Windows 10 version 1703 (and very likely already versions before that) simply ignore import tables that are not contained in a section. As a result, the pointers to the function names in the Import Address Table are not replaced by the function’s entry point address – the program will load just fine, but it will crash shortly thereafter when it tries to call the first API function.

So if we want to go down the »sectionless PE« route, we need to find an alternative way to load our imports. But how can we do that? Even LoadLibrary and GetProcAddress would need to be imported from kernel32.dll somehow … or do they? In fact, kernel32.dll (and ntdll.dll) are already loaded, by default, by Windows’ PE loader! We just need to find the addresses somehow. This can be done with some pointer chasing: The FS selector points to the Thread Environment Block (TEB), which contains a pointer to the Process Environment Block (PEB), which contains a pointer to the PE loader data, which contains a doubly-linked circular list of loader data tables for each loaded DLL, which contain a pointer to the DLL’s base address. Phew. But as complicated as that sounds, it’s just six simple MOV instructions. The complex part is what comes after that.

Because right now, we have a pointer to the base address of a DLL that’s supposed to be kernel32.dll. But we need function pointers, not DLL base addresses, and we can’t just call GetProcAddress yet (because we don’t know its address). The only thing we can do is re-implement GetProcAddress by parsing the PE header, looking for the export tables, searching these for the desired function name, and using the ultra-complicated three-step lookup procedure (that doesn’t even work as intended; I got consistent off-by-one errors when implementing it according to the spec) to get the actual address. That’s a lot of code, but there’s no way around that.

Having implemented a poor man’s GetProcAddress, note that we no longer need the real thing: We can directly look for LoadLibrary in the loaded DLLs (one of which is always kernel32.dll), load user32.dll with it and then use our own look-up function for all other required API calls as well. In fact, I went so far as to have a wrapper function that takes the base address of a DLL and the function name, looks the function up and calls it directly.

One nice side-effect of going sectionless is that Windows now allows us to set the file alignment to an arbitrarily low value, because it isn’t really interested in any alignment stuff in this case. (It checks that the section alignment is equal to the file alignment though, but that’s fine with us).

There is one additional pitfall on Windows 7 64-bit (I believe I didn’t see this on 32-bit Windows 7, but I’m not sure). It seems that its loader is not fully ignoring the section table as it ought to: if the DWORD where the file offset of the first section is stored is negative, the executable can’t be run. In effect, this means that the byte at offset 23 (decimal) after the optional header must not be 0x80 or greater. That’s quite a restriction, because we’re going to put code there and we don’t want to juggle around with the instructions until we have found an arrangement that works! Fortunately, we can circumvent this: The »optional header size« field does not really store the size of the optional header – the optional header has a fixed size after all, only determined by the number of data dictionary entries, which is stored explicitly. No, what the »optional header size« field actually encodes is the offset of the section table, relative to the optional header’s start. So we simply need to choose a value such that the DWORD at offset [optional header start + optional header size + 20] is guaranteed to be less than 0x80000000. One good candidate is the »image base« field, which defaults to 0x400000 and is located at offset 28 inside the optional header – so we put down 8 as the optional header size and we’re set!

bits 32
BASE      equ 0x00400000
ALIGNMENT equ 4
SECTALIGN equ 4

%define ROUND(v, a) (((v + a - 1) / a) * a)
%define ALIGNED(v) (ROUND(v, ALIGNMENT))
%define RVA(obj) (obj - BASE)

org BASE

mz_hdr:
    dw "MZ"                       ; DOS magic
    times 0x3a db 0               ; [UNUSED] DOS header
    dd RVA(pe_hdr)                ; address of PE header

pe_hdr:
    dw "PE",0                     ; PE magic + 2 padding bytes
    dw 0x014c                     ; i386 architecture
    dw 0                          ; no sections
    dd 0                          ; [UNUSED] timestamp
    dd 0                          ; [UNUSED] symbol table pointer
    dd 0                          ; [UNUSED] symbol count
    dw 8                          ; optional header size
    dw 0x0102                     ; characteristics: 32-bit, executable

opt_hdr:
    dw 0x010b                     ; optional header magic
    db 13,37                      ; [UNUSED] linker version
    dd RVA(the_end)               ; [UNUSED] code size
    dd RVA(the_end)               ; [UNUSED] size of initialized data
    dd 0                          ; [UNUSED] size of uninitialized data
    dd RVA(main)                  ; entry point address
    dd RVA(main)                  ; [UNUSED] base of code
    dd RVA(main)                  ; [UNUSED] base of data
    dd BASE                       ; image base
    dd SECTALIGN                  ; section alignment
    dd ALIGNMENT                  ; file alignment
    dw 4,0                        ; [UNUSED] OS version
    dw 0,0                        ; [UNUSED] image version
    dw 4,0                        ; subsystem version
    dd 0                          ; [UNUSED] Win32 version
    dd RVA(the_end)               ; size of image
    dd ALIGNED(ALL_HDR_SIZE)      ; size of headers
    dd 0                          ; [UNUSED] checksum
    dw 3                          ; subsystem = console
    dw 0                          ; [UNUSED] DLL characteristics
    dd 0x00100000                 ; [UNUSED] maximum stack size
    dd 0x00001000                 ; initial stack size
    dd 0x00100000                 ; maximum heap size
    dd 0x00001000                 ; [UNUSED] initial heap size
    dd 0                          ; [UNUSED] loader flags
    dd 16                         ; number of data directory entries
    times 16 dd 0,0               ; no entries in the data directories
OPT_HDR_SIZE equ $ - opt_hdr
ALL_HDR_SIZE equ $ - $$

;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;

main:
    ; set up stack frame for local variables
    push ebp
    %define DummyVar      ebp-4
    %define kernel32base  ebp-8
    %define user32base    ebp-12
    sub esp, 12

    ; locate the loader data tables where the loaded DLLs are managed
    mov eax, [fs:0x30]     ; get PEB pointer from TEB
    mov eax, [eax+0x0C]    ; get PEB_LDR_DATA pointer from PEB
    mov eax, [eax+0x14]    ; go to first LDR_DATA_TABLE_ENTRY
    mov eax, [eax]         ; move two entries further, because the
    mov eax, [eax]         ; third is typically kernel32.dll
try_next_lib:
    push eax               ; save LDR_DATA_TABLE_ENTRY pointer
    mov ebx, [eax+0x10]    ; load base address of the library
    mov esi, N_LoadLibrary
    call find_import       ; load LoadLibrary from there (if present)
    or eax, eax            ; found?
    jnz kernel32_found
    pop eax                ; restore LDR_DATA_TABLE_ENTRY pointer
    mov eax, [eax]         ; go to next LDR_DATA_TABLE_ENTRY
    jmp try_next_lib


find_import:  ; FUNCTION that finds procedure [esi] in library at base [ebx]
    mov edx, [ebx+0x3c]    ; get PE header pointer (w/ RVA translation)
    add edx, ebx
    cmp word [edx], "PE"   ; is it a PE header?
    jne find_import_fail
    mov eax, [edx+0x74]    ; check if data dictionary is present
    or eax, eax
    jz find_import_fail
    mov edx, [edx+0x78]    ; get export table pointer RVA
    or edx, edx            ; check if export table is present
    jz find_import_fail
    add edx, ebx           ; get absolute address of export table
    push edx               ; store the export table address for later
    mov ecx, [edx+0x18]    ; ecx = number of named functions
    mov edx, [edx+0x20]    ; edx = address-of-names list (w/ RVA translation)
    add edx, ebx
name_loop:
    dec ecx                ; pre-decrement counter and check if we're done
    js find_import_fail1
    push esi               ; store the desired function name's pointer (we will clobber it)
    mov edi, [edx]         ; load function name (w/ RVA translation)
    add edi, ebx
cmp_loop:
    lodsb                  ; load a byte of the two strings into AL, AH
    mov ah, [edi]          ; and increase the pointers
    inc edi
    cmp al, ah             ; identical bytes?
    jne next_name          ; if not, this is not the correct name
    or al, al              ; zero byte reached?
    jnz cmp_loop           ; if not, we need to compare more
    ; if we arrive here, we have a match!
    pop esi                ; restore the name pointer (though we don't use it any longer)
    pop edx                ; restore the export table address
    sub ecx, [edx+0x18]    ; turn the negative counter ECX into a positive one
    neg ecx
    dec ecx
    mov eax, [edx+0x24]    ; get address of ordinal table (w/ RVA translation)
    add eax, ebx
    movzx ecx, word [eax+ecx*2]  ; load ordinal from table
    ;sub ecx, [edx+0x10]    ; subtract ordinal base
    mov eax, [edx+0x1C]    ; get address of function address table (w/ RVA translation)
    add eax, ebx
    mov eax, [eax+ecx*4]   ; load function address (w/ RVA translation)
    add eax, ebx
    ret
next_name:
    pop esi                ; restore the name pointer
    add edx, 4             ; advance to next list item
    jmp name_loop
find_import_fail1:
    pop eax                ; we still had one dword on the stack
find_import_fail:
    xor eax, eax
    ret


call_import:   ; FUNCTION that finds procedure [esi] in library at base [ebx] and calls it
    call find_import
    or eax, eax            ; found?
    jz critical_error      ; if not, we're screwed
    jmp eax                ; but if so, call the function


    ; back to the main program ...
kernel32_found:
    ; we found kernel32 (ebx) and LoadLibraryA (eax), so we can load user32.dll
    mov [kernel32base], ebx  ; store kernel32's base address
    push N_user32
    call eax               ; call LoadLibraryA
    or eax, eax            ; check the result
    jz error2
    mov [user32base], eax  ; store user32's base address

    ; if (!OpenClipboard(NULL)) ExitProcess(1);
    push 0
    mov ebx, eax           ; user32 base address was still in eax
    mov esi, N_OpenClipboard
    call call_import
    or eax, eax
    jz error2

    ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;
    push 1  ; CF_TEXT
;    mov ebx, [user32base]
    mov esi, N_GetClipboardData
    call call_import
    or eax, eax
    jz error
    push eax  ; save hData for GlobalUnlock at the end

    ; char* str = GlobalLock(hData); if (!str) fail;
    push eax
    mov ebx, [kernel32base]
    mov esi, N_GlobalLock
    call call_import
    or eax, eax
    jz error

    ; strlen(str)
    mov ecx, eax
strlen_loop:
    mov dl, [ecx]
    or dl, dl
    jz strlen_end
    inc ecx
    jmp strlen_loop
strlen_end:
    sub ecx, eax

    ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...)
    push 0              ; lpOverlapped = NULL
    lea edx, [DummyVar] ; lpBytesWritten
    push edx
    push ecx            ; nNumberOfBytesToWrite = strlen(str)
    push eax            ; lpBuffer = str
    push -11            ; hFile = ...
;    mov ebx, [kernel32base]
    mov esi, N_GetStdHandle
    call call_import    ;     ... GetStdHandle(STD_OUTPUT_HANDLE)
    push eax
;    mov ebx, [kernel32base]
    mov esi, N_WriteFile
    call call_import

    ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);
;    mov ebx, [kernel32base]
    mov esi, N_GlobalUnlock
    call call_import    ; hData is already on the stack

    mov ebx, [user32base]
    mov esi, N_CloseClipboard
    call call_import

    push 0
    jmp exit

error:
    mov ebx, [user32base]
    mov esi, N_CloseClipboard
    call call_import
error2:
    push 1
exit:
    mov ebx, [kernel32base]
    mov esi, N_ExitProcess
    jmp call_import
critical_error:
    ret

N_user32:           db "user32.dll",0
N_LoadLibrary:      db "LoadLibraryA", 0
N_OpenClipboard:    db "OpenClipboard",0
N_GetClipboardData: db "GetClipboardData",0
N_GlobalLock:       db "GlobalLock",0
N_GetStdHandle:     db "GetStdHandle",0
N_WriteFile:        db "WriteFile",0
N_GlobalUnlock:     db "GlobalUnlock",0
N_CloseClipboard:   db "CloseClipboard",0
N_ExitProcess:      db "ExitProcess",0

align ALIGNMENT, db 0
the_end:

That’s quite a lot of work, but at least we can save another 25% and get down to 768 bytes. This comes at the expense of runtime performance, though, because our homegrown GetProcAddress implementation is not nearly as efficient as Windows’ original one: We simply scan all function names (of which there are over 1600 in kernel32.dll), while the proper loader uses binary search to speed things up. But we’re talking of a few hundred microsecons here, loading and running an executable at all takes an order of magnitude more time than that.

Import by hash

Of the 768 bytes in the sectionless version, 118 bytes (15%!) are spent on function names. That seems a little excessive, doesn’t it? After all, we’re not really interested in the names themselves, we just use them to find the function’s adresses. As a first try, we could limit the length of the stored strings by only comparing the first, say, 7 characters. We won’t be able to discern LoadLibraryA from its Unicode cousin LoadLibraryW this way, but since the names are guaranteed to be alphabetically sorted in export tables, we would hit LoadLibraryA first anyway. However, we can’t use less than 7 significant bytes, because otherwise e.g. GlobalLock would be too unspecific and we would get GlobalAddAtomA instead.

But 7 bytes per import is still quite some data, and the whole approach is a forward compatibility timebomb, because future versions of Windows could add new functions to our two DLLs with catastrophic effect. So, truncating names is not the best path to follow. However, there’s a much more powerful alternative: Hashing! As said, we’re not interested in the names, not even parts of it. A machine-readable mapping that can uniquely identify the proper function name without actually knowing it is sufficient; bonus points if it’s easy to compute. (For our purposes, we don’t need a cryptographically strong hash or anything fancy, we just want to tell a few function names apart!)

Long story short, such mappings exist. In our example, we’ll use a simple »rotate-and-xor« hash. The algorithm uses a 32-bit accumulator register. For each character of the function name, two operations are performed (in any order): The character’s ASCII code is XOR’ed into the register (addition would be possible as well), and the register is rotated by a fixed (and ideally prime) number of bits. This can be computed in two x86 instructions per character, and is able to map all names of the two DLLs in question (and also various others I tested with) into 32-bit hashes without any collisions. Another nice property is that the hash can be computed in reverse: We can store the start value of the accumulator, and a match is detected when after processing all characters of a function name, the accumulator becomes zero. (We could live without that, but it simplifies the implementation a tiny bit.)

This modification can be applied to the existing implementation quite easily, so here’s again just a diff:

@@ -95,5 +89,5 @@
     mov ebx, [eax+0x10]    ; load base address of the library
-    mov esi, N_LoadLibrary
+    mov esi, 0x01364564    ; hash of "LoadLibraryA"
     call find_import       ; load LoadLibrary from there (if present)
@@ -123,15 +117,16 @@
 cmp_loop:
-    lodsb                  ; load a byte of the two strings into AL, AH
-    mov ah, [edi]          ; and increase the pointers
-    inc edi
-    cmp al, ah             ; identical bytes?
-    jne next_name          ; if not, this is not the correct name
-    or al, al              ; zero byte reached?
-    jnz cmp_loop           ; if not, we need to compare more
+    movzx eax, byte [edi]  ; load a byte of the name ...
+    inc edi                ; ... and advance the pointer
+    xor esi, eax           ; apply xor-and-rotate
+    rol esi, 7
+    or eax, eax            ; last byte?
+    jnz cmp_loop           ; if not, process another byte
+    or esi, esi            ; result hash match?
+    jnz next_name          ; if not, this is not the correct name
     ; if we arrive here, we have a match!
@@ -180,5 +175,5 @@
     push 0
     mov ebx, eax           ; user32 base address was still in eax
-    mov esi, N_OpenClipboard
+    mov esi, 0xFC7956AD    ; hash of "OpenClipboard"
     call call_import
@@ -188,5 +183,5 @@
     push 1  ; CF_TEXT
 ;    mov ebx, [user32base]
-    mov esi, N_GetClipboardData
+    mov esi, 0x0C473D74    ; hash of "GetClipboardData"
     call call_import
     or eax, eax
@@ -197,5 +192,5 @@
     mov ebx, [kernel32base]
-    mov esi, N_GlobalLock
+    mov esi, 0x4A88F58C    ; hash of "GlobalLock"
     call call_import
@@ -221,18 +216,18 @@
 ;    mov ebx, [kernel32base]
-    mov esi, N_GetStdHandle
+    mov esi, 0xEACA71C2 ; hash of "GetStdHandle"
     call call_import    ;     ... GetStdHandle(STD_OUTPUT_HANDLE)
     push eax
 ;    mov ebx, [kernel32base]
-    mov esi, N_WriteFile
+    mov esi, 0x3FD1C30F ; hash of "WriteFile"
     call call_import

     ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);
 ;    mov ebx, [kernel32base]
-    mov esi, N_GlobalUnlock
+    mov esi, 0xC3907A85 ; hash of "GlobalUnlock"
     call call_import    ; hData is already on the stack

     mov ebx, [user32base]
-    mov esi, N_CloseClipboard
+    mov esi, 0x1D84425E ; hash of "CloseClipboard"
     call call_import
@@ -242,5 +237,5 @@
 error:
     mov ebx, [user32base]
-    mov esi, N_CloseClipboard
+    mov esi, 0x1D84425E ; hash of "CloseClipboard"
     call call_import
@@ -248,5 +243,5 @@
 exit:
     mov ebx, [kernel32base]
-    mov esi, N_ExitProcess
+    mov esi, 0x665640AC ; hash of "ExitProcess"
     jmp call_import
 critical_error:
@@ -254,13 +249,4 @@
 N_user32:           db "user32.dll",0
-N_LoadLibrary:      db "LoadLibraryA", 0
-N_OpenClipboard:    db "OpenClipboard",0
-N_GetClipboardData: db "GetClipboardData",0
-N_GlobalLock:       db "GlobalLock",0
-N_GetStdHandle:     db "GetStdHandle",0
-N_WriteFile:        db "WriteFile",0
-N_GlobalUnlock:     db "GlobalUnlock",0
-N_CloseClipboard:   db "CloseClipboard",0
-N_ExitProcess:      db "ExitProcess",0

The result is 656 bytes, 112 bytes less than the version without import-by-hash. It’s not quite the optimal amount of savings (which would be 118 bytes, the size of the name strings) because the comparison grew a little bit, but still quite an impressive result.

Header trickery

Before our short excursion into the land of hashes, we worked hard on bypassing the alignment limits, but still there’s a lot of space spent in the PE headers. One trivial thing is to remove the data dictionary as we don’t even have table-based imports by now. But that’s not all: Fortunately, there are many fields in the headers that aren’t evaluated by the Windows loader where we can put other stuff in. The largest part of this is the 64-byte DOS header at the beginning, of which only the first two bytes (the »MZ« signature) and the last four bytes (the address of the PE header) are important. We can actually move (»collapse«) the PE header inside the DOS header, all the way until address 4 (which is the minimum alignment requirement). In this case, the PE header location field of the DOS header coincides with the section alignment field of the PE header, so we get a section (and file) alignment of 4 – perfect!

Runs of other unused fields in the header can be used to put the last remaining string (»user32.dll«) and even code into. The latter is a bit complicated, because the code sequence must fit into the slot of unused fields, and if you’re unlucky, it might grow when moving into the header if a jump that used to be relative is turned into an absolute jump because the distance between jump site and target has become too large. I didn’t manage to fit a lot of code into the headers, but at least there’s something.

The following dump is what the headers now look like. The main part is the same, except that the blocks that have been moved into the headers (N_user32, next_name and parts of main) are now obviously gone:

mz_hdr:
    dw "MZ"                       ; DOS magic
    dw "kj"                       ; filler to align the PE header

pe_hdr:
    dw "PE",0                     ; PE magic + 2 padding bytes
    dw 0x014c                     ; i386 architecture
    dw 0                          ; no sections
N_user32: db "user32.dll",0,0  ; 12 bytes of data collapsed into the header
   ;dd 0                          ; [UNUSED-12] timestamp
   ;dd 0                          ; [UNUSED] symbol table pointer
   ;dd 0                          ; [UNUSED] symbol count
    dw 8                          ; optional header size
    dw 0x0102                     ; characteristics: 32-bit, executable

opt_hdr:
    dw 0x010b                     ; optional header magic
main_part_1:  ; 12 bytes of main entry point + 2 bytes of jump
    mov eax, [fs:0x30]     ; get PEB pointer from TEB
    mov eax, [eax+0x0C]    ; get PEB_LDR_DATA pointer from PEB
    mov eax, [eax+0x14]    ; go to first LDR_DATA_TABLE_ENTRY
    jmp main_part_2
    align 4, db 0
   ;db 13,37                      ; [UNUSED-14] linker version
   ;dd RVA(the_end)               ; [UNUSED] code size
   ;dd RVA(the_end)               ; [UNUSED] size of initialized data
   ;dd 0                          ; [UNUSED] size of uninitialized data
    dd RVA(main_part_1)           ; entry point address
main_part_2:  ; another 6 bytes of code + 2 bytes of jump
    ; set up stack frame for local variables
    push ebp
    %define DummyVar      ebp-4
    %define kernel32base  ebp-8
    %define user32base    ebp-12
    sub esp, 12
    mov eax, [eax]         ; go to where ntdll.dll typically is
    jmp main_part_3
    align 4, db 0
   ;dd RVA(main)                  ; [UNUSED-8] base of code
   ;dd RVA(main)                  ; [UNUSED] base of data
    dd BASE                       ; image base
    dd SECTALIGN                  ; section alignment (collapsed with the
                                  ; PE header offset in the DOS header)
    dd ALIGNMENT                  ; file alignment
next_name:  ; we interrupt again for a few bytes of code from the loader
    pop esi                ; restore the name pointer
    add edx, 4             ; advance to next list item
    jmp name_loop
    align 4, db 0
   ;dw 4,0                        ; [UNUSED-8] OS version
   ;dw 0,0                        ; [UNUSED] image version
    dw 4,0                        ; subsystem version
    dd 0                          ; [UNUSED-4] Win32 version
    dd RVA(the_end)               ; size of image
    dd RVA(opt_hdr)               ; size of headers (must be small enough
                                  ; so that entry point inside header is accepted)
    dd 0                          ; [UNUSED-4] checksum
    dw 3                          ; subsystem = console
    dw 0                          ; [UNUSED-6] DLL characteristics
    dd 0x00100000                 ; maximum stack size
    dd 0x00001000                 ; initial stack size
    dd 0x00100000                 ; maximum heap size
    dd 0x00001000                 ; initial heap size
    dd 0                          ; [UNUSED-4] loader flags
    dd 0                          ; number of data directory entries (= none!)
OPT_HDR_SIZE equ $ - opt_hdr
ALL_HDR_SIZE equ $ - $$

;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;

main_part_3:
    mov eax, [eax]         ; go to where kernel32.dll typically is
try_next_lib:
; (from here on, not much has changed)

With this, we’re at 436 bytes, a whopping 33% less than before! The downside is that the header declarations in the source code become quite unreadable by now, and that we’re no longer forward compatible: A future version of Windows might decide that the OS version listed in the EXE file is now totally relevant and may thus not want to execute files made for version »33630.1068«.

Unsafe optimizations

All along the way, we were cautious not to remove any checks and clean exits in case of failure. But we’re already relying on a few details of the PE loader that are unlikely to change soon, but are not carved into stone either. So why not go full YOLO and strip off all the safety nets? We could assume that …

… kernel32.dll always is the third image loaded (after our own executable and ntdll.dll).
… the kernel32.dll image is a proper PE image with all headers and dictionary items in their usual places.
… all imported functions actually exist.
… uninitialization (GlobalUnlock, CloseClipboard) is not neccesary, because the system cleans up our mess anyway when the process exits.
… GlobalLock is a no-operation that can be omitted completely, because the HGLOBAL that is returned by GetClipboardData is already a bona fide pointer.

This allows us to rip out a good chunk of code. For example, we don’t need to separate find_import and call_import any longer, because we’ll no longer check whether a function exists; if we want to look up a function, we’re always going to call it as well. Furthermore, the order of the loader and main code has been shuffled around a bit as well to make jumps as short as possible, and the code snippets used to fill the unused header fields are slightly different ones:

bits 32
BASE      equ 0x00400000
ALIGNMENT equ 4
SECTALIGN equ 4

%define ROUND(v, a) (((v + a - 1) / a) * a)
%define ALIGNED(v) (ROUND(v, ALIGNMENT))
%define RVA(obj) (obj - BASE)

org BASE

mz_hdr:
    dw "MZ"                       ; DOS magic
    dw "kj"                       ; filler to align the PE header

pe_hdr:
    dw "PE",0                     ; PE magic + 2 padding bytes
    dw 0x014c                     ; i386 architecture
    dw 0                          ; no sections
N_user32: db "user32.dll",0,0  ; 12 bytes of data collapsed into the header
   ;dd 0                          ; [UNUSED-12] timestamp
   ;dd 0                          ; [UNUSED] symbol table pointer
   ;dd 0                          ; [UNUSED] symbol count
    dw 8                          ; optional header size
    dw 0x0102                     ; characteristics: 32-bit, executable

opt_hdr:
    dw 0x010b                     ; optional header magic
main_part_1:  ; 12 bytes of main entry point + 2 bytes of jump
    mov eax, [fs:0x30]     ; get PEB pointer from TEB
    mov eax, [eax+0x0C]    ; get PEB_LDR_DATA pointer from PEB
    mov eax, [eax+0x14]    ; go to first LDR_DATA_TABLE_ENTRY
    jmp main_part_2
    align 4, db 0
   ;db 13,37                      ; [UNUSED-14] linker version
   ;dd RVA(the_end)               ; [UNUSED] code size
   ;dd RVA(the_end)               ; [UNUSED] size of initialized data
   ;dd 0                          ; [UNUSED] size of uninitialized data
    dd RVA(main_part_1)           ; entry point address
main_part_2:  ; another 6 bytes of code + 2 bytes of jump
    ; set up stack frame for local variables
    push ebp
    %define DummyVar      ebp-4
    %define kernel32base  ebp-8
    %define user32base    ebp-12
    sub esp, 12
    mov eax, [eax]         ; go to where ntdll.dll typically is
    jmp main_part_3
    align 4, db 0
   ;dd RVA(main)                  ; [UNUSED-8] base of code
   ;dd RVA(main)                  ; [UNUSED] base of data
    dd BASE                       ; image base
    dd SECTALIGN                  ; section alignment (collapsed with the
                                  ; PE header offset in the DOS header)
    dd ALIGNMENT                  ; file alignment
main_part_3:  ; another 5 bytes of code + 2 bytes of jump
    mov eax, [eax]         ; go to where kernel32.dll typically is
    mov ebx, [eax+0x10]    ; load base address of the library
    jmp main_part_4
    align 4, db 0
   ;dw 4,0                        ; [UNUSED-8] OS version
   ;dw 0,0                        ; [UNUSED] image version
    dw 4,0                        ; subsystem version
    dd 0                          ; [UNUSED-4] Win32 version
    dd RVA(the_end)               ; size of image
    dd RVA(opt_hdr)               ; size of headers (must be small enough
                                  ; so that entry point inside header is accepted)
    dd 0                          ; [UNUSED-4] checksum
    dw 3                          ; subsystem = console
    dw 0                          ; [UNUSED-2] DLL characteristics
    dd 0x00100000                 ; maximum stack size
    dd 0x00001000                 ; initial stack size
    dd 0x00100000                 ; maximum heap size
    dd 0x00001000                 ; initial heap size
    dd 0                          ; [UNUSED-4] loader flags
    dd 0                          ; number of data directory entries (= none!)
OPT_HDR_SIZE equ $ - opt_hdr
ALL_HDR_SIZE equ $ - $$

main_part_4:
    mov [kernel32base], ebx  ; store kernel32's base address
    mov esi, 0x01364564    ; hash of "LoadLibraryA"
    push N_user32          ; we want to load user32.dll
    call call_import       ; call LoadLibraryA
    mov [user32base], eax  ; store user32's base address

    ; if (!OpenClipboard(NULL)) ExitProcess(1);
    push 0
    mov ebx, eax           ; user32 base address was still in eax
    mov esi, 0xFC7956AD    ; hash of "OpenClipboard"
    call call_import
    or eax, eax
    jz error

    ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;
    push 1  ; CF_TEXT
;    mov ebx, [user32base]
    mov esi, 0x0C473D74    ; hash of "GetClipboardData"
    call call_import
    or eax, eax
    jz error

    ; strlen(str)
    mov ecx, eax
strlen_loop:
    mov dl, [ecx]
    or dl, dl
    jz strlen_end
    inc ecx
    jmp strlen_loop
strlen_end:
    sub ecx, eax

    ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...)
    push 0              ; lpOverlapped = NULL
    lea edx, [DummyVar] ; lpBytesWritten
    push edx
    push ecx            ; nNumberOfBytesToWrite = strlen(str)
    push eax            ; lpBuffer = str
    push -11            ; hFile = ...
    mov ebx, [kernel32base]
    mov esi, 0xEACA71C2 ; hash of "GetStdHandle"
    call call_import    ;     ... GetStdHandle(STD_OUTPUT_HANDLE)
    push eax
;    mov ebx, [kernel32base]
    mov esi, 0x3FD1C30F ; hash of "WriteFile"
    call call_import

    ; ExitProcess(0);
    push 0
    jmp exit
error:
    push 1
exit:
    mov ebx, [kernel32base]
    mov esi, 0x665640AC ; hash of "ExitProcess"
    ; fall-through into call_import


call_import:  ; FUNCTION that calls procedure [esi] in library at base [ebx]
    mov edx, [ebx+0x3c]    ; get PE header pointer (w/ RVA translation)
    add edx, ebx
    mov edx, [edx+0x78]    ; get export table pointer RVA (w/ RVA translation)
    add edx, ebx
    push edx               ; store the export table address for later
    mov ecx, [edx+0x18]    ; ecx = number of named functions
    mov edx, [edx+0x20]    ; edx = address-of-names list (w/ RVA translation)
    add edx, ebx
name_loop:
    push esi               ; store the desired function name's hash (we will clobber it)
    mov edi, [edx]         ; load function name (w/ RVA translation)
    add edi, ebx
cmp_loop:
    movzx eax, byte [edi]  ; load a byte of the name ...
    inc edi                ; ... and advance the pointer
    xor esi, eax           ; apply xor-and-rotate
    rol esi, 7
    or eax, eax            ; last byte?
    jnz cmp_loop           ; if not, process another byte
    or esi, esi            ; result hash match?
    jnz next_name          ; if not, this is not the correct name
    ; if we arrive here, we have a match!
    pop esi                ; restore the name pointer (though we don't use it any longer)
    pop edx                ; restore the export table address
    sub ecx, [edx+0x18]    ; turn the negative counter ECX into a positive one
    neg ecx
    mov eax, [edx+0x24]    ; get address of ordinal table (w/ RVA translation)
    add eax, ebx
    movzx ecx, word [eax+ecx*2]  ; load ordinal from table
    ;sub ecx, [edx+0x10]    ; subtract ordinal base
    mov eax, [edx+0x1C]    ; get address of function address table (w/ RVA translation)
    add eax, ebx
    mov eax, [eax+ecx*4]   ; load function address (w/ RVA translation)
    add eax, ebx
    jmp eax                ; jump to the target function
next_name:
    pop esi                ; restore the name pointer
    add edx, 4             ; advance to next list item
    dec ecx                ; decrease counter
    jmp name_loop

align ALIGNMENT, db 0
the_end:

The final result with this is 316 bytes, another 27% less than before!

Conclusion

This concludes our journey into size optimization. At this point, we’re 240 times smaller than the naïve first C implementation, and even if we consider our first serious optimization step (the C implementation without C library) as the starting point, we’re still almost 10 times smaller. But admittedly, the amount of effort necessary for this is extremely high and hardly justified ;)

You can download all the source files of this little experiment if you’re interested.

I’m not going to claim that my implementation is the smallest possible, most efficient or best-on-any-other-axis one. I’m not a seasoned sizecoder at that low level (usually I stop at the »get rid of the C library« step). What also concerns me is that I had to implement the export table parser differently from all documentation I could find on the subject (including Microsoft’s official PE specification) by not subtracting the base ordinal from the value in the name ordinal table to get the function address table index. So if you have any explanations or improvement ideas, let me know.

Update (2017-09-09): As a commenter pointed out, some of the executables didn’t run on Windows 7 x64. I figured out what’s the issue and updated the post and the download file accordingly – see the last paragraph before the code sample in the »going sectionless« chapter for details.

Posted in Computer Fun, Hacks | 11 Comments ...

11 Responses to »Writing ultra-small Windows executables«

widge (2017-09-09 19:07)

Thank you for the enjoyable read.

I have tried the ASM files on my PC with Windows 7 x64.

getclip_pe_v1_2sections.asm would not assemble with nasm-2.13.01-win32 so I switched to yasm-1.3.0-win32 which seems to be what you are using.

All of them would run except three:
– getclip_pe_v2_1section detected by Avira Antivir as TR/Crypt.XPACK.Gen and put in quarantine
– getclip_pe_v5_collapse, The application was unable to start correctly (0xc0000005)
– getclip_pe_v6_unsafe, The application was unable to start correctly (0xc0000005)
KeyJ (2017-09-09 23:56)

widge: Thanks for the information! I dug into the crashing issue (quite a rabbit hole, I tell ya!) and fixed it.
Regarding the anti-virus warning, there’s not much we can do about that. Anti-virus software is inherently broken and just loves to interfere with all kinds of size-coding :(
Tony Walker (2020-02-03 08:18)

KeyJ,
This is such a wonderful article.
Nothing like a good yak shaving :-)
Thank you – I learned a lot
Gregory Morse (2020-04-16 23:58)

Your C library code calls ExitProcess(1) twice instead of returning 1 from main which is a slight inconsistency. Also the assembly function calls ExitProcess twice instead of jump to a label for it which would save 5-2 or 3 bytes. More hand tricks could reduce even a few more bytes off the asm though I realize the focus here was the PE limits.

Most of the games with the LoadLibrary and GetProcAddress lookups are too dangerous for anything but playing around unless regressing them on all Windows versions including 32/64 bit x86 and Itanium, 7, 8 and 10 or more like XP, etc while constantly monitoring 10 updates. Also you could simplify hashing to 16 bits or any bizarre tricks that seem to always work. The same for some of the really unusual header compression tricks.

Basically you have created the closest as possible to COM files for Windows almost and if this could work on every flavor of Windows starting from 95 it would at least be neat as you could start coding as soon as the 2 key kernel32 functions become available.
KeyJ (2020-04-18 22:58)

You’re right, mov eax, returncode; ret should do the trick too. I heard somewhere that ExitProcess is mandatory though, so I kept it without thinking twice. Maybe I should revisit that, including re-testing on the relevant platforms — speaking of which, I wouldn’t count anything older than Windows 7 as a target, and even that is debatable by now. Same goes for Itanium; I’d rather see WINE as a valid target for execution than that ;)

All that being said: No, this isn’t something that I would recommend to use for any productive purposes. If you’re concerned with any kind of compatibility (forwards, backwards, sidewards), it goes without saying that you should stop at the point where any of the PE header fields are misused for anything.
Onelio (2021-03-28 22:46)

I was searching for this very thing.
Thank you very much! It helped me a lot to understand how the format works and why section alignment is so important!
stasoid (2021-06-19 09:39)

268-byter with imports for win10
KeyJ (2021-06-19 21:40)

stasoid: Wow, that’s interesting. I’m pretty sure that I tried a section alignment of 4 when I did this research four years ago, and it didn’t work. Good to know that it does now!
Import by hash still has its merits though, because import by name is larger than any of the alternatives and import by ordinal is notoriously fragile (a system update may shuffle around the ordinals, and you’re screwed).
Samir Ribić (2023-01-18 10:31)

It seems that Windows 11 allows sections aligned to less than 512 bytes but they must not be .data or .bss. Without resorting to assembly language we can get 1056 bytes.

I will take your second version in C, but instead Microsoft C, we can try gcc . I have used mingw64 version 4.9.2, which is bundled with DevCpp IDE (of course, I have prepared path environment variable to call it from the command line).
After calling

gcc -m32 -mconsole -nostdlib -nostartfiles -Os -Wall -s getclip.c -lkernel32 -luser32 -Wl,-e_mainCRTStartup,–section-alignment,16,-file-alignment,16 -o getclip

getclip.exe is now 1056 bytes and I have checked it on Windows 11 (64 bit), Windows 10 (64 bit) and Windows XP (32 bit). It contains 3 sections: .text, .rdata and .idata, and imports only kernel32.dll and user32.dll

However, if you just move DWORD dummy declaration outside mainCRTStartup into global scope, the executable is no more compatible with Windows 10, because such action creates .bss section.
KeyJ (2023-01-18 10:54)

Samir Ribić: Thanks a lot for your research! I didn’t know that Windows makes such a difference based on the writeability of a section, but with your description, it makes perfect sense.
anzhel (2024-08-09 17:14)

This is a very cool article! Thanks!!

Pages

Categories