Writing ultra-small Windows executables
How small can a valid and useful Win32 executable be? There already are a few tutorials about this topic, but these are either not working on modern Windows versions any longer or only cover the most basic »do nothing but return zero« program. The goal here should be to do something genuinely useful: a console application that outputs the contents of the Windows clipboard on standard output.
The target application
The target application scenrario is a little more than a simple »Hello World« application, but still only very basic Win32 API 101 with only a few calls to kernel32.dll and user32.dll functions and very little algorithmic stuff inbetween. Still, it’s undeniably useful, e.g. when you want to filter the contents of some text in an editor through (Win32 versions of) sed
, grep
, tr
, cut
, awk
or similar command-line tools. Or —and this is what I’m using very frequently— to quickly change the directory in the console to something else you happen to have open in Explorer. Besides copying the path from Explorer, you would normally need to type »cd /d
« in the command prompt window, followed by a right-click, and Enter. (The awkward /d
parameter is important, otherwise the current volume letter wouldn’t be changed as well.) That’s cumbersome; I use a small batch file (called fcd.cmd
in my case) which resides in a directory somewhere in the PATH and does this automatically:
@for /f "usebackq tokens=*" %%a in (`getclip`) do @cd /d %%a
It calls the getclip.exe
helper program to get the clipboard’s contents and crafts a cd /d
command with it (using cmd.exe
‘s byzantine syntax to make the backticks work their usual magic). I had this helper program already, but it was part of a GnuWin32 installation that requires a bunch of obscure DLLs to work; for my new Windows installation, I didn’t want to use this cruft again. However, there’s no simple alternative either: Windows ships with clip.exe
, but this only works in the opposite direction, putting data from a pipe into the clipboard, not out of it. A quick internet research only came up with solutions that had even more ridiculous dependencies. So I decided to shave a yak take matters into my own hands and write my own simple, small implementation of getclip.exe
. Couldn’t be that hard.
The naïve C implementation
The obvious first try is to write a C program, like this:
#include <windows.h> #include <stdio.h> #include <string.h> int main(void) { if (!OpenClipboard(NULL)) { ExitProcess(1); } HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) { CloseClipboard(); ExitProcess(1); } const char *str = (const char*)GlobalLock(hData); if (!str) { CloseClipboard(); ExitProcess(1); } fwrite((const void*)str, 1, strlen(str), stdout); GlobalUnlock(hData); CloseClipboard(); return 0; }
Note the use of fwrite
instead of puts
to avoid adding additional newlines at the end of the output. Other than that, it’s fairly basic stuff: opening the clipboard, requesting the data as plain ASCII text, mapping the data into our address space, writing it to stdout
and deallocating all the resources we acquired.
Using Visual Studio 2017 (configured to optimize for size) and the default Windows 8.1 SDK, this gives us an executable of 76800 bytes. Yes, that’s almost 77 kilobytes! We could get this down to 8704 bytes by linking dynamically against the C library DLL, but that’s cheating: This way, the user would require a bunch of DLLs in the multi-megabyte range to run it. We can do better than that.
That pesky C library
Looking closely at the code, it becomes obvious that this »C library tax« is not really necessary for this program: The only required library functions are fwrite
and strlen
, everything else is just plain Win32 API calls into kernel32.dll
and user32.dll
. fwrite
to stdout
can be trivially substituted by GetStdHandle
and WriteFile
, and strlen
doesn’t even need a substitute because it’s inlined by the compiler anyway. So let’s just get rid of the C library altogether and link with /NODEFAULTLIB
. In doing so, we lose the luxury of having a main
function that has a working heap and gets the command line parsed into argc
and argv
, but we don’t need that anyway. We can instead make our main function be mainCRTStartup
, which is the default entry point of console-mode Windows executables, and return from it by calling ExitProcess
. The whole program turns into this (changes highlighted):
#include <windows.h> #include <string.h> int mainCRTStartup(void) { if (!OpenClipboard(NULL)) { ExitProcess(1); } HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) { CloseClipboard(); ExitProcess(1); } const char *str = (const char*) GlobalLock(hData); if (!str) { CloseClipboard(); ExitProcess(1); } DWORD dummy; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), (const void*)str, strlen(str), &dummy, NULL); GlobalUnlock(hData); CloseClipboard(); ExitProcess(0); }
That’s not too many changes, but the result is quite impressive: we’re down to 3072 bytes (3 KiB), with no dependencies other than the two mandatory DLLs! At this point, further optimization isn’t reasonable: We’re already below the page size, filesystem cluster size and (modern) disk sector size, all of which are at 4 KiB. If we shrink anything below these 4 KiB, we won’t save any memory, storage space or load time. So there, we have it – we pushed it as far as it makes sense!
But no, we won’t leave it at that! (At least, I won’t.) It might be a pure sports exam by now, but if we started the quest to make a getclip
implementation as small as possible, we might just as well end it! So let’s go the next step …
Assembly to the rescue
The actual code is simple enough to forego the comfort zone of C programming altogether and write it straight in assembly, like this (NASM/YASM syntax):
global _mainCRTStartup extern _ExitProcess@4 extern _OpenClipboard@4 extern _CloseClipboard@0 extern _GetClipboardData@4 extern _GlobalLock@4 extern _GlobalUnlock@4 extern _GetStdHandle@4 extern _WriteFile@20 section .text _mainCRTStartup: ; set up stack frame for *lpBytesWritten push ebp sub esp, 4 ; if (!OpenClipboard(NULL)) ExitProcess(1); push 0 call _OpenClipboard@4 or eax, eax jz error2 ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail; push 1 ; CF_TEXT call _GetClipboardData@4 or eax, eax jz error push eax ; save hData for GlobalUnlock at the end ; char* str = GlobalLock(hData); if (!str) fail; push eax call _GlobalLock@4 or eax, eax jz error ; strlen(str) mov ecx, eax strlen_loop: mov dl, [ecx] or dl, dl jz strlen_end inc ecx jmp strlen_loop strlen_end: sub ecx, eax ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...) push 0 ; lpOverlapped = NULL lea edx, [ebp-4] ; put nBytesWritten on the stack push edx push ecx ; nNumberOfBytesToWrite = strlen(str) push eax ; lpBuffer = str push -11 ; hFile = ... call _GetStdHandle@4 ; ... GetStdHandle(STD_OUTPUT_HANDLE) push eax call _WriteFile@20 ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0); call _GlobalUnlock@4 ; hData is already on the stack call _CloseClipboard@0 push 0 call _ExitProcess@4 error: call _CloseClipboard@0 error2: push 1 call _ExitProcess@4
Assembling this and linking it with Microsoft’s link.exe
generates an executable of 2560 bytes. That might sound a bit disappointing (a mere 512 bytes reduction for writing everything in assembly, come on!), but in fact it’s more or less expected: Code generated by a good C compiler is usually already very tight (it might even be better than my attempt; I didn’t check that though) and by telling it to omit all C library dependencies, there’s not much additional cruft in there that would be produced by a compiler but not by an assembler.
However, by having a closer look into the generated executable, it shows lots of zeroes and all kinds of PE sections, including relocation information (which is not needed at all for non-ASLR exectutables) and (empty) debug information. There are no linker options (that I know of) that get rid of this, so we need to dig even deeper …
Constructing PE files by hand
The perhaps not easiest, but certainly most thorough way to stop any interference from the linker is not to use one and write all the PE headers and sections directly in the assembler. Unfortunately, the PE format is not very simple and full of idiosyncracies, so it takes some effort until a working binary emerges:
bits 32 BASE equ 0x00400000 ALIGNMENT equ 512 SECTALIGN equ 4096 %define ROUND(v, a) (((v + a - 1) / a) * a) %define ALIGNED(v) (ROUND(v, ALIGNMENT)) %define RVA(obj) (obj - BASE) section header progbits start=0 vstart=BASE mz_hdr: dw "MZ" ; DOS magic times 0x3a db 0 ; [UNUSED] DOS header dd RVA(pe_hdr) ; address of PE header pe_hdr: dw "PE",0 ; PE magic + 2 padding bytes dw 0x014c ; i386 architecture dw 2 ; two sections dd 0 ; [UNUSED] timestamp dd 0 ; [UNUSED] symbol table pointer dd 0 ; [UNUSED] symbol count dw OPT_HDR_SIZE ; optional header size dw 0x0102 ; characteristics: 32-bit, executable opt_hdr: dw 0x010b ; optional header magic db 13,37 ; [UNUSED] linker version dd ALIGNED(S_TEXT_SIZE) ; [UNUSED] code size dd ALIGNED(S_IDATA_SIZE) ; [UNUSED] size of initialized data dd 0 ; [UNUSED] size of uninitialized data dd RVA(section..text.vstart) ; entry point address dd RVA(section..text.vstart) ; [UNUSED] base of code dd RVA(section..idata.vstart) ; [UNUSED] base of data dd BASE ; image base dd SECTALIGN ; section alignment dd ALIGNMENT ; file alignment dw 4,0 ; [UNUSED] OS version dw 0,0 ; [UNUSED] image version dw 4,0 ; subsystem version dd 0 ; [UNUSED] Win32 version dd RVA(the_end) ; size of image dd ALIGNED(ALL_HDR_SIZE) ; size of headers dd 0 ; [UNUSED] checksum dw 3 ; subsystem = console dw 0 ; [UNUSED] DLL characteristics dd 0x00100000 ; [UNUSED] maximum stack size dd 0x00001000 ; initial stack size dd 0x00100000 ; maximum heap size dd 0x00001000 ; [UNUSED] initial heap size dd 0 ; [UNUSED] loader flags dd 16 ; number of data directory entries dd 0,0 ; no export table dd RVA(import_table) ; import table address dd IMPORT_TABLE_SIZE ; import table size times 14 dd 0,0 ; no other entries in the data directories OPT_HDR_SIZE equ $ - opt_hdr sect_hdr_text: db ".text",0,0,0 ; section name dd ALIGNED(S_TEXT_SIZE) ; virtual size dd RVA(section..text.vstart) ; virtual address dd ALIGNED(S_TEXT_SIZE) ; file size dd section..text.start ; file position dd 0,0 ; no relocations or debug info dw 0,0 ; no relocations or debug info dd 0x60000020 ; flags: code, readable, executable sect_hdr_idata: db ".idata",0,0 ; section name dd ALIGNED(S_IDATA_SIZE) ; virtual size dd RVA(section..idata.vstart) ; virtual address dd ALIGNED(S_IDATA_SIZE) ; file size dd section..idata.start ; file position dd 0,0 ; no relocations or debug info dw 0,0 ; no relocations or debug info dd 0xC0000040 ; flags: data, readable, writeable ALL_HDR_SIZE equ $ - $$ ;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;; section .text progbits follows=header align=ALIGNMENT vstart=BASE+SECTALIGN*1 s_text: ; set up stack frame for *lpBytesWritten push ebp sub esp, 4 ; if (!OpenClipboard(NULL)) ExitProcess(1); push 0 call [OpenClipboard] or eax, eax jz error2 ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail; push 1 ; CF_TEXT call [GetClipboardData] or eax, eax jz error push eax ; save hData for GlobalUnlock at the end ; char* str = GlobalLock(hData); if (!str) fail; push eax call [GlobalLock] or eax, eax jz error ; strlen(str) mov ecx, eax strlen_loop: mov dl, [ecx] or dl, dl jz strlen_end inc ecx jmp strlen_loop strlen_end: sub ecx, eax ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...) push 0 ; lpOverlapped = NULL lea edx, [ebp-4] ; put nBytesWritten on the stack push edx push ecx ; nNumberOfBytesToWrite = strlen(str) push eax ; lpBuffer = str push -11 ; hFile = ... call [GetStdHandle] ; ... GetStdHandle(STD_OUTPUT_HANDLE) push eax call [WriteFile] ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0); call [GlobalUnlock] ; hData is already on the stack call [CloseClipboard] push 0 call [ExitProcess] error: call [CloseClipboard] error2: push 1 call [ExitProcess] S_TEXT_SIZE equ $ - s_text ;;;;;;;;;;;;;;;;;;;; .idata ;;;;;;;;;;;;;;;;; section .idata progbits follows=.text align=ALIGNMENT vstart=BASE+SECTALIGN*2 s_idata: import_table: ; import of kernel32.dll dd 0 ; [UNUSED] read-only IAT dd 0 ; [UNUSED] timestamp dd 0 ; [UNUSED] forwarder chain dd RVA(N_kernel32) ; library name dd RVA(IAT_kernel32) ; IAT pointer ; import of user32.dll dd 0 ; [UNUSED] read-only IAT dd 0 ; [UNUSED] timestamp dd 0 ; [UNUSED] forwarder chain dd RVA(N_user32) ; library name dd RVA(IAT_user32) ; IAT pointer ; terminator (empty item) times 5 dd 0 IMPORT_TABLE_SIZE: equ $ - import_table IAT_kernel32: ExitProcess: dd RVA(H_ExitProcess) GlobalLock: dd RVA(H_GlobalLock) GlobalUnlock: dd RVA(H_GlobalUnlock) GetStdHandle: dd RVA(H_GetStdHandle) WriteFile: dd RVA(H_WriteFile) dd 0 IAT_user32: OpenClipboard: dd RVA(H_OpenClipboard) CloseClipboard: dd RVA(H_CloseClipboard) GetClipboardData: dd RVA(H_GetClipboardData) dd 0 align 4, db 0 N_kernel32: db "kernel32.dll",0 align 4, db 0 N_user32: db "user32.dll",0 align 2, db 0 H_OpenClipboard: db 0,0,"OpenClipboard",0 align 2, db 0 H_GetClipboardData: db 0,0,"GetClipboardData",0 align 2, db 0 H_GlobalLock: db 0,0,"GlobalLock",0 align 2, db 0 H_GetStdHandle: db 0,0,"GetStdHandle",0 align 2, db 0 H_WriteFile: db 0,0,"WriteFile",0 align 2, db 0 H_GlobalUnlock: db 0,0,"GlobalUnlock",0 align 2, db 0 H_CloseClipboard: db 0,0,"CloseClipboard",0 align 2, db 0 H_ExitProcess: db 0,0,"ExitProcess",0 S_IDATA_SIZE equ $ - s_idata align ALIGNMENT, db 0 the_end:
That’s a pretty standard »by the book« implementation of a PE file: Code and import tables are nicely segregated into separate sections, the sections have their default alignment, all headers are spelled out in full, and fields which are not used by the loader nevertheless have sensible values or at least the usual dummy values (i.e. zero). The only thing that’s missing is a proper DOS stub, so if anybody ever tries to run this on real DOS, it will crash and burn.
So what does it give us? The result is 1536 bytes of finest hand-crafted code. Not too bad, but not quite satisfying either. The elephant in the room is the 512-byte alignment of the sections in the file that causes a lot of empty space: Can’t we just turn that down to, like, nothing? Unfortunately, we really can’t: Windows 10’s loader insists on a file alignment of 512 bytes; any attempt to decrease it results in the message »This app can’t be executed on this PC«. It’s not even possible to strip the padding at the end of the last section. (WINE accepts all of that without flinching, but that’s not at all our target platform.)
Merging sections
Even with Windows being so uncooperative, we still got one trick up our sleeves: We can just put both the code and the import tables into a combined section. That’s not common to do (code/data separation exists for a reason), but on our quest to make the file smaller, we take what we can.
The modifications are quite small, so here’s just a diff:
@@ -34,3 +42,3 @@ dw 0x014c ; i386 architecture - dw 2 ; two sections + dw 1 ; one section dd 0 ; [UNUSED] timestamp @@ -44,8 +52,8 @@ db 13,37 ; [UNUSED] linker version - dd ALIGNED(S_TEXT_SIZE) ; [UNUSED] code size - dd ALIGNED(S_IDATA_SIZE) ; [UNUSED] size of initialized data + dd ALIGNED(S_SECT_SIZE) ; [UNUSED] code size + dd ALIGNED(S_SECT_SIZE) ; [UNUSED] size of initialized data dd 0 ; [UNUSED] size of uninitialized data - dd RVA(section..text.vstart) ; entry point address - dd RVA(section..text.vstart) ; [UNUSED] base of code - dd RVA(section..idata.vstart) ; [UNUSED] base of data + dd RVA(section.getclip.vstart); entry point address + dd RVA(section.getclip.vstart); [UNUSED] base of code + dd RVA(section.getclip.vstart); [UNUSED] base of data dd BASE ; image base @@ -74,20 +82,11 @@ -sect_hdr_text: - db ".text",0,0,0 ; section name - dd ALIGNED(S_TEXT_SIZE) ; virtual size - dd RVA(section..text.vstart) ; virtual address - dd ALIGNED(S_TEXT_SIZE) ; file size - dd section..text.start ; file position +sect_hdr: + db "getclip",0 ; section name + dd ALIGNED(S_SECT_SIZE) ; virtual size + dd RVA(section.getclip.vstart); virtual address + dd ALIGNED(S_SECT_SIZE) ; file size + dd section.getclip.start ; file position dd 0,0 ; no relocations or debug info dw 0,0 ; no relocations or debug info - dd 0x60000020 ; flags: code, readable, executable + dd 0xE0000060 ; flags: code + data, readable, writeable, executable -sect_hdr_idata: - db ".idata",0,0 ; section name - dd ALIGNED(S_IDATA_SIZE) ; virtual size - dd RVA(section..idata.vstart) ; virtual address - dd ALIGNED(S_IDATA_SIZE) ; file size - dd section..idata.start ; file position - dd 0,0 ; no relocations or debug info - dw 0,0 ; no relocations or debug info - dd 0xC0000040 ; flags: data, readable, writeable @@ -97,4 +96,4 @@ -section .text progbits follows=header align=ALIGNMENT vstart=BASE+SECTALIGN*1 -s_text: +section getclip progbits follows=header align=ALIGNMENT vstart=BASE+SECTALIGN*1 +the_section: @@ -157,9 +156,5 @@ -S_TEXT_SIZE equ $ - s_text - ;;;;;;;;;;;;;;;;;;;; .idata ;;;;;;;;;;;;;;;;; -section .idata progbits follows=.text align=ALIGNMENT vstart=BASE+SECTALIGN*2 -s_idata: - + align 4, ret @@ -215,3 +210,3 @@ -S_IDATA_SIZE equ $ - s_idata +S_SECT_SIZE equ $ - the_section
The result is (predictably) 1024 bytes, i.e. exactly 1 KiB. Within the constraints of the Windows loader, it’s not possible to go below that: We need at least one »pseudo-section« for the header and one section for actual code and data, and both of them need to be at least a full 512 bytes.
Going sectionless
As this whole section business works against us, can we possibly live without it? Windows will load at least the header part of the executable into memory anyway, and if we sneak the actual code and import table data into there, we should be fine. In fact, this used to work in the past, but at least Windows 10 version 1703 (and very likely already versions before that) simply ignore import tables that are not contained in a section. As a result, the pointers to the function names in the Import Address Table are not replaced by the function’s entry point address – the program will load just fine, but it will crash shortly thereafter when it tries to call the first API function.
So if we want to go down the »sectionless PE« route, we need to find an alternative way to load our imports. But how can we do that? Even LoadLibrary
and GetProcAddress
would need to be imported from kernel32.dll
somehow … or do they? In fact, kernel32.dll
(and ntdll.dll
) are already loaded, by default, by Windows’ PE loader! We just need to find the addresses somehow. This can be done with some pointer chasing: The FS selector points to the Thread Environment Block (TEB), which contains a pointer to the Process Environment Block (PEB), which contains a pointer to the PE loader data, which contains a doubly-linked circular list of loader data tables for each loaded DLL, which contain a pointer to the DLL’s base address. Phew. But as complicated as that sounds, it’s just six simple MOV instructions. The complex part is what comes after that.
Because right now, we have a pointer to the base address of a DLL that’s supposed to be kernel32.dll
. But we need function pointers, not DLL base addresses, and we can’t just call GetProcAddress
yet (because we don’t know its address). The only thing we can do is re-implement GetProcAddress
by parsing the PE header, looking for the export tables, searching these for the desired function name, and using the ultra-complicated three-step lookup procedure (that doesn’t even work as intended; I got consistent off-by-one errors when implementing it according to the spec) to get the actual address. That’s a lot of code, but there’s no way around that.
Having implemented a poor man’s GetProcAddress
, note that we no longer need the real thing: We can directly look for LoadLibrary
in the loaded DLLs (one of which is always kernel32.dll
), load user32.dll
with it and then use our own look-up function for all other required API calls as well. In fact, I went so far as to have a wrapper function that takes the base address of a DLL and the function name, looks the function up and calls it directly.
One nice side-effect of going sectionless is that Windows now allows us to set the file alignment to an arbitrarily low value, because it isn’t really interested in any alignment stuff in this case. (It checks that the section alignment is equal to the file alignment though, but that’s fine with us).
There is one additional pitfall on Windows 7 64-bit (I believe I didn’t see this on 32-bit Windows 7, but I’m not sure). It seems that its loader is not fully ignoring the section table as it ought to: if the DWORD where the file offset of the first section is stored is negative, the executable can’t be run. In effect, this means that the byte at offset 23 (decimal) after the optional header must not be 0x80 or greater. That’s quite a restriction, because we’re going to put code there and we don’t want to juggle around with the instructions until we have found an arrangement that works! Fortunately, we can circumvent this: The »optional header size« field does not really store the size of the optional header – the optional header has a fixed size after all, only determined by the number of data dictionary entries, which is stored explicitly. No, what the »optional header size« field actually encodes is the offset of the section table, relative to the optional header’s start. So we simply need to choose a value such that the DWORD at offset [optional header start + optional header size + 20] is guaranteed to be less than 0x80000000. One good candidate is the »image base« field, which defaults to 0x400000 and is located at offset 28 inside the optional header – so we put down 8 as the optional header size and we’re set!
bits 32 BASE equ 0x00400000 ALIGNMENT equ 4 SECTALIGN equ 4 %define ROUND(v, a) (((v + a - 1) / a) * a) %define ALIGNED(v) (ROUND(v, ALIGNMENT)) %define RVA(obj) (obj - BASE) org BASE mz_hdr: dw "MZ" ; DOS magic times 0x3a db 0 ; [UNUSED] DOS header dd RVA(pe_hdr) ; address of PE header pe_hdr: dw "PE",0 ; PE magic + 2 padding bytes dw 0x014c ; i386 architecture dw 0 ; no sections dd 0 ; [UNUSED] timestamp dd 0 ; [UNUSED] symbol table pointer dd 0 ; [UNUSED] symbol count dw 8 ; optional header size dw 0x0102 ; characteristics: 32-bit, executable opt_hdr: dw 0x010b ; optional header magic db 13,37 ; [UNUSED] linker version dd RVA(the_end) ; [UNUSED] code size dd RVA(the_end) ; [UNUSED] size of initialized data dd 0 ; [UNUSED] size of uninitialized data dd RVA(main) ; entry point address dd RVA(main) ; [UNUSED] base of code dd RVA(main) ; [UNUSED] base of data dd BASE ; image base dd SECTALIGN ; section alignment dd ALIGNMENT ; file alignment dw 4,0 ; [UNUSED] OS version dw 0,0 ; [UNUSED] image version dw 4,0 ; subsystem version dd 0 ; [UNUSED] Win32 version dd RVA(the_end) ; size of image dd ALIGNED(ALL_HDR_SIZE) ; size of headers dd 0 ; [UNUSED] checksum dw 3 ; subsystem = console dw 0 ; [UNUSED] DLL characteristics dd 0x00100000 ; [UNUSED] maximum stack size dd 0x00001000 ; initial stack size dd 0x00100000 ; maximum heap size dd 0x00001000 ; [UNUSED] initial heap size dd 0 ; [UNUSED] loader flags dd 16 ; number of data directory entries times 16 dd 0,0 ; no entries in the data directories OPT_HDR_SIZE equ $ - opt_hdr ALL_HDR_SIZE equ $ - $$ ;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;; main: ; set up stack frame for local variables push ebp %define DummyVar ebp-4 %define kernel32base ebp-8 %define user32base ebp-12 sub esp, 12 ; locate the loader data tables where the loaded DLLs are managed mov eax, [fs:0x30] ; get PEB pointer from TEB mov eax, [eax+0x0C] ; get PEB_LDR_DATA pointer from PEB mov eax, [eax+0x14] ; go to first LDR_DATA_TABLE_ENTRY mov eax, [eax] ; move two entries further, because the mov eax, [eax] ; third is typically kernel32.dll try_next_lib: push eax ; save LDR_DATA_TABLE_ENTRY pointer mov ebx, [eax+0x10] ; load base address of the library mov esi, N_LoadLibrary call find_import ; load LoadLibrary from there (if present) or eax, eax ; found? jnz kernel32_found pop eax ; restore LDR_DATA_TABLE_ENTRY pointer mov eax, [eax] ; go to next LDR_DATA_TABLE_ENTRY jmp try_next_lib find_import: ; FUNCTION that finds procedure [esi] in library at base [ebx] mov edx, [ebx+0x3c] ; get PE header pointer (w/ RVA translation) add edx, ebx cmp word [edx], "PE" ; is it a PE header? jne find_import_fail mov eax, [edx+0x74] ; check if data dictionary is present or eax, eax jz find_import_fail mov edx, [edx+0x78] ; get export table pointer RVA or edx, edx ; check if export table is present jz find_import_fail add edx, ebx ; get absolute address of export table push edx ; store the export table address for later mov ecx, [edx+0x18] ; ecx = number of named functions mov edx, [edx+0x20] ; edx = address-of-names list (w/ RVA translation) add edx, ebx name_loop: dec ecx ; pre-decrement counter and check if we're done js find_import_fail1 push esi ; store the desired function name's pointer (we will clobber it) mov edi, [edx] ; load function name (w/ RVA translation) add edi, ebx cmp_loop: lodsb ; load a byte of the two strings into AL, AH mov ah, [edi] ; and increase the pointers inc edi cmp al, ah ; identical bytes? jne next_name ; if not, this is not the correct name or al, al ; zero byte reached? jnz cmp_loop ; if not, we need to compare more ; if we arrive here, we have a match! pop esi ; restore the name pointer (though we don't use it any longer) pop edx ; restore the export table address sub ecx, [edx+0x18] ; turn the negative counter ECX into a positive one neg ecx dec ecx mov eax, [edx+0x24] ; get address of ordinal table (w/ RVA translation) add eax, ebx movzx ecx, word [eax+ecx*2] ; load ordinal from table ;sub ecx, [edx+0x10] ; subtract ordinal base mov eax, [edx+0x1C] ; get address of function address table (w/ RVA translation) add eax, ebx mov eax, [eax+ecx*4] ; load function address (w/ RVA translation) add eax, ebx ret next_name: pop esi ; restore the name pointer add edx, 4 ; advance to next list item jmp name_loop find_import_fail1: pop eax ; we still had one dword on the stack find_import_fail: xor eax, eax ret call_import: ; FUNCTION that finds procedure [esi] in library at base [ebx] and calls it call find_import or eax, eax ; found? jz critical_error ; if not, we're screwed jmp eax ; but if so, call the function ; back to the main program ... kernel32_found: ; we found kernel32 (ebx) and LoadLibraryA (eax), so we can load user32.dll mov [kernel32base], ebx ; store kernel32's base address push N_user32 call eax ; call LoadLibraryA or eax, eax ; check the result jz error2 mov [user32base], eax ; store user32's base address ; if (!OpenClipboard(NULL)) ExitProcess(1); push 0 mov ebx, eax ; user32 base address was still in eax mov esi, N_OpenClipboard call call_import or eax, eax jz error2 ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail; push 1 ; CF_TEXT ; mov ebx, [user32base] mov esi, N_GetClipboardData call call_import or eax, eax jz error push eax ; save hData for GlobalUnlock at the end ; char* str = GlobalLock(hData); if (!str) fail; push eax mov ebx, [kernel32base] mov esi, N_GlobalLock call call_import or eax, eax jz error ; strlen(str) mov ecx, eax strlen_loop: mov dl, [ecx] or dl, dl jz strlen_end inc ecx jmp strlen_loop strlen_end: sub ecx, eax ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...) push 0 ; lpOverlapped = NULL lea edx, [DummyVar] ; lpBytesWritten push edx push ecx ; nNumberOfBytesToWrite = strlen(str) push eax ; lpBuffer = str push -11 ; hFile = ... ; mov ebx, [kernel32base] mov esi, N_GetStdHandle call call_import ; ... GetStdHandle(STD_OUTPUT_HANDLE) push eax ; mov ebx, [kernel32base] mov esi, N_WriteFile call call_import ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0); ; mov ebx, [kernel32base] mov esi, N_GlobalUnlock call call_import ; hData is already on the stack mov ebx, [user32base] mov esi, N_CloseClipboard call call_import push 0 jmp exit error: mov ebx, [user32base] mov esi, N_CloseClipboard call call_import error2: push 1 exit: mov ebx, [kernel32base] mov esi, N_ExitProcess jmp call_import critical_error: ret N_user32: db "user32.dll",0 N_LoadLibrary: db "LoadLibraryA", 0 N_OpenClipboard: db "OpenClipboard",0 N_GetClipboardData: db "GetClipboardData",0 N_GlobalLock: db "GlobalLock",0 N_GetStdHandle: db "GetStdHandle",0 N_WriteFile: db "WriteFile",0 N_GlobalUnlock: db "GlobalUnlock",0 N_CloseClipboard: db "CloseClipboard",0 N_ExitProcess: db "ExitProcess",0 align ALIGNMENT, db 0 the_end:
That’s quite a lot of work, but at least we can save another 25% and get down to 768 bytes. This comes at the expense of runtime performance, though, because our homegrown GetProcAddress
implementation is not nearly as efficient as Windows’ original one: We simply scan all function names (of which there are over 1600 in kernel32.dll
), while the proper loader uses binary search to speed things up. But we’re talking of a few hundred microsecons here, loading and running an executable at all takes an order of magnitude more time than that.
Import by hash
Of the 768 bytes in the sectionless version, 118 bytes (15%!) are spent on function names. That seems a little excessive, doesn’t it? After all, we’re not really interested in the names themselves, we just use them to find the function’s adresses. As a first try, we could limit the length of the stored strings by only comparing the first, say, 7 characters. We won’t be able to discern LoadLibraryA
from its Unicode cousin LoadLibraryW
this way, but since the names are guaranteed to be alphabetically sorted in export tables, we would hit LoadLibraryA
first anyway. However, we can’t use less than 7 significant bytes, because otherwise e.g. GlobalLock
would be too unspecific and we would get GlobalAddAtomA
instead.
But 7 bytes per import is still quite some data, and the whole approach is a forward compatibility timebomb, because future versions of Windows could add new functions to our two DLLs with catastrophic effect. So, truncating names is not the best path to follow. However, there’s a much more powerful alternative: Hashing! As said, we’re not interested in the names, not even parts of it. A machine-readable mapping that can uniquely identify the proper function name without actually knowing it is sufficient; bonus points if it’s easy to compute. (For our purposes, we don’t need a cryptographically strong hash or anything fancy, we just want to tell a few function names apart!)
Long story short, such mappings exist. In our example, we’ll use a simple »rotate-and-xor« hash. The algorithm uses a 32-bit accumulator register. For each character of the function name, two operations are performed (in any order): The character’s ASCII code is XOR’ed into the register (addition would be possible as well), and the register is rotated by a fixed (and ideally prime) number of bits. This can be computed in two x86 instructions per character, and is able to map all names of the two DLLs in question (and also various others I tested with) into 32-bit hashes without any collisions. Another nice property is that the hash can be computed in reverse: We can store the start value of the accumulator, and a match is detected when after processing all characters of a function name, the accumulator becomes zero. (We could live without that, but it simplifies the implementation a tiny bit.)
This modification can be applied to the existing implementation quite easily, so here’s again just a diff:
@@ -95,5 +89,5 @@ mov ebx, [eax+0x10] ; load base address of the library - mov esi, N_LoadLibrary + mov esi, 0x01364564 ; hash of "LoadLibraryA" call find_import ; load LoadLibrary from there (if present) @@ -123,15 +117,16 @@ cmp_loop: - lodsb ; load a byte of the two strings into AL, AH - mov ah, [edi] ; and increase the pointers - inc edi - cmp al, ah ; identical bytes? - jne next_name ; if not, this is not the correct name - or al, al ; zero byte reached? - jnz cmp_loop ; if not, we need to compare more + movzx eax, byte [edi] ; load a byte of the name ... + inc edi ; ... and advance the pointer + xor esi, eax ; apply xor-and-rotate + rol esi, 7 + or eax, eax ; last byte? + jnz cmp_loop ; if not, process another byte + or esi, esi ; result hash match? + jnz next_name ; if not, this is not the correct name ; if we arrive here, we have a match! @@ -180,5 +175,5 @@ push 0 mov ebx, eax ; user32 base address was still in eax - mov esi, N_OpenClipboard + mov esi, 0xFC7956AD ; hash of "OpenClipboard" call call_import @@ -188,5 +183,5 @@ push 1 ; CF_TEXT ; mov ebx, [user32base] - mov esi, N_GetClipboardData + mov esi, 0x0C473D74 ; hash of "GetClipboardData" call call_import or eax, eax @@ -197,5 +192,5 @@ mov ebx, [kernel32base] - mov esi, N_GlobalLock + mov esi, 0x4A88F58C ; hash of "GlobalLock" call call_import @@ -221,18 +216,18 @@ ; mov ebx, [kernel32base] - mov esi, N_GetStdHandle + mov esi, 0xEACA71C2 ; hash of "GetStdHandle" call call_import ; ... GetStdHandle(STD_OUTPUT_HANDLE) push eax ; mov ebx, [kernel32base] - mov esi, N_WriteFile + mov esi, 0x3FD1C30F ; hash of "WriteFile" call call_import ; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0); ; mov ebx, [kernel32base] - mov esi, N_GlobalUnlock + mov esi, 0xC3907A85 ; hash of "GlobalUnlock" call call_import ; hData is already on the stack mov ebx, [user32base] - mov esi, N_CloseClipboard + mov esi, 0x1D84425E ; hash of "CloseClipboard" call call_import @@ -242,5 +237,5 @@ error: mov ebx, [user32base] - mov esi, N_CloseClipboard + mov esi, 0x1D84425E ; hash of "CloseClipboard" call call_import @@ -248,5 +243,5 @@ exit: mov ebx, [kernel32base] - mov esi, N_ExitProcess + mov esi, 0x665640AC ; hash of "ExitProcess" jmp call_import critical_error: @@ -254,13 +249,4 @@ N_user32: db "user32.dll",0 -N_LoadLibrary: db "LoadLibraryA", 0 -N_OpenClipboard: db "OpenClipboard",0 -N_GetClipboardData: db "GetClipboardData",0 -N_GlobalLock: db "GlobalLock",0 -N_GetStdHandle: db "GetStdHandle",0 -N_WriteFile: db "WriteFile",0 -N_GlobalUnlock: db "GlobalUnlock",0 -N_CloseClipboard: db "CloseClipboard",0 -N_ExitProcess: db "ExitProcess",0
The result is 656 bytes, 112 bytes less than the version without import-by-hash. It’s not quite the optimal amount of savings (which would be 118 bytes, the size of the name strings) because the comparison grew a little bit, but still quite an impressive result.
Header trickery
Before our short excursion into the land of hashes, we worked hard on bypassing the alignment limits, but still there’s a lot of space spent in the PE headers. One trivial thing is to remove the data dictionary as we don’t even have table-based imports by now. But that’s not all: Fortunately, there are many fields in the headers that aren’t evaluated by the Windows loader where we can put other stuff in. The largest part of this is the 64-byte DOS header at the beginning, of which only the first two bytes (the »MZ« signature) and the last four bytes (the address of the PE header) are important. We can actually move (»collapse«) the PE header inside the DOS header, all the way until address 4 (which is the minimum alignment requirement). In this case, the PE header location field of the DOS header coincides with the section alignment field of the PE header, so we get a section (and file) alignment of 4 – perfect!
Runs of other unused fields in the header can be used to put the last remaining string (»user32.dll
«) and even code into. The latter is a bit complicated, because the code sequence must fit into the slot of unused fields, and if you’re unlucky, it might grow when moving into the header if a jump that used to be relative is turned into an absolute jump because the distance between jump site and target has become too large. I didn’t manage to fit a lot of code into the headers, but at least there’s something.
The following dump is what the headers now look like. The main part is the same, except that the blocks that have been moved into the headers (N_user32
, next_name
and parts of main
) are now obviously gone:
mz_hdr: dw "MZ" ; DOS magic dw "kj" ; filler to align the PE header pe_hdr: dw "PE",0 ; PE magic + 2 padding bytes dw 0x014c ; i386 architecture dw 0 ; no sections N_user32: db "user32.dll",0,0 ; 12 bytes of data collapsed into the header ;dd 0 ; [UNUSED-12] timestamp ;dd 0 ; [UNUSED] symbol table pointer ;dd 0 ; [UNUSED] symbol count dw 8 ; optional header size dw 0x0102 ; characteristics: 32-bit, executable opt_hdr: dw 0x010b ; optional header magic main_part_1: ; 12 bytes of main entry point + 2 bytes of jump mov eax, [fs:0x30] ; get PEB pointer from TEB mov eax, [eax+0x0C] ; get PEB_LDR_DATA pointer from PEB mov eax, [eax+0x14] ; go to first LDR_DATA_TABLE_ENTRY jmp main_part_2 align 4, db 0 ;db 13,37 ; [UNUSED-14] linker version ;dd RVA(the_end) ; [UNUSED] code size ;dd RVA(the_end) ; [UNUSED] size of initialized data ;dd 0 ; [UNUSED] size of uninitialized data dd RVA(main_part_1) ; entry point address main_part_2: ; another 6 bytes of code + 2 bytes of jump ; set up stack frame for local variables push ebp %define DummyVar ebp-4 %define kernel32base ebp-8 %define user32base ebp-12 sub esp, 12 mov eax, [eax] ; go to where ntdll.dll typically is jmp main_part_3 align 4, db 0 ;dd RVA(main) ; [UNUSED-8] base of code ;dd RVA(main) ; [UNUSED] base of data dd BASE ; image base dd SECTALIGN ; section alignment (collapsed with the ; PE header offset in the DOS header) dd ALIGNMENT ; file alignment next_name: ; we interrupt again for a few bytes of code from the loader pop esi ; restore the name pointer add edx, 4 ; advance to next list item jmp name_loop align 4, db 0 ;dw 4,0 ; [UNUSED-8] OS version ;dw 0,0 ; [UNUSED] image version dw 4,0 ; subsystem version dd 0 ; [UNUSED-4] Win32 version dd RVA(the_end) ; size of image dd RVA(opt_hdr) ; size of headers (must be small enough ; so that entry point inside header is accepted) dd 0 ; [UNUSED-4] checksum dw 3 ; subsystem = console dw 0 ; [UNUSED-6] DLL characteristics dd 0x00100000 ; maximum stack size dd 0x00001000 ; initial stack size dd 0x00100000 ; maximum heap size dd 0x00001000 ; initial heap size dd 0 ; [UNUSED-4] loader flags dd 0 ; number of data directory entries (= none!) OPT_HDR_SIZE equ $ - opt_hdr ALL_HDR_SIZE equ $ - $$ ;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;; main_part_3: mov eax, [eax] ; go to where kernel32.dll typically is try_next_lib: ; (from here on, not much has changed)
With this, we’re at 436 bytes, a whopping 33% less than before! The downside is that the header declarations in the source code become quite unreadable by now, and that we’re no longer forward compatible: A future version of Windows might decide that the OS version listed in the EXE file is now totally relevant and may thus not want to execute files made for version »33630.1068«.
Unsafe optimizations
All along the way, we were cautious not to remove any checks and clean exits in case of failure. But we’re already relying on a few details of the PE loader that are unlikely to change soon, but are not carved into stone either. So why not go full YOLO and strip off all the safety nets? We could assume that …
- …
kernel32.dll
always is the third image loaded (after our own executable andntdll.dll
). - … the
kernel32.dll
image is a proper PE image with all headers and dictionary items in their usual places. - … all imported functions actually exist.
- … uninitialization (
GlobalUnlock
,CloseClipboard
) is not neccesary, because the system cleans up our mess anyway when the process exits. - …
GlobalLock
is a no-operation that can be omitted completely, because theHGLOBAL
that is returned byGetClipboardData
is already a bona fide pointer.
This allows us to rip out a good chunk of code. For example, we don’t need to separate find_import
and call_import
any longer, because we’ll no longer check whether a function exists; if we want to look up a function, we’re always going to call it as well. Furthermore, the order of the loader and main code has been shuffled around a bit as well to make jumps as short as possible, and the code snippets used to fill the unused header fields are slightly different ones:
bits 32 BASE equ 0x00400000 ALIGNMENT equ 4 SECTALIGN equ 4 %define ROUND(v, a) (((v + a - 1) / a) * a) %define ALIGNED(v) (ROUND(v, ALIGNMENT)) %define RVA(obj) (obj - BASE) org BASE mz_hdr: dw "MZ" ; DOS magic dw "kj" ; filler to align the PE header pe_hdr: dw "PE",0 ; PE magic + 2 padding bytes dw 0x014c ; i386 architecture dw 0 ; no sections N_user32: db "user32.dll",0,0 ; 12 bytes of data collapsed into the header ;dd 0 ; [UNUSED-12] timestamp ;dd 0 ; [UNUSED] symbol table pointer ;dd 0 ; [UNUSED] symbol count dw 8 ; optional header size dw 0x0102 ; characteristics: 32-bit, executable opt_hdr: dw 0x010b ; optional header magic main_part_1: ; 12 bytes of main entry point + 2 bytes of jump mov eax, [fs:0x30] ; get PEB pointer from TEB mov eax, [eax+0x0C] ; get PEB_LDR_DATA pointer from PEB mov eax, [eax+0x14] ; go to first LDR_DATA_TABLE_ENTRY jmp main_part_2 align 4, db 0 ;db 13,37 ; [UNUSED-14] linker version ;dd RVA(the_end) ; [UNUSED] code size ;dd RVA(the_end) ; [UNUSED] size of initialized data ;dd 0 ; [UNUSED] size of uninitialized data dd RVA(main_part_1) ; entry point address main_part_2: ; another 6 bytes of code + 2 bytes of jump ; set up stack frame for local variables push ebp %define DummyVar ebp-4 %define kernel32base ebp-8 %define user32base ebp-12 sub esp, 12 mov eax, [eax] ; go to where ntdll.dll typically is jmp main_part_3 align 4, db 0 ;dd RVA(main) ; [UNUSED-8] base of code ;dd RVA(main) ; [UNUSED] base of data dd BASE ; image base dd SECTALIGN ; section alignment (collapsed with the ; PE header offset in the DOS header) dd ALIGNMENT ; file alignment main_part_3: ; another 5 bytes of code + 2 bytes of jump mov eax, [eax] ; go to where kernel32.dll typically is mov ebx, [eax+0x10] ; load base address of the library jmp main_part_4 align 4, db 0 ;dw 4,0 ; [UNUSED-8] OS version ;dw 0,0 ; [UNUSED] image version dw 4,0 ; subsystem version dd 0 ; [UNUSED-4] Win32 version dd RVA(the_end) ; size of image dd RVA(opt_hdr) ; size of headers (must be small enough ; so that entry point inside header is accepted) dd 0 ; [UNUSED-4] checksum dw 3 ; subsystem = console dw 0 ; [UNUSED-2] DLL characteristics dd 0x00100000 ; maximum stack size dd 0x00001000 ; initial stack size dd 0x00100000 ; maximum heap size dd 0x00001000 ; initial heap size dd 0 ; [UNUSED-4] loader flags dd 0 ; number of data directory entries (= none!) OPT_HDR_SIZE equ $ - opt_hdr ALL_HDR_SIZE equ $ - $$ main_part_4: mov [kernel32base], ebx ; store kernel32's base address mov esi, 0x01364564 ; hash of "LoadLibraryA" push N_user32 ; we want to load user32.dll call call_import ; call LoadLibraryA mov [user32base], eax ; store user32's base address ; if (!OpenClipboard(NULL)) ExitProcess(1); push 0 mov ebx, eax ; user32 base address was still in eax mov esi, 0xFC7956AD ; hash of "OpenClipboard" call call_import or eax, eax jz error ; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail; push 1 ; CF_TEXT ; mov ebx, [user32base] mov esi, 0x0C473D74 ; hash of "GetClipboardData" call call_import or eax, eax jz error ; strlen(str) mov ecx, eax strlen_loop: mov dl, [ecx] or dl, dl jz strlen_end inc ecx jmp strlen_loop strlen_end: sub ecx, eax ; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...) push 0 ; lpOverlapped = NULL lea edx, [DummyVar] ; lpBytesWritten push edx push ecx ; nNumberOfBytesToWrite = strlen(str) push eax ; lpBuffer = str push -11 ; hFile = ... mov ebx, [kernel32base] mov esi, 0xEACA71C2 ; hash of "GetStdHandle" call call_import ; ... GetStdHandle(STD_OUTPUT_HANDLE) push eax ; mov ebx, [kernel32base] mov esi, 0x3FD1C30F ; hash of "WriteFile" call call_import ; ExitProcess(0); push 0 jmp exit error: push 1 exit: mov ebx, [kernel32base] mov esi, 0x665640AC ; hash of "ExitProcess" ; fall-through into call_import call_import: ; FUNCTION that calls procedure [esi] in library at base [ebx] mov edx, [ebx+0x3c] ; get PE header pointer (w/ RVA translation) add edx, ebx mov edx, [edx+0x78] ; get export table pointer RVA (w/ RVA translation) add edx, ebx push edx ; store the export table address for later mov ecx, [edx+0x18] ; ecx = number of named functions mov edx, [edx+0x20] ; edx = address-of-names list (w/ RVA translation) add edx, ebx name_loop: push esi ; store the desired function name's hash (we will clobber it) mov edi, [edx] ; load function name (w/ RVA translation) add edi, ebx cmp_loop: movzx eax, byte [edi] ; load a byte of the name ... inc edi ; ... and advance the pointer xor esi, eax ; apply xor-and-rotate rol esi, 7 or eax, eax ; last byte? jnz cmp_loop ; if not, process another byte or esi, esi ; result hash match? jnz next_name ; if not, this is not the correct name ; if we arrive here, we have a match! pop esi ; restore the name pointer (though we don't use it any longer) pop edx ; restore the export table address sub ecx, [edx+0x18] ; turn the negative counter ECX into a positive one neg ecx mov eax, [edx+0x24] ; get address of ordinal table (w/ RVA translation) add eax, ebx movzx ecx, word [eax+ecx*2] ; load ordinal from table ;sub ecx, [edx+0x10] ; subtract ordinal base mov eax, [edx+0x1C] ; get address of function address table (w/ RVA translation) add eax, ebx mov eax, [eax+ecx*4] ; load function address (w/ RVA translation) add eax, ebx jmp eax ; jump to the target function next_name: pop esi ; restore the name pointer add edx, 4 ; advance to next list item dec ecx ; decrease counter jmp name_loop align ALIGNMENT, db 0 the_end:
The final result with this is 316 bytes, another 27% less than before!
Conclusion
This concludes our journey into size optimization. At this point, we’re 240 times smaller than the naïve first C implementation, and even if we consider our first serious optimization step (the C implementation without C library) as the starting point, we’re still almost 10 times smaller. But admittedly, the amount of effort necessary for this is extremely high and hardly justified ;)
You can download all the source files of this little experiment if you’re interested.
I’m not going to claim that my implementation is the smallest possible, most efficient or best-on-any-other-axis one. I’m not a seasoned sizecoder at that low level (usually I stop at the »get rid of the C library« step). What also concerns me is that I had to implement the export table parser differently from all documentation I could find on the subject (including Microsoft’s official PE specification) by not subtracting the base ordinal from the value in the name ordinal table to get the function address table index. So if you have any explanations or improvement ideas, let me know.
Update (2017-09-09): As a commenter pointed out, some of the executables didn’t run on Windows 7 x64. I figured out what’s the issue and updated the post and the download file accordingly – see the last paragraph before the code sample in the »going sectionless« chapter for details.
Thank you for the enjoyable read.
I have tried the ASM files on my PC with Windows 7 x64.
getclip_pe_v1_2sections.asm would not assemble with nasm-2.13.01-win32 so I switched to yasm-1.3.0-win32 which seems to be what you are using.
All of them would run except three:
– getclip_pe_v2_1section detected by Avira Antivir as TR/Crypt.XPACK.Gen and put in quarantine
– getclip_pe_v5_collapse, The application was unable to start correctly (0xc0000005)
– getclip_pe_v6_unsafe, The application was unable to start correctly (0xc0000005)
widge: Thanks for the information! I dug into the crashing issue (quite a rabbit hole, I tell ya!) and fixed it.
Regarding the anti-virus warning, there’s not much we can do about that. Anti-virus software is inherently broken and just loves to interfere with all kinds of size-coding :(
KeyJ,
This is such a wonderful article.
Nothing like a good yak shaving :-)
Thank you – I learned a lot
Your C library code calls ExitProcess(1) twice instead of returning 1 from main which is a slight inconsistency. Also the assembly function calls ExitProcess twice instead of jump to a label for it which would save 5-2 or 3 bytes. More hand tricks could reduce even a few more bytes off the asm though I realize the focus here was the PE limits.
Most of the games with the LoadLibrary and GetProcAddress lookups are too dangerous for anything but playing around unless regressing them on all Windows versions including 32/64 bit x86 and Itanium, 7, 8 and 10 or more like XP, etc while constantly monitoring 10 updates. Also you could simplify hashing to 16 bits or any bizarre tricks that seem to always work. The same for some of the really unusual header compression tricks.
Basically you have created the closest as possible to COM files for Windows almost and if this could work on every flavor of Windows starting from 95 it would at least be neat as you could start coding as soon as the 2 key kernel32 functions become available.
You’re right,
mov eax, returncode; ret
should do the trick too. I heard somewhere thatExitProcess
is mandatory though, so I kept it without thinking twice. Maybe I should revisit that, including re-testing on the relevant platforms — speaking of which, I wouldn’t count anything older than Windows 7 as a target, and even that is debatable by now. Same goes for Itanium; I’d rather see WINE as a valid target for execution than that ;)All that being said: No, this isn’t something that I would recommend to use for any productive purposes. If you’re concerned with any kind of compatibility (forwards, backwards, sidewards), it goes without saying that you should stop at the point where any of the PE header fields are misused for anything.
I was searching for this very thing.
Thank you very much! It helped me a lot to understand how the format works and why section alignment is so important!
268-byter with imports for win10
stasoid: Wow, that’s interesting. I’m pretty sure that I tried a section alignment of 4 when I did this research four years ago, and it didn’t work. Good to know that it does now!
Import by hash still has its merits though, because import by name is larger than any of the alternatives and import by ordinal is notoriously fragile (a system update may shuffle around the ordinals, and you’re screwed).
It seems that Windows 11 allows sections aligned to less than 512 bytes but they must not be .data or .bss. Without resorting to assembly language we can get 1056 bytes.
I will take your second version in C, but instead Microsoft C, we can try gcc . I have used mingw64 version 4.9.2, which is bundled with DevCpp IDE (of course, I have prepared path environment variable to call it from the command line).
After calling
gcc -m32 -mconsole -nostdlib -nostartfiles -Os -Wall -s getclip.c -lkernel32 -luser32 -Wl,-e_mainCRTStartup,–section-alignment,16,-file-alignment,16 -o getclip
getclip.exe is now 1056 bytes and I have checked it on Windows 11 (64 bit), Windows 10 (64 bit) and Windows XP (32 bit). It contains 3 sections: .text, .rdata and .idata, and imports only kernel32.dll and user32.dll
However, if you just move DWORD dummy declaration outside mainCRTStartup into global scope, the executable is no more compatible with Windows 10, because such action creates .bss section.
Samir Ribić: Thanks a lot for your research! I didn’t know that Windows makes such a difference based on the writeability of a section, but with your description, it makes perfect sense.
This is a very cool article! Thanks!!