User-space stacks can grow to considerable sizes - megabytes and more. The size of the stack is not really a limitation there. However, it can still become a problem for low-level environments such as kernel space, firmware, and embedded OSes. Such targets typically come with tiny, fixed-size stacks - not more than a few pages of memory. This post explores a last-resort technique to bypass such a limitation.
Table of Contents
- Act 1:
regexin Linux kernel - Act 2: Proof of Concept
- Act 3: Rock & Roll
- Limitations
- Finale: Stackaroo
- Conclusion
- Bonus: Recursive variant
Act 1: regex in Linux kernel
The inspiration to write this post came from doing research on adapting common Rust crates such as regex to run as part of a Linux kernel module.
The issue I faced was substantial stack usage, causing general protection faults in address ranges close to the rsp.
It turned out the regex crate expected a bit more stack than kernel space could offer, which is 16kB on x64.
This problem can be simulated from user space by applying the stack limit using ulimit -s and running a sample program.
Through trial and error, I found that the regex crate remains stable across multiple runs with a stack size of approximately 40 kB.
use regex::Regex;
fn main() {
let re = Regex::new(r"^[a-zA-Z]+$").unwrap(); // Only alphabetic words
assert!(re.is_match("foo"));
}
$ ulimit -s
8192 # kB
$ ./regex-test ; echo $?
0
# Limit stack to 16kB
$ ulimit -s 16
$ ./regex-test ; echo $?
Segmentation fault (core dumped)
139
But the question remains - how to overcome this limitation in an environment where the stack size cannot be easily increased?
Act 2: Proof of Concept
The bypass will be performed within our program. The goal is to:
- Preserve the current
rspin a static variable. - Swap the
rspto a user-controlled buffer (backed by global memory in this example, though it could also be heap-backed). - Call a callback on the new “stack”.
- Restore the original stack pointer after the callback returns.
#include <stdio.h>
void my_callback(void* arg) {
printf("In callback with arg: %p\n", arg);
}
__attribute__((noinline)) void switch_stack(
void (*callback)(void*),
void* arg
) {
static void* OLD_SP = NULL;
static void(*CALLBACK_FN)(void*) = NULL;
static void* CALLBACK_ARG = NULL;
static char AUX_STACK[1 << 20] __attribute__((aligned(16))); // 1MB, 16-byte aligned
static void* NEW_SP = AUX_STACK + sizeof(AUX_STACK) - 8; // Ensure it's 16-byte aligned after the call instruction
CALLBACK_FN = callback;
CALLBACK_ARG = arg;
// Save rsp to OLD_SP and switch to NEW_SP
__asm__ volatile (
"mov %%rsp, %0\n"
"mov %1, %%rsp\n"
: "=m"(OLD_SP)
: "r"(NEW_SP)
);
// Call the callback function
CALLBACK_FN(CALLBACK_ARG);
// Restore old rsp
__asm__ volatile (
"mov %0, %%rsp\n"
:
: "m"(OLD_SP)
);
}
int main() {
switch_stack(my_callback, (void*)0xDEADBEEF);
return 0;
}
The switch_stack function accepts a pointer to a callback that will be called on the new stack and a void* argument.
The function is marked as noinline to prevent it from being merged into the caller’s frame.
This gives it its own predictable frame layout, making it easier to debug - though it should work as expected even when inlined.
__attribute__((noinline)) void switch_stack(
void (*callback)(void*),
void* arg
)
The function defines a set of static variables to hold the context after the stack switch.
We want to avoid local variables as much as possible.
If the compiler emits an instruction referencing local stack data (e.g., mov rax, [rsp+0x08]) after rsp is swapped, the program will likely crash.
OLD_SP- stores the original stack pointer so we can restore it after the callback completesCALLBACK_FN- holds the callback function pointer in static storageCALLBACK_ARG- holds the callback argument in static storageAUX_STACK- a static 1MB array serving as backing storage for the new stack, with 16-byte alignment required by x64 ABINEW_SP- initialized to point to the 16-byte aligned end ofAUX_STACKsince the stack grows downward. Must subtract 8 to keep it 16-byte aligned after thecallinstruction pushes the return address onto the stack.
static void* OLD_SP = NULL;
static void(*CALLBACK_FN)(void*) = NULL;
static void* CALLBACK_ARG = NULL;
static char AUX_STACK[1 << 20] __attribute__((aligned(16))); // 1MB, 16-byte aligned
static void* NEW_SP = AUX_STACK + sizeof(AUX_STACK) - 8; // Ensure it's 16-byte aligned after the call instruction
Before performing the stack swap, we copy the arguments to static storage.
After rsp is swapped, we cannot safely access any existing local variables.
If the local variable gets referenced using rsp-relative addressing, the program will likely crash.
However, if locals are accessed through rbp-relative instructions (e.g. mov rax, [rbp+0x8]), it will actually keep working fine as the rbp was left untouched after the swap.
In practice, rbp-relative addressing is used in unoptimized builds, so it’s just better to avoid referencing locals after the swap altogether.
CALLBACK_FN = callback;
CALLBACK_ARG = arg;
__asm__ volatile (
"mov %%rsp, %0\n" // Store rsp in OLD_SP
"mov %1, %%rsp\n" // Load NEW_SP into rsp
: "=m"(OLD_SP)
: "r"(NEW_SP)
);
To minimize inline assembly and maintain portability, we call the callback using C code. This avoids the complexity of dealing with ABI-related calling conventions.
CALLBACK_FN(CALLBACK_ARG);
After the stack is restored, it would be safe again to access switch_stack local variables - if there were any.
__asm__ volatile (
"mov %0, %%rsp\n"
:
: "m"(OLD_SP)
);
The program appears to work as expected:
# Unoptimized build
$ gcc main.c
./a.out
In callback with arg: 0xdeadbeef
# Optimized build
$ gcc -O3 main.c
$ ./a.out
In callback with arg: 0xdeadbeef
The compiled function doesn’t emit any instructions accessing [rsp+offset] after the stack swap, confirming the static variable approach works as intended. All memory accesses use rip-relative addressing ([rip+offset]), which references global/static variables independently of the stack pointer.
# UNOPTIMIZED
0000000000001177 <switch_stack>:
1177: f3 0f 1e fa endbr64
117b: 55 push rbp
117c: 48 89 e5 mov rbp,rsp
117f: 48 83 ec 10 sub rsp,0x10
1183: 48 89 7d f8 mov QWORD PTR [rbp-0x8],rdi
1187: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
118b: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
118f: 48 89 05 aa 2e 00 00 mov QWORD PTR [rip+0x2eaa],rax # 4040 <CALLBACK_FN.4>
1196: 48 8b 45 f0 mov rax,QWORD PTR [rbp-0x10]
119a: 48 89 05 a7 2e 00 00 mov QWORD PTR [rip+0x2ea7],rax # 4048 <CALLBACK_ARG.3>
11a1: 48 8b 05 68 2e 00 00 mov rax,QWORD PTR [rip+0x2e68] # 4010 <NEW_SP.2>
11a8: 48 89 25 a1 2e 00 00 mov QWORD PTR [rip+0x2ea1],rsp # 4050 <OLD_SP.1>
11af: 48 89 c4 mov rsp,rax
11b2: 48 8b 15 87 2e 00 00 mov rdx,QWORD PTR [rip+0x2e87] # 4040 <CALLBACK_FN.4>
11b9: 48 8b 05 88 2e 00 00 mov rax,QWORD PTR [rip+0x2e88] # 4048 <CALLBACK_ARG.3>
11c0: 48 89 c7 mov rdi,rax
11c3: ff d2 call rdx
11c5: 48 8b 25 84 2e 00 00 mov rsp,QWORD PTR [rip+0x2e84] # 4050 <OLD_SP.1>
11cc: 90 nop
11cd: c9 leave
11ce: c3 ret
# O3
0000000000001190 <switch_stack>:
1190: f3 0f 1e fa endbr64
1194: 48 89 f8 mov rax,rdi
1197: 48 83 ec 08 sub rsp,0x8
119b: 48 89 f7 mov rdi,rsi
119e: 48 89 35 a3 2e 10 00 mov QWORD PTR [rip+0x102ea3],rsi # 104048 <CALLBACK_ARG.3>
11a5: 48 89 05 a4 2e 10 00 mov QWORD PTR [rip+0x102ea4],rax # 104050 <CALLBACK_FN.4>
11ac: 48 8d 15 8d 2e 10 00 lea rdx,[rip+0x102e8d] # 104040 <OLD_SP.1>
11b3: 48 89 25 86 2e 10 00 mov QWORD PTR [rip+0x102e86],rsp # 104040 <OLD_SP.1>
11ba: 48 89 d4 mov rsp,rdx
11bd: ff d0 call rax
11bf: 48 8b 25 7a 2e 10 00 mov rsp,QWORD PTR [rip+0x102e7a] # 104040 <OLD_SP.1>
11c6: 48 83 c4 08 add rsp,0x8
11ca: c3 ret
Act 3: Rock & Roll
So, will this technique actually work in the Linux kernel? After all, this is the place where the stack limitations bit me in the butt.
This module uses the stack excessively to demonstrate the technique under real constraints.
The stack_heavy_callback function allocates 512 kB on the stack - exceeding the kernel’s 16 kB limit by a large margin.
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Dawid Macek");
MODULE_DESCRIPTION("A simple Hello World Linux kernel module");
MODULE_VERSION("1.0");
#define AUX_STACK_SIZE (1 << 20) // 1MB
#define BIG_BUFFER_SIZE (1 << 19) // 512kB
noinline void switch_stack(
void (*callback)(void*),
void* arg
) { /* CUT */ };
static void stack_heavy_callback(void* arg) {
int i;
// 512kB on the stack. Will surely crash the kernel.
volatile char large_stack_array[BIG_BUFFER_SIZE];
// Use the array to avoid optimization.
for (i = 0; i < BIG_BUFFER_SIZE; ++i) {
large_stack_array[i] = i % 256;
}
printk(KERN_INFO "Inside stack heavy callback.\n");
}
static int __init hello_init(void)
{
printk(KERN_INFO "Module loaded.\n");
switch_stack(stack_heavy_callback, NULL);
// stack_heavy_callback(NULL);
printk(KERN_INFO "After stack swap.\n");
return 0;
}
static void __exit hello_exit(void)
{
printk(KERN_INFO "Module unloaded.\n");
}
module_init(hello_init);
module_exit(hello_exit);
When compiling, the alarm bells are going off, but we aren’t stopping. Safety? Always off.
$ make
make -C /lib/modules/5.15.0-91-generic/build M=/vagrant/linux modules
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-91-generic'
warning: the compiler differs from the one used to build the kernel
The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
You are using: gcc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
CC [M] /vagrant/linux/hello.o
/vagrant/linux/hello.c: In function ‘stack_heavy_callback’:
/vagrant/linux/hello.c:22:1: warning: the frame size of 524296 bytes is larger than 1024 bytes [-Wframe-larger-than=]
22 | }
| ^
MODPOST /vagrant/linux/Module.symvers
CC [M] /vagrant/linux/hello.mod.o
LD [M] /vagrant/linux/hello.ko
BTF [M] /vagrant/linux/hello.ko
Skipping BTF generation for /vagrant/linux/hello.ko due to unavailability of vmlinux
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-91-generic
Against all odds, the module loads and executes without crashing. The stack switch successfully provides enough space for the callback to complete its work.
$ sudo dmesg --follow
[ 1325.550820] Module loaded.
[ 1325.551042] Inside stack heavy callback.
[ 1325.551042] After stack swap.
Without the stack swap
If stack_heavy_callback is called directly without the stack switch, we can expect either an oops or a complete kernel panic.
In this case, we got an oops.
Looking at the page fault address and rsp, it’s clear the code attempted to access memory outside the allocated stack bounds.
[ 1658.348625] BUG: unable to handle page fault for address: ffffa40d83602000
[ 1658.348831] #PF: supervisor write access in kernel mode
[ 1658.348977] #PF: error_code(0x0002) - not-present page
[ 1658.349116] PGD 100000067 P4D 100000067 PUD 1001e3067 PMD 2d37f067 PTE 0
[ 1658.349323] Oops: 0002 [#1] SMP NOPTI
[ 1658.349460] CPU: 3 PID: 9642 Comm: insmod Tainted: G OE 5.15.0-91-generic #101-Ubuntu
[ 1658.349590] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 09/01/2023
[ 1658.349749] RIP: 0010:stack_heavy_callback.constprop.0+0x3f/0x81 [hello]
[ 1658.349902] Code: 65 48 8b 04 25 28 00 00 00 48 89 45 e8 31 c0 4c 63 e3 49 81 fc ff ff 07 00 76 0f 4c 89 e6 48 c7 c7 00 60 cd c0 e8 b1 ed 7b d0 <42> 88 9c 25 e8 ff f7 ff ff c3 81 fb 00 00 08 00 75 d3 48 c7 c7 58
[ 1658.350171] RSP: 0018:ffffa40d835a7c28 EFLAGS: 00010293
[ 1658.350307] RAX: 0000000000000000 RBX: 000000000005a3d8 RCX: 0000000000000000
[ 1658.350442] RDX: 0000000000000000 RSI: ffff8a3ec3da0580 RDI: ffff8a3ec3da0580
[ 1658.350578] RBP: ffffa40d83627c40 R08: 0000000000000003 R09: fffffffffffcb188
[ 1658.350713] R10: 0000000000000012 R11: 0000000000000001 R12: 000000000005a3d8
[ 1658.350849] R13: ffff8a3dc8630ad0 R14: 0000000000000000 R15: ffffffffc0cd6040
[ 1658.350984] FS: 00007f5fef75d000(0000) GS:ffff8a3ec3d80000(0000) knlGS:0000000000000000
[ 1658.351120] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1658.351291] CR2: ffffa40d83602000 CR3: 000000000200e000 CR4: 00000000003506e0
The culprit instruction, corresponding to the large_stack_array[i] = i % 256; attempts to write to a stack address that doesn’t exist.
3f: 42 88 9c 25 e8 ff f7 mov BYTE PTR [rbp+r12*1-0x80018],bl
Limitations
This technique comes with severe caveats:
- Stack unwinding breaks: If the callback panics or throws an exception, the unwinding mechanism won’t be able to traverse back to the original stack properly, likely crashing the program. I’ve had some success with catching panics on Linux-based platforms, but it doesn’t seem to work on Windows.
- Debuggers get confused: Debugging tools will be confused by the non-standard stack layout.
- No recursive swaps: You cannot swap stacks within an already stack-swapped function - the static variables are already in use. Recursive swaps are possible, but require extensive inline assembly, reducing portability.
- Possibly breaks exploit-prevention solutions?: If I were developing one, I’d be validating if the
rspfalls into an expected memory ranges.
Despite these limitations, the technique remains valuable in constrained environments where other solutions aren’t feasible.
Finale: Stackaroo
After successfully deploying this technique in both user space and a kernel module, I packaged it into a reusable Rust library - stackaroo. The library provides a clean API supporting x64 and ARM64 architectures and comes with a C FFI layer. You can learn more reading the docs.
Let’s see how to reduce stack usage of the regex-test example from Act 1.
Some modifications were needed, but the code shouldn’t be that hard to follow.
The callout returns the matching result via an in-out argument:
use regex::Regex;
use stackaroo::swap_to_heap;
fn is_match(input: &str) -> bool {
struct Args {
input: String,
matched: bool,
}
// This runs on a new stack.
fn callout(args: &mut Args) {
let Ok(re) = Regex::new(r"^[a-zA-Z]+$") else {
args.matched = false;
return;
};
args.matched = re.is_match(&args.input);
}
unsafe {
let mut args = Args {
input: input.to_string(),
matched: false,
};
// Swap rsp to 1MB of heap-backed memory.
swap_to_heap(callout, Some(&mut args), 1 << 20).expect("Failed to swap stack.");
args.matched
}
}
fn main() {
assert!(is_match("foo"));
}
Testing it with the limited stack proves the program remains stable, not crashing anymore.
$ ulimit -s
8192 # kB
$ ./regex-test ; echo $?
0
$ ulimit -s 16 # kB
$ ./regex-test ; echo $?
0
Conclusion
Stack swapping might be a last-resort technique for environments where the given stack is insufficient and cannot be easily expanded. While it successfully bypasses stack size limitations in constrained contexts like kernel modules, embedded systems, or firmware, it’s not without significant trade-offs.
Bonus: Recursive variant
This variant avoids static variables entirely by storing the old stack pointer directly on the new stack, enabling recursive stack swaps.
__attribute__((noinline)) void switch_stack(
void (*callback)(void*),
void* arg,
void* stack_buf,
uintptr_t stack_size
) {
// Switch to new stack
__asm__ volatile (
// Calculate top of new stack (stack_buf + stack_size)
"lea (%[buf], %[size]), %%r11\n"
// Align to 16 bytes
"and $-16, %%r11\n"
// Save current rsp to r10
"mov %%rsp, %%r10\n"
// Switch to new stack
"mov %%r11, %%rsp\n"
// Push old rsp onto the new stack
"push %%r10\n"
:
: [buf] "r"(stack_buf),
[size] "r"(stack_size)
: "r10", "r11", "memory"
);
// Call the callback
callback(arg);
// Restore old stack
__asm__ volatile (
"pop %%rsp\n"
:
:
: "memory"
);
}