Skip to content

awful: pre-grow propolis-server heap to avoid #1008#1032

Merged
iximeow merged 3 commits intooxidecomputer:masterfrom
iximeow:rancid
Feb 6, 2026
Merged

awful: pre-grow propolis-server heap to avoid #1008#1032
iximeow merged 3 commits intooxidecomputer:masterfrom
iximeow:rancid

Conversation

@iximeow
Copy link
Member

@iximeow iximeow commented Feb 5, 2026

the gory details are in that issue, but for VMs with large address spaces it is relatively easy for a guest picking random pages to cause long streams of page faults as Propolis does I/O against those pages. The faults then starve out anything that would change the address space, most importantly brk() and friends which may need to grow the heap to support an allocation made as part of device operations.

At that point, the device will be (partially) stuck. Bad enough. Then the guest OS may notice the situation and try to restart the device. To do this, a vCPU will do some kind of access to the stuck device, which may be stuck in a way that the vCPU becomes blocked on the device. That vCPU won't be responsive to interrupts and from the guest perspective the whole machine is extremely broken.

We're not immediately sure how to untangle the faults or AS lock bits, so for the time being we can at least try to not brk() at runtime by growing the heap probably-enough to serve real needs.

(giving this a quick run on mb-0 before making it a real PR)

the gory details are in that issue, but for VMs with large address
spaces it is relatively easy for a guest picking random pages to cause
long streams of page faults as Propolis does I/O against those pages.
The faults then starve out anything that would change the address space,
most importantly `brk()` and friends which may need to grow the heap to
support an allocation made as part of device operations.

At that point, the device will be (partially) stuck. Bad enough. Then
the guest OS may notice the situation and try to restart the device. To
do this, a vCPU will do some kind of access to the stuck device, which
may be stuck in a way that the vCPU becomes blocked on the device. That
vCPU won't be responsive to interrupts and from the guest perspective
the whole machine is extremely broken.

We're not immediately sure how to untangle the faults or AS lock bits,
so for the time being we can at least try to not brk() at runtime by
growing the heap probably-enough to serve real needs.
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah, I agree that this is "awful", but...if the problem goes away, it's clearly less worse than not doing it!

Comment on lines +516 to +517
// (see propolis::block::crucible::Crucible::WORKER_COUNT)
*wanted_heap += 8 * PER_WORKER_HEAP;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, sorry: could we perhaps import that WORKER_COUNT constant here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd thought about it, but I think we really should do the // TODO there and make it tunable. then the number here would be that tunable, defaulted to DEFAULT_WORKER_COUNT (also 8)! and be basically the same as nworkers for the file backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine with me!

Comment on lines +675 to +677
// 64 * 1K is a wild over-estimate while we support 1-15 queues
// across virtio-block and nvme.
wanted_heap += 64 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64k ought to be enough for anybody?

Comment on lines +669 to +670
let balloon = vec![0u8; wanted_heap + 16 * propolis::common::MB];
std::mem::drop(balloon);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: rather than allocating a balloon, zero-initializing it, dropping it, and hoping that it correctly sets the max heap size such that the allocator won't try to brk to get more heap despite us having allocated a wanted_heap and change...why not just call brk ourselves here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we know in practice that growing the heap ended up at brk(), but the allocator might not use brk/sbrk to actually manage the storage (there's a mmap backend for umem for example which we're not using today and probably won't tomorrow, but..)

so there's still the iffy bit that maybe the allocations have weird alignment constraints, or the allocator gets fragmented, such that 320 contiguous MiB or whatever doesn't cut it later on. to be more confident about that I think we'd want the kind of pooled buffers you were talking about earlier, but also I really really want to fix the OS so we don't need this or the file backend buffers lol

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed wholeheartedly with...all of this.

@iximeow iximeow marked this pull request as ready for review February 5, 2026 21:12
@hawkw
Copy link
Member

hawkw commented Feb 5, 2026

@hawkw
Copy link
Member

hawkw commented Feb 5, 2026

specifically, it seems it was waiting for a login prompt: https://buildomat.eng.oxide.computer/wg/0/artefact/01KGQRWSKSJKNN9QXY8PPY27EG/lo4zfMXju35PmPUhejEF21RN5YPMZAMVs5qcXgIyXfJs0hEJ/01KGQRXWP544KACC9ZPBCCJQ84/01KGQTREWP1ZF5QAFJDN7PFY4M/phd-runner.log?format=x-bunyan#L1290

let's see what the serial output from the guest has to say:

BdsDxe: loading Boot0001 "UEFI " from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)
BdsDxe: starting Boot0001 "UEFI " from PciRoot(0x0)/Pci(0x4,0x0)/NVMe(0x1,00-00-00-00-00-00-00-00)
Welcome to GRUB!

GNU GRUB  version 2.06

��������������������������������������Ŀ���������������������������������������[19;02H     Use the ^ and v keys to select which entry is highlighted.          
      Press enter to boot the selected OS, `e' to edit the commands       
      before booting or `c' for a command-line.                            *Linux virthe highlighted entry will be executed automatically in 1s.                    The highlighted entry will be executed automatically in 0s.                   Booting `Linux virt'



   OpenRC 0.44.10 is starting up Linux 5.15.41-0-virt (x86_64)

 * /proc is already mounted
 * Mounting /run ... * /run/openrc: creating directory
 * /run/lock: creating directory
 * /run/lock: correcting owner
 * Caching service dependencies ... [ ok ]
 * Remounting devtmpfs on /dev ... [ ok ]
 * Mounting /dev/mqueue ... [ ok ]
 * Mounting modloop  ... * Verifying modloop
 [ ok ]
 * Mounting security filesystem ... [ ok ]
 * Mounting debug filesystem ... [ ok ]
 * Mounting persistent storage (pstore) filesystem ... [ ok ]
 * Mounting efivarfs filesystem ... [ ok ]
 * Starting busybox mdev ... [ ok ]
 * Loading hardware drivers ... [ ok ]
 * Loading modules ... [ ok ]
 * Setting system clock using the hardware clock [UTC] ... [ ok ]
 * Checking local filesystems  ... [ ok ]
 * Remounting filesystems ... [ ok ]
 * Mounting local filesystems ... [ ok ]
 * Configuring kernel parameters ... [ ok ]
 * Migrating /var/lock to /run/lock ... [ ok ]
 * Creating user login records ... [ ok ]
 * Cleaning /tmp directory ... [ ok ]
 * Setting hostname ... [ ok ]
 * Starting busybox syslog ... [ ok ]
 * Starting firstboot ... [ ok ]

Welcome to Alpine Linux 3.16
Kernel 5.15.41-0-virt on an x86_64 (/dev/ttyS0)

localhost login: root
Welcome to Alpine!

The Alpine Wiki contains a large amount of how-to guides and general
information about administrating Alpine systems.
See <http://wiki.alpinelinux.org/>.

You can setup the system with the command: setup-alpine

You may change this message by editing /etc/motd.

localhost:~# stty -F `tty` cols 9999
localhost:~# sudo lspci -vvx
-ash: sudo: not found
localhost:~# sudo lshw -notime
-ash: sudo: not found

LMAO WHAT

@AlejandroME AlejandroME added this to the 18 milestone Feb 5, 2026
@iximeow
Copy link
Member Author

iximeow commented Feb 5, 2026

well, I took some notes in #1035, reran phd-run and it passed.. really wish we had a core of propolis that was stuck, but lacking that I'll see about getting some other debugging-related things in separately..

Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great sure whatever it's awful but it works

Comment on lines +669 to +670
let balloon = vec![0u8; wanted_heap + 16 * propolis::common::MB];
std::mem::drop(balloon);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed wholeheartedly with...all of this.

@iximeow iximeow merged commit 47f0c4c into oxidecomputer:master Feb 6, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants