awful: pre-grow propolis-server heap to avoid #1008#1032
awful: pre-grow propolis-server heap to avoid #1008#1032iximeow merged 3 commits intooxidecomputer:masterfrom
Conversation
the gory details are in that issue, but for VMs with large address spaces it is relatively easy for a guest picking random pages to cause long streams of page faults as Propolis does I/O against those pages. The faults then starve out anything that would change the address space, most importantly `brk()` and friends which may need to grow the heap to support an allocation made as part of device operations. At that point, the device will be (partially) stuck. Bad enough. Then the guest OS may notice the situation and try to restart the device. To do this, a vCPU will do some kind of access to the stuck device, which may be stuck in a way that the vCPU becomes blocked on the device. That vCPU won't be responsive to interrupts and from the guest perspective the whole machine is extremely broken. We're not immediately sure how to untangle the faults or AS lock bits, so for the time being we can at least try to not brk() at runtime by growing the heap probably-enough to serve real needs.
hawkw
left a comment
There was a problem hiding this comment.
Okay, yeah, I agree that this is "awful", but...if the problem goes away, it's clearly less worse than not doing it!
| // (see propolis::block::crucible::Crucible::WORKER_COUNT) | ||
| *wanted_heap += 8 * PER_WORKER_HEAP; |
There was a problem hiding this comment.
nit, sorry: could we perhaps import that WORKER_COUNT constant here?
There was a problem hiding this comment.
I'd thought about it, but I think we really should do the // TODO there and make it tunable. then the number here would be that tunable, defaulted to DEFAULT_WORKER_COUNT (also 8)! and be basically the same as nworkers for the file backend.
| // 64 * 1K is a wild over-estimate while we support 1-15 queues | ||
| // across virtio-block and nvme. | ||
| wanted_heap += 64 * 1024; |
There was a problem hiding this comment.
64k ought to be enough for anybody?
| let balloon = vec![0u8; wanted_heap + 16 * propolis::common::MB]; | ||
| std::mem::drop(balloon); |
There was a problem hiding this comment.
question: rather than allocating a balloon, zero-initializing it, dropping it, and hoping that it correctly sets the max heap size such that the allocator won't try to brk to get more heap despite us having allocated a wanted_heap and change...why not just call brk ourselves here?
There was a problem hiding this comment.
we know in practice that growing the heap ended up at brk(), but the allocator might not use brk/sbrk to actually manage the storage (there's a mmap backend for umem for example which we're not using today and probably won't tomorrow, but..)
so there's still the iffy bit that maybe the allocations have weird alignment constraints, or the allocator gets fragmented, such that 320 contiguous MiB or whatever doesn't cut it later on. to be more confident about that I think we'd want the kind of pooled buffers you were talking about earlier, but also I really really want to fix the OS so we don't need this or the file backend buffers lol
There was a problem hiding this comment.
agreed wholeheartedly with...all of this.
|
huh, seems |
|
specifically, it seems it was waiting for a login prompt: https://buildomat.eng.oxide.computer/wg/0/artefact/01KGQRWSKSJKNN9QXY8PPY27EG/lo4zfMXju35PmPUhejEF21RN5YPMZAMVs5qcXgIyXfJs0hEJ/01KGQRXWP544KACC9ZPBCCJQ84/01KGQTREWP1ZF5QAFJDN7PFY4M/phd-runner.log?format=x-bunyan#L1290 let's see what the serial output from the guest has to say: LMAO WHAT |
|
well, I took some notes in #1035, reran |
hawkw
left a comment
There was a problem hiding this comment.
great sure whatever it's awful but it works
| let balloon = vec![0u8; wanted_heap + 16 * propolis::common::MB]; | ||
| std::mem::drop(balloon); |
There was a problem hiding this comment.
agreed wholeheartedly with...all of this.
the gory details are in that issue, but for VMs with large address spaces it is relatively easy for a guest picking random pages to cause long streams of page faults as Propolis does I/O against those pages. The faults then starve out anything that would change the address space, most importantly
brk()and friends which may need to grow the heap to support an allocation made as part of device operations.At that point, the device will be (partially) stuck. Bad enough. Then the guest OS may notice the situation and try to restart the device. To do this, a vCPU will do some kind of access to the stuck device, which may be stuck in a way that the vCPU becomes blocked on the device. That vCPU won't be responsive to interrupts and from the guest perspective the whole machine is extremely broken.
We're not immediately sure how to untangle the faults or AS lock bits, so for the time being we can at least try to not brk() at runtime by growing the heap probably-enough to serve real needs.
(giving this a quick run on mb-0 before making it a real PR)