\

Show HN: Sub-millisecond VM sandboxes using CoW memory forking

306 points - last Tuesday at 1:43 PM


I wanted to see how fast an isolated code sandbox could start if I never had to boot a fresh VM.

So instead of launching a new microVM per execution, I boot Firecracker once with Python and numpy already loaded, then snapshot the full VM state. Every execution after that creates a new KVM VM backed by a `MAP_PRIVATE` mapping of the snapshot memory, so Linux gives me copy-on-write pages automatically.

That means each sandbox starts from an already-running Python process inside a real VM, runs the code, and exits.

These are real KVM VMs, not containers: separate guest kernel, separate guest memory, separate page tables. When a VM writes to memory, it gets a private copy of that page.

The hard part was not CoW itself. The hard part was resuming the snapshotted VM correctly.

Rust, Apache 2.0.

Source
  • cperciva

    last Wednesday at 2:14 AM

    Don't forget about entropy! You've just created two identical copies of all of your random number generators, which could be very very bad for security.

    The firecracker team wrote a very good paper about addressing this when they added snapshot support.

      • adammiribyan

        last Wednesday at 5:55 AM

        Good callout. We seed entropy before snapshot to unblock getrandom(), but forks still share CSPRNG state. The proper fix per Firecracker’s docs is RNDADDENTROPY + RNDRESEEDCRNG after each fork, plus reseeding userspace PRNGs like numpy separately. On the roadmap. https://github.com/firecracker-microvm/firecracker/blob/main...

      • Retr0id

        last Wednesday at 4:26 AM

        I suppose it'd be easy enough to re-seed RNGs, but re-relocating ASLR sounds like a pain. (Although I suppose for Python that doesn't matter)

          • hinkley

            last Wednesday at 4:53 AM

            Off the cuff, the first step to ASLR is don’t publish your images and to rotate your snapshots regularly.

            The old fastCGI trick is to buffer the forking by idling a half a dozen or ten copies of the process and initialize new instances in the background while the existing pool is servicing new requests. By my count we are reinventing fastCGI for at least the fourth time.

            Long running tasks are less sensitive to the startup delays because we care a lot about a 4 second task taking an extra five seconds and we care much less about a 1 minute task taking 1:05. It amortizes out even in Little’s Law.

            • cperciva

              last Wednesday at 5:46 AM

              Re-seeding is easy. The hard parts are (a) finding everything which needs to be reseeded -- not just explicit RNGs but also things like keys used to pick outgoing port numbers in a pseudorandom order -- and (b) making sure that all the relevant code becomes aware that it was just forked -- not necessarily trivial given that there's no standard "you just got restarted from a snapshot" signal in UNIX.

                • Intermernet

                  last Wednesday at 10:59 AM

                  I would have thought that in the days of containers, we'd have better tooling around this. Sounds like a goldmine for vuln research!

                  • aa-jv

                    last Wednesday at 1:28 PM

                    Isn't this what -HUP is supposed to be for in the first place? Maybe a -STOP/-HUP/-HUP situation?

                      • treyd

                        last Wednesday at 5:59 PM

                        HUP is short for "hangup" which was supposed to be sent when the tty controlling the session the process is in hung up.

        • injidup

          last Wednesday at 7:19 AM

          It's so frustrating seeing all this sandbox tooling pop up for linux but windows is soooooo far behind. I mean Windows Sandbox ( https://learn.microsoft.com/en-us/windows/security/applicati... ) doesn't even have customizable networking white lists. You can turn networking on or off but that's about as fine grained as it gets. So all of us still having to write desktop windows stuff are left without a good method of easily putting our agents in a blast proof box.

            • benterix

              last Wednesday at 9:45 AM

              I don't mean to turn this into a religious war, but honestly, I sometimes wonder what would be the net benefit for humanity if Windows slowly disappeared. And I'm saying this as someone who appreciates the good stuff done by Microsoft in the past (windows 9* UI, decades-long support for Win32 APIs etc.).

                • lionkor

                  last Wednesday at 10:07 AM

                  Why doesn't Microsoft just take their incredible, human-replacing, AGI level AI's, and just port all their code to a Linux kernel instead of the NT kernel?

                  Oh right, because that's not in the training set.

                    • Intermernet

                      last Wednesday at 11:02 AM

                      The NT kernel is actually pretty amazing. You can even run a pretty solid Windows version if you want to sail the high seas. LTSC and masgrave will get you most of the way there.

                      • wongarsu

                        last Wednesday at 10:38 AM

                        The NT kernel is by far the best part of Windows. It's everything on top of it that has turned to crap

                • CTDOCodebases

                  last Wednesday at 7:49 AM

                  Web browsers don't even work properly in Windows Sandbox. There is a bug that hasn't been patched in over a year whereby web browsers can't use the GPU to render a page so all it displays is a white page. Users have to create a configuration file that turns off vGPU and launch Windows Sandbox from that.

                  • BornaP

                    last Wednesday at 11:59 AM

                    Feel you. That's why we're actively working on Windows and macOS sandbox support at Daytona - with proper isolation, agents tools, dynamic resizing etc; not just "networking on/off" level controls.

                    If you're building agents on Windows and want to give it a spin, reach out for early access.

                      • injidup

                        last Wednesday at 2:29 PM

                        Our stack is msvc / cmake / ninja / incredibuild ? Can you support such things?

                  • RyleHisk

                    yesterday at 11:57 AM

                    You can run WSL on Windows — then you've got access to all the Linux sandbox tools.

                    • ddtaylor

                      last Wednesday at 10:26 AM

                      I guess get busy contacting Microsoft or get busy using Open Source software instead.

                      • ivan_burazin

                        last Wednesday at 12:03 PM

                        Can give you access to win sandboxes on Daytona, just fill this in here and lmk! https://www.daytona.io/docs/en/computer-use/

                    • BornaP

                      last Wednesday at 12:05 PM

                      Really impressive work. Sub-millisecond cold starts via CoW forking is a pretty clever approach.

                      The tricky part we keep running into with agent sandboxes is that code execution is just one piece, bcs agents also need file system access, memory, git, a pty, and a bunch of other tools all wired up and isolated together. That's where things get hairy fast.

                        • jamiemallers

                          yesterday at 1:12 PM

                          [dead]

                      • crawshaw

                        last Wednesday at 1:39 AM

                        Nice to see this work! I experimented with this for exe.dev before we launched. The VM itself worked really well, but there was a lot of setup to get the networking functioning. And in the end, our target are use cases that don't mind a ~1-second startup time, which meant doing a clean systemd start each time was easier.

                        That said, I have seen several use cases where people want a VM for something minimal, like a python interpreter, and this is absolutely the sort of approach they should be using. Lot of promise here, excited to see how far you can push it!

                          • hrmtst93837

                            last Wednesday at 9:34 AM

                            The thing people tend to gloss over is how CoW shines until you need to update the base image, then you start playing whack-a-mole with stale memory and hotpatching. Snapshots give you a magic boot, but god help you when you need to roll out a security fix to hundreds of forks with divergent state.

                            Fast startup is nice. If the workload is "run plain Python on a trusted codebase" you win, but once it gets hairier the maintenance overhead sends you straight back to yak shaving.

                              • smj-edison

                                last Wednesday at 1:58 PM

                                Wouldn't you need to restart a process anyways if there's a security update? Sounds like you'd just need to kill all the VMs, start up the base again, and fork (but what do I know).

                                • crawshaw

                                  last Wednesday at 3:03 PM

                                  That is very true. We use copy on write for exe.dev base images right now, and are accumulating a lot of storage because of version drift.

                                  We believe the fix here is to mount the base image as a read-only block device, then mount a read-write block device overlay. We have not rolled it out yet because there are some edge cases we are working through, and we convinced ourselves we could rework images after the fact onto a base image.

                                  Right now our big win from copy-on-write is cloning VMs. You can `ssh exe.dev cp curvm newvm` in about a second to split your computer into a new one. It enables a lot of great workflows.

                              • indigodaddy

                                last Wednesday at 2:07 AM

                                simonw seems like he's always wanting what you describe, maybe more for wasm though

                                  • edunteman

                                    last Wednesday at 5:35 AM

                                    I’ve been a big fan of “what’s the thinnest this could be” interpretations of sandboxes. This is a great example of that. On the other end of the spectrum there’s just-bash from the Vercel folks.

                                      • adammiribyan

                                        last Wednesday at 6:06 AM

                                        Exactly —- they skip the OS, we make it free to clone.

                            • skwuwu

                              last Tuesday at 3:32 PM

                              I noticed that you implemented a high-performance VM fork. However, to me, it seems like a general-purpose KVM project. Is there a reason why you say it is specialized for running AI agents?

                                • adammiribyan

                                  last Tuesday at 4:12 PM

                                  Fair question. The fork engine itself is general purpose -- you could use it for anything that needs fast isolated execution. We say 'AI agents' because that's where the demand is right now. Every agent framework (LangChain, CrewAI, OpenAI Assistants) needs sandboxed code execution as a tool call, and the existing options (E2B, Daytona, Modal) all boot or restore a VM/container per execution. At sub-millisecond fork times, you can do things that aren't practical with 100-200ms startup: speculative parallel execution (fork 10 VMs, try 10 approaches, keep the best), treating code execution like a function call instead of an infrastructure decision, etc.

                                    • shayonj

                                      last Wednesday at 1:34 PM

                                      > you can do things that aren't practical with 100-200ms startup: speculative parallel execution (fork 10 VMs, try 10 approaches, keep the best), treating code execution like a function call instead of an infrastructure decision, etc.

                                      i am not following, why isn't it practical?

                                        • wamatt

                                          last Wednesday at 2:15 PM

                                          Off the top of my head trading or realtime voice come to mind. Probably plenty other domains could benefit

                              • wang_cong

                                today at 12:48 AM

                                No networking inside forks? This is not usable.

                                • vmg12

                                  last Wednesday at 1:12 AM

                                  Does it only work with that specific version of firecracker and only with vms with 1 vcpu?

                                  More than the sub ms startup time the 258kb of ram per VM is huge.

                                    • adammiribyan

                                      last Wednesday at 6:56 AM

                                      1 vCPU per fork currently. Multi-vCPU is doable (per-vCPU state restore in a loop) but would multiply fork time.

                                      On Firecracker version: tested with v1.12, but the vmstate parser auto-detects offsets rather than hardcoding them, so it should work across versions.

                                  • buckle8017

                                    last Wednesday at 2:04 AM

                                    This is how android processes work, but it's a security problem breaking some ASLR type things.

                                      • hnperu5

                                        last Wednesday at 5:59 AM

                                        [dead]

                                    • deivid

                                      last Wednesday at 8:34 AM

                                      Niiiiiice, I've been working on something like this, but reducing linux boot time instead of snapshot restore time; obviously my solution doesn't work for heavy runtimes

                                      • indigodaddy

                                        last Wednesday at 2:05 AM

                                        Your write-up made me think of:

                                        https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...

                                        Are there parallels?

                                          • CompuIves

                                            last Wednesday at 2:41 PM

                                            I think this is very similar! Really cool to see.

                                            The first version we launched used the exact same approach (MAP_PRIVATE). Later on, we bypassed the file system by using shared memory and using userfaultfd because ultimately the NVMe became the bottleneck (https://codesandbox.io/blog/cloning-microvms-using-userfault... and https://codesandbox.io/blog/how-we-scale-our-microvm-infrast...).

                                              • adammiribyan

                                                yesterday at 9:13 AM

                                                Glad to see the approach validated at scale! I hadn't seen your blog posts until they were linked here, going to dig into the userfaultfd path. Would love to chat if you're open to it.

                                        • aftbit

                                          last Wednesday at 2:35 PM

                                          Is this a service or a library? The README has curl and an API key... Can I run this myself on my own hardware?

                                            • adammiribyan

                                              yesterday at 9:04 AM

                                              Both. The engine is open source. You can self-host it on any Linux box with KVM. There's also a live API you can hit right now (curl example in the README). Building the managed service for teams that don't want to run their own infra.

                                          • jauntywundrkind

                                            last Tuesday at 9:32 PM

                                            I keep so so so many opencode windows going. I wish I had bought a better SSD, because I have so much swap space to support it all.

                                            I keep thinking I need to see if CRIU (checkpoint restore in userspace) is going to work here. So I can put work down for longer time, be able to close & restore instances sort of on-demand.

                                            I don't really love the idea of using VMs more, but I super love this project. Heck yes forking our processes/VMs.

                                              • indigodaddy

                                                last Wednesday at 2:13 AM

                                                You could throw this on a VPS or server and it could help in that regard: (disclaimer, my thing)

                                                https://GitHub.com/jgbrwn/vibebin

                                                • adammiribyan

                                                  last Tuesday at 10:53 PM

                                                  CRIU is great for save/restore. The nice thing about CoW forking is it's cheap branching, not just checkpointing. You can clone a running state thousands of times at a few hundred KB each.

                                              • indigodaddy

                                                last Wednesday at 2:08 AM

                                                Does this need passthrough or might we be able to leverage PVM with it on a passthrough-less cloud VM/VPS?

                                                  • dizhn

                                                    last Wednesday at 7:34 AM

                                                    I am not sure exactly what you are asking but firecracker does need access to /dev/kvm so nesting needs to be enabled on the VM.

                                                • diptanu

                                                  last Wednesday at 1:45 AM

                                                  The tricky part of doing this in production is cloning sandboxes across nodes. You would have to snapshot the resident memory, file system (or a CoW layer on top of the rootfs), move the data across nodes, etc.

                                                    • Rygian

                                                      last Wednesday at 10:12 AM

                                                      If each node has its own warmed-up VM awaiting from startup, there's no need to clone across nodes.

                                                      • indigodaddy

                                                        last Wednesday at 2:10 AM

                                                        Is this relevant?

                                                        https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...

                                                        • adammiribyan

                                                          last Wednesday at 6:00 AM

                                                          Agreed, cross-node is the hard next step. For now single-node density gets you surprisingly far. 1000 concurrent sandboxes on one $50 box. When we need multi-node, userfaultfd with remote page fetch is the likely path.

                                                            • shayonj

                                                              last Wednesday at 1:00 PM

                                                              Cool project. +1 on userfaultfd for the multi-node path. Wrote about how uffd-based on-demand restore works wrt to my Cloud Hypervisor change [1] if you are curious.

                                                              I think the the main things to watch are fault storms at resume (all vCPUs hitting missing pages at once) and handler throughput if you're serving pages over the network instead of local mmap. I think its less likely to happen when you fork a brand new VM vs say a VM that has been doing things for 5 mins.

                                                              Also interestingly, Cloud Hypervisor couldn't use MAP_PRIVATE for this because it breaks VFIO/vhost-user bindings. Firecracker's simpler device model is nice for cases like this.

                                                              [1] https://www.shayon.dev/post/2026/65/linux-page-faults-mmap-a...

                                                                • adammiribyan

                                                                  yesterday at 9:06 AM

                                                                  Great writeup, bookmarked. The fault storm point is interesting -- our forks are short-lived (execute and discard) so the working set is small, but for longer-running sandboxes that would absolutely be a problem.

                                                      • last Wednesday at 3:21 AM

                                                        • latortuga

                                                          last Wednesday at 2:53 AM

                                                          Similar to sprites.dev?

                                                          • polskibus

                                                            last Wednesday at 2:54 PM

                                                            Is it possible to run minikube inside ? I’d love to use it for ephemeral clusters for testing .

                                                            • huksley

                                                              last Wednesday at 7:20 PM

                                                              Any plans to offer self-hosted / open-source version?

                                                              • quickrefio

                                                                last Wednesday at 3:29 PM

                                                                Feels like fork() but for VMs—very cool.

                                                                • yagizdagabak

                                                                  last Tuesday at 10:14 PM

                                                                  Cool approach. Are you guys planning on creating a managed version?

                                                                    • adammiribyan

                                                                      last Tuesday at 10:47 PM

                                                                      The API in the readme is live right now -- you can curl it. Plan is multi-region, custom templates with your own dependencies, and usage-based pricing. Email in my profile if you want early access.

                                                                      • adammiribyan

                                                                        last Tuesday at 10:43 PM

                                                                        Thanks! Yes, there's going to be a managed version.

                                                                    • theredsix

                                                                      last Wednesday at 3:51 PM

                                                                      super clever and awesome!

                                                                      • aa-jv

                                                                        last Wednesday at 1:26 PM

                                                                        Nice.

                                                                        Now I want the ability to freeze the VM cryogenically and move it to another machine automagically, defrosting and running as seamlessly as possible.

                                                                        I know this is gonna happen soon enough, I've been waiting since the death of TandemOS for just this feature ..

                                                                        • handfuloflight

                                                                          last Wednesday at 12:45 AM

                                                                          Can you run this in another sandbox? Not sure why you'd want to... but can you?

                                                                            • Teknoman117

                                                                              last Wednesday at 1:14 AM

                                                                              Nested page tables / nested virtualization made it to consumer CPUs about a decade ago, so yes :)

                                                                              • wmf

                                                                                last Wednesday at 1:00 AM

                                                                                It's pretty common to run VMs within containers so an attacker has to escape twice. You can probably disable 99% of system calls.

                                                                            • izajahmad

                                                                              yesterday at 11:19 PM

                                                                              [dead]

                                                                              • justboy1987

                                                                                last Wednesday at 4:42 AM

                                                                                [dead]

                                                                                • jauntywundrkind

                                                                                  last Wednesday at 1:14 AM

                                                                                  Mods: can we merge with https://news.ycombinator.com/item?id=47412812?

                                                                                    • tomhow

                                                                                      last Wednesday at 7:49 AM

                                                                                      Done, thanks!

                                                                                  • olivercoleai

                                                                                    last Wednesday at 2:04 PM

                                                                                    [dead]

                                                                                    • ydw0127

                                                                                      last Wednesday at 4:27 PM

                                                                                      [dead]

                                                                                      • codance

                                                                                        last Wednesday at 2:15 AM

                                                                                        [dead]

                                                                                        • Jeffrin-dev

                                                                                          last Wednesday at 5:59 AM

                                                                                          [flagged]

                                                                                          • wei03288

                                                                                            last Wednesday at 6:28 AM

                                                                                            [flagged]

                                                                                              • adammiribyan

                                                                                                last Wednesday at 6:48 AM

                                                                                                On tail latency: KVM VM creation is 99.5% of the fork cost - create_vm, create_irq_chip, create_vcpu, and restoring CPU state. The CoW mmap is ~4 microseconds regardless of load. P99 at 1000 concurrent is 1.3ms. The mmap CoW page faults during execution are handled transparently by the host kernel and don't contribute to fork latency.

                                                                                                On snapshot staleness: yes, forks inherit all internal state including RNG seeds. For dependency updates you rebuild the template (~15s). No incremental update - full re-snapshot, similar to rebuilding a Docker image.

                                                                                                On the memory number: 265KB is the fork overhead before any code runs. Under real workloads we measured 3.5MB for a trivial print(), ~27MB for numpy operations. But 93% of pages stay shared across forks via CoW. We measured 100 VMs each running numpy sharing 2.4GB of read-only pages with only 1.75MB private per VM. So the real comparison to E2B's ~128MB is more like 3-27MB depending on workload, with most of the runtime memory shared.

                                                                                                  • user_7832

                                                                                                    last Wednesday at 11:20 AM

                                                                                                    Just a head's up, you're almost certainly replying to a bot. For some reason there's a ton of them in this post.