AWS Lambda MicroVMs: run untrusted code with VM-level isolation

🇧🇷 Leia em português.

Let me put you in a situation. You need to run a piece of code you did not write. Maybe it is the script your user pasted into your platform, maybe it is the snippet an AI agent just generated and wants to execute. And then comes the question that keeps anyone working with multi-tenant up at night: how do I run this without handing a stranger the keys to the house?

Until last week you had three paths, each with a catch. A VM gives you strong isolation but takes minutes to boot. A container starts in seconds but shares a kernel, so running untrusted code there takes a pile of hardening. And the Lambda Function was built for short request-response, not for a session that has to keep live state between one interaction and the next (externalizing it to DynamoDB stores the data, not the live runtime: the running process, the loaded packages, the memory). In the end you chose between performance and isolation. No way around it. Or there was.

Container, VM, or Lambda: the trade-off none of them solved alone

This pattern got common: AI coding assistants, interactive code environments, analytics, vulnerability scanners, game servers running player scripts. They all need the same thing: give each user their own environment to run code the team did not write, safely and without lag.

The knot is that real isolation and low latency pull in opposite directions. From a security angle you want a hard boundary between tenants (the Security pillar of the Well-Architected Framework: isolate what is not trusted). From an experience angle you want that environment up the instant the user shows up. Reconciling the two was the expensive work.

And there is a nice irony in this story. We spent years learning to build stateless apps, and now state is a requirement again.

The solution to the future was hiding in the past.

That is a line a friend dropped in a conversation, and it has not left my head since. Ever felt that way? Because I have. And it is roughly what Lambda MicroVMs does: it brings state back, without handing you the weight of a full VM.

What Lambda MicroVMs is

Lambda MicroVMs is a new primitive inside Lambda, built exactly for that gap. Each MicroVM gives a single user or session its own isolated environment that boots fast, keeps memory and disk for the whole session, and pauses to a low cost when the user steps away.

The magic comes from Firecracker, the same lightweight virtualization that already runs over 15 trillion Lambda invocations a month. This is not raw new tech, it is the mature foundation of Lambda itself, exposed in a new way.

The model is image-then-launch:

You build the image once (AWS runs your Dockerfile, initializes the app, and takes a snapshot of memory and disk). After that, every MicroVM you launch resumes from that snapshot instead of cold-booting. That is why launch and resume are near-instant, even for a multi-gigabyte session.

What it is actually for (with examples you will recognize)

The main cue: this only enters the picture if you are building a platform that runs third-party code. If your app does not execute outside code, you do not need it. It is a building block for people who build that kind of product:

Replit, CodeSandbox, "VS Code in the browser": the user types code in the browser and it runs isolated, per user, holding state while the tab is open. That "runs isolated" is the MicroVM.
Code interpreter (like ChatGPT's or Claude's): you ask "plot this CSV", the AI writes Python and runs it to answer you. The runtime that executes that generated code, isolated per conversation, is the use case.
CI/CD runner (and relatives): a job runs the code of a Pull Request that may come from any stranger's fork, untrusted by definition, so you want an isolated, disposable runner per job. Same family: a scanner that runs a suspicious binary, a coding-interview platform (the candidate's code runs isolated), an AI agent that runs shell commands. The thread tying it all together: each user, session, or job needs its own isolated environment, and the code running there is not code you wrote. That is the cue to use a MicroVM instead of a Lambda Function.

Lambda Function or Lambda MicroVM?

They do not compete, they complete each other. The official comparison:

	Lambda Functions	Lambda MicroVMs
Best for	request-response or event-driven (APIs, data processing, automation)	persistent environments running user or AI-produced untrusted code
Programming model	function handler invoked in a supported runtime	any application: run your own binaries, listen on ports, use Linux OS capabilities
Duration	up to 15 min per invocation; multi-step workflows up to a year with Lambda Durable Functions	up to 8 hours per session; suspend and resume across sessions
Runtime	service-provided runtimes (or customer-provided)	customer-provided MicroVM images
Inbound networking	direct invocations or event-source integrations; response streaming	inbound access to any port using OSI Layer 7 protocols
Concurrency	one request per execution environment at a time	multiple concurrent connections per MicroVM
Environment state	warm starts may reuse the environment, but state may not persist across invocations	memory and disk state preserved on suspend, restored on resume
Scaling	automatic: Lambda creates and destroys environments in response to traffic	developer-controlled: you create, suspend, resume, and terminate via API
Lifecycle	fully managed by Lambda	developer-controlled, with optional idle policies
Pricing	per-request + GB-seconds	per-second of compute while running + snapshot storage while suspended

The most common confusion: people assume the duration is the same as Lambda's. The startup is similar (both resume from a snapshot), but a Function dies at 15 minutes while a MicroVM holds a session for up to 8 hours with state intact. The real design: your app keeps Lambda Functions for the event-driven backbone, and calls MicroVMs only for the steps that need to run untrusted code in isolation.

How it works in practice: from endpoint to orchestration

Three things that trip people up at first, together.

The endpoint has a status. When you call run-microvm, you get an ID and a dedicated HTTPS endpoint for that MicroVM. But it is not ready instantly: it goes through states, from launch to RUNNING (about 2 seconds), and when idle it moves to suspended, coming back on resume. The endpoint is per MicroVM, per session.

One image, many MicroVMs. You build the image once (create-microvm-image) and each MicroVM is a run-microvm. Want two? Call it twice, and you get two independent instances. Idle behavior is governed by the idle-policy: maxIdleDurationSeconds (suspend after X idle) and autoResumeEnabled (the next request wakes the MicroVM on its own, in about 1s, no manual restart). When you are done, terminate-microvm releases everything.

You become the orchestrator. Since the endpoint is per session, something has to decide when to launch and where to route. Typically a Lambda Function in the backbone does it: it keeps a session -> MicroVM map (a store like DynamoDB in production), calls RunMicrovm on a user's first access, stores the ID and endpoint, mints a short-lived token with CreateMicrovmAuthToken, and proxies the request to the MicroVM's endpoint with the X-aws-proxy-auth header. If the instance is suspended and autoResume is on, the request itself wakes it. Add a routine to terminate orphan MicroVMs and you have the skeleton. The backbone code is in the next post in the series. And do not confuse this with Step Functions: MicroVM is the execution environment, Step Functions is an orchestrator, different layers.

Cost, limits, and what is still missing

Cost is a decision, not a detail. Werner Vogels keeps hammering in the Frugal Architect that cost is an architecture requirement, not a number you discover on the bill. The suspend is exactly that in practice: you pay a lot for VM-level isolation, but only while the user is active. When they leave, the MicroVM suspends and the cost drops, with no loss of state. Designing your idle-policy on purpose is a cost decision. The model, from the official table: you pay per second of compute while it runs, and only snapshot storage while it is suspended. Unit prices are on the Lambda pricing page.

Limits: ARM64, up to 16 vCPUs, 32 GB of memory, and 32 GB of disk per MicroVM, and up to 8 hours of total runtime. Provisioning is flexible: you set a baseline and burst up to 4x at peak, paying the baseline while it runs.

IaC: you can use the console, CloudFormation, and CDK.

Why Dockerfile + zip, and not a prebuilt ECR image? Aidan Steele dug into it: Lambda builds two copies of the image, one for Graviton 3 and one for Graviton 4, so it needs the source to recompile. The base comes from ECR Public, but pushing your own prebuilt image from a private ECR as the artifact is not the path. One thing that confuses people coming from containers: ECR does not leave your life. You do not deliver the MicroVM image via ECR, but inside the running MicroVM you can run Docker and docker pull your private ECR images at runtime. ECR is for consumption inside, not for delivering the image itself.

Networking and region: inbound traffic on configurable ports (HTTP/2, gRPC, WebSockets), service-provided JWE auth, outbound to the internet or your VPC. And it is available so far only in US East (N. Virginia, Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo).

When NOT to use it

If the workload is short request-response with no state, it stays a Lambda Function. A MicroVM there is a cannon for a mosquito. And if you just need more than 15 minutes with your own (trusted) code, a MicroVM is also overkill: for a long job, look at Fargate; for a multi-step workflow, Lambda Durable Functions (up to a year, as the table shows). MicroVMs are for when the differentiator is isolating untrusted code, not just going past 15 minutes.

There is also a gotcha AWS itself flags, and it rhymes with the determinism conversation: since the MicroVM boots from a pre-initialized snapshot (the equivalent of Lambda SnapStart, as Aidan Steele confirmed by testing), apps that generate unique content, open connections, or load ephemeral data at init may diverge. The snapshot froze a moment; whatever needs to be fresh per session cannot be frozen along with it. The fix has a name: lifecycle hooks to re-initialize randomness when each MicroVM is created. Map that out before assuming it just works.

Does it kill the container? No, and the reason is even better.

The hype of the week is "containers are obsolete." They are not. Quite the opposite: Aidan Steele tested it and you can run Docker inside a MicroVM, with OS capabilities enabled. So the MicroVM does not kill the container, it is more isolated and still runs containers inside. The honest cut is different: there is one specific spot, running untrusted code in isolation, where you will no longer want to harden a container by hand. There the MicroVM wins. Everywhere else, the container is still king.

The details the docs leave out

Aidan Steele spent launch day poking at the service and found some really interesting things that are not in the official docs. I read it and figured it was worth bringing here:

You can get a shell into the MicroVM, via the CreateMicrovmShellAuthToken API, with pty as a first-class citizen (Lambda Functions do not have it). Gold for IDE and coding-agent use cases.
Outbound UDP is blocked by default and DNS is a local stub, so DNS inside a container falls back to 8.8.8.8 and fails. The fix is to run with Lambda's DNS: docker run --dns 169.254.169.253, or go via VPC.
Lambda network connectors: a reified VPC config (subnets, security groups, an IAM role for the ENI) with its own lifecycle. The network team creates it, the developer just consumes it.
Performance (his tests): image build 2-3 min; RunMicrovm to RUNNING about 2s, plus 2s to serve; suspend and resume about 1s each.

What you take away

Lambda MicroVMs fills a real gap: VM-level isolation with near-instant launch and per-session state, which no single service delivered together.
It does not replace the Lambda Function, it complements it. Function in the backbone, MicroVM for the untrusted code.
The idle suspend is a deliberate cost lever, design your idle-policy on purpose.
Before locking in architecture: check the region (no São Paulo yet), the limits (ARM64, 16 vCPU, 32 GB, 8h), and the snapshot caveat. This post was the map. In the next one in the series I actually spin up a MicroVM and we prove the isolation in practice, launching two MicroVMs and testing whether one can reach the other, with the repo on GitHub for you to run along.

Got a case where you run user or AI code that today is duct-taped onto a container or a hand-rolled VM? Does this primitive fit? Drop a like, share it with whoever is building a multi-tenant platform, and let's talk. Cheers! =D

Originally published on willpeixoto.dev.

AWS Lambda MicroVMs: run untrusted code with VM-level isolation (no infra to manage)

Container, VM, or Lambda: the trade-off none of them solved alone

What Lambda MicroVMs is

What it is actually for (with examples you will recognize)

Lambda Function or Lambda MicroVM?

How it works in practice: from endpoint to orchestration

Cost, limits, and what is still missing

When NOT to use it

Does it kill the container? No, and the reason is even better.

The details the docs leave out

What you take away

Comments

More from this blog

AWS Lambda MicroVMs: rode código não confiável com isolamento de VM (sem gerenciar infra)

High Availability Has a Price: Resilience Is a Decision, Not a Stack

AWS MCP Server: which one to use, when, and how to set it up (the two servers explained)

AWS MCP Server: qual usar, quando e como configurar (os dois servidores explicados)

Command Palette

Container, VM, or Lambda: the trade-off none of them solved alone

What Lambda MicroVMs is

What it is actually for (with examples you will recognize)

Lambda Function or Lambda MicroVM?

How it works in practice: from endpoint to orchestration

Cost, limits, and what is still missing

When NOT to use it

Does it kill the container? No, and the reason is even better.

The details the docs leave out

What you take away

Comments

More from this blog