Hey HN, we're Jie Shen, Charles, Andreas, and Shaocheng. We built Chamber (https://usechamber.io), an AI agent that manages GPU infrastructure for you. You talk to it wherever your team already works and it handles things like provisioning clusters, diagnosing failed jobs, managing workloads. Demo: https://www.youtube.com/watch?v=xdqh2C_hif4
We all worked on GPU infrastructure at Amazon. Between us we've spent years on this problem — monitoring GPU fleets, debugging failures at scale, building the tooling around it. After leaving we talked to a bunch of AI teams and kept hearing the same stuff. Platform engineers spend half their time just keeping things running. Building dashboards, writing scheduling configs, answering "when will my job start?" all day. Researchers lose hours when a training run fails because figuring out why means digging through Kubernetes events, node logs, and GPU metrics in totally separate tools. Pretty much everyone had stitched together Prometheus, Grafana, Kubernetes scheduling policies, and a bunch of homegrown scripts, and they were spending as much time maintaining all of it as actually using it.
The thing we kept noticing is that most of this work follows patterns. Triage the failure, correlate a few signals, figure out what to do about it. If you had a platform with structured access to the full state of a GPU environment, you could have an agent do that work for you.
So that's what we built. Chamber is a control plane that keeps a live model of your GPU fleet: nodes, workloads, team structure, cluster health. Every operation it supports is exposed as a tool the agent can call. Inspecting node health, reading cluster topology, managing workload lifecycle, adjusting resource configs, provisioning infrastructure. These are structured operations with validation and rollback, not just raw shell commands. When we add new capabilities to the platform, they automatically become things the agent can do too.
We spent a lot of time on safety because we've seen what happens when infrastructure automation goes wrong. A wrong call can kill a multi-day training run or cascade across a cluster. So the agent has graduated autonomy. Routine stuff it handles on its own: diagnosing a failed job, resubmitting with corrected resources, cordoning a bad node. But anything that touches other teams' workloads or production jobs needs human approval first. Every action gets logged with what the agent saw, why it acted, and what it changed.
The platform underneath is really what makes the diagnosis work. When the agent investigates a failure, it queries GPU state, workload history, node health timelines, and cluster topology. That's the difference between "your job OOMed" and "your job OOMed because the batch size exceeded available VRAM on this node, here's a corrected config." Different root causes get different fixes.
One thing that surprised us, even coming from Amazon where we'd seen large GPU fleets: most teams we talk to can't even tell you how many GPUs are in use right now. The monitoring just doesn't exist. They're flying blind on their most expensive hardware.
We’ve launched with a few early customers and are onboarding new teams. We’re still refining pricing and are currently evaluating models like per-GPU-under-management and tiered plans. We plan to publish transparent pricing once we’ve validated what works best for customers. In the meantime, we know “contact us” isn’t ideal.
Would love to hear from anyone running GPU clusters. What's the most tedious part of your setup? What would you actually trust an agent to do? What's off limits? Looking forward to feedback!