Tiled Hacker news on React Router

Tell HN: DigitalOcean's managed services broke each other after update

76 points - 01/13/2026

Yesterday my production app went down. The cause? DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes.

Public endpoint worked. Private endpoint timed out. Root cause: a Cilium bug (#34503) where ARP entries go stale after infrastructure changes.

DO support responded relatively quickly (<12hrs). Their fix? Deploy a DaemonSet from a random GitHub user to ping stale ARP entries every 10 seconds. The upstream Cilium fix is merged but not yet deployed to DOKS. No ETA.

I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.

HN's usual advice is "just use managed services, focus on the business." Generally good advice. But managed doesn't mean worry-free, it means trading your failure modes for the vendor's failure modes. You're not choosing between problems and no problems. You're choosing between problems you control and (fewer?) problems you don't.

Still using DO. Still using managed services. Just with fewer illusions about what "managed" means.

ebiederm
01/13/2026
I don't know if this is realistic but as a general rule if I was contracting with someone so that my business would have higher reliability, I would ask for a service level agreement with a agreed upon amount the vendor will pay you for every unit of time there service is not up.
At least then your pain is their pain, and they are incentivesed to prevent problems and fix them quickly.
calvinmorrison
01/13/2026
At my work we pay a boring, regional VPS host that is not fancy. In fact its maybe a few levels above "your 2000's web host, with a LAMP stack, a FTP login and a bad admin panel". Just a bit above that.
However, they ALWAYS pick up the phone on the 3rd ring with a capable, on call linux sysadmin with good general DB, services, networking, DNS, email knowledge.
cadamsdotcom
01/13/2026
100% uptime is impossible of course, a 100% reliable service would survive the next ice age.
But reliability at the holy grails of 4 and 5 nines (99.99%, 99.999% uptime) means ever greater investment - geographically dispersing your service, distributed systems, dealing with clock drift, multi master, eventual consistency, replication, sharding.. it’s a long list.
Questions to ask: could you do better yourself - with the resources you have? Is it worth the investment of a migration to get there? Whats the payoff period for that extra sliver of uptime? Will it cost you in focus over the longer term? Is the extra uptime worth all those costs?
itake
01/13/2026
I just had a 12hr outage due to flyio's quick and easy postgres minor patch update cooking my database.
I ended up downloading the entire volume, setting up my own docker container locally, exporting it, creating a new cluster (on the latest major patch).
Lost most of my day yesterday
AlbinoDrought
01/13/2026
Since this is about DO managed Postgres: if you're using it with replicas, they use async replication and RPO can be greater than 15 minutes. Since failover is triggered during upgrades, there ends up being a lot of periods where you can lose multiple minutes of committed data.
kevin_nisbet
01/13/2026
> I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.
This happens with managed services and I understand the frustration, but vendors are just as fallible as the rest of us and are going to have wonky behaviour and outages, regardless of the stability they advertise. This is always part of build vs buy, buy doesn't always guarentee a friction free result.
It happens with the big cloud providers as well, I've spent hours with AWS chasing why some VMs are missing routing table entries inside the VPC, or on GCP we had to just ban a class of VMs because the packet processing was so bad we couldn't even get a file copy to complete between VMs.
mmh0000
01/13/2026
```
  > I chose managed services specifically to avoid ops emergencies
```
You may not be spending enough time on HN reading all the horror stories =P
The benefit of a managed service isn't that it doesn't go down; though it probably goes down less than something you self-manage, unless you're a full-time SRE with the experience to back it.
The benefit of a managed service is you say: "It's not my problem, I opened a ticket, now I'm going to get lunch, hope it's back up soon."
lep_qq
01/13/2026
This resonates. We run a similar setup (managed K8s + managed DBs) and hit a comparable issue last year with a cloud provider's CNI update that broke pod-to-service networking for 6 hours. The irony is that "managed" services often abstract away the problems you can fix (config, scaling, backups) while exposing you to problems you can't fix (vendor infrastructure bugs, dependency conflicts between their managed components). What helped us:
Redundancy across failure domains: We now run critical stateful workloads with connection pooling that can failover between private and public endpoints. Yes, it's more complexity, but it's complexity we control. Synthetic monitoring for managed services: We probe not just our app, but also the managed service endpoints from multiple network paths. Catches these "infrastructure layer" failures faster. Backup connectivity paths: For managed DBs, we keep both private VPC and public (firewalled) endpoints configured. If one breaks, we can switch in minutes via config.
The DaemonSet workaround is... alarming. It's essentially asking you to run production-critical infrastructure code from an untrusted source because their managed platform has a known bug with no ETA. Your point about trading failure modes is spot on. Managed services are still worth it for small teams, but the value prop is "fewer incidents" not "no incidents," and when they do happen, your MTTR is now bounded by vendor response time instead of your team's skills. Did DO at least provide the DaemonSet from an official source, or was it literally "here's a random GitHub link"?
hdjrudni
01/13/2026
Oof. I have a very similar set up except I'm using their managed MySQL instead of PostgreSQL. It appears I wasn't hit.
Same thought as you.. I just didn't want to figure out and manage MySQL-with-failover myself so I switched their managed solution a year or two ago and my bill went up like 300% or more (was running fine on a ~$12 or maybe $24 droplet + $5 volume but now costs, I don't even remember, $150 or so).
yellow_lead
01/13/2026
Try a different managed service. We're using Render for a year with no DB outages. Although, we have gone down with Cloudflare several times.
As far as dbs go, I believe Amazon RDS is quite reliable. I think Render uses it under the hood.
You could also consider AWS ECS directly with RDS.
mystraline
01/13/2026
I know its not quite the same, but Ive been moving some of my personal services off of docker, and back to a full VM.
I find less things that can go wrong with VMs. I can log and monitor them better, and increase resources as I see what's going on per machine.
Docker was smearing all the machines together. For early testing, its great due to speed of redeploy and cleaning state. But once you want to start tuning, docker is pretty hard to get right.
Maybe I'm not a great systems engineer. But I do like my lower complexity systems. 1 service per machine is, in my opinion easier to get right.
cosmin800
01/13/2026
Lower prices come with a cost. I am not a fan of AWS but they higher reliability.
solaris2007
01/13/2026
AWS designs and implements their foundational services holistically. I can understand that the services "higher up the stack" may not feel this way to AWS customers sometimes. However, the foundation of VPCs, EC2, EBS and S3, are very strong.
If the word "production" is suppose to really mean something to you, move your workload to Google Cloud, or move it to AWS, or on https://cast.ai
Disclaimer: I have no commercial affiliation with Cast AI.
sfifs
01/13/2026
Oh I've run into exactly the same issue on my personal cluster and I had no clue what was the issue. Is this solvable?
01/13/2026
atmosx
01/13/2026
“Welcome to the real world Neo!”
“There is no cloud, it’s just somebody else’s computer”
etc etc…
dfajgljsldkjag
01/13/2026
[flagged]
sethops1
01/13/2026
Obligatory, do you actually need kubernetes? I struggle to imagine any tiny startup that does.

Tell HN: DigitalOcean's managed services broke each other after update

ebiederm

SahAssar

neilfrndes

Nextgrid

calvinmorrison

abnercoimbre

calvinmorrison

Fhch6HQ

Nextgrid

calvinmorrison

adityaathalye

cadamsdotcom

Nextgrid

cadamsdotcom

Nextgrid

itake

AlbinoDrought

roryirvine

kevin_nisbet

Nextgrid

Ma8ee

mmh0000

hdjrudni

kikimora

Nextgrid

neilfrndes

lep_qq

neilfrndes

hdjrudni

yellow_lead

anurag

yellow_lead

mystraline

cosmin800

neilfrndes

Nextgrid

delish

deathanatos

killingtime74

solaris2007

tatersolid

sfifs

atmosx

dfajgljsldkjag

ta9000

dfajgljsldkjag

ta9000

neom

sethops1

hdjrudni

Nextgrid

sfifs

neilfrndes

osigurdson