Tell HN: DigitalOcean's managed services broke each other after update

Published 4 hours ago
Source: hnrss.org

Yesterday my production app went down. The cause? DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes.

Public endpoint worked. Private endpoint timed out. Root cause: a Cilium bug (#34503) where ARP entries go stale after infrastructure changes.

DO support responded relatively quickly (<12hrs). Their fix? Deploy a DaemonSet from a random GitHub user to ping stale ARP entries every 10 seconds. The upstream Cilium fix is merged but not yet deployed to DOKS. No ETA.

I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.

HN's usual advice is "just use managed services, focus on the business." Generally good advice. But managed doesn't mean worry-free, it means trading your failure modes for the vendor's failure modes. You're not choosing between problems and no problems. You're choosing between problems you control and (fewer?) problems you don't.

Still using DO. Still using managed services. Just with fewer illusions about what "managed" means.


Comments URL: https://news.ycombinator.com/item?id=46596075

Points: 46

# Comments: 24