\

Bluesky April 2026 Outage Post-Mortem

109 points - today at 3:51 PM

Source
  • threecheese

    today at 4:08 PM

    > What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.

    Thatโ€™ll do it.

      • 98codes

        today at 4:45 PM

        Ahh, the three relevant numbers in development: 0, 1, and infinity.

        • jandrese

          today at 6:58 PM

          The incredible part about this is because their backend is all TCP/IP they were literally exhausting the ports by leaving all 65k of them in TIME_WAIT, and the workaround was to start randomizing the localhost address to give them another trillion ports or so.

          • bombcar

            today at 4:31 PM

            Zero, one, many, many thousands.

            • LoganDark

              today at 7:03 PM

              And then they fix the issue by using multiple localhost IPs rather than, perhaps, not sending 15-20 thousand URIs at a time

                • odo1242

                  today at 7:17 PM

                  They mentioned it was a temporary fix that they removed after finding and fixing the true root cause, though.

              • htx80nerd

                today at 6:28 PM

                less than ideal if I had to be frank.

            • tapoxi

              today at 6:38 PM

              I don't really understand this architecture, but I thought Bluesky was distributed like Mastodon? How can it have an outage?

                • pfraze

                  today at 6:42 PM

                  This writeup is useful for backend engineers: https://atproto.com/articles/atproto-for-distsys-engineers

                  The simple answer is that atproto works like the web & search engines, where the apps aggregate from the distributed accounts. So the proper analogy here would be like yahoo going down in 1999.

                    • tapoxi

                      today at 6:56 PM

                      This is a fantastic write-up, thanks for sharing!

                      • isodev

                        today at 6:48 PM

                        Google and MSN Search were already available at this time. Also websites used to publish webrings and there was IRC and forums to ask people about things.

                    • isodev

                      today at 6:44 PM

                      Itโ€™s more of a concept of a plan for being distributed. I even went through the trouble of hosting my own PDC and still, I was unable to use the service during the outage

                      • Retr0id

                        today at 6:42 PM

                        Mastodon infra can have outages, too.

                          • tapoxi

                            today at 6:49 PM

                            It's just confined to one instance if it goes down, not all of Mastodon.

                        • direwolf20

                          today at 8:09 PM

                          It's not really distributed. It's a centralised service that pulls some parts of 0.01% of user profiles from their own servers.

                          • LoganDark

                            today at 7:04 PM

                            A web interface and home server can have an outage. Bluesky is just a web interface and home server.

                        • opem

                          today at 8:36 PM

                          At least they aren't hiding and transparent about it unlike the big tech corps with so called SLAs

                          • streetfighter64

                            today at 8:50 PM

                            > They represent real user-facing downtime

                            Off-topic, but "real" feels like the new "delve". Is there such a thing as "fake" or "virtual" downtime, or why do people feel the need to specify that all manner of things are "real" nowadays?

                            • goekjclo

                              today at 5:33 PM

                              > The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.

                              I expect this is common.

                              • mwkaufma

                                today at 6:48 PM

                                Tell us more about this buggy "new internal service" that's scraping batch data :P

                                • pembrook

                                  today at 7:12 PM

                                  Distributed social media goes down? hrmmm.

                                  Email and the internet don't have "downtime." Certain key infra providers do of course. ISPs can go down. DNS providers can go down. But the internet and email itself can't go down absent a global electricity outage.

                                  You haven't built a decentralized network until you reach that standard imo. Otherwise its just "distributed protocol" cosplay. Nice costume. Kind of like how everybody has been amnesia'd into thinking Obsidian is open source when it really isn't.

                                    • iAMkenough

                                      today at 7:14 PM

                                      Bluesky is a provider. Blacksky didnโ€™t go down.

                                        • pembrook

                                          today at 7:24 PM

                                          Is there anything running on Blacksky other than Bluesky with more than say, 100 active users?

                                          AOL never even got to that level of dominance in the internet 1.0 era.

                                          The point is it's not a distributed network if one node is 99.9% of all traffic.

                                  • jonstaab

                                    today at 6:00 PM

                                    nostr never goes down

                                      • jandrese

                                        today at 6:50 PM

                                        If nostr went down would people even notice?

                                          • nout

                                            today at 7:27 PM

                                            If any major nostr relay goes down, no one notices. That has happened many times, the network is very resilient to that.

                                            • jonstaab

                                              today at 6:53 PM

                                              probably not

                                          • pfraze

                                            today at 6:02 PM

                                            All support to other decentralizers but nothing never goes down.

                                              • nout

                                                today at 7:25 PM

                                                The comparison here is to something like TCP/IP. TCP/IP never goes down. TCP/IP is a protocol, the servers may go down and cause disruption, but the protocol doesn't really have the ability to "go down". Nostr is also a protocol. The communication on top of Nostr is pretty resilient compared to other solutions though, so that's the main highlight here.

                                                If tens of servers go down, then some people may start noticing a bit of inconvenience. If hundreds of servers go down, then some people may need to coordinate out of bound on what relays to use, but it still generally speaking works ok.

                                                • jonstaab

                                                  today at 6:26 PM

                                                  1000x redundancy makes it vanishingly unlikely. Although I know we're due for a pole shift so all bets are off I suppose.

                                                    • numpad0

                                                      today at 7:56 PM

                                                      Wasn't aware there are ~2k relays now. Have inter-relay sharing situation improved?

                                                      When I tried it long time ago, the idea was just a transposed Mastodon model that the client would just multi-post to dozen different servers(relays) automatically to be hopeful that the post would be available in at least one shared relays between the user and their followers. That didn't seem to scale well.

                                                        • jonstaab

                                                          today at 8:31 PM

                                                          Getting clients to do the right thing is like herding cats, but there has been some progress. Early 2023 Mike Dilger came up with the "gossip model" (renamed "outbox model" for obvious reasons). Here's my write-up: https://habla.news/hodlbod/8YjqXm4SKY-TauwjOfLXS

                                                          The basic idea is that for microblogging use cases users advertise which relays their content is stored on, which clients follow (this implies that there are less-decentralized indexes that hold these pointers, but it does help distribute content to aligned relays instead of blast content everywhere).

                                                          Also, relays aside, one key difference vs ActivityPub is that no third party owns your identity, which means you can move from one relay to another freely, which is not true on Mastodon.

                                          • gsibble

                                            today at 6:43 PM

                                            Did all 3 users notice?

                                              • ffsm8

                                                today at 7:07 PM

                                                Naw, only one did. Turns out the other two were his socket accounts he used to upvote and comment on his own content.

                                                Okay, nuff trolling for today

                                                • dogemaster2027

                                                  today at 7:11 PM

                                                  [dead]

                                              • electrondood

                                                today at 6:21 PM

                                                Great write up... curious about the RCA. Thanks!

                                                • rvz

                                                  today at 5:45 PM

                                                  Thank you for the post mortem on this outage.

                                                  • dogemaster2027

                                                    today at 6:57 PM

                                                    [dead]

                                                    • templar_snow

                                                      today at 5:08 PM

                                                      [flagged]

                                                        • lavela

                                                          today at 5:14 PM

                                                          Why?

                                                      • jmclnx

                                                        today at 5:38 PM

                                                        Lite Blue on a dark Blue background. That is a new one, I have seen grey text on lite grey, but blue on blue ?

                                                        The article does work in lynx, at least I can read it.

                                                        • drewg123

                                                          today at 7:02 PM

                                                          Golang's use of a potentially unbounded number of threads is just insane. I used to be fairly bullish on golang, but this, combined with the fact that its garbage collected, makes me feel its just unsuitable for production use.

                                                            • floating-io

                                                              today at 7:38 PM

                                                              You can have this problem with any kind of thread -- including OS threads -- if you do an unbounded spawn loop. Go is hardly unique in this.

                                                              Goroutines are actually better AFAIK because they distribute work on a thread pool that can be much smaller than the number of active goroutines.

                                                              If my quick skim created a correct understanding, then the problem here looks more like architecture. Put simply: does the memcached client really require a new TCP connection for every lookup? I would think you would pool those connections just like you would a typical database and keep them around for approximately forever. Then they wouldn't have spammed memcache with so many connections in the first place...

                                                              (edit: ah, it looks like they do use a pool, but perhaps the pool does not have a bounded upper size, which is its own kind of fail.)

                                                                • slopinthebag

                                                                  today at 8:17 PM

                                                                  Rust's async doesn't have this issue. Or at least, it's the same issue as malloc in an unbounded loop, but that's a more general issue not related to async or threading.

                                                                  15-20 thousand futures would be trivial. 15-20 thousand goroutines, definitely not.

                                                              • tombert

                                                                today at 7:33 PM

                                                                Why does garbage collection make it unsuitable for production use? A lot of production software is written in garbage collected languages like Java. Pretty much the entire backend for iTunes/Apple Music is written in Java, and it's not doing any kind of fancy bump allocator tricks to avoid garbage. In my mind, kind of hard to argue that Apple Music is not "production use".

                                                                There are certainly plenty of projects where garbage collection is too slow, but I don't know that they're the majority, and more people would likely prefer memory safety by default.

                                                                  • slopinthebag

                                                                    today at 8:20 PM

                                                                    Everything is understood by comparison. Unsuitable for production use, compared to what is the more apt question.

                                                                • today at 7:32 PM