Bluesky April 2026 Outage Post-Mortem

(pckt.blog)

78 points | by jcalabro 2 hours ago

10 comments

  • mwkaufma 3 minutes ago
    Tell us more about this buggy "new internal service" that's scraping batch data :P
  • threecheese 2 hours ago
    > What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.

    That’ll do it.

    • 98codes 2 hours ago
      Ahh, the three relevant numbers in development: 0, 1, and infinity.
    • bombcar 2 hours ago
      Zero, one, many, many thousands.
    • htx80nerd 23 minutes ago
      less than ideal if I had to be frank.
  • tapoxi 12 minutes ago
    I don't really understand this architecture, but I thought Bluesky was distributed like Mastodon? How can it have an outage?
    • pfraze 8 minutes ago
      This writeup is useful for backend engineers: https://atproto.com/articles/atproto-for-distsys-engineers

      The simple answer is that atproto works like the web & search engines, where the apps aggregate from the distributed accounts. So the proper analogy here would be like yahoo going down in 1999.

      • isodev 3 minutes ago
        Google and MSN Search were already available at this time. Also websites used to publish webrings and there was IRC and forums to ask people about things.
    • isodev 7 minutes ago
      It’s more of a concept of a plan for being distributed. I even went through the trouble of hosting my own PDC and still, I was unable to use the service during the outage
    • Retr0id 8 minutes ago
      Mastodon infra can have outages, too.
      • tapoxi 1 minute ago
        It's just confined to one instance if it goes down, not all of Mastodon.
  • goekjclo 1 hour ago
    > The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.

    I expect this is common.

  • electrondood 29 minutes ago
    Great write up... curious about the RCA. Thanks!
  • rvz 1 hour ago
    Thank you for the post mortem on this outage.
  • templar_snow 1 hour ago
    [flagged]
  • jonstaab 50 minutes ago
    nostr never goes down
    • jandrese 0 minutes ago
      If nostr went down would people even notice?
    • pfraze 48 minutes ago
      All support to other decentralizers but nothing never goes down.
      • jonstaab 24 minutes ago
        1000x redundancy makes it vanishingly unlikely. Although I know we're due for a pole shift so all bets are off I suppose.
  • jmclnx 1 hour ago
    Lite Blue on a dark Blue background. That is a new one, I have seen grey text on lite grey, but blue on blue ?

    The article does work in lynx, at least I can read it.

  • gsibble 7 minutes ago
    Did all 3 users notice?