Hi All,

As some of you may have realised, the planned upgrade sort of crashed everything, and we had our longest period of downtime since the site began.

This is partly because I had to go to sleep (thanks to a newborn and a job).

The good news is that the backup process worked! We’ve restored to seconds before the upgrade took the site offline.

The bad news is that federation is likely to be… wonky… for a little while. The site may also go up and down while I undo some of the fixes I tried.

Ultimately the issue came down to the upgrade failing (I am not sure why - will be digging into this now the priority is no longer getting the site up) and then the containers not talking to eachother, so the UI wouldn’t talk to lemmy, and lemmy wouldn’t talk to the database.

I rebuilt the containers, restored the backup, restarted everything, and it’s all come back up (admittedly not perfect right now).

Importantly, I want to issue an apology. This isn’t what I want for Lemmy.zip, and it should’ve been handled way better by myself. I’m always learning but this took way longer than it should’ve, and while I take some solace in the fact the backup process worked and has been proven to work in production, the delay in being able to get this back up is entirely my fault and frankly unacceptable.

I’ll be working to document this outage, the steps it took to get it back up, and some form of repeatable plan so a repair can be replicated in the future if I’m not available.

In terms of upgrading to 0.19.11 - I will have to try again soon as it’s got some security fixes we desperately need to implement.

Thanks

Demigodrick

  • tigeruppercut@lemmy.zip
    link
    fedilink
    English
    arrow-up
    59
    ·
    24 days ago

    Importantly, I want to issue an apology

    Way I see it, family and mental health always comes before internet randos. Thanks for working hard for everyone.

    • Demigodrick@lemmy.zipOPM
      link
      fedilink
      English
      arrow-up
      11
      ·
      23 days ago

      Lots of Internet randos have been very nice and supportive, so I feel a debt to the community to make this place the best it can be.

      But thank you ❤️

  • Demigodrick@lemmy.zipOPM
    link
    fedilink
    English
    arrow-up
    46
    ·
    24 days ago

    I will try and reply to each comment - but you’ve all been really kind and that means so much ❤️

    If you’re interested, this graph will show you how far behind we are. We should eventually catch up, but things will likely be very delayed for up to 12 hours.

    The status page did not work as expected - and I’ll try and link a few more places where I post updates. If you haven’t yet, definitely join the matrix space and you’ll get minute by minute panic updates 🫠

    • shortwavesurfer@lemmy.zip
      link
      fedilink
      English
      arrow-up
      7
      ·
      24 days ago

      That graph is really kind of neat, but it seems to only be synchronizing with a single instance at a time from what I can tell. I saw the world line has dropped significantly, but the other lines don’t look like they’ve fallen yet.

      • Demigodrick@lemmy.zipOPM
        link
        fedilink
        English
        arrow-up
        12
        ·
        24 days ago

        Yes, the lemmy.world admins kindly manually reset the timer for their instance so it started updating straight away!

        If an instance goes down, other instances slowly back off sending retries of activities so not to waste sending them to dead instances.

        You can use this tool to see this info. It links lemmy.world but you can search for any instance, and then look up lemmy.zip either under failed or lagging instances. You’ll see on the far right the “next send try” time and date. Looks like a lot will try again around 9pm (although I’m not entirely sure on the timezone there) - so over the next few hours instances will send another try, see that lemmy.zip is back up, and then start federation with us again :)

        • shortwavesurfer@lemmy.zip
          link
          fedilink
          English
          arrow-up
          5
          ·
          24 days ago

          It’s cool to see that there’s logic built-in that keeps instances from sending Federation requests to dead instances. But when an instance comes back online, they will re-synchronize themselves. An instance may drop out of the Federation, but when it comes back, it will get everything it missed. Eventually.

  • Omgboom@lemmy.zip
    link
    fedilink
    English
    arrow-up
    41
    ·
    edit-2
    24 days ago

    I’ve been there. But it is my honor to bestow upon you this award to commemorate the accomplishment

    • locuester@lemmy.zip
      link
      fedilink
      English
      arrow-up
      19
      ·
      24 days ago

      Ah yes. I still wear my 25 year old “deleted a prod database” badge with honor

      • SwizzleStick@lemmy.zip
        link
        fedilink
        English
        arrow-up
        17
        ·
        24 days ago

        It’s a bittersweet honour to have. My personal fail was being too cocky updating a ‘handful’ of product descriptions.

        (15398 rows(s) affected)

  • Debs@lemmy.zip
    link
    fedilink
    English
    arrow-up
    29
    ·
    24 days ago

    Thanks for all your hard work. We missed .zip while it was gone.

  • Blablablabum@lemmy.zip
    link
    fedilink
    English
    arrow-up
    28
    ·
    edit-2
    24 days ago

    Thanks for the update.

    I was a bit worried for your mental health as the hours of downtime continued :)

    Awesome that the backup restore procedure work that well.

    One thing I have been wondering is, why status.lemmy.zip stayed all green during all of this.

  • 0x0@lemmy.zip
    link
    fedilink
    English
    arrow-up
    17
    ·
    23 days ago

    entirely my fault and frankly unacceptable.

    You’re providing a service out of your own time, pocket and energy; you don’t owe anyone.
    It’s the other way around, we owe you.

    So thank you.
    Learn from your mistakes and carry on. 👍

  • Blaze@lemmy.zip
    link
    fedilink
    English
    arrow-up
    17
    ·
    24 days ago

    Thank you for this post. Don’t be so harsh on yourself, everyone can make a mistake!

    Good to see Lemmy.zip back up!

  • LiveLM@lemmy.zip
    link
    fedilink
    English
    arrow-up
    16
    ·
    edit-2
    23 days ago

    Dude, you’re being wayyyy to harsh on yourself!
    You run this awesome instance for free while caring for a newborn, you don’t owe anybody nothing.
    Forget the delay, forget apologies and “unacceptable”. Real life comes before social media, don’t beat yourself up for the outage.

    People who can’t stand downtime should practice personal redundancy by creating backup accounts on other instances ;)

    • Demigodrick@lemmy.zipOPM
      link
      fedilink
      English
      arrow-up
      4
      ·
      23 days ago

      Thanks, i appreciate the kind words. I always feel guilty when it doesn’t work. You have all put your trust in me to run an instance, so when it goes wrong I deffo feel the need to make it right.

  • rumba@lemmy.zip
    link
    fedilink
    English
    arrow-up
    15
    ·
    24 days ago

    Dude, keeping this running with a job and a newborn? You’re headed for sainthood.

    If you don’t have one, you could start an out of band chat during updates, just in case you need some eyes on things or just some moral support. I’m sure we have at least a few subject matter experts around if you can stand us :)

  • FryHyde@lemmy.zip
    link
    fedilink
    English
    arrow-up
    13
    ·
    23 days ago

    As a new parent myself, I’m stunned you managed to find the time to restore it at all. Good on ya, fella!

    • Demigodrick@lemmy.zipOPM
      link
      fedilink
      English
      arrow-up
      2
      ·
      23 days ago

      It’s quite an experience isn’t it! If I time it right, there’s a two hour window at the moment where I can guarantee myself a break. Usually it’s sleep, but I squeezed in a backup restore instead 🤣

  • Possibly linux@lemmy.zip
    link
    fedilink
    English
    arrow-up
    13
    ·
    23 days ago

    No need to apologize as you have been doing a stellar job. Your family needs to always take priority no matter what. I don’t care if it is down for a week as your health and kid are far more important.

    One thing I will say is that I think Lemmy.zip could really benefit from a external way of communicating announcements. It doesn’t need to be complicated and you could reuse your existing mastodon account to post updates when things go wrong. It also could allow for users to give advise on how to fix issues.

    • Demigodrick@lemmy.zipOPM
      link
      fedilink
      English
      arrow-up
      5
      ·
      23 days ago

      Thanks, yes I agree, I’ll be likely adding something to mastodon and im planning to look at alternative status pages as this one failed the one time it was really needed.

  • GeekFTW@lemmy.zip
    link
    fedilink
    English
    arrow-up
    13
    ·
    24 days ago

    Things gotta fuck up sometimes, tis how we figure shit out and learn things! You got this.

  • TacoEvent@lemmy.zip
    link
    fedilink
    English
    arrow-up
    11
    ·
    24 days ago

    I appreciate the transparency and frankly couldn’t ask for more. Shit happens and this is a one-person operation. Thanks for all your effort!