Well hello again, I have just learned that the host that recently had both nvme drives fail upon drive replacement, now has new problems: the filesystem report permanent data errors affecting the database of both, Matrix server and Telegram bridge.

I have just rented a new machine and am about to restore the database snapshot of the 26. of july, just in case. All the troubleshooting the recent days was very exhausting, however, i will try to do or at least prepare this within the upcoming hours.

Update

After a rescan the errors have gone away, however the drives logged errors too. It’s now the question as to whether the data integrety should be trusted.

Status august 1st

Well … good question… optimizations have been made last night, the restore was successful and … we are back to debugging outgoing federation :(


The new hardware also will be a bit more powerful… and yes, i have not forgotten that i wanted to update that database. It’s just that i was busy debugging federation problems.

References

  • MilanOPMA
    link
    fedilink
    arrow-up
    2
    ·
    1 year ago

    Thank you :) Well i am not sure if there was something to fight over except maybe some sort of refund… for now it seems to be fine one the new machine. – yes, i am from germany, however i think its a helsinki dc from hetzner.

    • Haui
      link
      fedilink
      arrow-up
      1
      ·
      1 year ago

      You’re very welcome. Hetzner is generally a good host afaik. It does depend on the configuration I suppose. Are you using the shared vps or something else? If the storage is guaranteed (as in not custom hardware) they are technically responsible for its condition. A host I‘m working with (also located at hetzner but in falkenstein) does 2 backups a day which also prevents having to revert far back.

      • MilanOPMA
        link
        fedilink
        arrow-up
        2
        ·
        1 year ago

        on hetzner its all dedicated servers – out goes an ax51-nvme, in comes an ax102. they have tried a connector cable swap in order to try to bring the nvme(s) back to life, i was wondering if this could have something to do with the smart errors logged and the temp zpool errors, however i think the cpu upgrade now at least is very welcomed by the matrix server 😅

        • Haui
          link
          fedilink
          arrow-up
          1
          ·
          1 year ago

          Hm. In that case I‘m not sure what their obligations are. It’s very rare that I hear of nvmes downright failing.

          If your smart error rates start going up, that is a clear indicator that something is gonna happen. I have a graph on my server showing the error rates. Actually, there is a „bad sectors“ or „reallocated sectors“ reading that should be more telling. Once they go up its critical I think.

          I didn’t even know you also ran a matrix server. I recently started looking into matrix but I cant really say anything yet. Is it federated as well? Or do you need to make a new account for each one?

          • MilanOPMA
            link
            fedilink
            arrow-up
            1
            ·
            edit-2
            1 year ago

            Yes, it is federated – however since there is no SSO on the Lemmy instance, you need to make a new account. Like you need to make new accounts between email providers. :) However it is a different federation protocol: Matrix vs ActivityPub. For more cool stuff, check out https://tchncs.de :3

            • Haui
              link
              fedilink
              arrow-up
              1
              ·
              1 year ago

              Cool! Thanks! I will check it out.

          • MilanOPMA
            link
            fedilink
            arrow-up
            1
            ·
            edit-2
            1 year ago

            Dang the old host was deleted from the monitoring – however looking on at least one smart thing from my emails, there were no errors logged before the drives gave up on life during replacement. They just had a ton read/written and the used counter at 255% (even tho rw and age were not equal, its weird and one reason why i wanted to have at least one replaced in the first place). This is the one that had more:

            SMART/Health Information (NVMe Log 0x02)
            Critical Warning:                   0x04
            Temperature:                        53 Celsius
            Available Spare:                    98%
            Available Spare Threshold:          10%
            Percentage Used:                    255%
            Data Units Read:                    7,636,639,249 [3.90 PB]
            Data Units Written:                 2,980,551,083 [1.52 PB]
            Host Read Commands:                 87,676,174,127
            Host Write Commands:                28,741,297,023
            Controller Busy Time:               705,842
            Power Cycles:                       7
            Power On Hours:                     17,437
            Unsafe Shutdowns:                   1
            Media and Data Integrity Errors:    0
            Error Information Log Entries:      0
            Warning  Comp. Temperature Time:    0
            Critical Comp. Temperature Time:    0
            Temperature Sensor 1:               53 Celsius
            Temperature Sensor 2:               64 Celsius
            
            Error Information (NVMe Log 0x01, 16 of 64 entries)
            No Errors Logged
            

            The new ones now, where the zpool errors happened look like this

            SMART/Health Information (NVMe Log 0x02)
            Critical Warning:                   0x00
            Temperature:                        24 Celsius
            Available Spare:                    100%
            Available Spare Threshold:          5%
            Percentage Used:                    3%
            Data Units Read:                    122,135,021 [62.5 TB]
            Data Units Written:                 31,620,076 [16.1 TB]
            Host Read Commands:                 1,014,224,069
            Host Write Commands:                231,627,064
            Controller Busy Time:               3,909
            Power Cycles:                       2
            Power On Hours:                     117
            Unsafe Shutdowns:                   0
            Media and Data Integrity Errors:    0
            Error Information Log Entries:      4
            Warning  Comp. Temperature Time:    0
            Critical Comp. Temperature Time:    0
            Temperature Sensor 1:               24 Celsius
            
            Error Information (NVMe Log 0x01, 16 of 256 entries)
            Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
              0          4     0  0x0000  0x8004  0x000            0     0     -
            
            SMART/Health Information (NVMe Log 0x02)
            Critical Warning:                   0x00
            Temperature:                        24 Celsius
            Available Spare:                    100%
            Available Spare Threshold:          5%
            Percentage Used:                    2%
            Data Units Read:                    153,193,333 [78.4 TB]
            Data Units Written:                 29,787,075 [15.2 TB]
            Host Read Commands:                 1,262,977,843
            Host Write Commands:                230,135,280
            Controller Busy Time:               4,804
            Power Cycles:                       11
            Power On Hours:                     119
            Unsafe Shutdowns:                   5
            Media and Data Integrity Errors:    0
            Error Information Log Entries:      14
            Warning  Comp. Temperature Time:    0
            Critical Comp. Temperature Time:    0
            Temperature Sensor 1:               24 Celsius
            
            Error Information (NVMe Log 0x01, 16 of 256 entries)
            Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
              0         14     0  0x100d  0x8004  0x000            0     0     -
            
            • Haui
              link
              fedilink
              arrow-up
              1
              ·
              1 year ago

              You‘re not telling me telling me you‘re reading 62 TB in 117 hours, right? Right? xD the old ones were even petabytes.

              Those numbers are just insane. I have worked with AI training and storage. I have never seen such numbers.

              Well, I suppose that nvme was very much eol. Now I understand the behavior. This many operations in such a short time will put serious strain on your system. No wonder parts can give up. Are you using a raid config? Sorry if you already mentioned it.

              • MilanOPMA
                link
                fedilink
                arrow-up
                1
                ·
                edit-2
                1 year ago

                i am not sure about those numbers on the new ones … it was one db restore and a few hrs of uptime … a scrub… , then i rsynced some stuff over and since then the thing is in idle 🤷

                sample of the current active system … i think at time of arrival it was 2+tb written or something

                SMART/Health Information (NVMe Log 0x02)
                Critical Warning:                   0x00
                Temperature:                        37 Celsius
                Available Spare:                    100%
                Available Spare Threshold:          10%
                Percentage Used:                    0%
                Data Units Read:                    88,116,921 [45.1 TB]
                Data Units Written:                 43,968,235 [22.5 TB]
                Host Read Commands:                 689,015,212
                Host Write Commands:                409,762,513
                Controller Busy Time:               1,477
                Power Cycles:                       4
                Power On Hours:                     248
                Unsafe Shutdowns:                   0
                Media and Data Integrity Errors:    0
                Error Information Log Entries:      0
                Warning  Comp. Temperature Time:    0
                Critical Comp. Temperature Time:    0
                Temperature Sensor 1:               37 Celsius
                Temperature Sensor 2:               46 Celsius
                
                Error Information (NVMe Log 0x01, 16 of 64 entries)
                No Errors Logged
                
                • Haui
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  1 year ago

                  I might now understand what happened to your nvme (just a guess):

                  SSDs have „spare“ sectors, not available to you until the old ones are used up. Then the new ones get cycled in.

                  The other info said: no spare available, usage 250%

                  I have read about this I think. If the spare sectors run out and the drive starts to get smaller and smaller, the system will fill it up to its old capacity and overwrite data, thus corrupting itself.

                  That what happens to phony ssds that get sold as tb drives but are 250 gig usb drives instead. As long as you only fill 250, you will not recognize something is wrong. Once you go above, you start losing data.

                  Not totally sure it works that way in ssds but I‘m somewhat sure this 250% usage is an indicator of a run down ssd.

                  And I still think it is pure negligence of hetzner to not have swapped them out then they were due.

                  Didn’t they run in raid 1 or something? Usually, if a drive fails, the second one should hold.

                  • MilanOPMA
                    link
                    fedilink
                    arrow-up
                    1
                    ·
                    edit-2
                    1 year ago

                    I am a bit confused now… the spare was 98% as to read in my snippet above … where does it say “no spare available”? I think it is on me to request a swap, and thats what i did as also the one with slightly less wear reported 255% used – which afaik is an aprox. lifetime left estimation based on rw cycles (not sure about all factors).

                    The one the hoster left in for me to play with, said no:

                    [Wed Jul 26 19:19:10 2023] nvme nvme1: I/O 9 QID 0 timeout, disable controller
                    [Wed Jul 26 19:19:10 2023] nvme nvme1: Device shutdown incomplete; abort shutdown
                    [Wed Jul 26 19:19:10 2023] nvme nvme1: Removing after probe failure status: -4
                    

                    Tried multiple kernelflags n stuff but couldn’t get past that error. Would have been interesting to have the hoster ship the thing to me (and maybe that would have been a long enough cooldown to have the thing working again), but i assume that would have been expensive from helsinki.