• ramenshaman@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    12 hours ago

    Trying to reproduce a bug that consistently only shows up the first time we boot up our machines that day. Currently testing to see how long the machine needs to be off in order for the bug to show up again. I’m at an hour now.

    • Buddahriffic@lemmy.world
      link
      fedilink
      arrow-up
      5
      ·
      11 hours ago

      Another angle to try is to set the date one day ahead and see if the bug shows up then. Might need to disconnect from network and set it in the BIOS for the test to work properly.

      I could be wrong, but I figure after being off for an hour, all capacitors should have discharged by then, so it’s probably not based on how long the hardware has been unpowered.

      Though one other angle I just thought of, if you have something that runs periodically, maybe the bug is related to that period being missed once or n times. Or it could be related to something that is meant to wake the computer to run some job and then go back to sleep but instead just sets it in a bad state.

      • ramenshaman@lemmy.world
        link
        fedilink
        arrow-up
        5
        ·
        11 hours ago

        The date/time aspect is an interesting thought. For a bit more context, this machine is a Raspberry Pi connected to several other devices, some via USB and some via a CAN network. The system gets powered on manually, the user performs a task, then shuts it down until they need it again. We only use the date/time for logging. The system is connected to our wifi at our facility but after we ship it then it’s likely it will never be connected to the internet except maybe when we’re servicing it and updating code. I don’t think the Pi has a RTC. I don’t really see how the date/time could be causing the issue I’m seeing (seems to be lag in communication with the devices on the CAN network) but I guess stranger things have happened.

        • Buddahriffic@lemmy.world
          link
          fedilink
          arrow-up
          4
          ·
          10 hours ago

          Ah that’s interesting. If you can swap the devices from one pi to another, try powering it all up on machine A, then swap the devices to machine B and power that on. Might tell you if the issue is with on the pi side or with the devices.

          Is latency higher on the first boot than on subsequent ones? I’d be looking into race conditions if you’re seeing a bit of lag cascade out into bigger problems. Race conditions are the worst, especially when the race most often goes the right way and just occasionally goes the wrong way. Though you can force the wrong way by adding delays in your code, if you have an idea of where the race is happening.

          • ramenshaman@lemmy.world
            link
            fedilink
            arrow-up
            2
            ·
            9 hours ago

            We have 3 theoretically identical systems here and this same issue occurs on 2 of them. The 3rd one… has bigger issues right now. That would be interesting to see what happens if I swap the Pis around but I’d give it >95% chance the same thing happens.

            • Buddahriffic@lemmy.world
              link
              fedilink
              arrow-up
              2
              ·
              5 hours ago

              The important bit is to power one on first before the swap, then you’ll have one setup where the pi was recently powered on and another setup where the connected devices were recently powered on. You might see the issue on only one of the devices, at which point you can say if it’s the pi being off for a while or the devices that triggers the issue.

              • ramenshaman@lemmy.world
                link
                fedilink
                arrow-up
                1
                ·
                edit-2
                5 hours ago

                Good point. I disabled the internet on both systems so when I come in on Monday hopefully I can confirm whether or not the date/time aspect is a problem. I’ll try this as well.

        • skuzz
          link
          fedilink
          arrow-up
          1
          ·
          10 hours ago

          Oh, apologies for my suggestion before seeing this comment hahaha!

          CAN devices I have limited experience with, but I know at least in the automotive industry, vehicles often have various CAN devices that have various sleep states. Like, shut car off, it holds brake system for a few minutes and then unlocks the brakes and that ECU shuts down. Later on, an emissions ECU may run a self-diagnostic. After a few days being powered off, the security ECU goes into low power and turns off wireless doorlocks. After the voltage drops too low, the ECU in the head unit ostensibly shuts down, and the next time the car is started, the head unit has to do a cold-reboot and takes a fortnight.

          Could be one of those CAN devices takes some time to get into the “off-adjacent” state to manifest the bug?

          • ramenshaman@lemmy.world
            link
            fedilink
            arrow-up
            1
            ·
            9 hours ago

            Largely for safety reasons, anytime the system is turned off power is instantly cut to the entire system. All of our CAN devices boot up much faster than the Pi does. Once the Pi boots, it sets up CAN communication.

    • skuzz
      link
      fedilink
      arrow-up
      2
      ·
      10 hours ago

      Could the time delay in being able to reproduce relate to some piece of code that has a timeout (thinking login timeout, cookie expiration, auth timeout, that sort of thing.) Or likewise, if the computer in question has multiple shutdown phases, like how many computers today “sleep” to RAM, and then an hour later sleep to disk in a more hibernatey fashion and fully power off? (Or some weirdness like how Windows shutdown now is ostensibly a hibernate, but a reboot is actually a full “power down power up” without shutting off power.)

      I like @Buddahriffic@lemmy.world 's take on being wall-clock-based. I once had a bug with some software that would just go belly-up on certain days for no reason whatsoever in a datacenter 2000 miles away. After having worked on some bare metal servers in the past and learned all about thermal issues firsthand, I checked the weather in that region. It only seemed to happen on extremely hot summer days, at the day’s temperature peak. Turns out the datacenter vendor had a cooling problem in that section of the DC and they were unaware of it…

      Crazy sometimes how bugs manifest.