• Fisch
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    3 months ago

    That would actually be insane. Right now, I still need my GPU and about 8-10 gigs of VRAM to run a 7B model tho, so idk how that’s supposed to work on a phone. Still, being able to run a model that’s as good as a 70B model but with the speed and memory usage of a 7B model would be huge.

    • JackGreenEarth@lemm.ee
      link
      fedilink
      English
      arrow-up
      4
      ·
      3 months ago

      I only need ~4 GB of RAM/VRAM for a 7B model, my GPU only has 6GB VRAM anyway. 7B models are smaller than you think, or you have a very inefficient setup.

      • Fisch
        link
        fedilink
        English
        arrow-up
        4
        ·
        3 months ago

        That’s weird, maybe I actually am doing something wrong. Is it because I’m using GGUF models maybe?

        • Mike1576218@lemmy.ml
          link
          fedilink
          arrow-up
          1
          ·
          3 months ago

          llama2 gguf with 2bit quantisation only needs ~5gb vram. 8bits need >9gb. Anything inbetween is possible. There are even 1.5bit and even 1bit options (not gguf AFAIK). Generally fewer bits means worse results though.

          • Fisch
            link
            fedilink
            English
            arrow-up
            1
            ·
            3 months ago

            Yeah, I usually take the 6bit quants, didn’t know the difference is that big. That’s probably why tho. Unfortunately, almost all Llama3 models are either 8B or 70B, so there isn’t really anything in between but I find Llama3 models to be noticeably better than Llama2 models, otherwise I would have tried bigger models with lower quants.

    • Chrobin
      link
      fedilink
      arrow-up
      1
      ·
      3 months ago

      I have never worked on machine learning, what does the B stand for? Billion? Bytes?