Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at this https URL.

  • rufus
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 year ago

    I don’t know how other people feel about this, but I would like if you gave a bit more context on links like this. Just so I can decide if I want to read that paper (or click the link) or not. I’m not really an expert so this paragraph doesn’t help me to contextualize anything.

    For example in this case, I remember skimming through “QLORA: Efficient Finetuning of Quantized LLMs” from May. But it needs to be dumbed down a bit so I can figure out the new achievement. Or link a news report that contextualizes it and not just throw the paper at me.

    Feel free to ignore my comment if everyone else is an expert. I’m just saying because it’s kind of time consuming to click on the link and read the abstract and conclusion and look up a few things just to understand what we’re talking about. Once we’re discussing several papers a week and I’m just a hobbyist, I’d like the link to come along a summary and the context.

      • rufus
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 year ago

        Thank you very much for your explanation. I can understand that one. This is exactly the important difference. In my words it’d be: They figured out a way to improve on the maths, making the calculations faster. (by reducing an important matrix multiplication in dimensionality)

        But there is another important aspect to it. They keep the quanzized property after the fine-tuning which QLoRA doesn’t. Which makes it a bit more precise than doing another (lossy) quantization after the fact.

        Your explanation got me on track to figure it out. Thanks. I wrote another longer reply to noneabove1182. I’m not going to repeat everything, but I think I’m satisfied now.

    • noneabove1182@sh.itjust.worksOPM
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Sure, I can try to add a couple lines on top of the abstract just to give a super brief synopsis

      In this case it would be something like:

      This paper discusses a new technique in which we can create a LORA for an already quantized model, this is unique from QLora which quantizes the full model on the fly to create a quantized lora. With this approach you can take your small model and work with it as is, saving a ton of resources and speeding up the process massively

      • rufus
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 year ago

        I’m sorry today’s not my day… I still don’t get it. Did you write that summary or is the paragraph/synopsis AI generated?

          • rufus
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            1 year ago

            Thanks. Whoo boy are there parts I didn’t understand. Sorry for asking if you did the summary yourself. I really couldn’t tell if it’s my intellectual abilities to follow what you said… or if that paragraph just didn’t make any sense because it was written by ChatGPT. So I just asked. (I still think it’s not quite right. Both methods quantize the model in the same way (contrary to what you said). The thing is QA-LoRA is able to keep it quantized during the whole process and Q-LoRA not.)

            I think my knowledge level is about undergraduate computer science, without having attended any course on machine learning. But I like to read those papers and play with the stuff.

            I think I got it now. The important explanation is in 3.2 and 3.3:

            • “In brief, the only benefit brought by QLoRA is the reduced memory cost for fine-tuning.”
            • “achieving the second goal lies in that W˜ (i.e., the quantized W) and s·AB can be merged without using high-precision numbers (e.g., FP16).”
            • “this is impossible in the original setting, i.e., W is quantized into W˜ in a column-wise manner while both A and B are unconstrained.”

            And then they invent the maths and group the operations and numbers in a way to make it possible. In result reducing the complexity of an important part of the calculation and keeping it within the same datatype, and successfully avoiding another lossy quantization step.

            I hope I’m right. Because I had to work hard and read it several times to get to this point.

            I have one serious question left:

            I don’t have a solid understanding of the consequences of using quantized numbers in (fine-)tuning. I always thought we needed a tiny learning rate for the gradient decent in machine learning. With more “coarse” quantized numbers, I thought that limits your ability to do small adjustments. Isn’t that how discrete number work? And like 4 bits isn’t much. It’s just 16 different values. I don’t get why the loss function converges under these circumstances. I’d expect the error to be significant at some point and making the process not stable. But obviously it is possible.

            And I have so many questions regarding the wording of this specific paper. The paper is a bit weird to me. Every few sencences, I get lost. You don’t need to read the rest of this comment. It’s more or less just my incomprehension of why I cannot understand the authors of that specific paper.

            I completely don’t get why they start their abstract with the words: “Recently years have witnessed a rapid development of large language models (LLMs).”

            No shit Sherlock? Who are they expecting to read this? Did they need to come up with a few more words for their abstract for some reason? And why a grammar/spelling mistake in the very first word of a paper?

            While having a look at their bibliography, I noticed it’s fairly common to open your paper with a stupid sentence like “Recent advances in … have …”. But usually it’s a bit more focused than referencing what broadly happened in the world in the last 5 years.

            And then in the introduction they just drop random papers. I don’t understand the connection to “Language Models are Few-Shot Learners” and “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model”. But I guess I’m not a scientist and never wrote a paper myself.

            Then they introduce the concept of parameter efficient fine tuning (in this case LORA) and quantization. I just skimmed through the QLoRA paper again, and I can understand those explanations (section “2 Background”) way better.

            And then they drop what I’d expect are the two most influential papers for this subject: “LLM-QAT: Data-Free Quantization Aware Training for Large Language Models” (arXiv:2305.17888) and “QLoRA: Efficient Finetuning of Quantized LLMs” (arXiv:2305.14314).

            And only then they say: “To the best of our knowledge, the most related work is QLoRA (Dettmers et al., 2023a)”. So they completely left me confused until the very end of section 2. I’m not an expert in this field. And my memory is bad. So upon first reading the paper, I always asked myself if I remember correctly that I’ve read something like this already. Or if it was some incohesive ramblings on Reddit, back in the time.

            And why are they constantly comparing it to LORA as if QLORA wasn’t the leading paper to compare oneself against? They don’t even mention it in “Related Work” until the very end in the only paragraph that has no bold text in it and I probably didn’t read it while skimming through the paper for the first three times. Isn’t the way science works to say what’s current state of scientific research and then say how you were able to improve on that? Why reference some arbitrary state in history of science instead of the current state?

            I’m not able to follow the maths. But the diagram “Figure 2” looks like QLoRA and QA-LoRA are almost the same, except for the magic with the D_in and L that makes it take less multiplications and the direct output of INT4 weights. This was the figure that lead me to the conclusion that I needed to dig further.

            I’m sorry for rambling on and on. Most of the time I can get something out of these papers, despite not being educated properly in machine learning. This was one of the few papers that left me completely without a clue and dumbfounded. Guess my brain is wired differently than the authors’ brains.

            I still don’t get why they explain things like they do. For example the abstract is a few sentences of unnecessary stuff every 8 year old knows. And then they immediately leap into super specific math details of one aspect, with “the imbalanced degrees of freedom of quantization”, talk about their “group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation”. Which I find ridiculous. It is why their proposed approach mathematically works. But they haven’t yet explained what their approach even is or what they’re trying to do in the first place and why this is important. Sure: “edge devices”. But there’s quite a bit missing. If you can infer how low rank adaptation is connected to edge devices, you don’t really need them to teach you the importance of LLMs. Then they say how many lines of code they need, which is a bit out of place. And then state their acheivements upon LoRA, which I guess is fine. So is mentioning they applied it and are releasing the code.

            • noneabove1182@sh.itjust.worksOPM
              link
              fedilink
              English
              arrow-up
              2
              ·
              1 year ago

              The abstract is meant to pull in random readers, so it’s understandable they’d lay a bit of foundation about what the paper will be about, even if it seems rather simple and unnecessarily wordy

              LoRA is still considered to be the gold standard in efficient fine tuning, so that’s why a lot of comparisons are made to it instead of QLoRA, which is more of a hacky way. They both have their advantages, but are pretty distinct.

              Another thing worth pointing out is that 4-bit is not actually just converting all 16bit weights into 4 bits (at least, not in GPTQ style) They also save a quantization factor, so there’s more information that can be retrieved from the final quantization than just “multiple everything by 4”

              QA LoRA vs QLoRA: I think my distinction is the same as what you said, it’s just about the starting and ending state. QLoRA though also introduced a lot of other different techniques, like double quantizations, normal float datatypes, and paged optimizations to make it work

              it’s also worth point out, not understanding it has nothing to do with intellect, it’s just how much foundational knowledge you have, i don’t understand most of the math but i’ve read enough of the papers to understand to some degree what’s going on

              The one thing I can’t quite figure out is, I know QLoRA is competitive with a LoRA because it trains more layers of the transformer vs a LoRA, but I don’t see any specific mention of QA-LoRA following that same method which I would think is needed to maintain the quality

              Overall you’re right though, this paper is a bit on the weaker side, that said if it works then it works and it’s a pretty decent discovery, but the paper alone does not guarantee that