From the abstract: “Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.”

Would allow larger models with limited resources. However, this isn’t a quantization method you can convert models to after the fact, Seems models need to be trained from scratch this way, and to this point they only went as far as 3B parameters. The paper isn’t that long and seems they didn’t release the models. It builds on the BitNet paper from October 2023.

“the matrix multiplication of BitNet only involves integer addition, which saves orders of energy cost for LLMs.” (no floating point matrix multiplication necessary)

“1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint”

Edit: Update: additional FAQ published

  • rufusOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    9 months ago

    I can’t find that mention of “8-bit models” anywhere in the paper, just by skimming it again I only see references and comparisons to FP16.

    I know these discussions from llama.cpp and ggml quantization. With that you can quantize a model more and more and it becomes worse the lower the precision gets. You can counter that by using a larger model that was more “intelligent” in the first place… With that you can calculate the sweet spot and what gives you the best quality at a certain compute cost or size… A more degraded bigger model, or a less degraded smaller model…

    But we don’t have different quantization levels here, just one. And it’s also difficult to compare, as with ggml you take the same model and quantize it to different levels… We also don’t have that here, you can’t take an existing model with this approach and quantize it and compare it to another… You have to train a new model from scratch. And then it’s a different model.

    I can’t find a good analogy here… Maybe it’s a bit like asking if the filesize of an JPEG image is more important than the resolution… It’s kind of the wrong question. You can compare different compression levels of the JPEG image, or compare the size of the JPEG to a BMP file… It’s really not a good analogy, but a BMP file with 20 times the size looks exactly like a smaller JPEG file on the screen. And you can also have a 7B parameter LLM model give better answers than a poor (or older) 13B model. It’s neither just parameter count nor presision alone.

    So if they say they can do with less than a third of the RAM and compute time and simultansously score a tiny bit higher in the benchmarks, I don’t see a tradeoff here.

    Generally speaking you can ask the question: What delivers the best results with at a given compute cost. Or the other way around: What has the lowest cost to arrive at a certain point. But this is kind of a different technique, same parameter count, same results, but significantly lower computing cost on inference.

    (And reading all the speculation elsewhere: There might be a different tradeoff. The authors didn’t talk about training and just made very small models. A more complex and expensive training process could be a tradeoff.)