[Paper] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

rufus · edit-2 2 years ago

[Paper] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

cum@lemmy.cafe · 2 years ago

This is sick. Would this lead to better offline LLMs on mobile?

rufus · 2 years ago

I think we’re already getting there. Lots of newer phones include AI accelerators. And all the companies advertise for AI. I don’t think they’re made to run LLMs, but anyways. Llama.cpp already runs on phones. And the limiting factor seems to be the RAM. I’ve tried Microsoft’s “phi-2”, quantized and on slow hardware, it’s surprisingly capable for such a small model. Something like a ternary model would significantly cut down on the amount of RAM that is being used which allows to load larger models while also making it faster, everywhere. So I’d say yes. And it would also allow me to load a more intelligent model on my PC.

I think the doing away with matrix multiplications is also a big deal, but has little consequences as of today. You’d first need to re-design the chips to take advantage of that. And local inference is typically limited by memory bandwidth, not multiplication speed. At least as far as I understand.

I’d say if this is true, it allows for a big improvement in parameter count for all kinds if use-cases. But I’ve also come to the conclusion that there might be a caveat to that. Maybe the training is prohibitively expensive. I don’t really know, at this point there is too much speculation going on and I’m not really an expert.

cum@lemmy.cafe · 2 years ago

Yeah I knew about the AI chips being more common but this is a really good write up, thanks!

turkishdelight@lemmy.ml · 2 years ago

ollama already lets you run many 7b llms on Android with 4bit quantization.

[Paper] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

[Paper] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper page - The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits