🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.ee to LocalLLaMA@sh.itjust.worksEnglish · 1 day agoHow much gpu do i need to run a 90b modelmessage-squaremessage-square14fedilinkarrow-up112arrow-down11file-text
arrow-up111arrow-down1message-squareHow much gpu do i need to run a 90b model🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.ee to LocalLLaMA@sh.itjust.worksEnglish · 1 day agomessage-square14fedilinkfile-text
minus-squareSylovik@lemmy.worldlinkfedilinkEnglisharrow-up4·14 hours agoIn case of LLM’s you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.
minus-squarered@lemmy.ziplinkfedilinkEnglisharrow-up1·5 hours agothis is useless, llama.cpp already does that airllm does (offloading to CPU) but its actually faster. so just use ollama
minus-square🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.eeOPlinkfedilinkEnglisharrow-up1·13 hours agoThat looks like exactly the sort of thing i want. Any existing solution to get it to behave like an ollama instance (i have a bunch of services pointed at an ollama run on docker).
minus-squareSylovik@lemmy.worldlinkfedilinkEnglisharrow-up2·5 hours agoYou may try Harbor. The description claims to provide an OpenAI-compatible API.
In case of LLM’s you should look at AirLLM. I suppose there is no conviniet integrations to local chat tools, but issue at Ollama already started.
this is useless, llama.cpp already does that airllm does (offloading to CPU) but its actually faster. so just use ollama
That looks like exactly the sort of thing i want. Any existing solution to get it to behave like an ollama instance (i have a bunch of services pointed at an ollama run on docker).
You may try Harbor. The description claims to provide an OpenAI-compatible API.