Ive been playing koboldai horde but the queue annoys me. I want a nsfw ai for playing on tavernai chat

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    6
    ·
    6 months ago

    koboldai horde

    I mean, you can run KoboldAI locally.

    I don’t know whether you’d consider that sufficiently fast. But if you’re already using that and happy with it, it’s probably what I’d try first.

  • Fisch
    link
    fedilink
    English
    arrow-up
    2
    ·
    6 months ago

    There’s a fork of text-generation-webui with HIP support, you should use that

  • j4k3@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    6 months ago

    The 7600 is the 16GB? I can’t say for AMD but a 16 GB 3080Ti can run a whole lot of something. I don’t do Kobold because building it was too much of a headache of dependencies. I don’t do silly tavern either because I prefer more control and versatility.

    I’m using an 18 core 12th gen with 64GB of sysmem and mostly use llama.cpp so that I can split the load between CPU and GPU. I wrote a little command line function that polls nvidia-smi and parses the GPU memory to tell me exactly how much I have used and what I have left over. That runs every 5 seconds in the terminal and displays the metrics on the title bar. Knowing exactly how much RAM you’re using in the GPU and dialing in the settings with new models makes a big difference. The various models have very different requirements and settings optimisation potential.

    I run an 8×7B quantized model at 5 bits most of the time. It takes around 50GB to initially load, but runs like a 13B after that and is quire light weight.

    I’m somewhat limited when it comes to training LoRA’s. Like I can only do 7-8B model stuff in that space, but with a GGUF I can run up to a 70B. I wish I had more than 64 GB of system memory though. At 96 or 128 I could run some of the 120B models. Command R is pretty popular and powerful, but I can’t load that one.

    The 16 GB can run something like moistral 11B in transformers and 4-bit using bits and bites too.

    • projectmoon@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 months ago

      How much speed are you actually getting on Mixtral (I assume that’s the 8x7b). I have 64 GB of RAM and an AMD RX 6800 XT with 16 GB of VRAM. I get like 4 tokens per second with Q5_K_M quant.

  • projectmoon@lemm.ee
    link
    fedilink
    English
    arrow-up
    1
    ·
    6 months ago

    Install ollama. It has ROCm support (on Linux at least). Then hook it up to your favorite client. It has its own API and an openai compatible one.