An Update on Generative AI
As I have previously posted, I have a homelab PC build to explore what I can do with self-hosted models.
Self-hosting is definitely looking more feasible as time goes on. I noted the ‘llamafile’ project in an earlier post. It definitely moves text generation by AI to a status of ‘it just works’ (at least so far as the running of the program is concerned; all the usual caveats about the quality of what it produces remain).
For myself, I also noted that what I could do with my homelab was an extensive and growing list. I will list again to show where I am at now:
– Text-to-image (SD)
– Image-to-image (SD)
– Image description (SD, llamafile)
– Face recognition (Pytorch)
– Speech to text (Whisper.cpp)
– Text to speech (CoquiAI TTS)
The release of the llamafile project spurred me to do some more work on my homelab system. While Stable Diffusion was able to recognize and use the Tesla P40, llamafile refused to do so. It would run, but in CPU-only inference mode. With some internet search, I was able to find some instances of people saying that they were using their P40-based systems with llamafile with good results. Unfortunately, they didn’t really go into detail such that I could directly figure out what their configuration had that mine did not.
I tried removing and reinstalling Nvidia drivers and the rest of the software stack. That didn’t work. But while trying to find other information, I ran across the Nvidia Container Toolkit software. Starting from a clean install of Ubuntu 22.04, I took some deliberation in adding things. The packages build-essential, cmake, clang, docker, and docker-compose were among those. Invoking ‘ubuntu-drivers devices’ showed the P40 and the variety of drivers to select from, with nvidia-driver-545 picked out as ‘recommended’. With that installed, I got on with the particulars of the Nvidia Container Toolkit installation. I ended up pulling several different nvidia-runtime container images, which included the nivdia-smi utility. Running the nvidia-smi utility via each container demonstrated that the hardware was accessible using its version of the CUDA toolkit. I determined this way that CUDA toolkit 12.3.1 was the latest one that showed the P40 as accessible. Since I did that with containers, I did not have a bunch of Nvidia CUDA toolkit builds and removals in my host system. I was able to get and install CUDA 12.3.1 for the host system.
After all that, I finally got llamafile to come up and load a model entirely in GPU VRAM. I first used a tiny model intended to be capable of running on low-resource machines like a Raspberry Pi. And after it provided a completion for my first prompt, the stated number of tokens/second it processed was a pretty stunning 82. I quickly determined that I could only fit about 2/3rds of the layers I needed to in GPU for the model I really wanted to run, the ‘mixture-of-experts’ Mixtral-8x7b-instruct-Q5-K-M one. Let’s just call it Mixtral. That model is based on eight different fine-tunings of the baseline Mistral AI 7b model weights (the 7b indicates the model had 7 billion parameters). The Mixtral model has a big feature by inheritance: Mistral 7b is multilingual, handling English, French, German, Italian, and Spanish. It has more English text in its corpus, so that is the one it handles best, but so far multi-lingual LLMs have been pretty scarce.
There is synchronicity. Llamafile 0.6 was released a couple of weeks ago, and one of the new features it offers is the ability to load layers on multiple GPUs. I assume that there is a performance penalty if the GPUs don’t already have a feature like Nvidia’s NVLINK communications booster. But the fact that a large LLM could be loaded that way had me ready to try. I had been doing the new build on the first AMD motherboard I got for this project, but I shifted the NVME SSD to the newer motherboard, dropped a second Nvidia Tesla P40 into the system, and booted up. The differences in hardware did make that first boot a bit longer than usual, but once the system was up, I was able to run llamafile with the Mixtral model, and it all fit in GPU VRAM split across the two Tesla P40s. I’ve done many prompts since then, and the average tokens/s performance is right around 20, where CPU-only varied between 2 and 6.
With Mixtral running with llamafile in server mode, I can use the web-app llama.cpp interface from any machine on my network. (And with Zerotier, my network isn’t limited to just when I’m at home, at least as long as my link is up.) I can also send completion requests via the API. It’s taken months, but I can finally add a couple more items to the capabilities list:
– Self-hosted LLM inference at ~20 t/s (llamafile w Mixtral)
– Vector embedding (Python w sentence-transformers)
Those two things expand the potential utility of the system greatly. Given the Mixtral multi-lingual capability, I am gearing up to do some automated translations of websites I manage as a first pass. I am also trying to scope a project for semantic search of both websites and other materials I have on hand.
Two P40s gives a total of 48GB of VRAM. Getting Mixtral loaded takes about 39GB of that. Llamafile has a handy ‘-ts’ (tensor split) parameter, where one can provide a list of integers to determine the proportion of layers to be put on each GPU in sequence. Given that Stable Diffusion doesn’t split model loading itself, I set ‘-ts’ to put more layers of Mixtral on the second GPU, and that left enough VRAM for Stable Diffusion to load a model. I will likely have to stop the Mixtral server when I want to run additional image-related processing, like image resizing or ‘deoldify’ (which colorizes black-and-white images automatically). Or re-think the homelab system again to get more VRAM.
There are more capable GPUs available. But I am coming to the conclusion that getting used Nvidia Tesla P40s is a pretty good way to get an excellent set of usable capabilities without a huge up-front outlay of cash.