Back around January 6th, 2023, I woke up to the “generative AI” (GenAI) phenomenon. The occasional mentions of “ChatGPT” were things I was classing as “chatbot technology” and not really paying attention to because of that. But that was a category error, and I have Tim Schmidt of Replimat to thank for that. Tim told me about how he was using OpenAI’s ChatGPT to assist his programming efforts in building a new 3D printing firmware and software tools stack in Rust. I also came across a thread on Twitter by @svpino on “11 ways I use ChatGPT every day to improve my coding” (IIRC), laying out the various programming tasks that ChatGPT could handle productively.
Within a few days, I had already blown threw about $13 of the $18 API credit OpenAI bestows on new users of its GPT-powered API. And blown my mind. Even using GPT-3 was a revelation. As I noted in those first days, the GenAI phenomenon is going to democratize expertise like bullets democratized soldiers. This is likely to be a pivotal moment in technological advances, comparable to electrification or the spread of the internet, but it is going to happen much, much faster than either of those.
And I wanted some of that for myself. As in, how can I get this working on my own local gear?
The short answer is that I can’t, and virtually no one can. At least in the broad sense of starting from scratch, it is a non-starter. The word “training” has a specific meaning in GenAI, where the initial phase of setting up and determining model weights for the underlying neural networks that drive this is accomplished, and that only happens with the application of massive GPU compute capability coupled with astounding amounts of time and energy. There is no home “starting from zero” approach I know of yet that gets one anywhere near the sort of capabilities that drew my attention early in the year.
However, one can usefully do certain things at home related to GenAI. And one I latched onto as an idea was to run the Stable Diffusion (SD) GenAI on local hardware. Stable Diffusion is a model trained on a large sample of tagged images, such that one way of generating novel images is to provide the model with a text “prompt”, and the model responds with an image as a “completion”. There are mentions online about how using NVIDIA GPUs with sufficient VRAM helps make SD image generation go faster. There are fewer mentions of how even older NVIDIA datacenter GPUs, which are heavily discounted in the used market, can be applied to SD operation. I will intersperse some output from my homelab setup through the following description of setting it up.
I got a bit excited about one such message, which seemed to indicate that the NVIDIA Tesla P40 datacenter GPU was used for SD, located one on eBay for about $100, and bought it. That was, it turned out, slightly premature. I was thinking that it could be added to an AMD 8-core machine I have, but I was wrong. I did eventually get it working, but it was a longer, harder road than I thought I was taking at the start. I’m going to outline what I now have for a homelab box so that others can evaluate it for themselves whether to do this, or take the easier (but more expensive) route of simply buying a consumer-grade NVIDIA GPU for their system.
First, let me say that while people are doing some part of GenAI using other approaches (CPU only versions of inference for some models is possible, or using non-NVIDIA GPU compute power and memory), the vast majority of effort and the most efficient results are happening using NVIDIA GPU computational capability. There are several things I’ve learned. NVIDIA has made GPUs for a long time, and over time the products have changed. The NVIDIA computational aspect is bound in their CUDA technology, and GPUs have varying numbers of CUDA cores. The actual implementation of the CUDA cores changes over time, too, so older GPUs may not be compatible with current coding. The GPU also provides memory, called VRAM to distinguish it from system RAM. The amount of VRAM will place a limit on what can be done in a GenAI project in a couple of ways. Using a GenAI model is called “inference”, and loading a model for inference is straightforward: you need as much or more free VRAM as the size of the model weights you are loading. Another process allows one to add information to the previously-trained model, and this is called “fine-tuning”. The current common approach to fine-tuning requires that one have four times the size of the model weights in VRAM. A common basis for homelab experimentation is Eleuther-AI’s GPT-J six-billion parameter large language model (LLM), and IIRC it requires 16GB of VRAM for inference, and 64GB of VRAM if you want to apply fine-tuning. GPUs supplying 16GB of VRAM in one card are expensive when they appear on the market. Getting to 64GB of VRAM generally means adding two or more GPUs in a system, and at that point it is also an issue as to whether the memory can be efficiently accessed across multiple GPUs. NVIDIA has a feature on some GPUs called NVLINK to make memory pooling efficient and fast. NVLINK, of course, adds more cost to the GPU.
What about the Tesla P40 I said I got? It comes with 24GB of VRAM (good), but it is a bit older and has CUDA cores with “compute level” 6.1 (not so good). It does not have NVLINK. There was a set of benchmark tests I saw comparing it to the NVIDIA RTX 4090, a current high-end consumer-grade GPU that also has 24GB of VRAM. The RTX 4090 has a higher clock speed on its VRAM, so it is inherently faster on that score, and its CUDA cores are compute level 8.9 instead of the 6.1 of the P40. The benchmarks do show the 4090 is a clearly superior GPU in performance, by a couple of times, sometimes more, sometimes a bit less than that, but roughly double the performance of the P40. Remember I said I got my P40 for $100 used? The RTX 4090 is in limited supply new now, and goes in various forms for between about $1700 and $3000 each. For a highly experimental introduction to homelab machine learning operations, that is a really steep ante to make. Even with a single RTX 4090, doing more than just inference with LLMs does not look to be currently feasible (but things do change quickly, so this may not age well at all).
The Tesla P40 obviously has some drawbacks, and I haven’t even finished listing them. It isn’t going to work with any coding that relies on CUDA compute level > 6.1. It doesn’t provide any way to link VRAM like NVLINK, so even just trying to add more P40s to get more VRAM looks like a gamble. And the slower memory means it isn’t as performant as newer GPUs can be. Now, a few more practical limitations. The P40 is a two-slot datacenter card with passive cooling. The server rack-mount systems it is designed to go in have high laminar flow cooling; a consumer box does not. One has to provide auxiliary cooling to keep it in its normal operating temperature range. The P40 draws 250W of peak at peak usage. One’s system PSU needs to be beefy enough to cope with that. The P40 needs an 8-pin CPU power connection from the PSU to function. No, it does not take a PCI-E power connection, and you are likely to fry the board if you hook it up with one. The P40 has to go in a full PCI Express slot, and it will take up the next slot over, too. You can’t put it in a riser card slot and expect it to work. The motherboard and CPU combination have to have “Resizable BAR” (addressing beyond 4GB) or the P40 will not even be recognized in the system.
On cooling the P40, I made the mistake of ordering a straightline 3D-printed fan shroud. When attached to the end of the P40 and the 97x94x33mm blower fan to fit it was attached, I was looking at having to grind out part of the full tower case I had for this just to fit it in. I bought a second 3D-printed fan shroud, this one had a folded path so the blower was next to the card and the fan shroud added less than 2″ to the overall length of the combination. It did make it thicker at the end, though.
So here are my build components. Several of these parts were contributed by Tim Schmidt, but I will list current pricing from Amazon or eBay for reference.
- NVIDIA P40 GPU (~$100 used)
- ASUS Prime B450M-A II motherboard (~$80)
- AMD Ryzen 7 1700X 8-core CPU (~$66)
- CPU cooler (~$50)
- Full tower case (~$100)
- Corsair CX750M PSU (~$50)
- (4) 16GB DDR4 RAM (~$180)
- 500GB NVME SSD (~$35)
- GTX 750 ti GPU (~$50)
- PSU -> 8-pin CPU cable (~$16)
- PSU -> Molex cable (~$12)
- Molex to four-pin fan cable adapter (~$12)
- PCI-e riser cards (~$16, using one of six)
- 3D-printed folded-path fan shroud (~$20)
- 97x94x33mm 12VDC blower fan (~$9)
- Epoxy casting material (~$3)
The 3D-printed fan shroud I got was not an exact fit to the P40, therefore the epoxy casting material to add some more flange to the part. That’s the fractional cost listed; the full epoxy casting kit was more like $30, but I already had it on-hand. I might have been able to use something cheaper like hot-melt glue, but I’d be a bit nervous about introducing a lot of heat to a 3D printed part.
The motherboard works OK, but I am frustrated by it. It was among the cheaper of the options I looked at. It has just one PCI Express slot, so that has to be dedicated to the P40. It supports CPUs with onboard graphics, but the CPU I am using does not have that. So I needed a separate GPU. I had thought perhaps I could use a USB interface with graphics, but that is not the case. This motherboard will not boot without a functional GPU in the system. The P40, being a datacenter GPU, has no video out. Not only does the motherboard insist on a GPU, it also appears to have very particular demands on the cable one uses to go between the GPU and a monitor. I had a period where the system would not boot, and after trying literally everything else, I changed the DVI cable I was using, and it booted. I cannot recommend this motherboard based on this behavior. If I want to run without a GPU or even a monitor, I don’t want my motherboard to go on strike because of that. The other issue is that both ‘Resizable BAR’ must be selected and the ‘CSM’ compatibility mode has to be off for the P40 to be recognized, and this motherboard refused to boot without CSM when I tried using any non-NVIDIA GPU. I had to pull the only NVIDIA GPU in the house from another system to get this to go. So that NVIDIA GPU is in a riser card, since the P40 is in the only PCI Express slot. I have not dealt with bracketing for it yet, so I have to be careful if I plan to move the box around. All in all, I’d recommend a different motherboard, hopefully one with multiple PCI Express 16x slots spaced for two-slot GPUs, and which either has built-in graphics or also spring for a CPU with graphics built-in. It would have saved me some headaches. Other than that, the ASUS build quality seems to be there, but the particulars of its firmware make it less useful than it could be.
I am running Ubuntu 22.04 LTS Desktop on this system. I followed a Github Gist for getting the appropriate drivers in the system. I am using Mambaforge for Python environments, setting up the minimal Mambaforge install and then adding environments as needed.
Once all the software and drivers are loaded, one can run “lshw -C display” and see the datacenter GPU appear in the results.
I got the AUTOMATIC1111 Stable Diffusion WebUI software. This actually installs with a single command-line invocation after the dependencies are installed. It is slick. The web application has plenty of features, and can also be run to provide an API. It has an extension manager that mostly just works. There was one extension I tried to install that gave runtime errors because it was trying to run Windows-platform code instead of Linux, but that’s the only glitch I have encountered.
(Prompting for ‘rabbit, style of Margaret Bourke-White’)
The extensions include Riffusion, the music generation GenAI that is based on Stable Diffusion. Unfortunately, what appears in the web app only seems to be converting images in a folder to music rather than the “text to music” process I was hoping to try out.
(Prompting for ‘rabbit, style of Diane Arbus’)
I also set up a systemd entry for the WebUI. With a couple of changes in the code, I now have it set so that I can run the WebUI app from any wifi-connected machine in the house with a compatible browser.
What about performance? I can do text-to-image generation with it at 512×512 pixels in about 10.5 for a 32-iteration process. An example:
The default number of iterations is 20, and my homelab can generate a 512×512 image at 20 iterations in under 5s.
I have tried a 2048×2048 image generation, and while it works (the advantage of 24GB of VRAM), it takes between 12 and 15 minutes of processing. The WebUI has the R-ESRGAN image upscaling GenAI (it might be as an extension, I don’t recall specifically), so the usual way to do things probably should be to generate images at the 512×512 pixel size or smaller, and then apply upscaling to promising images. I have been doing rather a lot of generation using 480×384 pixels, which corresponds to the 10×8 ‘perfect’ format. Given that the GenAI is self-hosted, I don’t have rate limits or other such restrictions on usage. For everything I have tried so far, the P40 has delivered.
Why is this capability of any more use than satisfying idle curiosity? Well, I have had to learn a fair amount about the practical side of GenAI through this exercise. Plus, there are a number of capabilities in the Stable Diffusion-based GenAI stack that I want to take advantage of. One of those is the ability to feed in an image and get an automatically-generated “prompt”, which is essentially a set of tags describing the content of the image. I have hundreds of thousands of photographs, so organizing and searching those has always been an issue. So one project is going to be having the system crawl through the file server, finding the tags for every image there, and storing that information for use. That will take a while, but it should open up a lot of capability for relevant search of images. There are a variety of projects out there concerning automated image repair, which given my recent emphasis on archiving and sharing family photos is going to be huge. The whole ‘img2img’ pipeline in Stable Diffusion looks both vastly complex and has immense possibilities for modifications to existing images, and I don’t even know the full extent of that yet.
Given that Riffusion is based on Stable Diffusion, I am holding out hope that I can get the ‘text to music’ version set up locally somehow. I am still looking at that.
The ‘text to video’ is conceptually appealing, but the results in the couple of times I have tried it result in tiny-sized video that nonetheless has an irritating prominent watermark.
At the moment, I am exploring the ‘text to image’ pipeline the most, getting a feel for how it works and its limitations. There are loads of possibilities for using this for generating illustrations for websites, presentations, and possibly books. There are certain aspects of it I am finding fascinating, mostly around the use of “style of X”, where X is the name of some artist. That works much better for some values of X than others, and reveals something about the internal workings of the process along the way.
The system build was a pretty steep learning curve, but I now have a tool whose operation is going to take a fair amount of time to master. It will probably change even as I am getting up to speed with it.
Update 2023-04-17: The ‘nvidia-smi’ command shows instantaneous power consumption, and the NVIDIA Tesla P40 card appears to draw 51W at idle. The GTX 750 ti, by contrast, draws 1W at idle. The command also shows card temperature, and the P40 is at 31 C at idle. I ran a 2048×2048 ‘txt2img’ operation, which took 6 minutes 40 s, and over that time the temperature climbed to 80 C and the power usage fluctuated mainly between 180W and 236W, with at least one spike to 266W (despite the reported cap of 250W). I think NVIDIA has a thermal shutoff at something like 85 C, so my cooling solution looks a bit marginal. I may need to add a fan or two aimed at the P40 backplate. This will become more consequential as we get into warmer weather here.
Also, I asked ChatGPT to write a Python program to walk a directory tree and query the “CLIP Interrogate” function of the Stable Diffusion API to obtain prompts that describe the content of images found in the directory tree, and to store file names and prompts in a SQLite database. It provided code that was very close to working as written.