Why?
For a long time I've been meaning to do something like this, a place for me to just collect all my experiments. There are a a lot of new things everyday and I barely get to try 1% of them. And the 1% I do try, I forget about them and move on to the next ones. I've tried creating github repositories etc. but in complete honesty, the only usable outcome of these experiments is some intuition of the concept in the worst case and a learning I'll carry with me in the best case. So, long story short, these notes are just a thought log of things I've tried, experimented, learned from and just found interesting.
Detour: No local machines
The setup itself has an interesting story, when I was looking into setting up a simple, no frills HTML page, the laziness dawned on me. Who writes a full static html webpage anyway, as much as I love to type, it's still boring. And of course, the LLMs are here, so I just wanted to find the right framework, use an LLM and build it, have it running in 30 mins so that I can do some actual work. And there I hit the first roadblock, I didn't want to use ChatGPT on the web cause it just feels like a text chatbot, I love using CLI based LLM interfaces since they have context that is hard to fill in with plain text. At work I love using Amazon Q, it has a very slick and perfectly minimalist CLI interface but I have not ever set up a CLI based tool for my local machine (I always end up using cursor/ copilot in vscode but I always hate that they need me to use the mouse and jump between places). And that begins the first hunt, which LLM?
Option 1: Codex
I pay for ChatGPT Plus and I saw the recent announcement that Codex is being offered to plus users as well. So I thought I'd give that a try, but the moment I installed it, I just did not like it. The whole CLI interface was very confusing and threw me off. The semi transparent text below the prompt box. The constant helper text above the prompt, and the whole thing was pretty slow not to mention (maybe just a bad day). But it just didn't stick so I thought I'd try the other options.
Option 2: Gemini CLI
I also have the Google Pro subscription and of course they have a CLI too, the gemini CLI so I installed that. All this was on my personal mac of course, and the moment I run it, I get the Mac notification that some program is trying to access my Home folder. In all fairness, it was just where I invoked the tool but for some reason, it set me off, I just did not like the idea of these tools having shell level access to my personal machine. The web is unsafe as it is, but my disk, that's a little too far. And that set me off on a whole other part.
[A little backstory, just a month before this, I built a PC for myself after so long. I went all out and got a AMD 9800x3D CPU and a 5070TI 16 Gb Over-clocked GPU. Perfect build for me to be honest.]
The Solution
I decided it was time to put the bigger machine to use. I've been playing video games for so long but I wanted to get full value out of that GPU and now seemed like a good time to start that. When I bought it, the original idea was to run inference locally anyway. I tried dual-booting etc. in the past, and frankly, I don't need so much power cause all I'm doing is typing out code for the most part. And, the CPU is super beefy so it's a breeze regardless. So I ended up creating a VM and obviously, the first thing I wanted to do was install Ollama in Ubuntu (I went for Ubuntu cause it's ready to go and I don't really want to get lost in the Linux weeds, I just want something solid that lasts and this was as good as the others). And then I check the hardware specs and realize that the Hyper-V setup does not allow for GPU sharing. So I begin exploring that and I find 2 main options:
PCIE Passthrough
TLDR; not possible with Hyper-V, this mode basically re-routes the GPU as if it's a physical plugged in device to the guest machine. Since Hyper-V sits under the Windows host, I was originally excited about this but then I realize that the only real way of getting this to work is to have a lvl 1 hyper-visor instead of the lvl 2 which Hyper-V is. And honestly, this would be a preferred setup. In an ideal world, I'd have a light-weight installation of linux, with the integrated GPU serving as the primary display source. There I could have 2 VMs, one windows and one linux. Each of them would use the passthrough to get dedicated access to the GPU. That way I could use it for gaming / LLM stuff depending on which of the VMs I choose to use. That would have been the cleanest way to do it, but, I already have a ton of downloaded things on the Windows machine and although the GPU is capable, it's nowhere near the giant models which I have subscriptions to, so all in all, not the right thing at the moment I felt. This can still be done on commerical machines though, and this kind soul even simplified the process Easy GPU PV
GPU Sharing/ para-virtualization
This was quite a bit of low-level hacking and it was a very fun rabbit hole to dive into. The idea is simple (I am most likely wrong in the sentences which follow but this is just my mental model based on the things I've seen and the way others got it to work), in a typical hypervisor based setup, the CPU is an easily shareable device (relatively), because fundamentally a CPU is statless, all the state is in the RAM and the ram disk allocation can be done cleanly to separate the working memory of the host and the guest OSes. Even in the case of mapped memory, it's still a clean separation. There is no concept of stateful execution when it comes to the CPU (i.e., the instruction itself just works with inputs and outputs which are stored differently). So for a typical setup, CPU sharing is pretty clean and works out of the box. A GPU on the other hand is a different beast, it's an integrated unit i.e., the working memory is very tightly bound with the execution unit. And the driver support is not universal ref, the 50 series is barely supported on Linux at the moment. So the way people got it to work (ingenious to be honest) goes something like this:
- We first create a mapped section in the GPU memory, effectively creating a partition
- We then copy the exact version of the drivers etc. from the host to the guest so that they are both using the same driver logic
- We then stitch it up so that both the operating systems see the GPU but the mapping works such that each operating systems GPU specific data is stored in the corresponding mapped portions of the memory This is even worse when done, since most places barely get it to work between homogenous host-guest combinations (all the ones online are pretty much Windows-Windows) but for Windows-Linux; it is even worse because of the difference between the open source reverse-engineered drivers and the official drivers. The online community was super crazy, they played Windows and Nvdia against each other so to speak. In one interesting post, the author basically pulled the appropriate parts from the WSlg kernel and monkey patched them into the guest Linux distribution to get it to work. That's amazing because the only other people who would want to get Nvdia support in Linux would be Microsoft (for the WSL project) and their driver implementation is the closest thing to pull of this hack (the even more amazing part is that all this is a couple thousand lines in cpp that power the whole stack upwards). But, this is also extremely finicky as one can imagine, and barely functional. Not to mention that any driver update (Nvdia seemingly sends at 2 per month) could kill the whole thing. And, the other part is, because of the whole memory mapping style setup, the memory is now always divided, it's like partitioning a disk. So if I mapped let's say 2 Gb out of the VRAM, that's lost regardless of the fact that the VM might actually be turned off. And I did not want to lose any performance to this sort of hacky setup (my understanding of this partioning is blurrier than the rest), so, the conclusion was to find a different means, a topic for the future (to not leave it completely hanging, I could just host an Ollama server in windows, and then access that endpoint via a subnet on the VM and it's basically getting me as far as I want to get).
I thought this article was super helpful to see the whole flow and the author's attempts: GPU acceleration with Hyper-V
Result
I ended up doing the setup with a simple Hyper-V VM that runs Ubuntu 22.04. Installed the Gemini CLI with no complaints (I honestly gave up on Codex, not that it was bad but I just wanted to get this done), and along they way also played around with Hugo, Jekyll and Astro for the static site generation but landed on 11ty cause I just like markdown and I liked how simple it was. So I fed some initial documentation into the context for Gemini, pulled some stuff from my old landing page, asked it to write the layout, the boilerplate (the whole point of this long post) and then finally a day after I originally set out, here I am, writing it all down and realizing that I could have finished this in 30 minutes and under 100 words if I just accepted the prompt on the Mac. I like this version better though.