How I Got Ollama Running on My Intel Arc GPU – A Step-by-Step Journey
Hey everyone, Daniel here from Tiger Triangle Technologies! I just uploaded a new YouTube video that I’m really excited about, and I wanted to share the experience with you in this blog post. If you’ve ever tried running Ollama on an Intel Arc graphics card and hit a wall because it seems to favor Nvidia or newer AMD cards, this one’s for you. I’ve got some great news: there’s a solution called IPEX-LLM that saved the day for me, and I’m going to walk you through how I made it work.
The Problem – and the Promise
So, perhaps you installed Ollama, a fantastic tool for running local large language models (LLMs). But when you fired it up with an Intel Arc graphics card, you quickly realized it wasn’t playing nice. Ollama tends to lean heavily on Nvidia and newer AMD hardware, leaving Intel users like us in a bit of a bind. I recently stumbled across IPEX-LLM—an acceleration library designed to optimize LLM inference on Intel hardware. It sounded like the perfect fix, and I couldn’t wait to dive in and test it out.
Step 1: Setting the Stage with IPEX-LLM
First things first, I needed to get IPEX-LLM up and running. The official documentation pointed me to a handy portable Ollama zip file that bundles everything I’d need—no manual installation required. I’m all about keeping things straightforward, so this was my go-to option. My setup? A ZenBook Duo with Windows 11, an Intel Core Ultra 9 185H processor, and an integrated Intel Arc GPU (128 MB VRAM with 18 GB shared memory). Not a high-end discrete GPU, but I was determined to make it work.
The prerequisites were simple: an up-to-date driver for my Intel hardware. The recommended version was 32.0.101.6078, and while my driver was a slightly newer minor revision, I figured it’d be close enough to give it a shot. With that checked off, I downloaded the portable zip from a nightly build link—pre-release, sure, but the community feedback (think thumbs-ups and party poppers) convinced me it was the way to go.
Step 2: Getting Ollama Running
After extracting the zip into a folder on my C drive, I found a batch file called start-ollama. This little gem kicks off the Ollama server with IPEX-LLM integration. I ran it, and the command line popped up with “IPEX-LLM Ollama Serve” in the title—promising so far! Since it was my first time, I needed a model to test with. I went with Llama 3.1 (8 billion parameters), pulling it down with ollama pull llama3.1. My internet isn’t lightning-fast, so I let it download while I grabbed a coffee.
Once it was ready, I launched it with ollama run llama3.1 --verbose. The verbose flag let me see detailed timings, and I was thrilled to spot “Found one SYCL device” in the output—proof that my GPU was being recognized. To put it to the test, I threw a prompt at it: “Name the top 10 movies with at least one software developer as a main character.” As it churned away, I kept an eye on my GPU and CPU utilization... and wasn't disappointed.
Step 3: Testing the Waters
The results were in, and I couldn’t believe my eyes. GPU utilization spiked to over 90%, while the CPU barely broke a sweat. Llama 3.1 churned out the response at nearly 12 tokens per second! Not stellar, but a very comfortable pace for a local chatbot. For an integrated GPU, that felt like a win. The list included classics like The Social Network, Hackers, and Pirates of Silicon Valley—subjective, sure, but based on popularity and tech relevance. The speed was comfortable, striking a solid balance between performance and accuracy for a system like mine.
But I didn’t stop there. I dug into the ollama-serve.bat file and noticed an option to uncomment a line that sends commands immediately instead of batching them. Could it boost performance? I gave it a try, but on my setup, it didn’t move the needle much. Your mileage might vary, so it’s worth experimenting with!
Step 4: GPU vs. CPU Showdown
To really see the difference IPEX-LLM makes, I tweaked an environment variable—OLLAMA_NUM_GPU—from 999 (use all GPUs) to 0 (force CPU only). I ran the same movie prompt again. The result? GPU usage dropped to around zero, CPU hovered around 32%, and the eval rate plummeted to 5.46 tokens per second. That’s less than half the speed I got with the GPU! It was a stark reminder of how much Intel’s acceleration library brings to the table.
Wrapping Up
By the end of this experiment, I was sold. Getting Ollama running on my Intel Arc GPU with IPEX-LLM wasn’t just possible—it was a game-changer. The portable zip made setup a breeze, and the performance boost from GPU acceleration was undeniable. Sure, there’s room to explore—like maybe testing the NPU (neural processing unit) on my Zenbook down the road—but for now, I’m thrilled with these results.
If you’ve got an Intel Arc GPU and want to run Ollama locally, check out my video for the full step-by-step walkthrough. It’s got all the nitty-gritty details, from downloading the zip to tweaking settings and watching those utilization metrics climb. Thanks for joining me on this tech adventure—I appreciate you sticking around! Until next time, stay curious and take care.
What do you think—have you tried running Ollama on unconventional hardware? Let me know in the comments on YouTube or drop me a line. Catch you in the next one!