Tuesday, November 12, 2024

The Case of the Slow Chatbot - and how to make it lightning fast

 



It was a brisk afternoon at Tiger Triangle Technologies, and the hum of computations filled the air, more eager than a bustling London thoroughfare. I had scarcely finished my previous experiment—indeed, quite a robust affair with Blazor web applications—when a most peculiar challenge presented itself: the elusive matter of optimizing performance for the Ollama software suite.

Seated at my desk, surrounded by stacks of benchmark records, diagnostics, and freshly sharpened code, I felt a thrill much akin to that of a detective on the brink of discovery. “Ah, here we are,” I muttered. “Today, we’re diving deep into the mysteries of local model performance.”

The challenge was one fit for a connoisseur of computational intrigues, a true fundamental, if you will, in the realm of Ollama’s capabilities. For those as intent on maximizing Ollama’s prowess as I was, today’s expedition would surely reveal insights worth their weight in bytes.

Scene 1: A Document of Tracking

Our journey began with a masterful stroke of methodology: a document, meticulously prepared, like a roadmap through the murky forests of performance metrics. “A document, Watson,” I remarked, speaking to my notepad, which was, in this case, standing in as my loyal companion. “With such a record, we can track each trial and improvement, ensuring that none of Ollama’s quirks evade our scrutiny.”

This document would be the measure against which our findings—model comparisons, resource utilizations, and timings—could be held. By journey’s end, we would uncover which model best suited my system.

“We start, as all wise scientists do, with an examination of our tools,” I said aloud, typing in ollama -v to reveal my Ollama version: 0.3.11. With a nod of approval, I commenced testing our first subject, Llama 2.

Scene 2: The Tale of Llama 2

I input ollama run llama2, being sure to append the verbose flag to glean every last bit of timing information. "Tell me about AMD," I murmured to the machine, prompting the model to share insights about processors and graphics cards. Yet, the results were lackluster: the CPU utilization soared to a considerable 72%, while the GPU remained nearly idle—a paltry 9%. I could not hide my disappointment.

“Curious,” I mused, watching the screen. “Ollama claims support for AMD graphics cards, yet the GPU slumbers while the CPU toils to the bone. Perhaps there is more to this tale.” The verbose output confirmed what I feared: the model was defaulting to the CPU. This was not ideal, for our case demanded optimal GPU performance.

Upon further examination, I unearthed a promising lead: a workaround developed by a certain online persona, known only by the moniker "LikeLoveWant." They had crafted a modified version of Ollama, explicitly designed for AMD. "Ah, Watson," I remarked to the notepad, "A rare gem hidden among the multitude! A shoutout indeed to this clever soul."

Scene 3: A Deft Adjustment

I installed this workaround and was primed to test Llama 2 again. However, another barrier appeared, one set by my own hand. The environment variable, HIP_VISIBLE_DEVICES, had been configured to limit Ollama to the CPU. “Ah, a self-imposed restriction, no less!” I said with amusement, promptly removing it to unlock the GPU’s potential.

After restarting Ollama, I gave the prompt once more: ollama run llama2. This time, the model responded with vigor. The GPU surged to a splendid 99%, while the CPU’s strain dropped to a mere 22%. The results were invigorating—21.44 tokens per second, a grand improvement over our initial 6.28 tokens per second.

“Yet,” I murmured, eyes narrowing, “while improved, it is not the swiftest. There must be another layer, another model that performs better.”

Scene 4: Enter Llama 3

The next act brought forth Llama 3, reputed to be formidable but known also to be weighty in computational demands. For the first test, I forced it onto the CPU alone, using our previous environment variable technique. It performed admirably in its own way, though it lagged behind its predecessor at a rate of 5.8 tokens per second.

But when set to use the GPU, Llama 3 was far from optimized; though the GPU utilization was high, its speed dropped to a dismal 2.89 tokens per second, taking a full four and a half minutes to complete its task. “An inefficient transfer of data, no doubt,” I surmised. “A bottleneck, Watson! Our GPU’s strength is but wasted on Llama 3, and this inefficiency could be its undoing.”

Scene 5: The Mystery Model Unveiled

With a flourish, I revealed our final contestant: the Mystery Model, better known in certain circles as Llama 3.2. “This lightweight model is rumored to be a marvel on constrained systems, optimized for edge devices,” I mused. “Let us see if it can handle the formidable demands of our quest.”

Our first trial was performed on the CPU, yielding a respectable 12.42 tokens per second. But it was when the model was unleashed upon the GPU that it truly shined. I watched in awe as the metrics displayed an astonishing rate of 94.26 tokens per second—a revelation! The once-sluggish system now processed language with vigor, displaying an unparalleled speed of 1,615 tokens per second on initial evaluations.

“This, my dear Watson, is the way,” I said in satisfaction, recording my final measurements. The GPU was fully engaged, the CPU played its supporting role gracefully, and the model ran with all the elegance of a virtuoso.


Scene 6: Our Findings in Conclusion

The records were assembled, the findings meticulously transcribed. It was evident: while Llama 2 had shown promise, it was Llama 3.2 that proved the swiftest on this particular system, excelling in GPU utilization and blazing speed. Llama 3, despite its impressive reputation, could not keep pace on my current configuration.

“There you have it,” I concluded, glancing over the document one last time. “A smaller model, such as Llama 3.2 or Phi, might be the most fitting for systems like mine. For those who seek precision, a larger model might serve better—but for me, speed and efficiency are paramount.”

I gathered my notes, marveling at the journey through models and metrics. “For those who embark on a similar quest,” I advised, “the process remains much the same: measure, compare, and optimize. Your system’s peculiarities, like clues in a case, will guide you to the right model.”

And with a final nod to the intricacies of code and computation, I signed off, already pondering what my next case might be in this grand pursuit of technological mastery.



Thank you for joining me on this journey. Share your findings, should you undertake a similar investigation; I’d be most intrigued to learn what you discover. Until our next adventure, I bid you adieu.