llamafile is the new best way to run a LLM on your own computer

I lifted the title directly from Simon Willison’s post: llamafile is the new best way to run a LLM on your own computer, even though I don’t have the expertise to know if it’s the BEST way. I can tell you it’s a damn easy way, even on Windows! Simon’s explanation works on a mac, but you can also run this on Windows with the following miniscule changes.

  1. Download the 4.26GB llamafile-server-0.1-llava-v1.5-7b-q4 file from Justine’s repository on Hugging Face.
  2. Open Command Prompt and navigate to the location of your downloaded file.
  3. Simply type the filename and wait for your default browser to display http://127.0.0.1:8080/

Cmd prompt

Simon reports that on his M2 Mac he’s seeing 55 tokens per second. On my brand-spanking new Dell office machine running Windows 10, I get 5.5 tokens per second 🙁 I was getting ready to grump all over Windows, here, but on my 2-year-old M1 MacBook Air it barely runs at all, clocking in with 0.35 tokens per second. And now my MacBook Air has crashed – maybe don’t try running it on yours, or do, and tell me why mine sucked so hard! 🙂