LM Studio - cognitive;overload

References:
https://lmstudio.ai

What is LM Studio?#

LM Studio is a cross-platform desktop application that lets you run and interact with LLMs (Large Language Models) locally on your machine. It provides a fairly user-friendly interface for downloading and chatting with various LLMs without requiring extensive technical knowledge. Since it runs locally, you can use it in offline scenarios and your data is never leaving your machine.

To be able to run LM Studio, you need a compatible CPU. It also supports both GPU and NPU acceleration for better performance. However, when I tried to make it run on my Intel Core Ultra’s NPU, it didn’t work at all. After doing a quick search, it appears that the NPU support is limited to Ryzen AI chips.

Benefits of a local LLM#

There are a few nice advantages of running an LLM locally::

No cloud service subscription: as it’s all local, you don’t need to subscribe to any cloud service or pay for API calls
Unlimited usage: you will not run into any usage limits imposed by cloud providers for free services
Privacy: your data does not leave your device

Their (TL;DR) privacy policy on their website states:

None of your messages, chat histories, and documents are ever transmitted from your system - everything is saved locally on your device by default.

Read more about their privacy policy on their web site: https://lmstudio.ai/app-privacy

Installation#

LM Studio supports Windows, macOS and Linux.
You can grab the latest installer from their official website: https://lmstudio.ai/download.

If you’re using Windows you can also istall it via winget:

winget install ElementLabs.LMStudio

Configuration#

Before downloading any models, you might want to change the default storage location for models (as these can be quite large!).
You can change this by clicking the Models icon on the left sidebar and changing the Models Directory at the top:

LM Studio Change Storage Location

Downloading models#

To browse, search and download LLMs, you can hit Ctrl+Shift+M to open the model manager:

LM Studio Model Download

Loading a model#

Now that we’ve downloaded a model, we can go back to the Chat tab and load the model by clicking Select a model to load at the top of the screen or by clicking Ctrl+L.

LM Studio Load Model

Note that it might be worth ticking on the Manually choose model load parameters radio button at the bottom of the model selection dialogue. This will allow you to tweak a lot of parameters which can improve performance, depending on your hardware or requirements. For instance, we can specify the context length for our prompt and various settings (such as GPU acceleration, CPU pool size, memory optimizations, etc):

LM Studio Load Model Advanced

Chatting with the model#

Once the model is loaded, you can start chatting, by typing your prompt in the text-box at the bottom of the screen and hitting Enter:

LM Studio New Chat

Happy to see this French model speaks French:

LM Studio Chat

As we chat with the model, we can also adjust miscellaneous settings, such as temperature (creativity), response length, etc:

LM Studio Chat Configuration

Performance#

To be fair, I’ve only played around with rather small models on a fairly high-end laptop so far. From my experience though, the performance has been very good, especially when using GPU acceleration. Offloading the whole thing to the GPU did speed things up a fair bit compared to CPU-only inference. The GPU seemed to be more or less fully utilized, appearing to be taking advantage of the hardware quite well:

LM Studio GPU usage

That said, even in CPU-only mode, the performance was still very acceptable. The response was a bit slower but still within reasonable limits. The CPU usage was sitting at ~50% on a 16-core CPU (24 logical cores).

LM Studio CPU usage

The model used in the screenshots above was mistralai/mistral-3-3b (3gb model) with a 4k context length.

Developer API#

In addition to the desktop application, LM Studio provides a local REST API and web server, that can be used to integrate LLMs into your own apps. This can quickly be enabled by clicking the Developer icon on the left sidebar, and toggle the server radio button at the top left corner:

LM Studio Start Server

Once the server is running, we can interact with it via HTTP requests.
The default server address is http://localhost:1234:

LM Studio Server Running

You can send POST requests to the /chat/completions API endpoint with a JSON payload containing your prompt and other parameters. Of course I used LM Studio itself to write some JavaScript code to demonstrate this. Here’s the example code:

async function fetchLMResponse(prompt, model = "mistralai/ministral-3-3b") {
    const url = `http://localhost:1234/v1/chat/completions`;
    const headers = { 'Content-Type': 'application/json' };

    const payload = {
        model: model,
        messages: [
            { role: "system", content: "You must respond in a very robotic, mechanical, and precise manner. Use short sentences, no emotions, and speak like a computer program." }, // Setting the tone
            { role: "user", content: prompt } // User prompt
        ],
        max_tokens: 100,
        temperature: 0.7,
    };

    try {
        const response = await fetch(url, {
            method: 'POST',
            headers,
            body: JSON.stringify(payload),
        });

        if (!response.ok) throw new Error(`HTTP error! Status: ${response.status}`);

        const data = await response.json();
        return data.choices[0].message.content;
    } catch (error) {
        console.error("LM Studio API request failed:", error);
        throw error;
    }
}

(async () => {
    try {
        const result = await fetchLMResponse("Tell me a joke about AI.");
        console.log(result);
    } catch (err) {
        console.error(err);
    }
})();

The important part to take note of in this code is of course the url to the API and the JSON payload variable, containing:

model: I believe the API should support running mutliple models simultaneously, so we need to pass in the one we want to use
messages: we pass two messages here, one system role, to set the tone and a user role containing the actual prompt
max_tokens: maximum length of the response
temperature: creativity of the response

When we run this code, we get the following response from the model:

LM Studio API Response

In the second call to the API I added in the system role in addition, and we can see it’s adding a more robotic tone to the response as requested (e.g. “Debugging complete.”).

Read more about the API here: https://lmstudio.ai/docs/developer/rest/endpoints

Note that they also has a specific SDK available in multiple for both Python and TypeScript: https://lmstudio.ai/docs/developer

CLI#

The app also comes with a CLI that can be used to load models and interact (chat) with them from the command line:

LM Studio CLI

Read more about the CLI here: https://lmstudio.ai/docs/cli