Deep Learning

Run LLMs Locally with Continue VS Code Extension

February 7, 2025

15 min read

Introduction

Artificial intelligence (AI) has become an indispensable tool for developers. Large Language Models (LLMs) like GPT-4o have revolutionized coding by assisting in everything from autocompleting code to generating entire functions. While cloud-based AI services have been the norm, there's a growing shift toward running LLMs locally. This move offers developers more control, faster responses, and enhanced security. With the advent of advanced open-source models like Deepseek R1 Distill Qwen 32B, now is the perfect time to consider upgrading your hardware and embracing local AI models.

The Value of Local LLMs for Developers

Locally deploying your own LLM proves advantageous for developers looking for faster real-time assistance, privacy, and security. While traditional AI chatbot services are options, they are not fully integrated within the workflow. Open-source models can remedy the disconnect between workflow and assistant, improving productivity and use cases.

Faster, Real-Time Assistance: Running LLMs locally eliminates the latency associated with cloud-based services. When the model is on your machine, responses are instantaneous, allowing for a smoother and more efficient coding experience significantly speeding up development cycles.
Privacy and Security Advantages: Using cloud-based AI services often involves sending your code—which may contain sensitive information—to third-party servers. Running LLMs locally ensures that your code stays on your machine, providing an added layer of security, crucial for developers working on proprietary or confidential projects where data breaches could have severe consequences.
Increased Flexibility and Customization: Local LLMs give you the freedom to customize models to suit your specific needs. You can fine-tune models, control updates, and choose which models to deploy.

Introducing Deepseek R1 Distill Qwen 32B

Deepseek R1 Distill Qwen 32B is the latest open-source coding model distilled from the industry-breaking Deepseek R1 model that dethroned other popular AI models in reasoning and accuracy. The Qwen series is developed with an emphasis on enhancing coding capabilities while maintaining strong performance in mathematics and general tasks. Deepseek R1 Distill Qwen 32B includes:

Multi-Language Support: Over 92 coding languages supported ****
Extensively Trained and Distilled: With over 32 billion parameters and distilled from one of the most efficient and complex model.
Open-Source: Released under the Apache 2.0 license, encouraging widespread adoption and customization.
IDE Integration: Compatible with popular integrated development environments (IDEs) and code editors.
Robust Security: Ensures data privacy and integrity, crucial for local development.

Deepseek R1 Distill Qwen 32B excels in code generation and completion, code reasoning, as well as mathematic reasoning and problem-solving. Open-source models continue to advance due to their collaborative nature, pushing the boundaries of local LLM agents. Deepseek R1 Distill Qwen has several models including 14B and 7B for lighter-weight deployments or even 70B models distilled from Llama 70B that can bring even more capabilities and even stronger attention.

We focused on Deepseek R1 Distill Qwen 32B for a professional-level option ideal for those with decently high-performance GPUs for powering the model without sacrificing quality and capabilities.

How LLMs Enhance the Developer Workflow

LLMs are becoming more efficient, with smaller model sizes that don't compromise performance. Innovations in model optimization mean that powerful LLMs like Deepseek R1 Distill Qwen 32B are increasingly accessible to developers without top-tier hardware. If your hardware isn’t powerful enough to deliver fast responses, the smaller Deepseek R1 Distill Qwen 14B and 7B options can serve as less compute intensive alternatives.

Code Summarization: Understanding complex codebases or unfamiliar code can be time-consuming. Our model powered by Deepseek R1 Distill Qwen 32B can quickly summarize large blocks of code, helping you grasp functionality without reading every line.
- Example: Highlight a function and ask the model to provide a summary and receive a concise explanation of what the code does.
Code Augmentation: The model can assist in inserting debug statements or console logs throughout your code. This automation helps in tracking data flow and identifying bugs more efficiently.
- Example: Request the model to add debug statements to a specific function. It will insert appropriate logs at critical points, saving you from manual insertion.
Unit Test Generation: Writing unit tests is essential but often tedious. We can generate unit tests based on your code, ensuring better coverage and reliability.
- Example: Provide a function to the model and ask it to generate unit tests. You'll receive test cases that you can directly integrate into your testing framework.
Autocomplete: Autocomplete powered by Deepseek R1 Distill Qwen 32B goes beyond simple code suggestions. It understands context and can predict complex code structures, making your coding process faster and more efficient.
- Example: As you start typing a function, the model predicts the entire code block, including parameters and return statements, allowing you to code at lightning speed.

The potential applications of LLMs are expanding. Future use cases may include:

Documentation Generation: Automatically create documentation based on your code.
Project Management: Use AI to estimate timelines or suggest task prioritization.
Codebase Analysis: Identify potential bottlenecks or security vulnerabilities.

Embracing LLMs now positions you at the forefront of these forthcoming advancements.

Why You Should Consider Running LLMs Locally

Running local LLMs boils down to cost, flexibility, and security.

While cloud-based models often require subscription fees or pay-per-use costs, running LLMs locally can be more cost-effective in the long run. The initial investment in hardware might seem steep, but it pays off by eliminating ongoing service fees.

	Cloud	Local
Cost	Recurring/ongoing cost	One time Hardware Investment
Latency	Dependent on network	Dependent on hardware
Networking	Dependency on Internet connectivity	LLMs run locally on the Machine
Flexibility	Less control and reliance on Cloud Services	Full control to augment your environment

To run LLMs like Deepseek R1 Distill Qwen 32B efficiently, a robust system is essential. Investing in such hardware ensures that you can run even the most demanding models without hiccups. Key requirements include:

GPUs: We recommend larger memory size GPUs to handle model computations for best performance and speed; the NVIDIA RTX 4090 (24GB) is an excellent choice. A minimum of 8GBs can still perform admirably as the minimum GPU spec requirement but outputs can take longer to generate.
RAM: More CPU RAM doesn’t hurt, especially if your other workloads require it. We recommend a workstation amount of RAM, 32GB or more for smooth multitasking.
Storage: Since running AIs, especially LLMs, data storage speed drastically affects performance. Always opt for the fast NVMe SSDs for faster data access and model loading times.

Not ready to invest heavily in hardware? A hybrid approach allows you to run models like Deepseek R1 Distill Qwen 32B locally for most tasks and rely on cloud-based services for more intensive operations. This flexibility ensures you can still benefit from local LLMs without fully committing to new hardware immediately.

Facilitate Deployment & Training AI with an Exxact GPU Workstation

With the latest CPUs and most powerful GPUs available, accelerate your deep learning and AI project optimized to your deployment, budget, and desired performance!

Configure Now

How to Run Deepseek R1 Distill Qwen 32B Locally

We will go over the steps on how to deploy our coding model via a container running Ollama which can be routed to a VS code interface for continued use and ease of access.

1. Set Up Your Local Environment with Docker and Ollama

Docker is essential for running applications in isolated environments, ensuring consistent performance across different setups.

Download Docker: Visit the Docker website and download the version compatible with your OS.
Install Docker: Follow the installation prompts specific to your operating system.
Verify Installation: Open a terminal and run docker --version to ensure it's installed correctly.

Ollama is the platform that lets you run and manage open-source LLMs locally, simplifying the process of downloading and configuring models.

Pull, Install, and Run the Ollama Docker Container:

This command will automatically pull the latest Ollama image, run it in the background, and set it to restart automatically when Docker is running.

# Pulls latest Ollama image to run in the background
docker run -d --restart always -p 11434:11434 ollama/ollama

# Verifies Ollama is running
curl http://localhost:11434/api/version

2. Download and Run Deepseek R1 Distill Qwen 32B with Ollama

With Ollama running in a container, we will walk you through downloading, verifying, and testing the Deepseek R1 Distill Qwen 32B model.

1 . Identify Your Ollama Container: First, use the following command to list all running Docker containers and locate the Ollama container's ID or name.

# List all currently running Docker Containers.
docker ps

# Locate the container running Ollama & note the name and ID

2. Download the Deepseek R1 Distill Qwen 32B Model: Use the docker exec command to download the DeepSeek-R1-Distill-Qwen-32B model. Replace YOUR_CONTAINER_ID_OR_NAME with the actual ID or name you found in the previous step. Verify your permissions and the model that is available.

docker exec -it YOUR_CONTAINER_ID_OR_NAME ollama pull deepseek-r1:32b-qwen-distill-q4_K_M
docker exec -it YOUR_CONTAINER_ID_OR_NAME ollama list

3. Test the Model: The downloaded models should appear. Finally, we test the Deepseek R1 Distill Qwen 32B model by generating a sample output.

docker exec -it YOUR_CONTAINER_ID_OR_NAME ollama run -m deepseek-r1:32b-qwen-distill-q4_K_M "Write a function to reverse a string in Python."

Configuring the Continue Extension for VS Code

Continue is a VS Code extension that integrates LLMs into your coding workflow, offering features like chat assistance and advanced autocomplete. Here is how we can implement the Continue Extension:

1. Install Continue Extension

Open VS Code.
Go to the Extensions view (Ctrl+Shift+X or Cmd+Shift+X on Mac).
Search for "Continue" and click "Install."

2. Configure Continue to Use Ollama and DeepSeek-R1-Distill-Qwen-32B

Open the extension settings.
Edit the config.json to include the model.

// filename: config.json
{
  "models": [
    {
      "title": "deepseek-r1-distill-qwen-32b",
      "model": "deepseek-r1:32b-qwen-distill-q4_K_M",
      "provider": "ollama",
      "apiBase": "http://localhost:11434",
      "systemMessage": "You are an expert software developer. You give helpful and concise responses. You use typescript and react with next js 14. You prefer arrow functions and more functional programmer."
    },
  ]
}

3. Verify Integration:

Open the Continue window.
Use the Continue chat feature to ensure it's working correctly.

Hybrid Approach: Hosted Models for Developers Without GPUs

If your local setup lacks the necessary hardware to run powerful LLMs like DeepSeek-R1-Distill-Qwen-32B, you can still leverage a hybrid approach by using hosted models. This allows you to benefit from advanced coding assistance without investing in dedicated high-end GPUs.

Several companies host free LLMs, providing access via APIs. Here's how to get API keys to integrate with VS Code Continue Extension.

Hosted Models on GitHub

GitHub provides access to a variety of remotely hosted LLMs, including coding-specific models. This is a great starting point if you're looking to explore AI-assisted coding without a local setup.

Navigate to the Models Page: Visit the GitHub Marketplace and select the "Models" section from the navigation menu. This will lead you to a catalog of available hosted models.
Select a Model: Browse the catalog to find a model that fits your requirements. For example, OpenAI’s GPT-4.0 Mini is a popular choice for coding assistance.
Obtain an API Key: On the model's page, look for an "API Key" option. Click it to generate your key, which you’ll use to integrate the model into your workflow.

Codestral Hosted on Mistral

Codestral, another free model hosted by Mistral, is tailored for coding tasks, especially autocomplete. Here’s how you can access it:

Sign in to the Mistral Console: Visit the Mistral Console and navigate to the Codestral section.
Get the API Key: Locate the section for Codestral and generate your API key. Copy it for later use.

Integrating Hosted Models with the Continue Extension

With API keys from GitHub or Mistral, you can integrate hosted models into the Continue extension in VS Code for chat and autocomplete features.

Open the config.json File: In VS Code, access the Continue extension settings and locate the config.json file.
Add API Keys: Insert your API keys into the model's array within config.json. Below is an example configuration:
Save and Reload: Save the file and reload the Continue extension to apply the changes.

{
  "tabAutocompleteModel": {
    "apiKey": "12345678901234567890",
    "title": "Codestral",
    "model": "codestral-latest",
    "provider": "mistral"
  },
  "models": [
    {
      "title": "deepseek-r1-distill-qwen-32b",
      "model": "deepseek-r1:32b-qwen-distill-q4_K_M",
      "provider": "ollama",
      "apiBase": "http://localhost:11434",
      "systemMessage": "You are an expert software developer. You give helpful and concise responses. You use typescript and react with next js 14. You prefer arrow functions and more functional programmer."
    },
    {
      "apiKey": "12345678901234567890",
      "title": "Codestral",
      "model": "codestral-latest",
      "provider": "mistral"
    },
    {
      "apiKey": "12345678901234567890",
      "engine": "anything",
      "apiBase": "https://models.inference.ai.azure.com"
      "apiType": "azure",
      "model": "gpt-4o-mini",
      "title": "gpt-4o-mini",
      "systemMessage": "You are an expert software developer. You give helpful and concise responses.",
      "provider": "azure"
    }
  ]
}

Additional Resources

Ollama deepseek-r1-diltill-qwen-32b: https://ollama.com/library/deepseek-r1:32b-qwen-distill-q4_K_M
Ollama Documentation: https://ollama.ai/docs
Continue VS Code Extension: https://marketplace.visualstudio.com/items?itemName=continue.continue

By integrating Deepseek R1 Distill Qwen 32B into your local development environment, you're not just adopting a new tool—you're embracing the future of AI-assisted coding. With its exceptional performance, open-source accessibility, and advanced capabilities, Deepseek R1 Distill Qwen 32B stands out as a leading choice for developers looking to enhance their productivity and code quality.

Conclusion

Running Large Language Models like Deepseek R1 Distill Qwen 32B locally is more than just a trend—it's a powerful shift in how developers can leverage AI to enhance their coding workflows. By bringing models onto your machine, you gain speed, security, and control that cloud-based services can't match. With tools like Docker, Ollama, and the Continue extension for VS Code, setting up and utilizing local LLMs is more accessible than ever.

Deepseek R1 features a wide variety of distilled models with parameter counts starting at as little as 1.5B. You can run Deepseek R1 right now on any GPU. For increased speed and performance, configure high-performance GPU workstations with Exxact featuring exceptional high VRAM GPUs like the RTX 5090, RTX 6000 Ada, or even the H100. The bigger and faster the GPU can run calculations and pull data, the faster and more accurate the outputs!

Fueling Innovation with an Exxact Multi-GPU Server

Accelerate your workload exponentially with the right system optimized to your use case. Exxact 4U Servers are not just a high-performance computer; it is the tool for propelling your innovation to new heights.

Configure Now

Topics

Have any questions?

Deep Learning