Local LLM for VS code copilot: Difference between revisions

Revision as of 16:50, 9 December 2025

In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".

Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.

Here an example: Qwen3 Coder

VS Code JSON

  "github.copilot.chat.customOAIModels": {
    "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
      "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
      "url": "http://gate0.neuro.uni-bremen.de:8000/v1",
      "toolCalling": true,
      "vision": false,
      "thinking": true,
      "maxInputTokens": 256000,
      "maxOutputTokens": 8192,      
      "requiresAPIKey": false
    }

Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model. There are at least three modes. XML, JSON, something strange. Qwen3 Coder falls in the category strange.

vllm-qwen-coder.service

[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
    --port 8000 \
    --host 0.0.0.0 \
    --dtype auto \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

To be installed and activated via

cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder

systemctl status vllm-qwen-coder

This look harmless but

How to install vLLM

We need uv: https://github.com/astral-sh/uv

cd /ollama_coder
uv init vllm-project

Put pyproject.toml into /ollama_coder/vllm-project

uv sync

pyproject.toml

[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "vllm>=0.12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

[tool.uv]
dev-dependencies = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

@@ Line 5: / Line 5: @@
 Here an example: Qwen3 Coder
-VS Code JSON:<syntaxhighlight lang="json">
+=== VS Code JSON ===
+<syntaxhighlight lang="json">
    "github.copilot.chat.customOAIModels": {
      "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
@@ Line 19: / Line 20: @@
 </syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model. There are at least three modes. XML, JSON, something strange. Qwen3 Coder falls in the category strange.
-vllm-qwen-coder.service<syntaxhighlight lang="bash">[Unit]
+=== vllm-qwen-coder.service ===
+<syntaxhighlight lang="bash">[Unit]
 Description=vLLM Qwen3 Coder Service
 After=network.target
@@ Line 61: / Line 63: @@
 </syntaxhighlight>Put pyproject.toml into /ollama_coder/vllm-project<syntaxhighlight lang="bash">
 uv sync
-</syntaxhighlight>The pyproject.toml : <syntaxhighlight lang="toml">
+</syntaxhighlight>
+=== pyproject.toml ===
+<syntaxhighlight lang="toml">
 [project]
 name = "vllm-project"