Local LLM for VS code copilot: Difference between revisions

Revision as of 17:01, 9 December 2025

In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".

Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.

Here an example: Qwen3 Coder

Also change the directories if you need to.

VS Code JSON

"github.copilot.chat.customOAIModels": {
    "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
      "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
      "url": "http://gate0.neuro.uni-bremen.de:8000/v1",
      "toolCalling": true,
      "vision": false,
      "thinking": true,
      "maxInputTokens": 256000,
      "maxOutputTokens": 8192,      
      "requiresAPIKey": false
    }

Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.

There are at least three modes: XML, JSON, and something strange.

Qwen3 Coder falls in the category "strange".

vllm-qwen-coder.service

[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
    --port 8000 \
    --host 0.0.0.0 \
    --dtype auto \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). To be installed and activated via

cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder

systemctl status vllm-qwen-coder

This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)

--enable-auto-tool-choice -- mandatory Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.

--tool-parser-plugin -- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in --tool-call-parser.

--tool-parser-plugin can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.

Also we need to follow this equation:

maxInputTokens = max-model-len - maxOutputTokens - safety_buffer

How to install vLLM

We need uv: https://github.com/astral-sh/uv

cd /ollama_coder
uv init vllm-project

Copy pyproject.toml into /ollama_coder/vllm-project

uv sync

pyproject.toml

[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "vllm>=0.12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

[tool.uv]
dev-dependencies = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

@@ Line 8: / Line 8: @@
 === VS Code JSON ===
-<syntaxhighlight lang="json">
+<syntaxhighlight lang="json">"github.copilot.chat.customOAIModels": {
-  "github.copilot.chat.customOAIModels": {
      "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
        "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
@@ Line 19: / Line 18: @@
        "maxOutputTokens": 8192,
        "requiresAPIKey": false
-     }
+     }</syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.
-</syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.
 There are at least three modes: XML, JSON, and something strange.
@@ Line 53: / Line 51: @@
 [Install]
-WantedBy=multi-user.target</syntaxhighlight>To be installed and activated via<syntaxhighlight lang="bash">
+WantedBy=multi-user.target</syntaxhighlight>Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface).
+To be installed and activated via<syntaxhighlight lang="bash">
 cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
 systemctl daemon-reload
@@ Line 61: / Line 61: @@
 systemctl status vllm-qwen-coder
-</syntaxhighlight>This look harmless but
+</syntaxhighlight>This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)
+* <code>--enable-auto-tool-choice</code> -- '''mandatory''' Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
+* <code>--tool-parser-plugin</code> -- '''optional''' tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in <code>--tool-call-parser</code>.
+<code>--tool-parser-plugin</code> can ruin your day.  Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.
+Also we need to follow this equation:<syntaxhighlight>
+maxInputTokens = max-model-len - maxOutputTokens - safety_buffer
+</syntaxhighlight>
 == How to install vLLM ==