Local LLM for VS code copilot: Difference between revisions

Latest revision as of 19:55, 9 December 2025

In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".

Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.

Here an example: Qwen3 Coder

Also change the directories if you need to.

VS Code JSON

"github.copilot.chat.customOAIModels": {
    "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
      "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
      "url": "http://gate0.neuro.uni-bremen.de:8000/v1",
      "toolCalling": true,
      "vision": false,
      "thinking": true,
      "maxInputTokens": 256000,
      "maxOutputTokens": 8192,      
      "requiresAPIKey": false
    }

Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.

There are at least three modes: XML, JSON, and something strange.

Qwen3 Coder falls in the category "strange".

You really really want to make a new custom agent setting in VS Code Insider Copilot. You can get inspired by other model setting:

https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/main/VSCode%20Agent/gpt-5-mini.txt

vllm-qwen-coder.service

[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
    --port 8000 \
    --host 0.0.0.0 \
    --dtype auto \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). To be installed and activated via

cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder

systemctl status vllm-qwen-coder

This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)

--enable-auto-tool-choice -- mandatory Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.

--tool-parser-plugin -- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in --tool-call-parser.

--tool-parser-plugin can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.

maxInputTokens

Also we need to follow this equation:

maxInputTokens = max-model-len - maxOutputTokens - safety_buffer

max-model-len

max-model-len is from the model page (256k = 256 *1024) = 262144

https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

maxOutputTokens

Claude says:

[->]

TL;DR: maxOutputTokens is a policy decision you make, not a model specification. Pick based on what length of responses you typically need. 8192 is a solid middle ground for coding.

For your setup:

Your 8192 is a reasonable choice because:

Coding tasks often need longer outputs (full functions, files)
It's ~3% of your total context (262K), leaving 97% for input
It balances completeness vs. generation speed

Your choice based on use case:
- Short code completions: 512-2048 tokens
- Medium functions/explanations: 2048-4096 tokens
- Full files/long responses: 4096-8192 tokens
- Very long generation: 8192-16384 tokens

Practical constraints:
- Generation time: More tokens = longer wait
- API timeout limits: Some clients timeout on long requests
- Cost considerations: If using paid APIs
- User experience: Longer isn't always better

[<-]

How to install vLLM

We need uv: https://github.com/astral-sh/uv

cd /ollama_coder
uv init vllm-project

Copy pyproject.toml into /ollama_coder/vllm-project

uv sync

pyproject.toml

[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "vllm>=0.12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

[tool.uv]
dev-dependencies = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]