Local LLM for VS code copilot
In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".
Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.
Here an example: Qwen3 Coder
Also change the directories if you need to.
VS Code JSON
"github.copilot.chat.customOAIModels": {
"Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
"name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
"url": "http://gate0.neuro.uni-bremen.de:8000/v1",
"toolCalling": true,
"vision": false,
"thinking": true,
"maxInputTokens": 256000,
"maxOutputTokens": 8192,
"requiresAPIKey": false
}
Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.
There are at least three modes: XML, JSON, and something strange.
Qwen3 Coder falls in the category "strange".
You really really want to make a new custom agent setting in VS Code Insider Copilot. You can get inspired by other model setting:
vllm-qwen-coder.service
[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target
[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
--port 8000 \
--host 0.0.0.0 \
--dtype auto \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). To be installed and activated via
cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder
systemctl status vllm-qwen-coder
This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)
--enable-auto-tool-choice-- mandatory Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
--tool-parser-plugin-- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in--tool-call-parser.
--tool-parser-plugin can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.
maxInputTokens
Also we need to follow this equation:
maxInputTokens = max-model-len - maxOutputTokens - safety_buffermax-model-len
max-model-len is from the model page (256k = 256 *1024) = 262144
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
maxOutputTokens
Claude says:
[->]
TL;DR: maxOutputTokens is a policy decision you make, not a model specification. Pick based on what length of responses you typically need. 8192 is a solid middle ground for coding.
For your setup:
Your 8192 is a reasonable choice because:
- Coding tasks often need longer outputs (full functions, files)
- It's ~3% of your total context (262K), leaving 97% for input
- It balances completeness vs. generation speed
- Your choice based on use case:
- Short code completions: 512-2048 tokens
- Medium functions/explanations: 2048-4096 tokens
- Full files/long responses: 4096-8192 tokens
- Very long generation: 8192-16384 tokens
- Practical constraints:
- Generation time: More tokens = longer wait
- API timeout limits: Some clients timeout on long requests
- Cost considerations: If using paid APIs
- User experience: Longer isn't always better
[<-]
How to install vLLM
We need uv: https://github.com/astral-sh/uv
cd /ollama_coder
uv init vllm-project
Copy pyproject.toml into /ollama_coder/vllm-project
uv sync
pyproject.toml
[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"vllm>=0.12.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0.0",
"black>=24.0.0",
"ruff>=0.1.0",
]
[tool.uv]
dev-dependencies = [
"pytest>=8.0.0",
"black>=24.0.0",
"ruff>=0.1.0",
]