Local LLM for VS code copilot: Difference between revisions
Created page with "In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API"." |
|||
| (8 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
In [[Config VS Code (Insiders)]] we see how we can add a custom LLM via the "OpenAI compatible API". | In [[Config VS Code (Insiders)]] we see how we can add a custom LLM via the "OpenAI compatible API". | ||
Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard. | |||
Here an example: Qwen3 Coder | |||
Also change the directories if you need to. | |||
=== VS Code JSON === | |||
<syntaxhighlight lang="json">"github.copilot.chat.customOAIModels": { | |||
"Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": { | |||
"name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8", | |||
"url": "http://gate0.neuro.uni-bremen.de:8000/v1", | |||
"toolCalling": true, | |||
"vision": false, | |||
"thinking": true, | |||
"maxInputTokens": 256000, | |||
"maxOutputTokens": 8192, | |||
"requiresAPIKey": false | |||
}</syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model. | |||
There are at least three modes: XML, JSON, and something strange. | |||
Qwen3 Coder falls in the category "strange". | |||
<big>You '''really really''' want to make a new custom agent setting in VS Code Insider Copilot. You can get inspired by other model setting:</big> | |||
<big>https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/main/VSCode%20Agent/gpt-5-mini.txt</big> | |||
=== vllm-qwen-coder.service === | |||
<syntaxhighlight lang="bash">[Unit] | |||
Description=vLLM Qwen3 Coder Service | |||
After=network.target | |||
[Service] | |||
Type=simple | |||
User=ollama | |||
Group=ollama | |||
WorkingDirectory=/ollama_coder/vllm-project | |||
Environment="VLLM_LOGGING_LEVEL=INFO" | |||
Environment="CUDA_VISIBLE_DEVICES=0" | |||
Environment="HF_HOME=/ollama_coder/huggingface_cache" | |||
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \ | |||
--port 8000 \ | |||
--host 0.0.0.0 \ | |||
--dtype auto \ | |||
--max-model-len 262144 \ | |||
--gpu-memory-utilization 0.85 \ | |||
--enable-auto-tool-choice \ | |||
--tool-call-parser qwen3_coder | |||
Restart=on-failure | |||
RestartSec=10 | |||
StandardOutput=journal | |||
StandardError=journal | |||
[Install] | |||
WantedBy=multi-user.target</syntaxhighlight>Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). | |||
To be installed and activated via<syntaxhighlight lang="bash"> | |||
cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service | |||
systemctl daemon-reload | |||
systemctl enable vllm-qwen-coder | |||
systemctl start vllm-qwen-coder | |||
systemctl status vllm-qwen-coder | |||
</syntaxhighlight>This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling) | |||
* <code>--enable-auto-tool-choice</code> -- '''mandatory''' Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate. | |||
* <code>--tool-parser-plugin</code> -- '''optional''' tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in <code>--tool-call-parser</code>. | |||
<code>--tool-parser-plugin</code> can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is. | |||
==== maxInputTokens ==== | |||
Also we need to follow this equation:<syntaxhighlight>maxInputTokens = max-model-len - maxOutputTokens - safety_buffer</syntaxhighlight> | |||
==== max-model-len ==== | |||
max-model-len is from the model page (256k = 256 *1024) = 262144 | |||
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 | |||
==== maxOutputTokens ==== | |||
Claude says: | |||
[->] | |||
'''TL;DR''': <code>maxOutputTokens</code> is a policy decision you make, not a model specification. Pick based on what length of responses you typically need. 8192 is a solid middle ground for coding. | |||
'''For your setup:''' | |||
Your '''8192''' is a reasonable choice because: | |||
* Coding tasks often need longer outputs (full functions, files) | |||
* It's ~3% of your total context (262K), leaving 97% for input | |||
* It balances completeness vs. generation speed | |||
* '''Your choice based on use case''': | |||
** Short code completions: 512-2048 tokens | |||
** Medium functions/explanations: 2048-4096 tokens | |||
** Full files/long responses: 4096-8192 tokens | |||
** Very long generation: 8192-16384 tokens | |||
* '''Practical constraints''': | |||
** '''Generation time''': More tokens = longer wait | |||
** '''API timeout limits''': Some clients timeout on long requests | |||
** '''Cost considerations''': If using paid APIs | |||
** '''User experience''': Longer isn't always better | |||
[<-] | |||
== How to install vLLM == | |||
We need uv: https://github.com/astral-sh/uv<syntaxhighlight lang="bash"> | |||
cd /ollama_coder | |||
uv init vllm-project | |||
</syntaxhighlight>Copy pyproject.toml into /ollama_coder/vllm-project<syntaxhighlight lang="bash"> | |||
uv sync | |||
</syntaxhighlight> | |||
=== pyproject.toml === | |||
<syntaxhighlight lang="toml"> | |||
[project] | |||
name = "vllm-project" | |||
version = "0.1.0" | |||
description = "A project using vLLM for LLM inference" | |||
readme = "README.md" | |||
requires-python = ">=3.10" | |||
dependencies = [ | |||
"vllm>=0.12.0", | |||
] | |||
[project.optional-dependencies] | |||
dev = [ | |||
"pytest>=8.0.0", | |||
"black>=24.0.0", | |||
"ruff>=0.1.0", | |||
] | |||
[tool.uv] | |||
dev-dependencies = [ | |||
"pytest>=8.0.0", | |||
"black>=24.0.0", | |||
"ruff>=0.1.0", | |||
] | |||
</syntaxhighlight> | |||
Latest revision as of 19:55, 9 December 2025
In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".
Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.
Here an example: Qwen3 Coder
Also change the directories if you need to.
VS Code JSON
"github.copilot.chat.customOAIModels": {
"Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
"name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
"url": "http://gate0.neuro.uni-bremen.de:8000/v1",
"toolCalling": true,
"vision": false,
"thinking": true,
"maxInputTokens": 256000,
"maxOutputTokens": 8192,
"requiresAPIKey": false
}
Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.
There are at least three modes: XML, JSON, and something strange.
Qwen3 Coder falls in the category "strange".
You really really want to make a new custom agent setting in VS Code Insider Copilot. You can get inspired by other model setting:
vllm-qwen-coder.service
[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target
[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
--port 8000 \
--host 0.0.0.0 \
--dtype auto \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). To be installed and activated via
cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder
systemctl status vllm-qwen-coder
This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)
--enable-auto-tool-choice-- mandatory Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
--tool-parser-plugin-- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in--tool-call-parser.
--tool-parser-plugin can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.
maxInputTokens
Also we need to follow this equation:
maxInputTokens = max-model-len - maxOutputTokens - safety_buffermax-model-len
max-model-len is from the model page (256k = 256 *1024) = 262144
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
maxOutputTokens
Claude says:
[->]
TL;DR: maxOutputTokens is a policy decision you make, not a model specification. Pick based on what length of responses you typically need. 8192 is a solid middle ground for coding.
For your setup:
Your 8192 is a reasonable choice because:
- Coding tasks often need longer outputs (full functions, files)
- It's ~3% of your total context (262K), leaving 97% for input
- It balances completeness vs. generation speed
- Your choice based on use case:
- Short code completions: 512-2048 tokens
- Medium functions/explanations: 2048-4096 tokens
- Full files/long responses: 4096-8192 tokens
- Very long generation: 8192-16384 tokens
- Practical constraints:
- Generation time: More tokens = longer wait
- API timeout limits: Some clients timeout on long requests
- Cost considerations: If using paid APIs
- User experience: Longer isn't always better
[<-]
How to install vLLM
We need uv: https://github.com/astral-sh/uv
cd /ollama_coder
uv init vllm-project
Copy pyproject.toml into /ollama_coder/vllm-project
uv sync
pyproject.toml
[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"vllm>=0.12.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.0.0",
"black>=24.0.0",
"ruff>=0.1.0",
]
[tool.uv]
dev-dependencies = [
"pytest>=8.0.0",
"black>=24.0.0",
"ruff>=0.1.0",
]