Local LLM for VS code copilot: Difference between revisions

From Master of Neuroscience Wiki
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 3: Line 3:
Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.  
Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.  


Here an example: Qwen3 Coder
Here an example: Qwen3 Coder  
 
Also change the directories if you need to.


=== VS Code JSON ===
=== VS Code JSON ===
<syntaxhighlight lang="json">
<syntaxhighlight lang="json">"github.copilot.chat.customOAIModels": {
  "github.copilot.chat.customOAIModels": {
     "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
     "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
       "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
       "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
Line 17: Line 18:
       "maxOutputTokens": 8192,       
       "maxOutputTokens": 8192,       
       "requiresAPIKey": false
       "requiresAPIKey": false
     }
     }</syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.
</syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model. There are at least three modes. XML, JSON, something strange. Qwen3 Coder falls in the category strange.  
 
There are at least three modes: XML, JSON, and something strange.
 
Qwen3 Coder falls in the category "strange".  
 
<big>You '''really really''' want to make a new custom agent setting in VS Code Insider Copilot. You can get inspired by other model setting:</big>
 
<big>https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/main/VSCode%20Agent/gpt-5-mini.txt</big>


=== vllm-qwen-coder.service ===
=== vllm-qwen-coder.service ===
Line 47: Line 55:


[Install]
[Install]
WantedBy=multi-user.target</syntaxhighlight>To be installed and activated via<syntaxhighlight lang="bash">
WantedBy=multi-user.target</syntaxhighlight>Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface).
 
To be installed and activated via<syntaxhighlight lang="bash">
cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl daemon-reload
Line 55: Line 65:
systemctl status vllm-qwen-coder
systemctl status vllm-qwen-coder


</syntaxhighlight>This look harmless but  
</syntaxhighlight>This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)
 
* <code>--enable-auto-tool-choice</code> -- '''mandatory''' Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
 
* <code>--tool-parser-plugin</code> -- '''optional''' tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in <code>--tool-call-parser</code>.
 
<code>--tool-parser-plugin</code> can ruin your day.  Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.
 
==== maxInputTokens ====
 
Also we need to follow this equation:<syntaxhighlight>maxInputTokens = max-model-len - maxOutputTokens - safety_buffer</syntaxhighlight>
 
==== max-model-len ====
max-model-len is from the model page (256k = 256 *1024) = 262144
 
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8
 
==== maxOutputTokens ====
Claude says:
 
[->]
 
'''TL;DR''': <code>maxOutputTokens</code> is a policy decision you make, not a model specification. Pick based on what length of responses you typically need. 8192 is a solid middle ground for coding.
 
'''For your setup:'''
 
Your '''8192''' is a reasonable choice because:
 
* Coding tasks often need longer outputs (full functions, files)
* It's ~3% of your total context (262K), leaving 97% for input
* It balances completeness vs. generation speed
 
* '''Your choice based on use case''':
** Short code completions: 512-2048 tokens
** Medium functions/explanations: 2048-4096 tokens
** Full files/long responses: 4096-8192 tokens
** Very long generation: 8192-16384 tokens
 
* '''Practical constraints''':
** '''Generation time''': More tokens = longer wait
** '''API timeout limits''': Some clients timeout on long requests
** '''Cost considerations''': If using paid APIs
** '''User experience''': Longer isn't always better
 
[<-]


== How to install vLLM ==
== How to install vLLM ==
Line 61: Line 115:
cd /ollama_coder
cd /ollama_coder
uv init vllm-project
uv init vllm-project
</syntaxhighlight>Put pyproject.toml into /ollama_coder/vllm-project<syntaxhighlight lang="bash">
</syntaxhighlight>Copy pyproject.toml into /ollama_coder/vllm-project<syntaxhighlight lang="bash">
uv sync
uv sync
</syntaxhighlight>
</syntaxhighlight>

Latest revision as of 19:55, 9 December 2025

In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".

Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.

Here an example: Qwen3 Coder

Also change the directories if you need to.

VS Code JSON

"github.copilot.chat.customOAIModels": {
    "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
      "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
      "url": "http://gate0.neuro.uni-bremen.de:8000/v1",
      "toolCalling": true,
      "vision": false,
      "thinking": true,
      "maxInputTokens": 256000,
      "maxOutputTokens": 8192,      
      "requiresAPIKey": false
    }

Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.

There are at least three modes: XML, JSON, and something strange.

Qwen3 Coder falls in the category "strange".

You really really want to make a new custom agent setting in VS Code Insider Copilot. You can get inspired by other model setting:

https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools/blob/main/VSCode%20Agent/gpt-5-mini.txt

vllm-qwen-coder.service

[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
    --port 8000 \
    --host 0.0.0.0 \
    --dtype auto \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). To be installed and activated via

cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder

systemctl status vllm-qwen-coder

This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)

  • --enable-auto-tool-choice -- mandatory Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
  • --tool-parser-plugin -- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in --tool-call-parser.

--tool-parser-plugin can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.

maxInputTokens

Also we need to follow this equation:

maxInputTokens = max-model-len - maxOutputTokens - safety_buffer

max-model-len

max-model-len is from the model page (256k = 256 *1024) = 262144

https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

maxOutputTokens

Claude says:

[->]

TL;DR: maxOutputTokens is a policy decision you make, not a model specification. Pick based on what length of responses you typically need. 8192 is a solid middle ground for coding.

For your setup:

Your 8192 is a reasonable choice because:

  • Coding tasks often need longer outputs (full functions, files)
  • It's ~3% of your total context (262K), leaving 97% for input
  • It balances completeness vs. generation speed
  • Your choice based on use case:
    • Short code completions: 512-2048 tokens
    • Medium functions/explanations: 2048-4096 tokens
    • Full files/long responses: 4096-8192 tokens
    • Very long generation: 8192-16384 tokens
  • Practical constraints:
    • Generation time: More tokens = longer wait
    • API timeout limits: Some clients timeout on long requests
    • Cost considerations: If using paid APIs
    • User experience: Longer isn't always better

[<-]

How to install vLLM

We need uv: https://github.com/astral-sh/uv

cd /ollama_coder
uv init vllm-project

Copy pyproject.toml into /ollama_coder/vllm-project

uv sync

pyproject.toml

[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "vllm>=0.12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

[tool.uv]
dev-dependencies = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]