Local LLM for VS code copilot: Difference between revisions

From Master of Neuroscience Wiki
Line 8: Line 8:


=== VS Code JSON ===
=== VS Code JSON ===
<syntaxhighlight lang="json">
<syntaxhighlight lang="json">"github.copilot.chat.customOAIModels": {
  "github.copilot.chat.customOAIModels": {
     "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
     "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
       "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
       "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
Line 19: Line 18:
       "maxOutputTokens": 8192,       
       "maxOutputTokens": 8192,       
       "requiresAPIKey": false
       "requiresAPIKey": false
     }
     }</syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.   
</syntaxhighlight>Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.   


There are at least three modes: XML, JSON, and something strange.   
There are at least three modes: XML, JSON, and something strange.   
Line 53: Line 51:


[Install]
[Install]
WantedBy=multi-user.target</syntaxhighlight>To be installed and activated via<syntaxhighlight lang="bash">
WantedBy=multi-user.target</syntaxhighlight>Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface).
 
To be installed and activated via<syntaxhighlight lang="bash">
cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl daemon-reload
Line 61: Line 61:
systemctl status vllm-qwen-coder
systemctl status vllm-qwen-coder


</syntaxhighlight>This look harmless but  
</syntaxhighlight>This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)
 
* <code>--enable-auto-tool-choice</code> -- '''mandatory''' Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
 
* <code>--tool-parser-plugin</code> -- '''optional''' tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in <code>--tool-call-parser</code>.
 
<code>--tool-parser-plugin</code> can ruin your day.  Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.
 
Also we need to follow this equation:<syntaxhighlight>
maxInputTokens = max-model-len - maxOutputTokens - safety_buffer
</syntaxhighlight>


== How to install vLLM ==
== How to install vLLM ==

Revision as of 17:01, 9 December 2025

In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".

Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.

Here an example: Qwen3 Coder

Also change the directories if you need to.

VS Code JSON

"github.copilot.chat.customOAIModels": {
    "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
      "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
      "url": "http://gate0.neuro.uni-bremen.de:8000/v1",
      "toolCalling": true,
      "vision": false,
      "thinking": true,
      "maxInputTokens": 256000,
      "maxOutputTokens": 8192,      
      "requiresAPIKey": false
    }

Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.

There are at least three modes: XML, JSON, and something strange.

Qwen3 Coder falls in the category "strange".

vllm-qwen-coder.service

[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
    --port 8000 \
    --host 0.0.0.0 \
    --dtype auto \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). To be installed and activated via

cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder

systemctl status vllm-qwen-coder

This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)

  • --enable-auto-tool-choice -- mandatory Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
  • --tool-parser-plugin -- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in --tool-call-parser.

--tool-parser-plugin can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.

Also we need to follow this equation:

maxInputTokens = max-model-len - maxOutputTokens - safety_buffer

How to install vLLM

We need uv: https://github.com/astral-sh/uv

cd /ollama_coder
uv init vllm-project

Copy pyproject.toml into /ollama_coder/vllm-project

uv sync

pyproject.toml

[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "vllm>=0.12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

[tool.uv]
dev-dependencies = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]