Local LLM for VS code copilot: Difference between revisions

From Master of Neuroscience Wiki
Line 69: Line 69:
<code>--tool-parser-plugin</code> can ruin your day.  Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.  
<code>--tool-parser-plugin</code> can ruin your day.  Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.  


Also we need to follow this equation:<syntaxhighlight>
Also we need to follow this equation:<syntaxhighlight>maxInputTokens = max-model-len - maxOutputTokens - safety_buffer</syntaxhighlight>max-model-len is from the model page (256k = 256 *1024) = 262144
maxInputTokens = max-model-len - maxOutputTokens - safety_buffer
 
</syntaxhighlight>
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8


== How to install vLLM ==
== How to install vLLM ==

Revision as of 17:13, 9 December 2025

In Config VS Code (Insiders) we see how we can add a custom LLM via the "OpenAI compatible API".

Add in a custom model to copilot in vs code insiders is easy. Let it do agent stuff is really hard.

Here an example: Qwen3 Coder

Also change the directories if you need to.

VS Code JSON

"github.copilot.chat.customOAIModels": {
    "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8": {
      "name": "Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8",
      "url": "http://gate0.neuro.uni-bremen.de:8000/v1",
      "toolCalling": true,
      "vision": false,
      "thinking": true,
      "maxInputTokens": 256000,
      "maxOutputTokens": 8192,      
      "requiresAPIKey": false
    }

Ollama is not a good choice for an agent model. This has something to do with the format the agent commands are delivered by the model.

There are at least three modes: XML, JSON, and something strange.

Qwen3 Coder falls in the category "strange".

vllm-qwen-coder.service

[Unit]
Description=vLLM Qwen3 Coder Service
After=network.target

[Service]
Type=simple
User=ollama
Group=ollama
WorkingDirectory=/ollama_coder/vllm-project
Environment="VLLM_LOGGING_LEVEL=INFO"
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="HF_HOME=/ollama_coder/huggingface_cache"
ExecStart=/ollama_coder/vllm-project/.venv/bin/vllm serve Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 \
    --port 8000 \
    --host 0.0.0.0 \
    --dtype auto \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

Environment="HF_HOME=/ollama_coder/huggingface_cache" tell the tool where to store the model data (HF -> huggingface). To be installed and activated via

cp -f vllm-qwen-coder.service /etc/systemd/system/vllm-qwen-coder.service
systemctl daemon-reload
systemctl enable vllm-qwen-coder
systemctl start vllm-qwen-coder

systemctl status vllm-qwen-coder

This looks harmless but (https://docs.vllm.ai/en/latest/features/tool_calling/#none-function-calling)

  • --enable-auto-tool-choice -- mandatory Auto tool choice. It tells vLLM that you want to enable the model to generate its own tool calls when it deems appropriate.
  • --tool-parser-plugin -- optional tool parser plugin used to register user defined tool parsers into vllm, the registered tool parser name can be specified in --tool-call-parser.

--tool-parser-plugin can ruin your day. Look under https://docs.vllm.ai/en/latest/features/tool_calling/#hermes-models-hermes what the correct value for your model is.

Also we need to follow this equation:

maxInputTokens = max-model-len - maxOutputTokens - safety_buffer

max-model-len is from the model page (256k = 256 *1024) = 262144

https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8

How to install vLLM

We need uv: https://github.com/astral-sh/uv

cd /ollama_coder
uv init vllm-project

Copy pyproject.toml into /ollama_coder/vllm-project

uv sync

pyproject.toml

[project]
name = "vllm-project"
version = "0.1.0"
description = "A project using vLLM for LLM inference"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "vllm>=0.12.0",
]

[project.optional-dependencies]
dev = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]

[tool.uv]
dev-dependencies = [
    "pytest>=8.0.0",
    "black>=24.0.0",
    "ruff>=0.1.0",
]