Skip to content

Hermes Agent integration with custom endpoints

Hermes Agent supports integration with custom endpoints, enabling users to connect locally hosted or self-deployed Large Language Models (LLMs) via OpenAI-compatible APIs.^[001-TODO__Qwen_3.6_27B_—面向_Coding_Agent_的开源模型.md] This capability allows for greater control over the underlying model, privacy, and the use of open-source models like Qwen 3.6 27B within the Hermes workflow^[001-TODO__Qwen_3.6_27B—_面向_Coding_Agent_的开源模型.md].

Configuration Methods

There are two primary methods to configure Hermes Agent to use a custom endpoint: interactive configuration and direct file editing.

Interactive Configuration

Users can initiate the configuration process via the command line interface.^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md]

hermes model

When prompted, select "custom endpoint". You will then need to provide the following details: * Base URL: The address of your local server (e.g., http://localhost:8000/v1). * Model: The specific model identifier (e.g., Qwen/Qwen3-27B). * API Key: For local deployments, this field can usually be left empty^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md].

Configuration File

For persistent settings, you can edit the Hermes configuration file directly at ~/.hermes/config.yaml^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md].

provider:
  base_url: http://localhost:8000/v1
  default_model: Qwen/Qwen3-27B
  # Explicitly set context limit to avoid default value restrictions
  max_context_tokens: 32768

Key Configuration Parameters

When integrating with custom endpoints, especially locally hosted models, certain parameters are critical for optimal performance^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md]:

  • base_url: This must point to the OpenAI-compatible endpoint of your inference server (e.g., vLLM).
  • default_model: The model name must match the one loaded in your inference server.
  • max_context_tokens: It is recommended to explicitly set this value. If left unset, Hermes may default to a smaller window size, which would waste the capabilities of models optimized for long contexts (like 32k tokens)^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md].

Agent Inheritance

In Hermes Agent, the model configuration established for a parent Agent is automatically inherited by any child Agents^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md]. This means that configuring the custom endpoint at the root level ensures consistent behavior across the entire agent hierarchy without the need for repetitive setup.

Integration Example: vLLM and Qwen 3.6 27B

A common use case for custom endpoints is integrating with vLLM, a high-throughput serving engine, using a model like Qwen 3.6 27B^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md].

  1. Start the vLLM Server: Ensure the server is started with the correct parameters, particularly --enable-auto-tool-choice and the correct --tool-call-parser (e.g., qwen3), to ensure the model executes tools rather than just describing them^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md].

    vllm serve Qwen/Qwen3-27B \
      --port 8000 \
      --tensor-parallel-size 1 \
      --max-model-len 32768 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3
    
  2. Configure Hermes: Update the ~/.hermes/config.yaml to point to the vLLM server, setting max_context_tokens to 32768 to utilize the model's full capacity^[001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md].

Sources

  • 001-TODO__Qwen_3.6_27B_—_面向_Coding_Agent_的开源模型.md