Z.AI Thinking Control

Z.AI GLM models can generate an internal reasoning chain (reasoning_content) before producing a response. This improves answer quality, but reasoning tokens are billed as output, consume the shared max_tokens budget, and increase latency.

On LLMTR, thinking is OFF by default (opt-in). A plain request spends no reasoning tokens, so short answers never cost more than you expect. Thinking only runs when you explicitly ask for it:

Model slug suffix — zai/glm-5.1:think (enable) / zai/glm-5.1:fast (disable)
Body field — { "reasoning": true } (enable) / { "reasoning": false } (disable)

When neither suffix nor body is given, the gateway forwards thinking: { "type": "disabled" } to the Z.AI API.

Which models support thinking control?

Model	LLMTR default	Explicitly enableable?
GLM-5.1, GLM-5, GLM-5-Turbo	Off (opt-in)	Yes
GLM-5V-Turbo	Off (opt-in)	Yes
GLM-4.7, GLM-4.7-FlashX	Off (opt-in)	Yes
GLM-4.6, GLM-4.6V, GLM-4.6V-FlashX	Off (opt-in)	Yes
GLM-4.5, GLM-4.5-X, GLM-4.5-Air, GLM-4.5-AirX, GLM-4.5V	Off (opt-in)	Yes
GLM-OCR, GLM-4-32B-0414-128K	—	No

Behaviour is identical across every thinking-capable GLM model: the model produces a reasoning chain when :think or reasoning: true is sent, and answers directly otherwise.

Disabling thinking via slug suffix

The :fast suffix disables thinking. Use it for latency-sensitive requests or when max_tokens is constrained.

curl https://llmtr.com/v1/chat/completions \
  -H "Authorization: Bearer llmtr-your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai/glm-5.1:fast",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

The :think suffix explicitly enables thinking (required for deep analysis, since the default is off):

"model": "zai/glm-4.5-air:think"

Controlling via body field

reasoning: false disables thinking, reasoning: true enables it. When both suffix and body field are provided, the body field takes precedence.

curl https://llmtr.com/v1/chat/completions \
  -H "Authorization: Bearer llmtr-your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai/glm-5.1",
    "reasoning": false,
    "messages": [
      {"role": "user", "content": "Quick response please"}
    ]
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://llmtr.com/v1",
    api_key="llmtr-your_key",
)

# Thinking off — fast mode
response = client.chat.completions.create(
    model="zai/glm-5.1:fast",
    messages=[{"role": "user", "content": "Write a short greeting"}],
)
print(response.choices[0].message.content)

# Thinking on — deep analysis
response = client.chat.completions.create(
    model="zai/glm-5.1:think",
    messages=[{"role": "user", "content": "Explain the time complexity of this algorithm"}],
    max_tokens=4000,
)
print(response.choices[0].message.content)

JavaScript (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://llmtr.com/v1",
  apiKey: process.env.LLMTR_API_KEY,
});

// Thinking disabled
const fast = await client.chat.completions.create({
  model: "zai/glm-4.7:fast",
  messages: [{ role: "user", content: "Hello" }],
});

// Thinking enabled via body field
const deep = await client.chat.completions.create({
  model: "zai/glm-4.5-air",
  messages: [{ role: "user", content: "Review this code and list any bugs" }],
  extra_body: { reasoning: true },
  max_tokens: 4000,
});

max_tokens and thinking

Thinking tokens count against the max_tokens budget. With thinking enabled and a low max_tokens value, the model may exhaust the budget during its reasoning chain and return an empty response.

Recommended minimum max_tokens values:

Scenario	Recommended minimum
Thinking on, simple question	1 500
Thinking on, complex question	4 000+
Thinking off (`:fast`)	256

With thinking disabled no reasoning tokens are spent; standard token counts are sufficient.

Billing

Reasoning tokens are billed as output tokens. To reduce cost, disable thinking with the :fast suffix. See the Billing page for details.