Skip to content

Z.AI Thinking Control

Z.AI GLM models can generate an internal reasoning chain (reasoning_content) before producing a response. This improves answer quality, but reasoning tokens are billed as output, consume the shared max_tokens budget, and increase latency.

On LLMTR, thinking is OFF by default (opt-in). A plain request spends no reasoning tokens, so short answers never cost more than you expect. Thinking only runs when you explicitly ask for it:

  1. Model slug suffixzai/glm-5.1:think (enable) / zai/glm-5.1:fast (disable)
  2. Body field{ "reasoning": true } (enable) / { "reasoning": false } (disable)

When neither suffix nor body is given, the gateway forwards thinking: { "type": "disabled" } to the Z.AI API.

ModelLLMTR defaultExplicitly enableable?
GLM-5.1, GLM-5, GLM-5-TurboOff (opt-in)Yes
GLM-5V-TurboOff (opt-in)Yes
GLM-4.7, GLM-4.7-FlashXOff (opt-in)Yes
GLM-4.6, GLM-4.6V, GLM-4.6V-FlashXOff (opt-in)Yes
GLM-4.5, GLM-4.5-X, GLM-4.5-Air, GLM-4.5-AirX, GLM-4.5VOff (opt-in)Yes
GLM-OCR, GLM-4-32B-0414-128KNo

Behaviour is identical across every thinking-capable GLM model: the model produces a reasoning chain when :think or reasoning: true is sent, and answers directly otherwise.

The :fast suffix disables thinking. Use it for latency-sensitive requests or when max_tokens is constrained.

Terminal window
curl https://llmtr.com/v1/chat/completions \
-H "Authorization: Bearer llmtr-your_key" \
-H "Content-Type: application/json" \
-d '{
"model": "zai/glm-5.1:fast",
"messages": [
{"role": "user", "content": "Hello"}
]
}'

The :think suffix explicitly enables thinking (required for deep analysis, since the default is off):

Terminal window
"model": "zai/glm-4.5-air:think"

reasoning: false disables thinking, reasoning: true enables it. When both suffix and body field are provided, the body field takes precedence.

Terminal window
curl https://llmtr.com/v1/chat/completions \
-H "Authorization: Bearer llmtr-your_key" \
-H "Content-Type: application/json" \
-d '{
"model": "zai/glm-5.1",
"reasoning": false,
"messages": [
{"role": "user", "content": "Quick response please"}
]
}'
from openai import OpenAI
client = OpenAI(
base_url="https://llmtr.com/v1",
api_key="llmtr-your_key",
)
# Thinking off — fast mode
response = client.chat.completions.create(
model="zai/glm-5.1:fast",
messages=[{"role": "user", "content": "Write a short greeting"}],
)
print(response.choices[0].message.content)
# Thinking on — deep analysis
response = client.chat.completions.create(
model="zai/glm-5.1:think",
messages=[{"role": "user", "content": "Explain the time complexity of this algorithm"}],
max_tokens=4000,
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://llmtr.com/v1",
apiKey: process.env.LLMTR_API_KEY,
});
// Thinking disabled
const fast = await client.chat.completions.create({
model: "zai/glm-4.7:fast",
messages: [{ role: "user", content: "Hello" }],
});
// Thinking enabled via body field
const deep = await client.chat.completions.create({
model: "zai/glm-4.5-air",
messages: [{ role: "user", content: "Review this code and list any bugs" }],
extra_body: { reasoning: true },
max_tokens: 4000,
});

Thinking tokens count against the max_tokens budget. With thinking enabled and a low max_tokens value, the model may exhaust the budget during its reasoning chain and return an empty response.

Recommended minimum max_tokens values:

ScenarioRecommended minimum
Thinking on, simple question1 500
Thinking on, complex question4 000+
Thinking off (:fast)256

With thinking disabled no reasoning tokens are spent; standard token counts are sufficient.

Reasoning tokens are billed as output tokens. To reduce cost, disable thinking with the :fast suffix. See the Billing page for details.