Z.AI Thinking Control
Z.AI GLM models can generate an internal reasoning chain (reasoning_content) before producing a response.
This improves answer quality, but reasoning tokens are billed as output, consume the shared max_tokens
budget, and increase latency.
On LLMTR, thinking is OFF by default (opt-in). A plain request spends no reasoning tokens, so short answers never cost more than you expect. Thinking only runs when you explicitly ask for it:
- Model slug suffix —
zai/glm-5.1:think(enable) /zai/glm-5.1:fast(disable) - Body field —
{ "reasoning": true }(enable) /{ "reasoning": false }(disable)
When neither suffix nor body is given, the gateway forwards thinking: { "type": "disabled" } to the Z.AI API.
Which models support thinking control?
Section titled “Which models support thinking control?”| Model | LLMTR default | Explicitly enableable? |
|---|---|---|
| GLM-5.1, GLM-5, GLM-5-Turbo | Off (opt-in) | Yes |
| GLM-5V-Turbo | Off (opt-in) | Yes |
| GLM-4.7, GLM-4.7-FlashX | Off (opt-in) | Yes |
| GLM-4.6, GLM-4.6V, GLM-4.6V-FlashX | Off (opt-in) | Yes |
| GLM-4.5, GLM-4.5-X, GLM-4.5-Air, GLM-4.5-AirX, GLM-4.5V | Off (opt-in) | Yes |
| GLM-OCR, GLM-4-32B-0414-128K | — | No |
Behaviour is identical across every thinking-capable GLM model: the model produces a reasoning chain when
:think or reasoning: true is sent, and answers directly otherwise.
Disabling thinking via slug suffix
Section titled “Disabling thinking via slug suffix”The :fast suffix disables thinking. Use it for latency-sensitive requests or when max_tokens is constrained.
curl https://llmtr.com/v1/chat/completions \ -H "Authorization: Bearer llmtr-your_key" \ -H "Content-Type: application/json" \ -d '{ "model": "zai/glm-5.1:fast", "messages": [ {"role": "user", "content": "Hello"} ] }'The :think suffix explicitly enables thinking (required for deep analysis, since the default is off):
"model": "zai/glm-4.5-air:think"Controlling via body field
Section titled “Controlling via body field”reasoning: false disables thinking, reasoning: true enables it.
When both suffix and body field are provided, the body field takes precedence.
curl https://llmtr.com/v1/chat/completions \ -H "Authorization: Bearer llmtr-your_key" \ -H "Content-Type: application/json" \ -d '{ "model": "zai/glm-5.1", "reasoning": false, "messages": [ {"role": "user", "content": "Quick response please"} ] }'Python (OpenAI SDK)
Section titled “Python (OpenAI SDK)”from openai import OpenAI
client = OpenAI( base_url="https://llmtr.com/v1", api_key="llmtr-your_key",)
# Thinking off — fast moderesponse = client.chat.completions.create( model="zai/glm-5.1:fast", messages=[{"role": "user", "content": "Write a short greeting"}],)print(response.choices[0].message.content)
# Thinking on — deep analysisresponse = client.chat.completions.create( model="zai/glm-5.1:think", messages=[{"role": "user", "content": "Explain the time complexity of this algorithm"}], max_tokens=4000,)print(response.choices[0].message.content)JavaScript (OpenAI SDK)
Section titled “JavaScript (OpenAI SDK)”import OpenAI from "openai";
const client = new OpenAI({ baseURL: "https://llmtr.com/v1", apiKey: process.env.LLMTR_API_KEY,});
// Thinking disabledconst fast = await client.chat.completions.create({ model: "zai/glm-4.7:fast", messages: [{ role: "user", content: "Hello" }],});
// Thinking enabled via body fieldconst deep = await client.chat.completions.create({ model: "zai/glm-4.5-air", messages: [{ role: "user", content: "Review this code and list any bugs" }], extra_body: { reasoning: true }, max_tokens: 4000,});max_tokens and thinking
Section titled “max_tokens and thinking”Thinking tokens count against the max_tokens budget. With thinking enabled and a low max_tokens value,
the model may exhaust the budget during its reasoning chain and return an empty response.
Recommended minimum max_tokens values:
| Scenario | Recommended minimum |
|---|---|
| Thinking on, simple question | 1 500 |
| Thinking on, complex question | 4 000+ |
Thinking off (:fast) | 256 |
With thinking disabled no reasoning tokens are spent; standard token counts are sufficient.
Billing
Section titled “Billing”Reasoning tokens are billed as output tokens.
To reduce cost, disable thinking with the :fast suffix.
See the Billing page for details.