Claude API Token Limits: Handling Errors in Production
The Claude API enforces two separate limits: a per-request context limit (how many tokens in a single call) and a rate limit (tokens per minute across all calls). Hitting either returns a 429 or 400 error. The fixes differ: per-request limits need prompt compression, rate limits need backoff and batching.
When building with the Claude API, you will encounter token limit errors in two distinct forms. The first is a 400 error with a message about exceeding the model context window — you are sending too many tokens in a single request. The second is a 429 error indicating you have exceeded your rate limit — too many tokens per minute across all requests. Each requires a different fix.
Per-request context window limits by model
| Model | Context window (input) | Max output tokens |
|---|---|---|
| claude-opus-4-6 | 1,048,576 tokens | 32,000 tokens |
| claude-sonnet-4-6 | 1,048,576 tokens | 16,000 tokens |
| claude-haiku-4-5 | 200,000 tokens | 8,096 tokens |
How to fix 400 context limit errors
A 400 error means your prompt plus conversation history exceeds the model context window. The most reliable fix is compressing what you send before it reaches the API. For text content, Token Limits REST API accepts raw text and returns compressed output — call it before sending to Claude. For code and logs, strip timestamps, blank lines, and repeated paths before including them in prompts.
- ✓Compress file contents and logs via POST https://tokenlimits.app/api/compress before including in prompts
- ✓Truncate conversation history to the last N turns instead of sending all history
- ✓Use rolling window summaries — summarize old turns with Haiku, keep only the summary
- ✓Split large documents into chunks and process sequentially instead of all at once
- ✓Use claude-haiku-4-5 for chunked pre-processing steps where full reasoning is not needed
How to fix 429 rate limit errors
Rate limits are measured in tokens per minute (TPM) and requests per minute (RPM). The limits vary by tier and model — check your Anthropic console for your current tier limits. When you hit a 429, the response includes a Retry-After header. Implement exponential backoff starting at the value in that header.
- Read the Retry-After header from the 429 response
- Wait that many seconds before retrying (do not hammer the API)
- Implement exponential backoff: double the wait on each subsequent 429
- Add jitter (random offset) to avoid synchronized retries across multiple workers
- Queue requests and process at a controlled rate if you need sustained high throughput
AWS Bedrock and Claude token limits
AWS Bedrock enforces its own token quotas separate from Anthropic's direct API. The default Bedrock quota for Claude models is lower than the direct API and varies by region. You can request a quota increase through the AWS Service Quotas console. Bedrock returns a ThrottlingException (HTTP 429) when limits are hit — handle it the same way as the direct API 429 with exponential backoff.
Using the Token Limits compression API in your app
Token Limits exposes a REST compression endpoint you can call from any language. POST your raw text to https://tokenlimits.app/api/compress with your license key in the request body. The response is the compressed text. Include this in your prompt pipeline before sending to Claude. The API handles up to 100 requests per minute.
Compress Claude API prompts before they hit token limits
Token Limits REST API compresses prompts and tool outputs by 60-80%. Call it before sending to Claude to reduce token usage and avoid 400/429 errors. Free trial, 50 requests.
FAQ
What is the Claude API token limit per request?
Claude Opus 4.6 and Sonnet 4.6 accept up to 1,048,576 input tokens per request. Claude Haiku 4.5 accepts up to 200,000. Output is limited to 32,000 tokens for Opus 4.6 and 16,000 for Sonnet 4.6.
What does a Claude API 429 error mean?
You have exceeded your tokens-per-minute or requests-per-minute rate limit. The response includes a Retry-After header. Wait that many seconds before retrying, and implement exponential backoff for sustained high load.
How do I reduce Claude API token usage?
Compress inputs before sending: strip timestamps, blank lines, repeated content, and verbose formatting. Use the Token Limits REST API to automate this. Also truncate conversation history and use cheaper models (Haiku) for pre-processing steps.
Do AWS Bedrock token limits differ from the direct Claude API?
Yes. AWS Bedrock has its own quota system per region, separate from Anthropic's direct API. Default Bedrock quotas are often lower. Request increases through the AWS Service Quotas console if you need higher throughput.
What is the best way to handle long documents in the Claude API?
For documents that exceed 100k tokens, compress first (60-80% savings), then chunk and process sequentially if still too large. Use Haiku to pre-summarize chunks, then pass summaries to Opus or Sonnet for final reasoning.