Overview
What Is CloudRift Inference ?
CloudRift Inference lets you run large-language-model (LLM) and general AI inference on demand, powered by high-performance AMD and NVIDIA GPUs
- Pay-as-you-go – billing per million tokens, no reserved capacity needed
- Low-latency endpoints ready for production workloads
Quick Start (REST API)
CloudRift endpoints are OpenAI-compatible – you can drop them into any OpenAI client by changing the base URL and model name.
Create an API Token
– Sign in → APIs → Generate Token
– Copy and store the token securely, as they cannot be shown again later.Call the
/v1/chat/completions
endpoint
curl -X POST https://inference.cloudrift.ai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_RIFT_API_KEY" \
-d '{
"model": "llama4:maverick",
"messages": [
{"role": "user", "content": "What is the meaning of life?"}
],
"stream": true
}'
Supported Parameters
The API supports the same request-body fields as OpenAI’s chat/completions specification (e.g. temperature
, top_p
, max_tokens
, stream
).
For the full list see the OpenAI API reference.
Models & Pricing
Token pricing and context limits are listed on the live Models & Pricing page.
Prices may change as we add new checkpoints or hardware generations; check that page for the latest rates.
FAQ
Do you support server-side streaming?
Yes. Send "stream": true
; the response is text/event-stream
.
What latency should I expect?
Typical first-token latency is ≈ 120 ms on current 8 B models; larger checkpoints take longer.