Overview

What Is CloudRift Inference ?

CloudRift Inference lets you run large-language-model (LLM) and general AI inference on demand, powered by high-performance AMD and NVIDIA GPUs

Pay-as-you-go – billing per million tokens, no reserved capacity needed
Low-latency endpoints ready for production workloads

Quick Start (REST API)

CloudRift endpoints are OpenAI-compatible – you can drop them into any OpenAI client by changing the base URL and model name.

Create an API Token
– Sign in → APIs → Generate Token
– Copy and store the token securely, as they cannot be shown again later.
Call the /v1/chat/completions endpoint

curl -X POST https://inference.cloudrift.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_RIFT_API_KEY" \
  -d '{ 
    "model": "llama4:maverick",
    "messages": [
      {"role": "user", "content": "What is the meaning of life?"}
    ],
    "stream": true
  }'

Supported Parameters

The API supports the same request-body fields as OpenAI’s chat/completions specification (e.g. temperature, top_p, max_tokens, stream).
For the full list see the OpenAI API reference.

Models & Pricing

Token pricing and context limits are listed on the live Models & Pricing page.
Prices may change as we add new checkpoints or hardware generations; check that page for the latest rates.

FAQ

Do you support server-side streaming?

Yes. Send "stream": true; the response is text/event-stream.

What latency should I expect?

Typical first-token latency is ≈ 120 ms on current 8 B models; larger checkpoints take longer.

What Is CloudRift Inference ?​

Quick Start (REST API)​

Supported Parameters​

Models & Pricing​

FAQ​

What Is CloudRift Inference ?

Quick Start (REST API)

Supported Parameters

Models & Pricing

FAQ