The best price for scaling your Azure OpenAI applications

Volume Discount Calculator

Potential Savings*: $2500
Discount: 25%

Monthly consumption

$25000

Savings : $2500
Potential
Savings* : 20%
20%
Model
GPT-4o
GPT-4o-mini
GPT-4-Turbo
GPT-3.5-Turbo
Llama 3.2 1B
Llama 3.2 3B
Llama 3.2 11B Vision
Mixtral 8.7b
Gemma 2.9b
Input price/1M tokens
Input price**
$4.38
/1M tokens
$4.38
/1M tokens
$8.75
/1M tokens
$0.42
/1M tokens
$4.38
/1M tokens
$4.38
/1M tokens
$4.38
/1M tokens
$4.38
/1M tokens
$4.38
/1M tokens
Output price/1M tokens
Output price**
$13.13
/1M tokens
$13.13
/1M tokens
$26.25
/1M tokens
$1.27
/1M tokens
$13.13
/1M tokens
$13.13
/1M tokens
$13.13
/1M tokens
$13.13
/1M tokens
$13.13
/1M tokens
*Potential savings include additional average savings of 20% from CogCache serving cached responses. Actual savings may vary by use case.
**Effective price per 1M token
Explore Pricing
Show more
Model
GPT-4o
From
$0
$2,500
$5,000
$12,500
$25,000
To
$2,500
$5,000
$12,500
$25,000
Unlimited
Discount
25%
27.5%
30%
32.5%
40%
GPT-4o-mini
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
GPT-4-Turbo
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
GPT-3.5-Turbo
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Llama 3.1 70b
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Llama 3.1 8b
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Llama 3.2 1B
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Llama 3.2 3B
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Llama 3.2 11B Vision
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Llama 3 70B
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Llama 3 8B
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Mixtral 8.7b
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
Gemma 2.9b
$0
$2,500
$5,000
$12,500
$25,000
$2,500
$5,000
$12,500
$25,000
Unlimited
25%
27.5%
30%
32.5%
40%
MODEL PROVIDERS:
INFERENCE PROVIDERS:
MODEL PROVIDERS:
INFERENCE PROVIDERS:

Cognitive LLM Caching

Where AI goes to scale
  • Save up to 50% on your inference costs
  • Drive up to 100x faster response times
  • Access the latest frontier models with no capacity limits
  • Gain full LLM control and alignment
  • Keep your AI traffic completely secure with Private Endpoints
Get started with $20 of free tokens
MODEL PROVIDERS:
INFERENCE PROVIDERS:

Single pane of glass for all your inference workload

Single pane of glass for all your inference workload

Get startedin minutes

No need for code changes, simply swap your end-point to CogCache and start your acceleration and savings journey.

Quickstart Guide >
Quickstart Guide >
Code copied to clipboard!
 
from openai import OpenAI

COGCACHE_LLM_MODEL = ""  # the model of choice
COGCACHE_API_KEY = ""  # the generated CogCache API key

client = OpenAI(
    base_url = "https://proxy-api.cogcache.com/v1/",
    api_key = COGCACHE_API_KEY, # this is not needed here, if it's already set via environment variables
    default_headers = {
        "Authorization": f"Bearer {COGCACHE_API_KEY}",
    },
)

response = client.chat.completions.create(
    model = COGCACHE_LLM_MODEL,
    stream = True,
    messages = [
        {
            "role": "system",
            "content": "Assistant is a large language model trained by OpenAI.",
        },
        {
           "role": "user",
           "content": "Write a blog post about Generative AI"
        },
    ],
)

for chunk in response:
    print(chunk)
Code copied to clipboard!
 
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

COGCACHE_LLM_MODEL = ""  # the model of choice
COGCACHE_API_KEY = ""  # the generated CogCache API key

model = ChatOpenAI(
    base_url = "https://proxy-api.cogcache.com/v1/",
    model = COGCACHE_LLM_MODEL,
    openai_api_key = COGCACHE_API_KEY,
    default_headers = {
        "Authorization": f"Bearer {COGCACHE_API_KEY}"
    },
)

response = model.stream(
    [
        SystemMessage(content="Assistant is a large language model trained by OpenAI."),
        HumanMessage(content="Write a blog post about Generative AI"),
    ],
)

for chunk in response:
    print(chunk.content)
Code copied to clipboard!
 
curl --location 'https://proxy-api.cogcache.com/v1/chat/completions' \
--header 'Authorization: Bearer {COGCACHE_API_KEY}' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "system",
            "content": "Assistant is a large language model trained by OpenAI."
        },
        {
            "role": "user",
            "content": "Write a blog post about Generative AI"
        }
    ],
    "model": "COGCACHE_LLM_MODEL”,
    "stream": true
}'
Code copied to clipboard!
 
import OpenAI from "openai";

const COGCACHE_LLM_MODEL = ""; // the model of choice
const COGCACHE_API_KEY = ""; // the generated CogCache API key

const openai = new OpenAI({
  baseURL: "https://proxy-api.cogcache.com/v1/",
  apiKey: COGCACHE_API_KEY,
  defaultHeaders: {
    Authorization: `Bearer ${COGCACHE_API_KEY}`,
  },
});

async function main() {
  try {
    const response = await openai.chat.completions.create({
      messages: [
        {
          role: "system",
          content: "Assistant is a large language model trained by OpenAI.",
        },
        { 
          role: "user", 
          content: "Write a blog post about Generative AI" 
        },
      ],
      model: COGCACHE_LLM_MODEL,
      stream: true,
    });

    // Check if response is async iterable
    if (response[Symbol.asyncIterator]) {
      for await (const chunk of response) {        
        const text = chunk.choices[0]?.delta?.content; // Adjust based on actual response structure
        if (text) {
          console.log(text);
        }
      }
    } else {
      console.log('Response is not an async iterable');
    }

  } catch (error) {
    console.error('An error occurred:', error);
  }
}

main();
Code copied to clipboard!
 
import { ChatOpenAI } from "@langchain/openai";

const COGCACHE_LLM_MODEL = ""; // the model of choice
const COGCACHE_API_KEY = ""; // the generated CogCache API key

const model = new ChatOpenAI({
  configuration: {
    baseURL: "https://proxy-api.cogcache.com/v1/",
    baseOptions: {
      headers: {
        "Authorization": `Bearer ${COGCACHE_API_KEY}`,
      },
    },
  },
  apiKey: COGCACHE_API_KEY,
  model: COGCACHE_LLM_MODEL,
  streaming: true,
});

(async () => {
  const response = await model.invoke(
    [
       ["system", "Assistant is a large language model trained by OpenAI."],
       ["user", "Write a blog post about Generative AI"] ,
    ],
    {
      callbacks: [
        {
          handleLLMNewToken: async (token) => {
            console.log("handleLLMNewToken", token);
          },
          handleLLMEnd: async (output) => {
            console.log("handleLLMEnd", output);
          },
          handleChatModelStart: (...chat) => {
            console.log("handleChatModelStart", chat);
          },
        },
      ],
    }
  );
})().catch((err) => {
  console.error("Something went wrong:", err);
});

Harness the power of Cognitive Caching

Say hello to the lowest cost on the market for inference tokens.

CogCache works as a proxy between your AI applications and your LLMs, accelerating content generation through caching results, cutting costs and speeding responses by eliminating the need to consume tokens on previously generated content.

Do more with less

CogCache dramatically reduces
inference costs while keeping your
GenAI solutions blazing fast,
aligned and transparent.

Stay in control

Directly control, refine and edit the output of generative AI
applications with a self-healing
cache that mitigates misaligned
responses.

Deploy in minutes

With one line of code you can equip your team to control the entire GenAI lifecycle — from 
rapid deployment to real-time governance to continuous optimization.

How it Works

Query Volume Tracking

We monitor the total number of queries per day to our AI models.

Cache Yield

Our system identifies and caches repetitive queries, addressing them from the cache instead of the LLM.

LLM Query Management

This reduction in direct LLM calls minimizes the computational load, leading to lower operational costs.

Aggregate Savings

Over time, the cumulative savings from reduced LLM queries add up, driving down overall costs and improving efficiency.

Slash Your GenAI Carbon
Footprint by up to 50%

Slash Your GenAI Carbon Footprint by up to 50%

Reduce Energy Consumption

Lower your energy usage 
and costs by up to 50% with our innovative Cognitive Caching technology. Scale your conversational AI without escalating its environmental impact.

Accelerate AI Responses

Experience 100x faster interactions without the need for energy-intensive operations. Enable your users to get quicker, more efficient responses.

A Sustainable Future

Cognitive Caching is more than
a quick fix, it's a paradigm shift. Lead the way in sustainable tech innovation and create a positive impact on our planet.

Reduce Costs and Carbon Footprint

Save up to 50% on your LLM costs with our reserved capacity and cut your carbon footprint by over 50%, making your AI operations more sustainable and cost-effective.

Boost Performance

Experience lightning-fast, predictable performance with response times accelerated by up to 100x, ensuring smooth and efficient operations of your LLMs via Cognitive Caching.

Drive Control and Alignment

Maintain complete oversight on all LLM text generated, ensuring alignment and grounding of responses to uphold your brand integrity and comply with governance requirements.

Full-stack LLM Observability

Gain real-time insights, track performance key metrics and view all the logged requests for easy debugging.

Seamless Integration,
Endless scalability.

Fast, Safe and Cost-effective.

Instant Implementation

Switch your code endpoints
with the supplied key, and
you're set.

Resolution Engine

Ensure every interaction with
your AI content is traceable
and secure.

Multilingual Support

Supports multiple languages,
expanding your global reach.

Data Integrations

Integrates effortlessly with
your existing business systems.

Guaranteed Capacity

CogCache ensures availability
of Azure OpenAI tokens thanks
to our reserved capacity.

Predictability

Eliminate hallucinations and
guarantee accuracy in your
prompt responses.

Security

CogCache acts like a firewall for your LLM, blocking prompt njections and any attempts to jailbreak it.

Savings

Slash your Azure OpenAI costs by up to 50% with volume discounting and cognitive caching.

Private Endpoints

Secure, direct access through your private network - ensuring your AI traffic never touches the public internet.

Before / After

Standard AI Challenges
COGCACHE ACTIVATED
Hyper-Fast Cache Retrieval.
Unpredictable and slow LLM response times.
Self-Healing Cache.
Stochastic results yielding different responses every time.
Asynchronous Risk & Quality Scoring.
AI grounding issues are impossible to detect and address.
Temporal Relevance Tracking.
AI safety risks, biased and unaligned responses. Relevance Tracking.
Full Workflow Interface for Your Responsible AI Teams.
Lack of explainability, accountability & transparency.
DCAI and DCAI Amendment Updates.
No cost-effective way to consume tokens for repeated prompts.
No easy way to monitor token consumption.
Hard to understand and predict Azure OpenAI response patterns.

FAQ

What is a token?
+

You can think of tokens as pieces of words used for natural language processing. For English text, 1 token is approximately 4 characters or 0.75 words.

What Is LLM Response Caching?
+

LLM response caching is a method for optimizing the efficiency and performance of large language models (LLMs) like those from OpenAI and Azure OpenAI. By caching LLM responses, frequently generated content can be stored and reused, which leads to significant LLM cost reduction, lower AI carbon footprint, and faster response times. This approach, known as generative AI caching or cognitive caching, supports multilingual AI needs and enhances model observability. Caching LLM responses is a strategic way to ensure smoother operations, efficient resource use, and sustainable AI solutions.

What are the different models offered?
+

We're currently offering GPT-4o, GPT-4o-mini, GPT-4-Turbo, GPT-3.5-Turbo, Llama 3.1 70b, Llama 3.1 8b, Llama 3.2 1B, Llama 3.2 3B, Llama 3.2 11B Vision, Mixtral 8.7b, Gemma 7b, and Gemma 2.9b.

How is pricing structured?
+

Pricing is tiered based on monthly spend, with discounts increasing as spend increases.

What are the base token prices for each model?
+

Base prices start with a built-in 25% discount to market prices, and increase as spend increases.

What discounts are available?
+

Discounts range from 25% to 40%, depending on the monthly spend and the specific model used.

What does "Potential savings" mean?
+

Potential savings combine the listed price discount with an additional average 20% savings from CogCache serving cached responses. Actual savings derived from cognitive caching can be lower or higher, depending on the use case.

Is there a maximum discount?
+

The maximum listed price discount is 40% at the highest spend tier.

Are there any spending limits?
+

If your monthly spend is over $50,000, contact our sales team.

Is a credit card needed to use CogCache?
+

No, a credit card is not required to use CogCache. You can start using CogCache immediately with a $20 credit.

What happens when your credits run out?
+

Credits autofill based on the limits you set.

What are private endpoints?
+

Private Endpoints create a secure, dedicated network interface that connects your AI application directly to CogCache through Azure's private backbone. This allows you to access CogCache's capabilities through you private virtual networks without exposing traffic to the public internet, ensuring enhanced security and compliance for sensitive AI workloads.

A few more details