The best price for scaling your Azure OpenAI applications

Volume Discount Calculator

Potential Savings*: $2500

Discount: 25%

Monthly consumption

$25000

Savings : $2500

Potential Savings* : 20%

20%

Model

GPT-4o

GPT-4o-mini

GPT-4-Turbo

GPT-3.5-Turbo

Llama 3.2 1B

Llama 3.2 3B

Llama 3.2 11B Vision

Mixtral 8.7b

Gemma 2.9b

TM1 (Touchcast)

Input price/1M tokens

Input price**

$4.38

/1M tokens

$4.38

/1M tokens

$8.75

/1M tokens

$0.42

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

Output price/1M tokens

Output price**

$13.13

/1M tokens

$13.13

/1M tokens

$26.25

/1M tokens

$1.27

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

*Potential savings include additional average savings of 20% from CogCache serving cached responses. Actual savings may vary by use case.

**Effective price per 1M token

Explore Pricing

Model

GPT-4o

From

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

Discount

25%

27.5%

30%

32.5%

40%

GPT-4o-mini

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

GPT-4-Turbo

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

GPT-3.5-Turbo

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Llama 3.1 70b

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Llama 3.1 8b

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Llama 3.2 1B

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Llama 3.2 3B

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Llama 3.2 11B Vision

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Llama 3 70B

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Llama 3 8B

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Mixtral 8.7b

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

Gemma 2.9b

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

TM1 (Touchcast)

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000

$12,500

$25,000

Unlimited

25%

27.5%

30%

32.5%

40%

MODEL PROVIDERS:

INFERENCE PROVIDERS:

MODEL PROVIDERS:

INFERENCE PROVIDERS:

Start for free

Contact Sales

MODEL PROVIDERS:

INFERENCE PROVIDERS:

Get started  in minutes

No need for code changes, simply swap your end-point to CogCache and start your acceleration and savings journey.

Quickstart Guide >

 
from openai import OpenAI

COGCACHE_LLM_MODEL = ""  # the model of choice
COGCACHE_API_KEY = ""  # the generated CogCache API key

client = OpenAI(
    base_url = "https://proxy-api.cogcache.com/v1/",
    api_key = COGCACHE_API_KEY, # this is not needed here, if it's already set via environment variables
    default_headers = {
        "Authorization": f"Bearer {COGCACHE_API_KEY}",
    },
)

response = client.chat.completions.create(
    model = COGCACHE_LLM_MODEL,
    stream = True,
    messages = [
        {
            "role": "system",
            "content": "Assistant is a large language model trained by OpenAI.",
        },
        {
           "role": "user",
           "content": "Write a blog post about Generative AI"
        },
    ],
)

for chunk in response:
    print(chunk)

 
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage, SystemMessage

COGCACHE_LLM_MODEL = ""  # the model of choice
COGCACHE_API_KEY = ""  # the generated CogCache API key

model = ChatOpenAI(
    base_url = "https://proxy-api.cogcache.com/v1/",
    model = COGCACHE_LLM_MODEL,
    openai_api_key = COGCACHE_API_KEY,
    default_headers = {
        "Authorization": f"Bearer {COGCACHE_API_KEY}"
    },
)

response = model.stream(
    [
        SystemMessage(content="Assistant is a large language model trained by OpenAI."),
        HumanMessage(content="Write a blog post about Generative AI"),
    ],
)

for chunk in response:
    print(chunk.content)

 
curl --location 'https://proxy-api.cogcache.com/v1/chat/completions' \
--header 'Authorization: Bearer {COGCACHE_API_KEY}' \
--header 'Content-Type: application/json' \
--data '{
    "messages": [
        {
            "role": "system",
            "content": "Assistant is a large language model trained by OpenAI."
        },
        {
            "role": "user",
            "content": "Write a blog post about Generative AI"
        }
    ],
    "model": "COGCACHE_LLM_MODEL”,
    "stream": true
}'

 
import OpenAI from "openai";

const COGCACHE_LLM_MODEL = ""; // the model of choice
const COGCACHE_API_KEY = ""; // the generated CogCache API key

const openai = new OpenAI({
  baseURL: "https://proxy-api.cogcache.com/v1/",
  apiKey: COGCACHE_API_KEY,
  defaultHeaders: {
    Authorization: `Bearer ${COGCACHE_API_KEY}`,
  },
});

async function main() {
  try {
    const response = await openai.chat.completions.create({
      messages: [
        {
          role: "system",
          content: "Assistant is a large language model trained by OpenAI.",
        },
        { 
          role: "user", 
          content: "Write a blog post about Generative AI" 
        },
      ],
      model: COGCACHE_LLM_MODEL,
      stream: true,
    });

    // Check if response is async iterable
    if (response[Symbol.asyncIterator]) {
      for await (const chunk of response) {        
        const text = chunk.choices[0]?.delta?.content; // Adjust based on actual response structure
        if (text) {
          console.log(text);
        }
      }
    } else {
      console.log('Response is not an async iterable');
    }

  } catch (error) {
    console.error('An error occurred:', error);
  }
}

main();

 
import { ChatOpenAI } from "@langchain/openai";

const COGCACHE_LLM_MODEL = ""; // the model of choice
const COGCACHE_API_KEY = ""; // the generated CogCache API key

const model = new ChatOpenAI({
  configuration: {
    baseURL: "https://proxy-api.cogcache.com/v1/",
    baseOptions: {
      headers: {
        "Authorization": `Bearer ${COGCACHE_API_KEY}`,
      },
    },
  },
  apiKey: COGCACHE_API_KEY,
  model: COGCACHE_LLM_MODEL,
  streaming: true,
});

(async () => {
  const response = await model.invoke(
    [
       ["system", "Assistant is a large language model trained by OpenAI."],
       ["user", "Write a blog post about Generative AI"] ,
    ],
    {
      callbacks: [
        {
          handleLLMNewToken: async (token) => {
            console.log("handleLLMNewToken", token);
          },
          handleLLMEnd: async (output) => {
            console.log("handleLLMEnd", output);
          },
          handleChatModelStart: (...chat) => {
            console.log("handleChatModelStart", chat);
          },
        },
      ],
    }
  );
})().catch((err) => {
  console.error("Something went wrong:", err);
});

Harness the power of Cognitive Caching

Say hello to the lowest cost on the market for inference tokens.

CogCache works as a proxy between your AI applications and your LLMs, accelerating content generation through caching results, cutting costs and speeding responses by eliminating the need to consume tokens on previously generated content.

Get Started

Start for free

Do more with less

CogCache dramatically reduces inference costs while keeping your GenAI solutions blazing fast, aligned and transparent.

Stay in control

Directly control, refine and edit the output of generative AI applications with a self-healing cache that mitigates misaligned responses.

Deploy in minutes

With one line of code you can equip your team to control the entire GenAI lifecycle — from  rapid deployment to real-time governance to continuous optimization.

How it Works

Query Volume Tracking

We monitor the total number of queries per day to our AI models.

Cache Yield

Our system identifies and caches repetitive queries, addressing them from the cache instead of the LLM.

LLM Query Management

This reduction in direct LLM calls minimizes the computational load, leading to lower operational costs.

Aggregate Savings

Over time, the cumulative savings from reduced LLM queries add up, driving down overall costs and improving efficiency.

Slash Your GenAI Carbon
Footprint by up to 50%

Slash Your GenAI Carbon Footprint by up to 50%

Reduce Energy Consumption

Lower your energy usage  and costs by up to 50% with our innovative Cognitive Caching technology. Scale your conversational AI without escalating its environmental impact.

Accelerate AI Responses

Experience 100x faster interactions without the need for energy-intensive operations. Enable your users to get quicker, more efficient responses.

A Sustainable Future

Cognitive Caching is more than a quick fix, it's a paradigm shift. Lead the way in sustainable tech innovation and create a positive impact on our planet.

Reduce Costs and Carbon Footprint

Save up to 50% on your LLM costs with our reserved capacity and cut your carbon footprint by over 50%, making your AI operations more sustainable and cost-effective.

Boost Performance

Experience lightning-fast, predictable performance with response times accelerated by up to 100x, ensuring smooth and efficient operations of your LLMs via Cognitive Caching.

Drive Control and Alignment

Maintain complete oversight on all LLM text generated, ensuring alignment and grounding of responses to uphold your brand integrity and comply with governance requirements.

Full-stack LLM Observability

Gain real-time insights, track performance key metrics and view all the logged requests for easy debugging.

Seamless Integration, Endless scalability.

Fast, Safe and Cost-effective.

Instant Implementation

Switch your code endpoints
with the supplied key, and
you're set.

Resolution Engine

Ensure every interaction with
your AI content is traceable
and secure.

Multilingual Support

Supports multiple languages,
expanding your global reach.

Data Integrations

Integrates effortlessly with
your existing business systems.

Guaranteed Capacity

CogCache ensures availability
of Azure OpenAI tokens thanks
to our reserved capacity.

Predictability

Eliminate hallucinations and
guarantee accuracy in your
prompt responses.

Security

CogCache acts like a firewall for your LLM, blocking prompt njections and any attempts to jailbreak it.

Savings

Slash your Azure OpenAI costs by up to 50% with volume discounting and cognitive caching.

Private Endpoints

Secure, direct access through your private network - ensuring your AI traffic never touches the public internet.

Before / After

Standard AI Challenges

COGCACHE ACTIVATED

Hyper-Fast Cache Retrieval.

Unpredictable and slow LLM response times.

Self-Healing Cache.

Stochastic results yielding different responses every time.

Asynchronous Risk & Quality Scoring.

AI grounding issues are impossible to detect and address.

Temporal Relevance Tracking.

AI safety risks, biased and unaligned responses. Relevance Tracking.

Full Workflow Interface for Your Responsible AI Teams.

Lack of explainability, accountability & transparency.

DCAI and DCAI Amendment Updates.

No cost-effective way to consume tokens for repeated prompts.

No easy way to monitor token consumption.

Hard to understand and predict Azure OpenAI response patterns.

FAQ

What is a token?

You can think of tokens as pieces of words used for natural language processing. For English text, 1 token is approximately 4 characters or 0.75 words.

What Is LLM Response Caching?

LLM response caching is a method for optimizing the efficiency and performance of large language models (LLMs) like those from OpenAI and Azure OpenAI. By caching LLM responses, frequently generated content can be stored and reused, which leads to significant LLM cost reduction, lower AI carbon footprint, and faster response times. This approach, known as generative AI caching or cognitive caching, supports multilingual AI needs and enhances model observability. Caching LLM responses is a strategic way to ensure smoother operations, efficient resource use, and sustainable AI solutions.

What are the different models offered?

We're currently offering GPT-4o, GPT-4o-mini, GPT-4-Turbo, GPT-3.5-Turbo, Llama 3.1 70b, Llama 3.1 8b, Llama 3.2 1B, Llama 3.2 3B, Llama 3.2 11B Vision, Mixtral 8.7b, Gemma 7b, Gemma 2.9b, and TM1 (Touchcast)

How is pricing structured?

Pricing is tiered based on monthly spend, with discounts increasing as spend increases.

What are the base token prices for each model?

Base prices start with a built-in 25% discount to market prices, and increase as spend increases.

What discounts are available?

Discounts range from 25% to 40%, depending on the monthly spend and the specific model used.

What does "Potential savings" mean?

Potential savings combine the listed price discount with an additional average 20% savings from CogCache serving cached responses. Actual savings derived from cognitive caching can be lower or higher, depending on the use case.

Is there a maximum discount?

The maximum listed price discount is 40% at the highest spend tier.

Are there any spending limits?

If your monthly spend is over $50,000, contact our sales team.

Is a credit card needed to use CogCache?

No, a credit card is not required to use CogCache. You can start using CogCache immediately with a $20 credit.

What happens when your credits run out?

Credits autofill based on the limits you set.

What are private endpoints?

Private Endpoints create a secure, dedicated network interface that connects your AI application directly to CogCache through Azure's private backbone. This allows you to access CogCache's capabilities through you private virtual networks without exposing traffic to the public internet, ensuring enhanced security and compliance for sensitive AI workloads.

Join the Waitlist

First Name

Last Name

Company name

Country/Region

Business Email

Job title

Azure OpenAI

Don't have an Azure Marketplace account?

Sign up for the self-service waitlist.

Thank you!
Your request has been received!

You are already on the waitlist

Volume Discount Calculator

Monthly consumption

$0

$25000

GPT-4o

GPT-4o-mini

GPT-4-Turbo

GPT-3.5-Turbo

Llama 3.2 1B

Llama 3.2 3B

Llama 3.2 11B Vision

Mixtral 8.7b

Gemma 2.9b

TM1 (Touchcast)

$4.38

/1M tokens

$4.38

/1M tokens

$8.75

/1M tokens

$0.42

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$4.38

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$26.25

/1M tokens

$1.27

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

$13.13

/1M tokens

12.5%

12.5%

12.5%

15%

12.5%

12.5%

12.5%

12.5%

12.5%

32.5%

32.5%

32.5%

35%

32.5%

32.5%

32.5%

32.5%

32.5%

GPT-4o

$0

$2,500

$5,000

$12,500

$25,000

$2,500

$5,000