Llama 2 is a family of state-of-the-art open-access large language models released by Meta.The Llama 2 release introduces a family of pretrained and fine-tuned LLMs, ranging in scale from 7B to 70B parameters (7B, 13B, 70B). The pretrained models come with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model. To deploy Llama2, you can use the identifier below:

  • Llama2 7b: llamav2-7b
  • Llama2 7b Chat: llamav2-7b-chat

Here’s an example of how to call the deployed endpoint:

Request Parameters

  curl --location --request POST 'https://run.cerebrium.ai/llamav2-7b-chat-webhook/predict' \
      --header 'Authorization: <API_KEY>' \
      --header 'Content-Type: application/json' \
      --data-raw '{
        "prompt": "Hey! How are you doing?",
        "max_length": 100,
        "temperature": 0.9
        "top_p": 1.0
        "top_k": 10
        "num_return_sequences": 1
        "repetition_penalty": 2.0
    }'
Authorizationrequired
string

This is the Cerebrium API key used to authenticate your request. You can get it from your Cerebrium dashboard.

promptrequired
string

The prompt you would like llama2 to process.

max_length
string

The maximum number of tokens the model must generate. The default is 100.

temperature
float

The value used to control the randomness in the model’s predictions. Higher values will result in more random outputs while smaller values make the output more deterministic. The default is 1.0.

top_p
float

A parameter also known as nucleus sampling, it’s used to control the randomness by allowing the model to only consider a minimum number of tokens with highest probability that cumulatively add up to the specified top_p. The default is 0.0.

top_k
int

The maximum number of highest probability vocab tokens considered for each step during the generation of sequences. Reducing top_k will limit the number of output possibilities, resulting in more deterministic outputs. The default is 50.

num_return_sequences
int

The number of independently computed sequences to return. If set to more than 1, then the number of sequences generated will be num_return_sequences, each one independently computed. The default is 1.

repetition_penalty
string

parameter used in text generation to penalize repeated words or tokens in the generated text. This is a number greater than or equal to 1.

{
  "run_id": "dc8f23ab-7237-42dc-b6cf-430abdbba8f7",
  "run_time_ms": 10077.8913497924805,
  "message": "Ran successfully",
  "result": [
    {
      "generated_text": "Hey! How are you doing? I'm doing great! *hugs* So, what's new with you? "
    }
  ]
}

Response Parameters

run_idrequired
string

A unique identifier for the run that you can use to associate prompts with webhook endpoints.

run_time_msrequired
string

The amount of time in milliseconds it took to run your function. This is what you will be billed for.

messagerequired
string

Whether of not the response was successful

resultrequired
string

The result generated from Llama2