Lama 2
Llama 2 is a family of state-of-the-art open-access large language models released by Meta.The Llama 2 release introduces a family of pretrained and fine-tuned LLMs, ranging in scale from 7B to 70B parameters (7B, 13B, 70B). The pretrained models come with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model. To deploy Llama2, you can use the identifier below:
- Llama2 7b:
llamav2-7b
- Llama2 7b Chat:
llamav2-7b-chat
Here’s an example of how to call the deployed endpoint:
Request Parameters
curl --location --request POST 'https://run.cerebrium.ai/llamav2-7b-chat-webhook/predict' \
--header 'Authorization: <API_KEY>' \
--header 'Content-Type: application/json' \
--data-raw '{
"prompt": "Hey! How are you doing?",
"max_length": 100,
"temperature": 0.9
"top_p": 1.0
"top_k": 10
"num_return_sequences": 1
"repetition_penalty": 2.0
}'
This is the Cerebrium API key used to authenticate your request. You can get it from your Cerebrium dashboard.
The prompt you would like llama2 to process.
The maximum number of tokens the model must generate. The default is 100.
The value used to control the randomness in the model’s predictions. Higher values will result in more random outputs while smaller values make the output more deterministic. The default is 1.0.
A parameter also known as nucleus sampling, it’s used to control the randomness by allowing the model to only consider a minimum number of tokens with highest probability that cumulatively add up to the specified top_p. The default is 0.0.
The maximum number of highest probability vocab tokens considered for each step during the generation of sequences. Reducing top_k will limit the number of output possibilities, resulting in more deterministic outputs. The default is 50.
The number of independently computed sequences to return. If set to more than 1, then the number of sequences generated will be num_return_sequences, each one independently computed. The default is 1.
parameter used in text generation to penalize repeated words or tokens in the generated text. This is a number greater than or equal to 1.
{
"run_id": "dc8f23ab-7237-42dc-b6cf-430abdbba8f7",
"run_time_ms": 10077.8913497924805,
"message": "Ran successfully",
"result": [
{
"generated_text": "Hey! How are you doing? I'm doing great! *hugs* So, what's new with you? "
}
]
}
Response Parameters
A unique identifier for the run that you can use to associate prompts with webhook endpoints.
The amount of time in milliseconds it took to run your function. This is what you will be billed for.
Whether of not the response was successful
The result generated from Llama2
curl --location --request POST 'https://run.cerebrium.ai/llamav2-7b-chat-webhook/predict' \
--header 'Authorization: <API_KEY>' \
--header 'Content-Type: application/json' \
--data-raw '{
"prompt": "Hey! How are you doing?",
"max_length": 100,
"temperature": 0.9
"top_p": 1.0
"top_k": 10
"num_return_sequences": 1
"repetition_penalty": 2.0
}'
{
"run_id": "dc8f23ab-7237-42dc-b6cf-430abdbba8f7",
"run_time_ms": 10077.8913497924805,
"message": "Ran successfully",
"result": [
{
"generated_text": "Hey! How are you doing? I'm doing great! *hugs* So, what's new with you? "
}
]
}