Dataset Curation
Cerebrium’s fine-tuning functionality is in public beta and we are adding more functionality each week! If there are any issues or if you have an urgent requirement, please reach out to support
The fine-tuning library has been built to leverage Parameter Efficient Fine-tuning and Low-Rank Approximation to reduce the number of trainable parameters by >99.9% while providing results that are comparable to full fine-tuning.
Due to the much lower number of trainable parameters, the datasets used do not need to be greater than 1000 examples. Often much smaller datasets of 200-500 examples are more than sufficient to achieve good results. However, no matter the size of your dataset, it is always best to use a diverse set training examples such that your dataset has good variation and covers a wide variety of potential inputs.
By default, we provide two templates for question-and-answering which we have adapted from Alpaca-Lora format as our prompt template format. You can use these templates or create your own. For more information on templating see this page.
The dataset is a JSON or JSONL file which contains a “prompt”, “completion”, and if needed “context” parameters.
For your convenience, an example dataset has been provided for testing here.
Below are some helpful tips to take into consideration when preparing your datasets for fine-tuning:
1. Data Quality:
Before fine-tuning models, ensure your data is of high quality. Clean data by removing outliers, duplicates, handle missing values appropriately and make sure you remove any irrelevant data. It is best to use a tool for this such as NomicAI in order to see a map of your data and easily remove data that is not relevant to your use case.
It is important to keep a diverse dataset and make sure your dataset is balanced. For example, make sure there are a similar number of positive outcomes than there are negative outcomes as well as a similar number of female to male records etc.
2. Use Case Understanding:
The approach for fine-tuning should align with the specifics of your use case. Different problems require varying degrees of precision, recall, speed, etc. For example, if the model is being used for a task that requires factual accuracy, then the data used to fine-tune the model should be carefully curated to ensure that it is accurate.
3. Data Structure:
Consider the structure of your data. Sequential data should be formatted differently than non-sequential data. Also, understand which features of your data are most relevant to your model.
While the initial curation of data is a difficult and time-consuming process it is very important for you to continuously monitor your model and the predictions/generates it gives your users and if those are good or bad. In doing this you can fine-tune your model further and provide better generations and cover more edge cases.
Example Dataset
Below is an example of a short JSON dataset that can be used for fine-tuning:
[
{
"prompt": "What is Cerebrium?",
"completion": "Cerebrium is an AWS Sagemaker alternative providing all the features you need to quickly build an ML product.",
"source": "https://docs.cerebrium.ai/introduction"
},
{
"prompt": "What is the goal of Cerebrium?",
"completion": "The goal of Cerebrium is to help companies create value through machine learning as quickly and as painlessly as possible by abstracting away a lot of the complexity and mundane infrastructure setup.",
"source": "https://docs.cerebrium.ai/introduction"
},
{
"prompt": "How can I send feedback to Cerebrium?",
"completion": "You can send feedback requests to Cerebrium at support@cerebrium.ai.",
"source": "https://docs.cerebrium.ai/introduction"
},
{
"prompt": "Where can I start a project and deploy my first model?",
"completion": "You can start a project and deploy your first model at https://dashboard.cerebrium.ai.",
"source": "https://docs.cerebrium.ai/introduction"
},
{
"prompt": "How do I set up a Cerebrium account?",
"completion": "To set up a Cerebrium account, follow these steps: 1. Go to [dashboard.cerebrium.ai](https://dashboard.cerebrium.ai) 2. Signup or Login 3. Navigate to the API Keys page 4. You should see an API key with the source 'Cerebrium'. Click the eye icon to display it. It will be in the format: c_api_key-xxx",
"source": "https://docs.cerebrium.ai/quickstarts/xgboost"
},
{
"prompt": "How do I install the Cerebrium framework?",
"completion": "To install the Cerebrium framework, run the following command in your notebook or terminal: `pip install cerebrium`",
"source": "https://docs.cerebrium.ai/quickstarts/xgboost"
},
{
"prompt": "How do I create and save an XGBoost model?",
"completion": "To create and save an XGBoost model, use the following code snippet: ```from sklearn.datasets import load_iris from xgboost import XGBClassifier iris = load_iris() X, y = iris.data, iris.target xgb = XGBClassifier() xgb.fit(X, y) xgb.save_model('iris.json')``` This code creates a simple XGBoost classifier for the Iris dataset and saves the model to a JSON file.",
"source": "https://docs.cerebrium.ai/quickstarts/xgboost"
}
]