In this tutorial, we will recreate a question-answering bot that can answer questions based on a YouTube video. We recreated the application built here by @m_morzywolek.

To see the final implementation, you can view it here

Basic Setup

It is important to think of the way you develop models using Cerebrium should be identical to developing on a virtual machine or Google Colab - so converting this should be very easy! Please make sure you have the Cerebrium package installed and have logged in. If not, please take a look at our docs here

First we create our project:

cerebrium init langchain-QA

We need certain Python packages to implement this project. Let’s add those to our [cerebrium.dependencies.pip] section of our cerebrium.toml file:

pytube = "latest" # For audio downloading
langchain = "latest"
faiss-gpu = "latest"
ffmpeg = "latest"
openai-whisper = "latest"
transformers = ">=4.35.0"
sentence_transformers = ">=2.2.0"

To use Whisper, we also have to install ffmpeg and a few other packages as a Linux package and therefore have to define these in [cerebrium.dependencies] - this is to install all Linux-based packages.

apt = [ "ffmpeg", "libopenblas-base", "libomp-dev"]

Our file will contain our main Python code. This is a relatively simple implementation, so we can do everything in 1 file. We would like a user to send in a link to a YouTube video with a question and return to them the answer as well as the time segment of where we got that response. So let us define our request object.

from pydantic import BaseModel

class Item(BaseModel):
    url: str
    question: str

Above, we use Pydantic as our data validation library. Due to the way that we have defined the Base Model, “url” and “question” are required parameters and so if they are not present in the request, the user will automatically receive an error message.

Convert Video to text

Below, we will use the Whisper model from OpenAI to convert the video audio to text. We will then split the text into its phrase segments with its respective timings, so we know the exact source of where our model got the answer from.

import pytube
from datetime import datetime
import whisper

model = whisper.load_model("small")

def store_segments(segments):
    texts = []
    start_times = []

    for segment in segments:
        text = segment["text"]
        start = segment["start"]

        # Convert the starting time to a datetime object
        start_datetime = datetime.fromtimestamp(start)

        # Format the starting time as a string in the format "00:00:00"
        formatted_start_time = start_datetime.strftime("%H:%M:%S")


    return texts, start_times

def predict(item, run_id, logger):
    item = Item(**item)

    video = pytube.YouTube(item.url)
    audio = video.streams.get_audio_only()
    fn ="/models/content/", filename= f"{video.title}.mp4")

    transcription = model.transcribe(f"/models/content/{video.title}.mp4")
    res = transcription["segments"]

    texts, start_times = store_segments(res)

Langchain Implementation

Below, we will implement Langchain to use a vectorstore, where we will store all our video segments above, with an LLM, locally hosted on Cerebrium, to generate answers.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.chains import VectorDBQAWithSourcesChain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import CerebriumAI
import openai
import faiss

sentenceTransformer = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def create_embeddings(texts, start_times):
    text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
    docs = []
    metadatas = []
    for i, d in enumerate(texts):
        splits = text_splitter.split_text(d)
        metadatas.extend([{"source": start_times[i]}] * len(splits))
    return metadatas, docs

#add following to your predict function after texts, start_times = store_segments(res)
    metadatas, docs = create_embeddings(texts, start_times)
    embeddings = HuggingFaceEmbeddings()
    store = FAISS.from_texts(docs, embeddings, metadatas=metadatas)
    faiss.write_index(store.index, "docs.index")
    llm = CerebriumAI(
    chain = VectorDBQAWithSourcesChain.from_llm(llm=llm, vectorstore=store)

    result = chain({"question": item.question})

    return {"result": result}

Above, we chunk our text segments and store them in a FAISS vector store. To create the embeddings, we use an open-source model from Hugging Face. The reason it is better to run Whisper and an embeddings model on the same machine is you save 2 round trips of network requests which will save you above 600ms+ on average.

We then integrate Langchain with a Cerebrium deployed endpoint to answer questions. Lastly, we return the results


Your cerebrium.toml file is where you can set your compute/environment. Please make sure that the GPU you specify is a AMPERE_A5000, and that you have enough memory (RAM) on your instance to run the models. You cerebrium.toml file should look like:

predict_data = "{\"prompt\": \"Here is some example predict data for your cerebrium.toml which will be used to test your predict function on build.\"}"
force_rebuild = false
disable_animation = false
log_level = "INFO"
disable_deployment_confirmation = false

name = "langchain-qa"
python_version = "3.10"
include = "[./*,]"
exclude = "[./.*, ./__*]"

gpu = "AMPERE_A5000"
cpu = 2
memory = 16.0
gpu_count = 1

min_replicas = 0
cooldown = 60

ffmpeg = "latest"
"libopenblas-base" = "latest"
"libomp-dev" = "latest"

pytube = "latest" # For audio downloading
langchain = "latest"
faiss-gpu = "latest"
ffmpeg = "latest"
openai-whisper = "latest"
transformers = ">=4.35.0"
sentence_transformers = ">=2.2.0"


To deploy the model use the following command:

cerebrium deploy

Once deployed, we can make the following request:

curl --location --request POST '' \
--header 'Authorization: <JWT-TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "",
    "question": "How old was Steve Jobs when started Apple?"

We then get the following results:

  "run_id": "8959bfaa-f6c1-4445-980c-1ab469a4b878",
  "message": "success",
  "result": {
    "result": {
      "question": "How old was Steve Jobs when started Apple?",
      "answer": "20",
      "sources": ""
  "run_time_ms": 72109.55119132996