In this tutorial, we will recreate a question-answering bot that can answer questions based on a YouTube video. We recreated the application built here by @m_morzywolek.

To see the final implementation, you can view it here

Create Cerebrium Account

Before building, you need to set up a Cerebrium account. This is as simple as starting a new Project in Cerebrium and copying the API key. This will be used to authenticate all calls for this project.

Create a project

  1. Go to
  2. Sign up or Login
  3. Navigate to the API Keys page
  4. You will need your private API key for deployments. Click the copy button to copy it to your clipboard


Basic Setup

It is important to think of the way you develop models using Cerebrium should be identical to developing on a virtual machine or Google Colab - so converting this should be very easy!

Let us create our requirements.txt file and add the following packages:

pytube # For audio downloading

To use Whisper we also have to install ffmpeg and a few other packages as a Linux package and therefore have to create another file to define it, pkglist.txt - this is to install all Linux-based packages.


To start, we need to create a file which will contain our main Python code. This is a relatively simple implementation, so we can do everything in 1 file. We would like a user to send in a link to a YouTube video with a question and return to them the answer as well as the time segment of where we got that response. So let us define our request object.

from pydantic import BaseModel

class Item(BaseModel):
    url: str
    question: str

Above, we use Pydantic as our data validation library, and BaseModel is where Cerebrium keeps some default parameters like “webhook_url” that you can use for long-running tasks but do not worry about that functionality for this tutorial. Due to the way that we have defined the Base Model, “url” and “question” are required parameters and so if they are not present in the request, the user will automatically receive an error message.

Convert Video to text

Below, we will use the Whisper model from OpenAI to convert the video audio to text. We will then split the text into its phrase segments with its respective timings, so we know the exact source of where our model got the answer from.

import pytube
from datetime import datetime

model = whisper.load_model("small")

def store_segments(segments):
    texts = []
    start_times = []

    for segment in segments:
        text = segment["text"]
        start = segment["start"]

        # Convert the starting time to a datetime object
        start_datetime = datetime.fromtimestamp(start)

        # Format the starting time as a string in the format "00:00:00"
        formatted_start_time = start_datetime.strftime("%H:%M:%S")


    return texts, start_times

def predict(item, run_id, logger):
    item = Item(**item)

    video = pytube.YouTube(item.url)
    audio = video.streams.get_audio_only()
    fn ="/models/content/", filename= f"{video.title}.mp4")

    transcription = model.transcribe(f"/models/content/{video.title}.mp4")
    res = transcription["segments"]

    texts, start_times = store_segments(res)

Langchain Implementation

Below, we will implement Langchain to use a vectorstore, where we will store all our video segments above, with an LLM, Flan-T5, hosted on Cerebrium in order to generate answers.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.chains import VectorDBQAWithSourcesChain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import CerebriumAI
import openai
import faiss

sentenceTransformer = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
os.environ["CEREBRIUMAI_API_KEY"] = "private-XXXXXXXXXXXX"

def create_embeddings(texts, start_times):
    text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
    docs = []
    metadatas = []
    for i, d in enumerate(texts):
        splits = text_splitter.split_text(d)
        metadatas.extend([{"source": start_times[i]}] * len(splits))
    return metadatas, docs

#add following to your predict function after texts, start_times = store_segments(res)
    metadatas, docs = create_embeddings(texts, start_times)
    embeddings = HuggingFaceEmbeddings()
    store = FAISS.from_texts(docs, embeddings, metadatas=metadatas)
    faiss.write_index(store.index, "docs.index")
    llm = CerebriumAI(
    chain = VectorDBQAWithSourcesChain.from_llm(llm=llm, vectorstore=store)

    result = chain({"question": item.question})

    return {"result": result}

Above, we chunk our text segments and store them in a FAISS vector store. To create the embeddings, we use an open-source model from Hugging Face. The reason it is better to run Whisper and an embeddings model on the same machine is you save 2 round trips of network requests which will save you above 600ms+ on average.

We then integrate Langchain with a Cerebrium deployed endpoint to answer questions. Lastly, we return the results

To deploy the model to an AMPERE_A5000, we use the following command:

cerebrium deploy langchain --hardware AMPERE\_A5000 --api-key private-XXXXXXXXXXXXX

Once deployed, we can make the following request:

curl --location --request POST '' \
--header 'Authorization: public-XXXXXXXXXXXX' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "",
    "question": "How old was Steve Jobs when started Apple?"

We then get the following results:

  "run_id": "8959bfaa-f6c1-4445-980c-1ab469a4b878",
  "message": "success",
  "result": {
    "result": {
      "question": "How old was Steve Jobs when started Apple?",
      "answer": "20",
      "sources": ""
  "run_time_ms": 72109.55119132996