Building a ChatBot with Llama, Vicuna, FastChat and Streamlit

Step by Step guide to building and deploying a chatbot with Llama, Vicuna, FastChat and Streamlit

In this blog post, we will walk you through the process of building a chatbot using Llama, Vicuna and FastChat. Llama is a foundational large language model released by Meta. The Vicuna model was created by fine-tuning Llama on user-shared conversations collected from ShareGPT. FastChat is an open platform for training, serving, and evaluating large language model based chatbots.

Note - the pypi package name for fastchat is fschat. There is also a fastchat package, which is unrelated.

By the end of this blog post, you will have a working chatbot that you can use to interact with users.

It’s worth noting that FastChat also has a web based interface for serving the chat. In this post we’ve opted to skip that since it’s more complicated, it requires running a worker, a controller, and a Gradio webserver.

The code for this chatbot is hosted on GitHub. You can also run it directly in Saturn Cloud:

Run in Saturn Cloud

Driving the interactive chat loop

We create the following class StreamLitChatLoop to drive the model and hold all necessary state.

The load_models method is responsible for using the HuggingFace interface to Vicuna and loading the appropriate model weights.

    def load_models(self):
        model = LlamaForCausalLM.from_pretrained(
            self.model_path,
            load_in_8bit=True,
            torch_dtype=torch.float16,
            device_map="auto",
        )
        tokenizer = LlamaTokenizer.from_pretrained(self.model_path)
        self.model = model
        self.tokenizer = tokenizer

FastChat provides a Conversation object that we use, which helps us generate the actual inputs being fed into Vicuna. We create a clear_conversation method so that we can clear out all conversation state and begin the chat from the beginning anytime we like.

    def clear_conversation(self):
        self.conv: Conversation = Conversation(
            name='vicuna_v1.1',
            system=self.system_prompt,
            roles=['USER', 'ASSISTANT'],
            messages=[],
            offset=0,
            sep_style=SeparatorStyle.ADD_COLON_TWO,
            sep=' ',
            sep2='</s>',
            stop_str=None,
            stop_token_ids=None
        )

The last 2 methods take user input, and load it into the conversation object. The conversation object is capable of combining the system prompt along with the history of past chat messages in order to generate the subsequent message stream from Vicuna

    def take_user_input(self, user_input: str):
        conv = self.conv
        conv.append_message(conv.roles[0], user_input)
        conv.append_message(conv.roles[1], None)

    def loop(self):
        prompt = self.conv.get_prompt()
        gen_params = {
            "model": self.model_path,
            "prompt": prompt,
            "temperature": 0.7,
            "top_p": 0.9,
            "repetition_penalty": 1.0,
            "max_new_tokens": 512,
            "stop": self.conv.stop_str,
            "stop_token_ids": self.conv.stop_token_ids,
            "echo": False,
        }
        output_stream = generate_stream(self.model, self.tokenizer, gen_params, 0)
        output_text = ""
        for outputs in output_stream:
            output_text = outputs["text"]
            output_text = output_text.strip().split(" ")
            output_text = " ".join(output_text)
            yield output_text
        final_out = output_text
        self.conv.messages[-1][-1] = final_out.strip()

An example of a sample conversation

This can be hard to understand just by looking at the code. Understanding the prompts that are fed into the model at different stages of the computation will help you understand what’s going on.

I’m starting out the conversation with the following prompt:

write me an introduction paragraph for a blog post titled "Building a ChatBot with Llama, Vicuna, FastChat and Streamlit"

That message gets loaded into the FastChat Conversation object, and turned into the following prompt:

        A chat between a curious user and an artificial intelligence assistant.
        The assistant gives helpful, detailed, and polite answers to the user's questions.

        USER: write me an introduction paragraph for a blog post titled "Building a ChatBot with Llama, Vicuna, FastChat and Streamlit" ASSISTANT:

The system prompt, at the top sets the context for the conversation and instructs Vicuna on what tone to take with the user.

This prompts Vicuna to generate a stream of tokens. the generate_stream method from FastChat keeps track of the output state and yields streaming updates for the message:

Welcome
Welcome to our
Welcome to our latest blog
Welcome to our latest blog post!
Welcome to our latest blog post! In this
Welcome to our latest blog post! In this article,
Welcome to our latest blog post! In this article, we'
Welcome to our latest blog post! In this article, we'll be
Welcome to our latest blog post! In this article, we'll be diving
Welcome to our latest blog post! In this article, we'll be diving into the
Welcome to our latest blog post! In this article, we'll be diving into the world of
Welcome to our latest blog post! In this article, we'll be diving into the world of chatbot
Welcome to our latest blog post! In this article, we'll be diving into the world of chatbot development and
Welcome to our latest blog post! In this article, we'll be diving into the world of chatbot development and exploring

With the eventual final output:

Welcome to our latest blog post! In this article, we'll be diving into the world of chatbot development and exploring how to build a chatbot using Llama, Vicuna, FastChat, and Streamlit. These tools are some of the most popular and versatile options available for creating chatbots, and in this post, we'll show you how to harness their power to create a chatbot that is both functional and visually appealing. Whether you're a seasoned developer or just starting out, this post is sure to provide valuable insights and techniques for building your own chatbot. So, let's get started!

When I send a subsequent message to Vicuna:

Can you be less cheerful and more serious?

The FastChat conversation object aggregates the conversation history, and generates the following prompt:

        A chat between a curious user and an artificial intelligence assistant.
        The assistant gives helpful, detailed, and polite answers to the user's questions.

        USER: write me an introduction paragraph for a blog post titled "Building a ChatBot with Llama, Vicuna, FastChat and Streamlit" ASSISTANT: Welcome to our latest blog post! In this article, we'll be diving into the world of chatbot development and exploring how to build a chatbot using Llama, Vicuna, FastChat, and Streamlit. These tools are some of the most popular and versatile options available for creating chatbots, and in this post, we'll show you how to harness their power to create a chatbot that is both functional and visually appealing. Whether you're a seasoned developer or just starting out, this post is sure to provide valuable insights and techniques for building your own chatbot. So, let's get started!</s>USER: Can you be less cheerful and more serious? ASSISTANT:

Interacting with the model using Streamlit

Streamlit is a convenient way to build ML applications. Streamlit isn’t as full featured as traditional GUI toolkits, but that makes it much simpler to use. You can think of a Streamlit application as a Python script that executes from top to bottom after all user input, rendering content along the way.

We have the following Streamlit application for the chatbot

One of the first things we do is load the model - we wrap this in a st.cache decorator to ensure that subsequent runs do not have to actually load the model.

@st.cache_resource
def load_models():
    chat = StreamlitChatLoop(os.getenv('MODEL_PATH', '/home/jovyan/workspace/models/vicuna-7b'))
    chat.load_models()
    return chat

chat = load_models()

We also render 2 UI elements on the page - the text box which will display the chat history, and the input text box + submit button for the user to type their message.

output = st.text('')

def clear_conversation():
    chat.clear_conversation()

st.button('Clear Conversation', on_click=clear_conversation)


with st.form(key='input-form', clear_on_submit=True):
    user_input = st.text_area("You:", key='input', height=100)
    submit_button = st.form_submit_button(label='Send')

Streamlit applications run from top to bottom whenever the user triggers a UI interaction. Since we don’t want to do anything if the user has not submitted a message, we use the following block to stop execution until the user has actually submitted a message

if not submit_button:
    st.stop()

If the user has submitted a message, we will pass it to the model, read the entire message history from the conversation object, and render those to the page:

chat.take_user_input(user_input)
messages = [f'{role}: {"" if message is None else message}' for role, message in chat.conv.messages]
message_string = "\n\n".join(messages)
for text in chat.loop():
    message = message_string + text
    message = message.replace('\n', '\n\n')
    output.write(message)

To run this in Saturn Cloud, click the link below:

Run in Saturn Cloud

Next Steps

In this post we’ve outlined how you can build a chat bot with Llama, Vicuna, Fastchat and Streamlit. The repository linked in this article also includes to deploy this as a service in Saturn Cloud.

In future posts, we will cover augmenting these chat bots with your own confidential and proprietary data.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.