Chatbot with RAG

Chatbot with RAG - Details

Home
Details

The Chatbot with RAG

Time frame: 05/2024 - 07/2024
Project URL: Chatbot
Team: Noah Hartmann
Core Technologies/Frameworks/Tools: Python, LangChain, HuggingFace, DeepEval, Streamlit

About

As part of the elective course "Artificial Intelligence and Adaptive Systems", a chatbot application with Retrieval Augmented Generation (RAG) pipeline has been developed. It can answer questions regarding Taylor Swift.
A detailed report (in German) can be found here.

Retrieval Augmented Generation

Retrieval-Augmented Generation (RAG) is an architectural concept designed to enhance the efficiency of Large Language Model (LLM) applications by utilizing custom data. This approach involves retrieving relevant data or documents related to a question or task and using this information as context for the LLM.
RAG combines two key components:

Retrieval System: It searches for relevant external information.
Generation Model: It uses this retrieved information to generate accurate and contextually appropriate responses.

Some Challenges addressed by RAG

Hallucination: RAG reduces the likelihood of the LLM generating plausible but incorrect information by incorporating up-to-date and verified external data. This helps mitigate issues like sentence contradictions, factual inaccuracies, and irrelevant outputs.
Out of Date: Traditional LLMs may provide outdated information due to their training data. RAG addresses this by ensuring that the model has access to the latest and most reliable information through the retrieval step.
Lack of Source Attribution: RAG allows users to trace the sources of the information provided by the LLM, thereby enhancing transparency and trustworthiness.

Why did I build a Chatbot with RAG and did not fine tune?
Combining Fine-Tuning with RAG can significantly improve LLMs. Fine-Tuning refines models for specific domains using targeted data, while RAG enhances accuracy by integrating current external information.
For applications that do not require specialized domain knowledge (like this little chatbot), implementing a RAG architecture alone may be sufficient.

The application was implemented and evaluated using

Python (v3.12)
LangChain (v0.2.7)
Hugging Face
Streamlit (v1.36.0)
Deepeval (v0.21.65)

Due to the time constraints of the project there are a lot of opportunities for optimization and improvement.

Set Top-k and Chunk Size differently
Using a Re-Ranker
Using a more advanced retriever
...

Screen Capture

My responsibilities

As I developed the application alone, I was responsible for all tasks.
I started by researching the RAG architecture and the tools & components needed to implement it.
Needed and used components:

Dataset for filling the vector database
Library for data preprocessing - used Pandas
Text splitter for dividing long texts into smaller, usable segments
An embeddings model for converting text files into vectors
A chat model for generating responses
A vector store
A retriever for searching and retrieving relevant documents from the vector database
Suitable prompt templates for structuring queries to the language model
Suitable chains for linking various processing steps in the RAG process

A Frontend

For evaluation I leveraged the open-source LLM evaluation framework DeepEval and used the following metrics

Contextual Precision Metric
Contextual Recall Metric
Contextual Relevancy Metric
Answer Relevancy Metric
Faithfulness Metric

With a test run of 20 cases, the system achieved a success rate of 75%. In another test run with 34 cases, the success rate dropped to 61.8%.

The two graphics indicate that the retriever is not optimally configured. There is potential for improvement in selecting relevant information for given inputs. Two areas to consider are:

Chunk Size of the Text Splitter: Chunks may be too large, including irrelevant information, or too small, missing important context.
Adjustment of Top-K: The number of returned results may not be ideal—too many results might include irrelevant information, while too few might miss relevant information.

Overall, refining these aspects could enhance the retriever's performance. In the end I deployed the application to Streamlit.

Learnings

Importance of Prompt Engineering:
Significance of Dataset Selection:
Challenges in Formulating Standalone Questions:

Lessons on Dataset Usage:

Overall, while the application is functional, there is significant potential for improvement. This project provided a foundational understanding and highlighted many areas for future development and optimization that were not addressed due to time constraints.