The Chatbot with RAG

  • Time frame: 05/2024 - 07/2024
  • Project URL: Chatbot
  • Team: Noah Hartmann
  • Core Technologies/Frameworks/Tools: Python, LangChain, HuggingFace, DeepEval, Streamlit
About

As part of the elective course "Artificial Intelligence and Adaptive Systems", a chatbot application with Retrieval Augmented Generation (RAG) pipeline has been developed. It can answer questions regarding Taylor Swift.
A detailed report (in German) can be found here.

Retrieval Augmented Generation
Retrieval-Augmented Generation (RAG) is an architectural concept designed to enhance the efficiency of Large Language Model (LLM) applications by utilizing custom data. This approach involves retrieving relevant data or documents related to a question or task and using this information as context for the LLM.
RAG combines two key components:
  • Retrieval System: It searches for relevant external information.
  • Generation Model: It uses this retrieved information to generate accurate and contextually appropriate responses.
Some Challenges addressed by RAG
  • Hallucination: RAG reduces the likelihood of the LLM generating plausible but incorrect information by incorporating up-to-date and verified external data. This helps mitigate issues like sentence contradictions, factual inaccuracies, and irrelevant outputs.
  • Out of Date: Traditional LLMs may provide outdated information due to their training data. RAG addresses this by ensuring that the model has access to the latest and most reliable information through the retrieval step.
  • Lack of Source Attribution: RAG allows users to trace the sources of the information provided by the LLM, thereby enhancing transparency and trustworthiness.
Why did I build a Chatbot with RAG and did not fine tune?
Combining Fine-Tuning with RAG can significantly improve LLMs. Fine-Tuning refines models for specific domains using targeted data, while RAG enhances accuracy by integrating current external information.
For applications that do not require specialized domain knowledge (like this little chatbot), implementing a RAG architecture alone may be sufficient.

The application was implemented and evaluated using
  • Python (v3.12)
  • LangChain (v0.2.7)
  • Hugging Face
  • Streamlit (v1.36.0)
  • Deepeval (v0.21.65)
Due to the time constraints of the project there are a lot of opportunities for optimization and improvement.
  • Set Top-k and Chunk Size differently
  • Using a Re-Ranker
  • Using a more advanced retriever
  • ...

Screen Capture


My responsibilities

As I developed the application alone, I was responsible for all tasks.
I started by researching the RAG architecture and the tools & components needed to implement it.
Needed and used components:

For evaluation I leveraged the open-source LLM evaluation framework DeepEval and used the following metrics
  • Contextual Precision Metric
  • Contextual Recall Metric
  • Contextual Relevancy Metric
  • Answer Relevancy Metric
  • Faithfulness Metric
With a test run of 20 cases, the system achieved a success rate of 75%. In another test run with 34 cases, the success rate dropped to 61.8%.

The two graphics indicate that the retriever is not optimally configured. There is potential for improvement in selecting relevant information for given inputs. Two areas to consider are:
  • Chunk Size of the Text Splitter: Chunks may be too large, including irrelevant information, or too small, missing important context.
  • Adjustment of Top-K: The number of returned results may not be ideal—too many results might include irrelevant information, while too few might miss relevant information.
Overall, refining these aspects could enhance the retriever's performance. In the end I deployed the application to Streamlit.

Learnings

  • Importance of Prompt Engineering:
  • Initially, the retriever returned many irrelevant documents. I improved this by refining data preprocessing and experimenting with parameters like Top-K and threshold. However, the generated answers were still partially incorrect. Trying different LLMs and embedding models like Cohere didn’t help. The solution was adjusting the prompt templates to provide better instructions to the model, leading to more relevant responses.
  • Significance of Dataset Selection:
  • The originally chosen dataset, "Netflix Movies and TV Shows," was unsuitable for evaluation due to the lack of factual questions. Consequently, I had to find a new dataset and adjust the data preprocessing towards the end of the project.
  • Challenges in Formulating Standalone Questions:
  • There were difficulties in creating a standalone question from chat history and a new query. Often, comments were added instead of rephrasing the original question. Inappropriate context sometimes led to unsuitable answers, affecting the rephrasing of standalone questions. Multiple variants of a question were also generated instead of a single reformulated question.
    I suspect that improvements in both the retriever and the prompt template are needed to optimize this process. Due to time constraints, I opted for a different approach. Instead of reformulating user prompts based on chat history, the prompts are now revised by an LLM before the retrieval step to make the queries as precise as possible without considering previous context.
  • Lessons on Dataset Usage:
  • Initially, I used a subset of the dataset, which made it easier to manage and understand the process. However, I later realized that I had worked with only a fraction of the original dataset and neglected to use the full set. When I updated the vector database with the complete dataset, the quality of the RAG pipeline significantly decreased. This suggests that the data preprocessing worked well for the small subset but was inadequate for the entire dataset.
Overall, while the application is functional, there is significant potential for improvement. This project provided a foundational understanding and highlighted many areas for future development and optimization that were not addressed due to time constraints.