LLMs (Language Models) have undoubtedly revolutionized the way we interact with technology. They have the potential to answer a wide range of questions and provide valuable information in real-time. However, despite their impressive capabilities, LLMs still have limitations in accurately answering certain types of queries. This is where better data engineering comes into play.

Peter Buck

Hatched by Peter Buck

Oct 15, 2023

3 min read

0

LLMs (Language Models) have undoubtedly revolutionized the way we interact with technology. They have the potential to answer a wide range of questions and provide valuable information in real-time. However, despite their impressive capabilities, LLMs still have limitations in accurately answering certain types of queries. This is where better data engineering comes into play.

One of the main issues with LLMs is their reliability in providing prompt responses. Many users have experienced unreliable answers when asking fact-finding questions such as the monetary value of a provision or the parties involved in a transaction. In an attempt to address this problem, a new wave of applications has emerged in the market, offering automated question answering. These applications aim to improve the accuracy of LLMs in responding to such queries.

For example, ChatGPT, powered by GPT-3.5 and GPT-4, has been trained to answer fact-finding questions. However, even with the latest models, there is still room for improvement. When tested with a set of 100 random sentences containing references to specific terms like "intellectual property," "material adverse effect," or "amendments," the models tend to undershoot the correct answers by a significant margin. This highlights the need for better data engineering techniques to enhance the performance of LLMs.

One approach to improve LLMs is by incorporating organic interactions and feedback data. OpenAI's BlenderBot 3, for instance, has been trained using real conversations and feedback from users to enhance its skills and safety. By leveraging organic data, the model can learn from both high-quality conversations and feedback, as well as identify and avoid adversarial or toxic behavior. This helps in refining the responses and making the LLM more reliable.

However, training models with organic data comes with its own set of challenges. The interactions with people "in the wild" are diverse, and not all interactions are helpful or conducive to learning. Some individuals may intentionally try to trick the model into providing unhelpful or toxic responses. Therefore, it is crucial to develop techniques that enable LLMs to learn from helpful teachers while filtering out the unhelpful or adversarial elements.

In order to further progress in this direction, OpenAI has made the participating de-identified interaction data publicly available for the research community. This move aims to encourage collaboration and accelerate advancements in LLMs. Researchers can now access this dataset to explore new techniques and algorithms, and contribute to the development of more reliable and accurate language models.

While data engineering plays a vital role in improving LLMs, there are also actionable steps that users can take to enhance their experience with these models. Here are three suggestions:

  • 1. Verify and cross-reference information: Although LLMs provide quick responses, it's always a good practice to verify the information they provide. Cross-referencing the answers with trusted sources ensures accuracy and helps in avoiding misinformation.
  • 2. Provide context and specific details: When asking questions, it is beneficial to provide as much context and specific details as possible. This helps LLMs understand the query better and increases the chances of receiving accurate and relevant responses.
  • 3. Report and provide feedback: If you encounter any inaccurate or problematic responses from LLMs, make sure to report them to the developers. Feedback plays a crucial role in improving the models and addressing their limitations.

In conclusion, while LLMs have made significant advancements in answering a wide range of questions, there are still areas where they fall short. Better data engineering techniques, such as incorporating organic interactions and feedback data, can help enhance the reliability and accuracy of LLMs. OpenAI's release of the interaction data for research purposes further encourages progress in this field. By combining efforts from researchers and users, we can continue to push the boundaries of LLMs and unlock their full potential.

Hatch New Ideas with Glasp AI 🐣

Glasp AI allows you to hatch new ideas based on your curated content. Let's curate and create with Glasp AI :)