"Fine-Tuning Embeddings for Better Similarity Search and Learning in Public"

Hatched by Glasp

Aug 22, 2023

4 min read

14 views

"Fine-Tuning Embeddings for Better Similarity Search and Learning in Public"

Introduction:
In the realm of AI, embeddings play a crucial role in various applications such as similarity search and labeling workflows. In this article, we will delve into the concept of fine-tuning embeddings for better similarity search and explore how this can benefit the labeling workflow in the Kern AI refinery. Additionally, we will discuss the idea of learning in public and its potential in enhancing knowledge flow within organizations.

Understanding Embeddings and Their Applications:
Before diving into the details, it is important to have a basic understanding of what embeddings are and how they are generated. In simple terms, embeddings are numerical representations of objects or concepts in a vector space. These representations capture the semantic meaning and relationships between different entities.

One powerful application of embeddings is similarity search, where records can be compared based on the cosine similarity of their embeddings. By fine-tuning embeddings, we aim to increase the number of records of the same class within a similarity labeling session. This can greatly enhance the efficiency and accuracy of the labeling process.

Fine-Tuning Language Models:
Large language models (LLMs) are widely used for various tasks such as question answering, information extraction, and sentiment analysis. These models excel in their performance due to the right architecture, well-designed training procedures, and access to vast amounts of training data from the internet. However, LLMs often lack domain-specific expertise.

This is where fine-tuning comes into play. Fine-tuning is the process of adjusting a language model to better fit the domain of the data. Before fine-tuning, it is recommended to explore existing fine-tuned models in the Hugging Face model database that may be suitable for your data. By fine-tuning language models, we can leverage their generalized abilities while incorporating domain-specific knowledge.

The Prerequisite of Similarity Learning:
To successfully fine-tune embeddings, a task or objective is necessary. In our case, similarity learning serves as the prerequisite for fine-tuning. Similarity, in this context, is defined by class labels. Two records are considered similar if they share the same class label, and different if they have different class labels.

An Experiment in Fine-Tuning Embeddings:
To illustrate the process of fine-tuning embeddings, we conducted an experiment using Kern AI refinery. We selected 20,000 random records and manually labeled 261 of them, filtering for a confidence score higher than 0.7. This resulted in 10,854 usable records for our fine-tuning pipeline.

We utilized SimilarityGroupSamples, as class information was the only similarity measure available. The goal was to learn a mapping from one embedding to another using a pre-trained LLM as the encoder and a SkipConnectionHead. The performance of the fine-tuned embeddings was measured using the "top_1k" metric, which aimed to increase the number of records of the same class within the 1000 most similar records.

Benefits of Fine-Tuned Embeddings:
The results of our experiment indicated that even with a small labeling session of 25 records, the fine-tuned embeddings already showed improvements compared to raw embeddings. This suggests that fine-tuning can be beneficial even with limited labeled data. Moreover, the fine-tuned embeddings can also enhance the performance of classifiers trained on the same data.

Challenges and Future Directions:
While fine-tuning embeddings can bring significant improvements, there are challenges to consider. For instance, when using basic PCA, embeddings may not be well-separated in only two dimensions, making the annotation process difficult. However, ongoing research focuses on methods to fine-tune embeddings, aiming for better separation of classes in the 2D space.

Learning in Public for Knowledge Flow:
In parallel to the discussion on fine-tuning embeddings, we explore the concept of learning in public. Learning in public refers to making one's learning process transparent and sharing it with others. By doing so, individuals can receive valuable feedback, support, and improve their knowledge and skills.

In organizations, there is a tremendous potential to enhance knowledge flow by adopting a "socialized knowledge management system". Personal Knowledge Management (PKM) framework, which emphasizes individual needs and desires, can serve as a foundation for enhancing knowledge flows. The Seek-Sense-Share approach within PKM encourages individuals to seek knowledge, make sense of it, and share it with others, fostering a culture of continuous learning and improvement.

Conclusion and Actionable Advice:
Fine-tuning embeddings can significantly improve similarity search and labeling workflows. By leveraging the power of embeddings and domain-specific knowledge, organizations can enhance the accuracy and efficiency of their AI applications. Additionally, adopting a learning in public approach can foster knowledge flow and drive innovation within organizations.

Here are three actionable advice to consider:

Explore existing fine-tuned models before embarking on the fine-tuning process to leverage pre-trained models that may suit your data.
Implement SimilarityGroupSamples to utilize class information as a similarity measure for fine-tuning embeddings.
Encourage learning in public within your organization by adopting the PKM framework and promoting the Seek-Sense-Share approach for individuals to enhance knowledge flow.

By fine-tuning embeddings and embracing a culture of learning in public, organizations can unlock the full potential of their data and knowledge, leading to improved AI applications and innovation.