The 100 Days of SwiftUI is a comprehensive program designed to help individuals learn SwiftUI effectively. Created by Hacking with Swift, this initiative offers a collection of videos, tutorials, tests, and more. With SwiftUI gaining popularity as a powerful and intuitive framework for building user interfaces across Apple platforms, it's essential to have a solid understanding of its concepts and features.
Hatched by Glasp
Sep 05, 2023
4 min read
3 views
Copy Link
The 100 Days of SwiftUI is a comprehensive program designed to help individuals learn SwiftUI effectively. Created by Hacking with Swift, this initiative offers a collection of videos, tutorials, tests, and more. With SwiftUI gaining popularity as a powerful and intuitive framework for building user interfaces across Apple platforms, it's essential to have a solid understanding of its concepts and features.
To fine-tune your embeddings for better similarity search, it's crucial to first understand what embeddings are and how they are generated. Embeddings are representations of data in a lower-dimensional space that capture the semantic meaning of the data. They are typically used in natural language processing tasks, such as text classification and sentiment analysis.
In the context of the Kern AI refinery, embeddings play a crucial role in enhancing the labeling workflow. One tool that has been implemented is similarity search, where users can select a record and find similar records based on the cosine similarity of their embeddings. By fine-tuning the embeddings, the goal is to increase the number of records of the same class within a similarity labeling session.
Large language models (LLMs) are powerful models that can solve various tasks, including question answering, information extraction, and sentiment analysis. What makes them effective is a combination of the right architecture, a well-designed training procedure, and the availability of vast amounts of training data from the internet. However, LLMs lack domain-specific expertise, which is where fine-tuning comes into play.
Fine-tuning is the process of adjusting a language model to better fit the domain-specific data. Before fine-tuning, it's important to explore existing models in the Hugging Face model database to check if a model has already been fine-tuned on similar data. This can save time and resources. If no suitable pre-trained model is available, the fine-tuning process can be initiated.
To fine-tune the embeddings, a task needs to be defined. In this case, the task is similarity learning, where the goal is to determine the similarity between records based on their class labels. Each record consists of a title, a description, and an associated label. By selecting 20,000 records and manually labeling 261 of them, a dataset of 10,854 usable records is obtained for the fine-tuning pipeline.
Quaterion, a tool used in the Kern AI refinery, provides different options for leveraging similarity information in the fine-tuning process. These options include using a similarity score, pre-formed triplets, or similarity groups based on class labels. In this case, the class information is used as the similarity measure, and SimilarityGroupSamples are employed.
The fine-tuning process involves using a pre-trained LLM as the encoder and adding a SkipConnectionHead on top of it. This approach is preferable over using just a linear layer. The goal is to learn a mapping from one embedding to another. In traditional classification tasks, a classification head with as many out-features as there are classes is used. However, in this case, a different metric is employed to measure the success of the fine-tuning process.
The "top_1k" metric is used to measure the increase in the number of records of the same class within the 1000 most similar records. This metric captures the objective of the fine-tuning process. Additionally, the amount of records that need to be labeled for the fine-tuning to be beneficial is also identified.
The test data consists of 9,156 records that were not used in the training or validation steps. The results show that even with as few as 25 labeled records, the benefits of fine-tuning the embeddings are already noticeable. The fine-tuned embeddings consistently outperform the raw embeddings, indicating the effectiveness of the process.
Furthermore, fine-tuned embeddings with class information can also benefit classifiers trained on the same data. Basic principal component analysis (PCA) is often insufficient to separate the embeddings well in just two dimensions, making the annotation process challenging. As a result, efforts are being made to develop methods that fine-tune embeddings and improve the separation of classes in the 2D space.
In conclusion, fine-tuning embeddings for better similarity search is a valuable technique that can enhance the labeling workflow in the Kern AI refinery. By leveraging the power of large language models and incorporating class information, the number of records of the same class within a similarity labeling session can be increased. This process not only improves similarity search but also benefits classifiers trained on the same data.
To apply this knowledge in your own projects, here are three actionable pieces of advice:
- 1. Before fine-tuning your embeddings, explore existing models in the Hugging Face model database to check if a suitable pre-trained model is available. This can save time and resources.
- 2. When fine-tuning the embeddings, consider using similarity groups based on class labels. This can provide a meaningful measure of similarity and help achieve the desired results.
- 3. Experiment with different approaches to separate the embeddings in the 2D space. By fine-tuning the embeddings to improve class separation, the annotation process can be made more efficient.
By following these advice, you can fine-tune your embeddings effectively and improve the similarity search in your labeling workflow.
Resource:
Copy Link