How to Fine-Tune Your Embeddings for Better Similarity Search and the Growth Engines of BeReal


Hatched by Glasp

Sep 01, 2023

5 min read


How to Fine-Tune Your Embeddings for Better Similarity Search and the Growth Engines of BeReal

In the world of artificial intelligence and machine learning, embeddings play a crucial role in various tasks. They are representations of data points in a continuous vector space that capture the semantic meaning and relationships between those points. One way to leverage these embeddings is through similarity search, where you can find similar records based on the cosine similarity of their embeddings. This can greatly enhance the labeling process in AI workflows.

To achieve better similarity search results, it is important to fine-tune your embeddings. Fine-tuning is the process of adjusting a pre-trained language model to better fit the domain of your data. Large language models (LLM) are trained on a wide range of tasks using vast amounts of data from the internet. While they are generally good at many tasks, they may lack domain-specific expertise. Fine-tuning allows you to tailor the model to your specific needs.

Before diving into the fine-tuning process, it's essential to understand the concept of similarity learning. In order to fine-tune embeddings, you need a task to solve. This task can be anything from supervised classification to unsupervised masked token prediction. In the case of similarity search, similarity is defined by the class labels assigned to each record. Records with the same class label are considered similar, while those with different labels are considered different.

In a recent experiment, 20,000 records were randomly selected and loaded into the Kern AI refinery. Out of these, 261 records were manually labeled with a confidence score above 0.7. These labeled records were then used for fine-tuning the embeddings. The goal was to increase the number of records of the same class within the top 1,000 most similar records.

To fine-tune the embeddings, a pre-trained LLM was used as the encoder, and a SkipConnectionHead was added on top. This approach allows for better mapping from one embedding to another. The performance of the fine-tuned embeddings was measured using the "top_1k" metric, which captures the increase in the number of records of the same class within the top 1,000 most similar records. The results showed that even with as few as 25 labeled records, the fine-tuned embeddings outperformed the raw embeddings.

The benefits of fine-tuning embeddings extend beyond just similarity search. They can also improve the performance of classifiers trained on the same data. However, it is important to note that the embeddings may not be well separated in only two dimensions, making the annotation process challenging. To overcome this, ongoing research is being conducted to fine-tune the embeddings in a way that leads to better separation of classes in a 2D space.

While fine-tuning embeddings can greatly enhance the labeling process, it is not the only aspect of AI that has seen significant growth. BeReal, an anti-Instagram social media startup, has gained attention from prominent publications and experienced rapid growth through its ambassador program and TikTok.

BeReal was introduced to the world by Kevin and Alexis, who chose LinkedIn as their medium to make their mark. The rise of anti-Instagram social media startups like Popparazi, ttyl, clubhouse, and BeReal caught the attention of various publications. In April 2021, BeReal was featured on Product Hunt, but the founders requested the post to be taken down.

May 2021 marked a significant milestone for BeReal as it received its first major publication piece in the Financial Times. The article titled "Lessons for Big Tech from the 'anti-social' photo app" highlighted the unique approach of BeReal and its potential impact on the social media landscape. This was followed by Vogue's coverage of Poparazzi and BeReal in June 2021, further solidifying their position in the new generation of social media platforms.

To fuel its growth, BeReal implemented an ambassador program that rewarded individuals for referring others to the app. Ambassadors would receive $30 for every person who downloaded the app through their referral link. If the referred person provided feedback, the incentive increased to $50. The program focused on hosting parties, partnering with student organizations and Greek houses, and securing placements in various student newspapers.

Another growth engine for BeReal was TikTok. From May 2022 to July 2022, BeReal went viral multiple times on TikTok, with nine videos from the official BeReal channel garnering over 1 million views each. This exposure on TikTok helped BeReal reach a wider audience and attract more users to the platform.

In September 2022, the significance of BeReal's growth became apparent when the founder's own mother asked them to explain the app. This anecdote highlights the increasing popularity and recognition of BeReal among different demographics.

In conclusion, fine-tuning embeddings can greatly enhance the similarity search process in AI workflows. By adjusting a pre-trained language model to better fit the domain of your data, you can improve the accuracy of similarity search results. Additionally, BeReal's growth engines, such as its ambassador program and presence on TikTok, have contributed to its success in the competitive social media landscape. The combination of these two topics showcases the continuous evolution and innovation in the fields of AI and social media platforms.

Actionable Advice:

  • 1. When fine-tuning embeddings, explore existing pre-trained models that may already be suitable for your data domain. This can save time and resources.
  • 2. Experiment with different similarity learning tasks to find the most effective approach for your specific use case. Supervised classification and unsupervised masked token prediction are good starting points.
  • 3. Consider leveraging social media platforms, like TikTok, to increase awareness and attract a larger user base for your product or service. Viral content can significantly impact growth and engagement.

(Note: The information in this article is for educational purposes only and does not endorse or promote any specific products or platforms. The examples used are fictional and do not represent any real-world entities or events.)

Hatch New Ideas with Glasp AI 🐣

Glasp AI allows you to hatch new ideas based on your curated content. Let's curate and create with Glasp AI :)