Liftoff! How to get started with your first ML project 🚀

People who are new to the Machine Learning world often run into two recurring stumbling blocks. The first is choosing the right library to learn, which can be daunting when there are so many to pick from. Even once you’ve settled on a library and gone through some tutorials, the next issue is coming up with your first big project and scoping it properly to maximize your learning. If you’ve run into those problems, and if you're looking for a new ML library to add to your toolkit, you're in the right place!

In this post I’ll take you through some tips for going from 0 to 100 with a new library by using Sentence Transformers (ST) as an example. We'll start by understanding the basics of what ST can do, and highlight some things that make it a great library to learn. Then, I'll share my battle-tested strategy for tackling your first self-driven project. We’ll also talk about how I built my first ST-powered project, and what I learned along the way 🥳

What is Sentence Transformers?

Sentence embeddings? Semantic search? Cosine similarity?!?! 😱 Just a few short weeks ago, these terms were so confusing to me that they made my head spin. I’d heard that Sentence Transformers was a powerful and versatile library for working with language and image data and I was eager to play around with it, but I was worried that I would be out of my depth. As it turns out, I couldn’t have been more wrong!

Sentence Transformers is among the libraries that Hugging Face integrates with, where it’s described with the following:

Compute dense vector representations for sentences, paragraphs, and images

In a nutshell, Sentence Transformers answers one question: What if we could treat sentences as points in a multi-dimensional vector space? This means that ST lets you give it an arbitrary string of text (e.g., “I’m so glad I learned to code with Python!”), and it’ll transform it into a vector, such as [0.2, 0.5, 1.3, 0.9]. Another sentence, such as “Python is a great programming language.”, would be transformed into a different vector. These vectors are called “embeddings,” and they play an essential role in Machine Learning. If these two sentences were embedded with the same model, then both would coexist in the same vector space, allowing for many interesting possibilities.

What makes ST particularly useful is that, once you’ve generated some embeddings, you can use the built-in utility functions to compare how similar one sentence is to another, including synonyms! 🤯 One way to do this is with the “Cosine Similarity” function. With ST, you can skip all the pesky math and call the very handy util.cos_sim function to get a score from -1 to 1 that signifies how “similar” the embedded sentences are in the vector space they share – the bigger the score is, the more similar the sentences are!

A flowchart showing sentences being embedded with Sentence Transformers, and then compared with Cosine Similarity
After embedding sentences, we can compare them with Cosine Similarity.

Comparing sentences by similarity means that if we have a collection of sentences or paragraphs, we can quickly find the ones that match a particular search query with a process called semantic search. For some specific applications of this, see this tutorial for making a GitHub code-searcher or this other tutorial on building an FAQ engine using Sentence Transformers.

Why learn to use Sentence Transformers?

First, it offers a low-barrier way to get hands-on experience with state-of-the-art models to generate embeddings. I found that creating my own sentence embeddings was a powerful learning tool that helped strengthen my understanding of how modern models work with text, and it also got the creative juices flowing for ideation! Within a few minutes of loading up the msmarco-MiniLM-L-6-v3 model in a Jupyter notebook I’d come up with a bunch of fun project ideas just from embedding some sentences and running some of ST’s utility functions on them.

Second, Sentence Transformers is an accessible entry-point to many important ML concepts that you can branch off into. For example, you can use it to learn about clustering, model distillation, and even launch into text-to-image work with CLIP. In fact, Sentence Transformers is so versatile that it’s skyrocketed to almost 8,000 stars on GitHub, with more than 3,000 projects and packages depending on it. On top of the official docs, there’s an abundance of community-created content (look for some links at the end of this post 👀), and the library’s ubiquity has made it popular in research.

Third, embeddings are key for several industrial applications. Google searches use embeddings to match text to text and text to images; Snapchat uses them to "serve the right ad to the right user at the right time"; and Meta (Facebook) uses them for their social search. In other words, embeddings allow you to build things like chatbots, recommendation systems, zero-shot classifiers, image search, FAQ systems, and more.

On top of it all, it’s also supported with a ton of Hugging Face integrations 🤗.

Tackling your first project

So you’ve decided to check out Sentence Transformers and worked through some examples in the docs… now what? Your first self-driven project (I call these Rocket Launch projects 🚀) is a big step in your learning journey, and you’ll want to make the most of it! Here’s a little recipe that I like to follow when I’m trying out a new tool:

  1. Do a brain dump of everything you know the tool’s capable of: For Sentence Transformers this includes generating sentence embeddings, comparing sentences, retrieve and re-rank for complex search tasks, clustering, and searching for similar documents with semantic search.
  2. Reflect on some interesting data sources: There’s a huge collection of datasets on the Hugging Face Hub, or you can also consult lists like awesome-public-datasets for some inspiration. You can often find interesting data in unexpected places – your municipality, for example, may have an open data portal. You’re going to spend a decent amount of time working with your data, so you may as well pick datasets that excite you!
  3. Pick a secondary tool that you’re somewhat comfortable with: Why limit your experience to learning one tool at a time? “Distributed practice” (a.k.a. “spaced repetition”) means spreading your learning across multiple sessions, and it’s been proven to be an effective strategy for learning new material. One way to actively do this is by practicing new skills even in situations where they’re not the main learning focus. If you’ve recently picked up a new tool, this is a great opportunity to multiply your learning potential by battle-testing your skills. I recommend only including one secondary tool in your Rocket Launch projects.
  4. Ideate: Spend some time brainstorming on what different combination of the elements from the first 3 steps could look like! No idea is a bad idea, and I usually try to aim for quantity instead of stressing over quality. Before long you’ll find a few ideas that light that special spark of curiosity for you ✨

For my first Sentence Transformers project, I remembered that I had a little dataset of popular song lyrics kicking around, which I realized I could combine with ST’s semantic search functionality to create a fun playlist generator. I imagined that if I could ask a user for a text prompt (e.g. “I’m feeling wild and free!”), maybe I could find songs that had lyrics that matched the prompt! I’d also been making demos with Gradio and had recently been working on scaling up my skills with the newly-released Gradio Blocks, so for my secondary tool I decided I would make a cool Blocks-based Gradio app to showcase my project. Never pass up a chance to feed two birds with one scone 🦆🐓

Here’s what I ended up making! Keep an eye out for a future blog post where we'll break down how this was built 👀

What can you expect to learn from your first project?

Since every project is unique, your learning journey will also be unique! According to the “constructivism” theory of learning, knowledge is deeply personal and constructed by actively making connections to other knowledge we already possess. Through my Playlist Generator project, for example, I had to learn about the various pre-trained models that Sentence Transformers supports so that I could find one that matched my use-case. Since I was working with Gradio on Hugging Face Spaces, I learned about hosting my embeddings on the Hugging Face Hub and loading them into my app. To top it off, since I had a lot of lyrics to embed, I looked for ways to speed up the embedding process and even got to learn about Sentence Transformers’ Multi-Processor support.

Once you’ve gone through your first project, you’ll find that you’ll have even more ideas for things to work on! Have fun, and don’t forget to share your projects and everything you’ve learned with us over at 🤗

Further reading: