2nd Place Solution - All about text embeddings :P

#22
by zayedupal - opened

Hi All,

First of all thanks a lot to @huggingface-co for hosting this competition.
It was a truly enjoyable experience to take part in, and I learned a lot.
I hope to share a more detailed post and codes in the future when I have time, but here is a summary of the steps I took and the intuitions driving them:

  • Basic Data Analysis:

    From the basic analysis, it was clear that the dataset was balanced, no missing values. However, there were duplicates based on the movie name and synopsis (discussed it later).
  • Feature selection:

  • I used only the synopsis first and later concatenated movie name and synopsis in a single text. Combining them increased the evaluation accuracy.
  • Removing Duplicates:

    To remove duplicates, I used sentence-transformer to calculate the cosine similarity between movie name and synopsis with genre. Then I kept the most similar one.
  • My approach:

  • I tried out different pre-trained models for generating text embeddings.
  • Classified those embeddings using different models (used scikit learn for easy model building).
  • Finally, combined the predictions through soft voting (average of the predicted probabilities to select class) from each of those classifiers.
  • The model that won the second place:

  • Embeddings generated using - 1. https://huggingface.co/sentence-transformers/sentence-t5-xxl, 2. https://huggingface.co/google/flan-t5-xxl, 3. https://huggingface.co/google/flan-t5-xl.
  • Model used - Logistic Regression with saga solver. All the other parameters remained at the default values specified by scikit-learn.
  • The model that has the best private score:

    Everything was similar to the previous model, except I also added https://huggingface.co/hkunlp/instructor-xl for embedding generation.
  • Learnings and Observations

  1. Simple Logistic Regression worked better than Random Forest, MLP, Decision Tree, SVM.
  2. https://huggingface.co/spaces/mteb/leaderboard has been a huge help in choosing different models for text embedding generation.
  3. https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu and https://paperswithcode.com/sota/natural-language-inference-on-rte benchmarks also helped a lot.
  4. I used simple 70:30 split on the training data to evaluate the models. I should've done k-fold for better selection of the models.

Hope this helps.

Best,
Zayed

How much time it took for generating embedding?

@aman1391 , it was different for the models. I didn't save the runtimes. As far as I remember, flan-t5-xxl and t5-xxl took longest time. I have used paid google colab. For the whole competition, I used around 250 compute units in total.

Okay btw nicely done so much to learn from your solution ๐Ÿ˜ƒ

Great @zayedupal โ€“ thank you so much for sharing your solution ๐Ÿ™

@zayedupal , thank you, i am trying to reproduce but not success, could you please share your code?

Sorry for the late.
I have uploaded my codes here:
https://github.com/zayedupal/Hugging_Face_Movie_Genre_Prediction_Public/blob/main/README.md
@AkmalAzzam
@janbelke , waiting for the prize :P

Well it has been more than 2 months but we haven't heard about the prize @janbelke

Sign up or log in to comment