---
title: Reflections on Kaggle competition [Coleridge Initiative - Show US the Data]
desc: Reflecting on what worked and what didn't
published: true
date_published: 2021-12-29
tags: kaggle nlp
---

It's been several months since the [Coleridge Initiative - Show US the Data](https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data) competition has ended, but I recently got in the mood to write a quick reflection about my experience. This reflection is mainly a way for me to assess what I learned, but maybe you'll also find something worthwhile.


## Competition Details 🏆

The hosts wanted to find all of the mentions of public datasets hidden in published journals and articles. When I say hidden, I mean that the paper refers to a dataset but never officially cites anything in the references section.  In the hosts own words,  

> Can natural language processing find the hidden-in-plain-sight data citations? [This will] show how public data are being used in science and help the government make wiser, more transparent public investments."  

Participants competed to see who could find the most mentions of datasets in roughly 8,000 publications.

As this was my first Kaggle competition, I quickly realized that it was much more nuanced and complicated than I expected. It was also a wake-up call for me to realize how little I knew about PyTorch, Tensorflow, GPUs, TPUs, and Machine Learning in general.  I was challenged and pushed to think in new ways, and I felt that I had grown significantly after the competition ended.  

Moreover, the final results of the competition were very surprising because my team jumped over 1400 spots from the public to private leaderboard to end at 47th out of 1610 teams, earning us a silver medal (top 3%). If you would like to know more about why there are separate public and private leaderboards, [read this post here.](https://qr.ae/pG6Xc1) I've included a plot below of public ranking vs private ranking to show how much "shake-up" there was. My team had the 3rd highest positive delta, meaning that only 2 other teams jumped more positions from public to private. The person who probably had the worst day went from 176 (just outside the medal range) to 1438. To understand the figure, there are essentially 3 different categories. The first category would be any point on the line y=x. This means that the team had the exact same score on public and private leaderboards. The further the points get away from y=x, the bigger the difference between public and private leaderboards. The second category is for the teams who dropped from public to private -- they are in the region between the line y=x and the y-axis. The final category is for the teams that went up from public to private leaderboards: the region from the line y=x to the x-axis. My team fell into this third category and we were practically in the bottom right corner.

<img src="/assets/coleridge-shakeup.png" alt="Public vs private leaderboard rank for Coleridge competition" class="img-fluid mx-auto d-block">

## Why the shake-up?

 >*shake-up is the term used on Kaggle to describe the change in rank from public to private leaderboard*

In general, shake-up is caused by overfitting to the public leaderboard, not having a good cross-validation method, and having a model that does not generalize well. 

For this competition, the huge jump in score was because 
1. string matching worked well on the public leaderboard but not the private one.
   - the public leaderboard contained dataset names from the training data but the private leaderboard didn't have any.
2. most people did not check to see if their approach could find datasets that weren't mentioned in the training dataset

If I'm being honest, I didn't have great cross-validation, but I also refused to do string matching using dataset lists because it didn't seem like it was what the hosts wanted and I was more interested in applying a deep learning approach. 

## Best solution 💡

 
It was a bit annoying that my best submission actually didn't require any training whatsoever. I suppose I could consider this as being resourceful, but I still think it's annoying considering all the time I put into other methods that ultimately scored worse.

My best submission used a combination of regular expressions and a pre-trained question answering model that I pulled off the 🤗 Hugging Face Model Hub. In terms of regular expressions, I came up with a few simple patterns that looked for the sentences that contained words indicative of a dataset, such as Study, Survey, Dataset, etc. These regex patterns were used to quickly narrow down the amount of text that I would run through the slow transformer model. I sampled a few different architectures, BERT, RoBERTa, ELECTRA, and ultimately chose an ELECTRA model that had been trained on SQuAD 2.0. The final step was to pass the sentences that matched the regex patterns into the question answering model as the context, alongside "What is the dataset?" as the question. Adding the question gives the model extra information which helps it extract the right span of text. This submission went from 0.335 to 0.339 from public to private leaderboard.

 
## Setbacks 😖

As mentioned earlier, my NER models were absolutely dreadful. I figured that identifying the dataset names from passages of text was essentially token classification so I focused 95% of my efforts on building an NER model. I think NER models struggled because the training set had over 14,000 different texts, but only 130 different dataset labels. If there were a wider variety of dataset labels, it might have done better but I think the model just memorized those 130 labels. Moreover, there were many spans that had dataset names that were not labeled, so I was implicitly training the model to ignore other dataset names. I ended up using my NER model for one of my two final submissions, and it went from 0.398 in the public leaderboard to 0.038 in the private leaderboard. Big oof.
 

## Key Takeaways 🗝️

 
I don't think this was a typical Kaggle competition, and it was a little surprising to see how some top teams relied much more on good rule-based approaches rather than on deep learning models. I think most Kagglers default to using the biggest, most complicated transformer-based approach because they've heard so much about BERT et al. While transformers have been able to accomplish remarkable scores on numerous benchmarks, they can't do everything! I fell into this trap too, and I think my final score was mainly due to luck. 🍀