# How I trained this model
## Training Data
As my training data, I used [this dataset from Kaggle](https://www.kaggle.com/hugodarwood/epirecipes), consisting of more than 20,000 recipes from [Epicurious.com](www.epicurious.com). I hand-labeled food mentions in a random subset of 300 recipes, consisting of 981 sentences, using the awesome labeling tool [Prodigy](https://prodi.gy).
### Exploring the data
The Kaggle dataset includes many fields besides the recipes themselves, such as Title, Date, Category, Calories, Fat, Protein, Sodium, and Ingredients. In future projects, I'll use these as labels to train classification or regression models.
#### Common foods
What kinds of foods are most commonly included in the training data? To answer this question, I determined the most common food terms after labeling 300 recipes. The most common one-word foods were:
| Term | # Mentions|
| ----------| ----------|
| salt | 301 |
| pepper | 202 |
| oil | 122 |
| butter | 122 |
| sugar | 114 |
| sauce | 89 |
| garlic | 78 |
| dough | 69 |
| onion | 66 |
| flour | 63 |
And the most common two-word food terms were:
| Term | # Mentions|
| ----------| ----------|
|lemon juice| 47
|olive oil|29
|lime juice| 16
|flour mixture| 12
|sesame seeds| 11
|egg mixture| 10
|pan juices|10
|baking powder| 10
|cream cheese|10
|green onions| 10
You can see that the recipes include both baking and non-baking cooking. They span a wide variety of cooking styles and techniques. They also include cocktail recipes and a detailed set of instructions for setting up a beachside clambake!
### Difficult labeling decisions
You'd think labeling food would be easy, but I faced some hard decisions. Is water food? Sometimes water is an ingredient, but other times it's used almost as a tool, as in "plunge asparagus into boiling water." What about phrases like "egg mixture"? Should "mixture" be considered part of the food item? Or how about the word "slices" in the sentence "Arrange slices on a platter"? These decisions would normally be guided by a business use case. But here I was just training a model for fun, so these decisions were tough, and a bit arbitrary.
## Model Architecture
I fine tuned a few different BERT-style models (BERT, distilBERT, and RoBERTa) and achieved the best performance with RoBERTa. You can see the training code [here.](https://github.com/carolmanderson/food/blob/master/notebooks/modeling/Train_BERT.ipynb)
A few years ago I trained a model with a completely different architecture on the same dataset. That model had a bidirectional LSTM layer followed by a softmax output layer to predict token labels. I used pretrained GloVe embeddings as the only feature. You can see the old training code [here.](https://github.com/carolmanderson/food/blob/master/notebooks/modeling/Train_basic_LSTM_model.ipynb)
## Model Performance
This model achieved 95% recall and 96% precision when evaluated at the entity level (i.e., the exact beginning and end of each `FOOD` entity must match the ground truth labels in order to count as a true postive).
For comparison, the LSTM model that I previously trained achieved both precision and recall of 92%.
One of the errors I commonly saw with the older LSTM model involved words with multiple meanings. The word "salt", for example, can be a food, but it can also be a verb, as in "Lightly salt the vegetables." The word "butter" is similarly ambiguous — you can use butter as a ingredient, but you can also "butter the pan." The old model seemed to treat all mentions of these words as food: