File size: 6,592 Bytes
b18d8ef
b822a55
b18d8ef
b822a55
 
5596403
 
b822a55
5596403
 
 
 
cb30520
5596403
 
 
 
 
 
 
 
b18d8ef
cb30520
 
7ffc80e
b822a55
 
 
 
 
 
 
5596403
f605548
 
b822a55
 
 
 
 
 
cb30520
 
 
b822a55
 
f605548
b822a55
 
cb30520
 
 
f605548
 
4f41cb0
 
 
 
6dc2ba6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f41cb0
afdd8a4
4f41cb0
b822a55
 
5ec1461
 
 
 
 
 
 
95369c6
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
language: en
license: mit
pipeline_tag: token-classification
tags:
- token-classification
- NER
widget:
- text: >
    Here are the ingredients to use: Tomato Paste, Sesame Oil, Cheese Cultures, 
    Ground Corn, Vegetable Oil, Brown rice, sea salt, Tomatoes, Milk, Onions,
    Egg Yolks, Lime Juice Concentrate, Corn Starch, Condensed Milk, Spices,
    Artificial Flavor, red 5, roasted coffee.
- text: >
    Beef: 250 calories per 100g Chicken: 165 calories per 100g Salmon: 206
    calories per 100g Tofu: 76 calories per 100g Lentils: 116 calories per 100g
    Carrots: 41 calories per 100g Spinach: 23 calories per 100g Apples: 52
    calories per 100g Bananas: 89 calories per 100g Oranges: 47 calories per
    100g Rice: 130 calories per 100g cooked Pasta: 131 calories per 100g cooked
    Bread: 265 calories per 100g Olive oil: 884 calories per 100g Butter: 717
    calories per 100g
---
This is a BERT model that is specifically designed for Named Entity Recognition (NER) in the field of nutrition labeling. Its main objective is to detect and categorize different nutritional components from textual data. By doing so, it provides a systematic understanding of the information generally found on nutrition labels and other nutritional materials.

This model was created as a benchmark and learning tool on training models from augmented data.


## Training Data Description

The training data for the `sgarbi/bert-fda-nutrition-ner` model was thoughtfully curated from the U.S. Food and Drug Administration (FDA) through their publicly available datasets. This data primarily originates from the FoodData Central website and features comprehensive nutritional information and labeling for a wide array of food products.

### Data Source
- **Labeling Source**: U.S. Food and Drug Administration (FDA), FoodData Central. [FDA FoodData Central](https://fdc.nal.usda.gov/download-datasets.html). The dataset includes detailed nutritional data, such as ingredient lists, nutritional values, serving sizes, and other essential label information.
- **Yelp Restaurant Reviews:** Utilized the [Yelp Review Full dataset](https://huggingface.co/datasets/yelp_review_full) from Hugging Face, augmented with Mistral 7B for general tagging, to enrich the model's understanding of restaurant-related nutritional mentions.
- **Amazon Food Reviews:** Similar to the Yelp dataset, this model also incorporates the [Amazon Food Reviews dataset](https://huggingface.co/datasets/jhan21/amazon-food-reviews-dataset) from Hugging Face, augmented with Mistral 7B, enhancing its capability to recognize and classify a wide range of nutritional information from diverse food product reviews correlated with FDA data.

### Preprocessing and Augmentation Steps
- **Extraction**: Key textual data, encompassing nutritional facts and ingredient lists, were extracted from the FDA dataset.
- **Normalization**: All text underwent normalization for consistency, including converting to lowercase and removing redundant formatting.
- **Entity Tagging**: Significant nutritional elements were manually tagged, creating a labeled dataset for training. This includes macronutrients, vitamins, minerals, and various specific dietary components.
- **Tokenization and Formatting**: The data was tokenized and formatted to meet the BERT model's input requirements.
- **Introducing Noise**: To enhance the model's ability to handle real-world, imperfect data, deliberate noise was introduced into the training set. This included:
    - **Sentence Swaps**: Random swapping of sentences within the text to promote the model's understanding of varied sentence structures.
    - **Introducing Misspellings**: Deliberately inserting common spelling errors to train the model to recognize and correctly process misspelled words frequently encountered in real-world scenarios such as inaccurate document scans.


## Considerations

- The model was trained only on publicly available data from food product labels. No private or sensitive data was used.
- Labeling tasks were performed by Mistral 7B-Instruct served by mistral.ai (https://docs.mistral.ai/). It is probable that models experienced hallucinations during labeling data, which could result in an imprecise taxonomy classification.
- The tool only extracts nutritional entities from text; it should not be used for nutrition or health recommendations. Qualified experts should provide any nutrition advice.
- The language and phrasing on certain types of food product labels may introduce biases to the model. 
- This model was created for exploring the BERT architecture and NER tasks.
  
## Label Map

```python
label_map = {
    0: 'O',
    1: 'I-VITAMINS',
    2: 'I-STIMULANTS',
    3: 'I-PROXIMATES',
    4: 'I-PROTEIN',
    5: 'I-PROBIOTICS',
    6: 'I-MINERALS',
    7: 'I-LIPIDS',
    8: 'I-FLAVORING',
    9: 'I-ENZYMES',
    10: 'I-EMULSIFIERS',
    11: 'I-DIETARYFIBER',
    12: 'I-COLORANTS',
    13: 'I-CARBOHYDRATES',
    14: 'I-ANTIOXIDANTS',
    15: 'I-ALCOHOLS',
    16: 'I-ADDITIVES',
    17: 'I-ACIDS',
    18: 'B-VITAMINS',
    19: 'B-STIMULANTS',
    20: 'B-PROXIMATES',
    21: 'B-PROTEIN',
    22: 'B-PROBIOTICS',
    23: 'B-MINERALS',
    24: 'B-LIPIDS',
    25: 'B-FLAVORING',
    26: 'B-ENZYMES',
    27: 'B-EMULSIFIERS',
    28: 'B-DIETARYFIBER',
    29: 'B-COLORANTS',
    30: 'B-CARBOHYDRATES',
    31: 'B-ANTIOXIDANTS',
    32: 'B-ALCOHOLS',
    33: 'B-ADDITIVES',
    34: 'B-ACIDS'
}
```

Here are some example model outputs on the provided text:

```Python

INPUT:
'Here are the ingredients to use: Tomato Paste, Sesame Oil, Cheese Cultures,  Ground Corn, Vegetable Oil, Brown rice, sea salt, Tomatoes, Milk, Onions, Egg Yolks, Lime Juice Concentrate, Corn Starch, Condensed Milk, Spices, Artificial Flavor, red 5, roasted coffee'

Output:
['CLS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-PROBIOTICS', 'I-PROBIOTICS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-MINERALS', 'I-MINERALS', 'O', 'B-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'O', 'B-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'I-PROXIMATES', 'O', 'B-FLAVORING', 'O', 'B-FLAVORING', 'I-FLAVORING', 'O', 'B-COLORANTS', 'I-COLORANTS', 'O', 'B-STIMULANTS', 'I-STIMULANTS', 'O', 'I-STIMULANTS']
```

## Github

https://github.com/ESgarbi/bert-fda-nutrition-ner