sgarbi/bert-fda-nutrition-ner

This is a BERT model that is specifically designed for Named Entity Recognition (NER) in the field of nutrition labeling. Its main objective is to detect and categorize different nutritional components from textual data. By doing so, it provides a systematic understanding of the information generally found on nutrition labels and other nutritional materials.

This model was created as a benchmark and learning tool on training models from augmented data.

Training Data Description

The training data for the sgarbi/bert-fda-nutrition-ner model was thoughtfully curated from the U.S. Food and Drug Administration (FDA) through their publicly available datasets. This data primarily originates from the FoodData Central website and features comprehensive nutritional information and labeling for a wide array of food products.

Data Source

Labeling Source: U.S. Food and Drug Administration (FDA), FoodData Central. FDA FoodData Central. The dataset includes detailed nutritional data, such as ingredient lists, nutritional values, serving sizes, and other essential label information.
Yelp Restaurant Reviews: Utilized the Yelp Review Full dataset from Hugging Face, augmented with Mistral 7B for general tagging, to enrich the model's understanding of restaurant-related nutritional mentions.
Amazon Food Reviews: Similar to the Yelp dataset, this model also incorporates the Amazon Food Reviews dataset from Hugging Face, augmented with Mistral 7B, enhancing its capability to recognize and classify a wide range of nutritional information from diverse food product reviews correlated with FDA data.

Preprocessing and Augmentation Steps

Extraction: Key textual data, encompassing nutritional facts and ingredient lists, were extracted from the FDA dataset.
Normalization: All text underwent normalization for consistency, including converting to lowercase and removing redundant formatting.
Entity Tagging: Significant nutritional elements were manually tagged, creating a labeled dataset for training. This includes macronutrients, vitamins, minerals, and various specific dietary components.
Tokenization and Formatting: The data was tokenized and formatted to meet the BERT model's input requirements.
Introducing Noise: To enhance the model's ability to handle real-world, imperfect data, deliberate noise was introduced into the training set. This included:
- Sentence Swaps: Random swapping of sentences within the text to promote the model's understanding of varied sentence structures.
- Introducing Misspellings: Deliberately inserting common spelling errors to train the model to recognize and correctly process misspelled words frequently encountered in real-world scenarios such as inaccurate document scans.

Considerations

The model was trained only on publicly available data from food product labels. No private or sensitive data was used.
Labeling tasks were performed by Mistral 7B-Instruct served by mistral.ai (https://docs.mistral.ai/). It is probable that models experienced hallucinations during labeling data, which could result in an imprecise taxonomy classification.
The tool only extracts nutritional entities from text; it should not be used for nutrition or health recommendations. Qualified experts should provide any nutrition advice.
The language and phrasing on certain types of food product labels may introduce biases to the model.
This model was created for exploring the BERT architecture and NER tasks.

Label Map

label_map = {
    0: 'O',
    1: 'I-VITAMINS',
    2: 'I-STIMULANTS',
    3: 'I-PROXIMATES',
    4: 'I-PROTEIN',
    5: 'I-PROBIOTICS',
    6: 'I-MINERALS',
    7: 'I-LIPIDS',
    8: 'I-FLAVORING',
    9: 'I-ENZYMES',
    10: 'I-EMULSIFIERS',
    11: 'I-DIETARYFIBER',
    12: 'I-COLORANTS',
    13: 'I-CARBOHYDRATES',
    14: 'I-ANTIOXIDANTS',
    15: 'I-ALCOHOLS',
    16: 'I-ADDITIVES',
    17: 'I-ACIDS',
    18: 'B-VITAMINS',
    19: 'B-STIMULANTS',
    20: 'B-PROXIMATES',
    21: 'B-PROTEIN',
    22: 'B-PROBIOTICS',
    23: 'B-MINERALS',
    24: 'B-LIPIDS',
    25: 'B-FLAVORING',
    26: 'B-ENZYMES',
    27: 'B-EMULSIFIERS',
    28: 'B-DIETARYFIBER',
    29: 'B-COLORANTS',
    30: 'B-CARBOHYDRATES',
    31: 'B-ANTIOXIDANTS',
    32: 'B-ALCOHOLS',
    33: 'B-ADDITIVES',
    34: 'B-ACIDS'
}

Here are some example model outputs on the provided text:


INPUT:
'Here are the ingredients to use: Tomato Paste, Sesame Oil, Cheese Cultures,  Ground Corn, Vegetable Oil, Brown rice, sea salt, Tomatoes, Milk, Onions, Egg Yolks, Lime Juice Concentrate, Corn Starch, Condensed Milk, Spices, Artificial Flavor, red 5, roasted coffee'

Output:
['CLS', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-PROBIOTICS', 'I-PROBIOTICS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-MINERALS', 'I-MINERALS', 'O', 'B-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'O', 'B-CARBOHYDRATES', 'O', 'B-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'I-LIPIDS', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-CARBOHYDRATES', 'I-CARBOHYDRATES', 'I-CARBOHYDRATES', 'O', 'B-PROXIMATES', 'I-PROXIMATES', 'O', 'B-FLAVORING', 'O', 'B-FLAVORING', 'I-FLAVORING', 'O', 'B-COLORANTS', 'I-COLORANTS', 'O', 'B-STIMULANTS', 'I-STIMULANTS', 'O', 'I-STIMULANTS']

Github

https://github.com/ESgarbi/bert-fda-nutrition-ner