File size: 8,049 Bytes
8a3b507
 
 
 
 
 
be02bbe
8a3b507
 
 
 
 
 
 
9189e38
 
 
 
 
 
e9b9609
 
 
 
 
9189e38
cc901e9
 
b9a27ee
 
 
 
22bb999
 
 
b9a27ee
 
cc901e9
 
 
 
 
245c3c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9189e38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
245c3c3
9189e38
245c3c3
9189e38
245c3c3
 
 
 
 
9189e38
245c3c3
9189e38
245c3c3
 
 
 
 
 
45bb422
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: Brightly Ai
emoji: 👁
colorFrom: blue
colorTo: pink
sdk: gradio
python_version: 3.9.6
sdk_version: 4.36.1
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Brightly AI

AI Algorithms to classify words provided by food rescue organizations into a predefined dictionary given by the USDA.

## Overview

The Brightly algorithm classifies items for Food Rescue Organizations (FRO) by leveraging AI from Large Language Models (LLMs). At a high-level, the algorithm ingests CSV files provided by FRO's, iterates through each row, identifying items as food or non-food, performs syntax analysis to break down multi-item words into individual items, and maps the resultant data to the USDA dictionary database.

On a technical level, Brightly AI converts input items into numerical representations using embedded encodings. It employs cosine similarity to compare these encodings with a USDA predefined dictionary, ensuring precise matches. This allows the algorithm to classify items like "Xoconostle" to items like "Prickly pears, raw" with high accuracy, where algorithms like Levenshtein distance would fail.

Additionally, it handles various word forms and multi-term descriptions to maintain high accuracy. For example, only "crunchy peanut butter" would be extracted from "Peanut butter, crunchy, 16 oz." However, "Banana, hot chocolate & chips" would be properly broken down into "Banana", "Hot Chocolate", and "Chips" and categorized accordingly.

## Running

```
# Start celery worker
celery -A tasks worker --loglevel=info
python run.py

# clear tasks queue
celery -A tasks purge
```

```
docker build -t brightly-ai .
docker run -p 7860:7860 brightly-ai
```

# TODO

[ ] Add instructions re: each file in repo

## Files and their purpose

Here's a markdown table of the filename, and a brief description of what it does.

| Filename                      | Description                                                                                                                                                                                             |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| run.py                        | The main file to run the program. You pass it an array of words, and it'll process each word, store the results to a CSV file in the results folder, and stores any new mappings in the sqlite database |
| algo_fast.py                  | Uses a fast version of our LLM to encode word embeddings, and use cosine similarity to determine if they are similar.                                                                                   |
| algo_slow.py                  | A similar version of the algorithm, however, it has more a larger amount of embeddings from the dictionary.                                                                                             |
| multi_food_item_detector.py   | Determines if the given string of text is multiple food items, or a single food item.                                                                                                                   |
| update_pickle.py              | Updates the dictionary pickle file with any new words that have been added to the dictionary/additions.csv file.                                                                                        |
| add_mappings_to_embeddings.py | This takes all the reviewed mappings in the mappings database, and adds them to the embeddings file.                                                                                                    |

### How It Works

1. Initialization:

- Database Connection: Connects to a database to store and retrieve word mappings.
- Similarity Models: Initializes models to quickly and accurately find similar words.
- Pluralizer: Handles singular and plural forms of words.

2. Processing Input Words:

- Reading Input: The script reads input words, either from a file or a predefined list.
- Handling Multiple Items: If an input contains multiple items (separated by commas or slashes), it splits them and processes each item separately.

3. Mapping Words:

- Fast Similarity Search: Quickly finds the most similar word from the dictionary.
- Slow Similarity Search: If the fast search is inconclusive, it performs a more thorough search.
- Reverse Mapping: Attempts to find similar words by reversing the input word order.
- GPT-3 Query: If all else fails, queries GPT-3 for recommendations.

4. Classifying as Food or Non-Food:

- Classification: Determines if the word is a food item.
- Confidence Score: Assigns a score based on the confidence of the classification.

5. Storing Results:

- Database Storage: Stores the results in the database for future reference.
- CSV Export: Saves the final results to a CSV file for easy access.

## How it works (for auditor)

## Single Item Classification

1. We clean the input text by removing any special characters and converting it to lowercase. (i.e. "Bananas (organic)" becomes "bananas organic")
2. We run the input text against a zero-shot classifier to identify if it's a food item or not. We get back a label "food" or "non-food" and a score. If the score is higher than a set threshold for non-food, we classify as "Non-Food Item".
3. If USDA is in the input text, we classify as "Government Donation (Not Counted)".
4. We check if the input text is a specific food item, broad category, or Heterogeneous mixture. For example, "carrot" would be a specific food item, "vegetable" would be a broad category, and "Mixed groceries" would be a heterogeneous mixture.
5. If it's a specific food item, we then run the input text against our fast similarity search algorithm. This algorithm uses a pre-trained LLM to encode the input text and the dictionary. It then uses cosine similarity to find the most similar word in the dictionary. If the similarity is above a certain threshold, we classify the input text as the most similar word in the dictionary.

## Multi-Item Classification

1. Identify delimiters in the input word (commas, slashes, &, and).
2. Split the input word into parts using the identified delimiter. (“Green beans, carrots, produce” becomes “green beans”, “carrots”, “produce”)
3. Process each part of the item individually (meaning, process: “Green beans” individually). Each item is run through the "Single Item Classification" above.
4. If any item in the list is a non-food item, classify the entire input as 'Non-Food Item' and return this classification.
5. If any item in the list is a Heterogeneous Mixture, classify the entire input as 'Heterogeneous Mixture' and return this classification.
6. If no heterogeneous mixture nor non-food is found, identify and return the mapping with the lowest DMC value.

## Double Counting Prevention

For each CSV file, we identify donations from the current organization to other organizations in the same year (donations_from). This is done by scanning the other organization’s CSV files for donations received from the current organization in the same year and summing up the total_emissions_reduction value.

We also identify donations received by the given organization from other organizations CSV files that we have processed (donations_to). This is done by scanning the current organization’s CSV file for donations received from other organizations in the same year and summing up the total_emissions_reduction value.

We then sum the emissions reduction values in donations_from and donations_to to get the total amount of emissions reduction values. We take half the number of emissions reduction values and subtract it from the total_emissions_reduction value in the current organization’s CSV file to prevent double counting.

We repeat this process for each organization’s CSV file.