### Load packages

In [1]:
!pip install datasets
!pip install datatrove
import datasets
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from datatrove.pipeline.readers import ParquetReader



## Methodology

In order to measure bias in the dataset, we consider the following simple [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) based approach. The idea is that the specificity of a term -- in our case, how `biased` it is -- can be quantified as an inverse function of the number of documents in which it occurs.

Given a dataset and terms for a subpopulation (gender) of interest:
1. Evaluate Inverse Document Frequencies on the full dataset
2. Compute the average TF-IDF vectors for the dataset for a given subpopulation (gender)
3. Sort the terms by variance to see words that are much more likely to appear specifically for a given subpopulation




### Load Fineweb


In [2]:
# NOTE this is just a sample to be runnable in colab!! Remove limit=10000 if using in a place with more RAM.
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb/sample/10BT", progress=True, limit=10000)
corpus = map(lambda doc: doc.text, data_reader())

### Compute frequencies

In [3]:
# Step 1: get document frequencies for the dataset
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
full_tfidf = vectorizer.fit_transform(corpus)
tfidf_feature_names = np.array(vectorizer.get_feature_names_out())


  0%|          | 0/10000 [00:00<?, ?it/s][32m2024-05-31 01:37:47.091[0m | [1mINFO    [0m | [36mdatatrove.pipeline.readers.base[0m:[36mread_files_shard[0m:[36m193[0m - [1mReading input file 000_00000.parquet[0m
100%|██████████| 10000/10000 [00:05<00:00, 1726.34it/s]


### Bias analysis: Gender tf-idf

In [4]:
# Step 2: get average TF-IDF vectors **for each gender**
woman_docs = map(lambda doc: doc.text, filter(lambda doc: "woman" in doc.text.split(), data_reader()))
man_docs = map(lambda doc: doc.text, filter(lambda doc: "man" in doc.text.split(), data_reader()))
tfidf_by_gender = {}
tfidf_by_gender["man"] = np.asarray(vectorizer.transform(man_docs).mean(axis=0))[0]
tfidf_by_gender["woman"] = np.asarray(vectorizer.transform(woman_docs).mean(axis=0))[0]

  0%|          | 0/10000 [00:00<?, ?it/s][32m2024-05-31 01:37:53.349[0m | [1mINFO    [0m | [36mdatatrove.pipeline.readers.base[0m:[36mread_files_shard[0m:[36m193[0m - [1mReading input file 000_00000.parquet[0m
100%|██████████| 10000/10000 [00:02<00:00, 3546.68it/s]
  0%|          | 0/10000 [00:00<?, ?it/s][32m2024-05-31 01:37:56.196[0m | [1mINFO    [0m | [36mdatatrove.pipeline.readers.base[0m:[36mread_files_shard[0m:[36m193[0m - [1mReading input file 000_00000.parquet[0m
100%|██████████| 10000/10000 [00:01<00:00, 5873.86it/s]


In [6]:
# Step 3: for each term, compute the variance across genders
all_tfidf = np.array(list(tfidf_by_gender.values()))
tf_idf_var = all_tfidf - all_tfidf.sum(axis=0, keepdims=True)
tf_idf_var = np.power((tf_idf_var * tf_idf_var).sum(axis=0), 0.5)
sort_by_variance = tf_idf_var.argsort()[::-1]

In [8]:
# Create the data structure for the visualization,
# showing the highest variance words for each gender,
# and how they deviate from the mean
pre_pandas_lines = [
    {
        "word": tfidf_feature_names[w],
        "man": all_tfidf[0, w],
        "woman": all_tfidf[1, w],
        "man+": all_tfidf[0, w] - all_tfidf[:, w].mean(),
        "woman+": all_tfidf[1, w] - all_tfidf[:, w].mean(),
        "variance": tf_idf_var[w],
        "total": all_tfidf[:, w].sum(),
    }
    for w in sort_by_variance[:50]
]

## Results

In [9]:
# Plot
df = pd.DataFrame.from_dict(pre_pandas_lines)
df.style.background_gradient(
    axis=None,
    vmin=0,
    vmax=0.2,
    cmap="YlGnBu"
).format(precision=2)

Unnamed: 0,word,man,woman,man+,woman+,variance,total
0,man,0.05,0.02,0.02,-0.02,0.06,0.07
1,woman,0.01,0.06,-0.02,0.02,0.06,0.07
2,said,0.03,0.03,-0.0,0.0,0.04,0.06
3,like,0.02,0.02,-0.0,0.0,0.03,0.05
4,women,0.01,0.03,-0.01,0.01,0.03,0.04
5,just,0.02,0.02,-0.0,0.0,0.03,0.05
6,time,0.02,0.02,-0.0,0.0,0.03,0.04
7,people,0.02,0.02,-0.0,0.0,0.03,0.04
8,life,0.02,0.02,-0.0,0.0,0.03,0.04
9,love,0.01,0.02,-0.0,0.0,0.02,0.03


### Sorting by bias

In order to better surface biases, we can sort the table by how much one gender term over-represents other terms associated with it.

#### Bias towards `man`

In this case, we see occupation words such as `police` with a higher association towards `man`

In [10]:
df.sort_values('man+', ascending=False).style.background_gradient(
    axis=None,
    vmin=0,
    vmax=0.2,
    cmap="YlGnBu"
).format(precision=2)

Unnamed: 0,word,man,woman,man+,woman+,variance,total
0,man,0.05,0.02,0.02,-0.02,0.06,0.07
11,men,0.02,0.01,0.0,-0.0,0.02,0.03
14,police,0.02,0.01,0.0,-0.0,0.02,0.03
34,home,0.01,0.01,0.0,-0.0,0.02,0.02
18,new,0.01,0.01,0.0,-0.0,0.02,0.03
46,state,0.01,0.01,0.0,-0.0,0.01,0.02
47,told,0.01,0.01,0.0,-0.0,0.01,0.02
26,old,0.01,0.01,0.0,-0.0,0.02,0.03
43,does,0.01,0.01,0.0,-0.0,0.01,0.02
32,say,0.01,0.01,0.0,-0.0,0.02,0.02


#### Bias towards `woman`

In this case, we see words like `love`, `life`, and `family` with a higher skew towards `woman`.

In [11]:
df.sort_values('woman+', ascending=False).style.background_gradient(
    axis=None,
    vmin=0,
    vmax=0.2,
    cmap="YlGnBu"
).format(precision=2)

Unnamed: 0,word,man,woman,man+,woman+,variance,total
1,woman,0.01,0.06,-0.02,0.02,0.06,0.07
4,women,0.01,0.03,-0.01,0.01,0.03,0.04
9,love,0.01,0.02,-0.0,0.0,0.02,0.03
8,life,0.02,0.02,-0.0,0.0,0.03,0.04
36,family,0.01,0.01,-0.0,0.0,0.02,0.02
40,sex,0.01,0.01,-0.0,0.0,0.01,0.02
21,make,0.01,0.02,-0.0,0.0,0.02,0.03
7,people,0.02,0.02,-0.0,0.0,0.03,0.04
5,just,0.02,0.02,-0.0,0.0,0.03,0.05
29,book,0.01,0.01,-0.0,0.0,0.02,0.02
