{ "cells": [ { "cell_type": "markdown", "source": [ "### Load packages" ], "metadata": { "id": "utSDkGUL101i" }, "id": "utSDkGUL101i" }, { "cell_type": "code", "execution_count": 1, "id": "34299990-bd58-4fe9-99fe-15d4b6796106", "metadata": { "id": "34299990-bd58-4fe9-99fe-15d4b6796106", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "08b365f9-4076-4762-d913-a66f4721ee7e" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [] } ], "source": [ "!pip install datasets\n", "!pip install datatrove\n", "import datasets\n", "import json\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.feature_extraction.text import TfidfTransformer\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from datatrove.pipeline.readers import ParquetReader" ] }, { "cell_type": "markdown", "id": "703c7781-0a33-41dc-8da9-2fa034483cad", "metadata": { "id": "703c7781-0a33-41dc-8da9-2fa034483cad" }, "source": [ "## Methodology\n", "\n", "In order to measure bias in the dataset, we consider the following simple [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) based approach. The idea is that the specificity of a term -- in our case, how `biased` it is -- can be quantified as an inverse function of the number of documents in which it occurs.\n", "\n", "Given a dataset and terms for a subpopulation (gender) of interest:\n", "1. Evaluate Inverse Document Frequencies on the full dataset\n", "2. Compute the average TF-IDF vectors for the dataset for a given subpopulation (gender)\n", "3. Sort the terms by variance to see words that are much more likely to appear specifically for a given subpopulation\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "7c837c65-987f-45cf-b18d-fc7836894372", "metadata": { "id": "7c837c65-987f-45cf-b18d-fc7836894372" }, "source": [ "### Load Fineweb\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "dbd19018", "metadata": { "id": "dbd19018" }, "outputs": [], "source": [ "# NOTE this is just a sample to be runnable in colab!! Remove limit=10000 if using in a place with more RAM.\n", "data_reader = ParquetReader(\"hf://datasets/HuggingFaceFW/fineweb/sample/10BT\", progress=True, limit=10000)\n", "corpus = map(lambda doc: doc.text, data_reader())" ] }, { "cell_type": "markdown", "source": [ "### Compute frequencies" ], "metadata": { "id": "eBj1TtiW2C-6" }, "id": "eBj1TtiW2C-6" }, { "cell_type": "code", "source": [ "# Step 1: get document frequencies for the dataset\n", "vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')\n", "full_tfidf = vectorizer.fit_transform(corpus)\n", "tfidf_feature_names = np.array(vectorizer.get_feature_names_out())\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "fOi2tMVr7ORS", "outputId": "ea085db2-a7e4-4038-9f0b-2264905d13dc" }, "id": "fOi2tMVr7ORS", "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ " 0%| | 0/10000 [00:00" ], "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 wordmanwomanman+woman+variancetotal
0man0.050.020.02-0.020.060.07
1woman0.010.06-0.020.020.060.07
2said0.030.03-0.000.000.040.06
3like0.020.02-0.000.000.030.05
4women0.010.03-0.010.010.030.04
5just0.020.02-0.000.000.030.05
6time0.020.02-0.000.000.030.04
7people0.020.02-0.000.000.030.04
8life0.020.02-0.000.000.030.04
9love0.010.02-0.000.000.020.03
10know0.010.02-0.000.000.020.03
11men0.020.010.00-0.000.020.03
12don0.010.02-0.000.000.020.03
13god0.010.01-0.000.000.020.03
14police0.020.010.00-0.000.020.03
15day0.010.01-0.000.000.020.03
16way0.010.02-0.000.000.020.03
17good0.010.01-0.000.000.020.03
18new0.010.010.00-0.000.020.03
19did0.010.01-0.000.000.020.03
20years0.010.01-0.000.000.020.03
21make0.010.02-0.000.000.020.03
22going0.010.01-0.000.000.020.03
23want0.010.01-0.000.000.020.03
24think0.010.01-0.000.000.020.03
25year0.010.01-0.000.000.020.03
26old0.010.010.00-0.000.020.03
27ve0.010.01-0.000.000.020.02
28work0.010.01-0.000.000.020.02
29book0.010.01-0.000.000.020.02
30world0.010.01-0.000.000.020.02
31right0.010.01-0.000.000.020.02
32say0.010.010.00-0.000.020.02
33really0.010.01-0.000.000.020.02
34home0.010.010.00-0.000.020.02
35got0.010.01-0.000.000.020.02
36family0.010.01-0.000.000.020.02
37story0.010.01-0.000.000.010.02
38things0.010.01-0.000.000.010.02
39didn0.010.01-0.000.000.010.02
40sex0.010.01-0.000.000.010.02
41great0.010.01-0.000.000.010.02
42come0.010.01-0.000.000.010.02
43does0.010.010.00-0.000.010.02
44young0.010.01-0.000.000.010.02
45let0.010.01-0.000.000.010.02
46state0.010.010.00-0.000.010.02
47told0.010.010.00-0.000.010.02
48read0.010.01-0.000.000.010.02
49little0.010.01-0.000.000.010.02
\n" ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "markdown", "id": "e273abff-3d81-431f-9188-82d87d1ecda2", "metadata": { "id": "e273abff-3d81-431f-9188-82d87d1ecda2" }, "source": [ "### Sorting by bias\n", "\n", "In order to better surface biases, we can sort the table by how much one gender term over-represents other terms associated with it." ] }, { "cell_type": "markdown", "source": [ "#### Bias towards `man`\n", "\n", "In this case, we see occupation words such as `police` with a higher association towards `man`" ], "metadata": { "id": "pSzSO6RrTw9l" }, "id": "pSzSO6RrTw9l" }, { "cell_type": "code", "execution_count": 10, "id": "34229f06-5bf7-4ece-b43e-7d453931abd4", "metadata": { "id": "34229f06-5bf7-4ece-b43e-7d453931abd4", "collapsed": true, "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "outputId": "e9dda580-1b0e-4673-d077-10576a440ca7" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ], "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 wordmanwomanman+woman+variancetotal
0man0.050.020.02-0.020.060.07
11men0.020.010.00-0.000.020.03
14police0.020.010.00-0.000.020.03
34home0.010.010.00-0.000.020.02
18new0.010.010.00-0.000.020.03
46state0.010.010.00-0.000.010.02
47told0.010.010.00-0.000.010.02
26old0.010.010.00-0.000.020.03
43does0.010.010.00-0.000.010.02
32say0.010.010.00-0.000.020.02
37story0.010.01-0.000.000.010.02
20years0.010.01-0.000.000.020.03
6time0.020.02-0.000.000.030.04
17good0.010.01-0.000.000.020.03
13god0.010.01-0.000.000.020.03
25year0.010.01-0.000.000.020.03
42come0.010.01-0.000.000.010.02
41great0.010.01-0.000.000.010.02
30world0.010.01-0.000.000.020.02
45let0.010.01-0.000.000.010.02
33really0.010.01-0.000.000.020.02
19did0.010.01-0.000.000.020.03
49little0.010.01-0.000.000.010.02
35got0.010.01-0.000.000.020.02
22going0.010.01-0.000.000.020.03
28work0.010.01-0.000.000.020.02
31right0.010.01-0.000.000.020.02
15day0.010.01-0.000.000.020.03
24think0.010.01-0.000.000.020.03
38things0.010.01-0.000.000.010.02
27ve0.010.01-0.000.000.020.02
12don0.010.02-0.000.000.020.03
16way0.010.02-0.000.000.020.03
10know0.010.02-0.000.000.020.03
39didn0.010.01-0.000.000.010.02
2said0.030.03-0.000.000.040.06
48read0.010.01-0.000.000.010.02
3like0.020.02-0.000.000.030.05
23want0.010.01-0.000.000.020.03
44young0.010.01-0.000.000.010.02
29book0.010.01-0.000.000.020.02
5just0.020.02-0.000.000.030.05
7people0.020.02-0.000.000.030.04
21make0.010.02-0.000.000.020.03
40sex0.010.01-0.000.000.010.02
36family0.010.01-0.000.000.020.02
8life0.020.02-0.000.000.030.04
9love0.010.02-0.000.000.020.03
4women0.010.03-0.010.010.030.04
1woman0.010.06-0.020.020.060.07
\n" ] }, "metadata": {}, "execution_count": 10 } ], "source": [ "df.sort_values('man+', ascending=False).style.background_gradient(\n", " axis=None,\n", " vmin=0,\n", " vmax=0.2,\n", " cmap=\"YlGnBu\"\n", ").format(precision=2)" ] }, { "cell_type": "markdown", "source": [ "#### Bias towards `woman`\n", "\n", "In this case, we see words like `love`, `life`, and `family` with a higher skew towards `woman`." ], "metadata": { "id": "w-uy1KgQUDhP" }, "id": "w-uy1KgQUDhP" }, { "cell_type": "code", "source": [ "df.sort_values('woman+', ascending=False).style.background_gradient(\n", " axis=None,\n", " vmin=0,\n", " vmax=0.2,\n", " cmap=\"YlGnBu\"\n", ").format(precision=2)" ], "metadata": { "id": "ufATwOCojOdv", "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "outputId": "b7890818-92a6-400f-99d4-8406658c0560" }, "id": "ufATwOCojOdv", "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ], "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 wordmanwomanman+woman+variancetotal
1woman0.010.06-0.020.020.060.07
4women0.010.03-0.010.010.030.04
9love0.010.02-0.000.000.020.03
8life0.020.02-0.000.000.030.04
36family0.010.01-0.000.000.020.02
40sex0.010.01-0.000.000.010.02
21make0.010.02-0.000.000.020.03
7people0.020.02-0.000.000.030.04
5just0.020.02-0.000.000.030.05
29book0.010.01-0.000.000.020.02
44young0.010.01-0.000.000.010.02
23want0.010.01-0.000.000.020.03
3like0.020.02-0.000.000.030.05
48read0.010.01-0.000.000.010.02
2said0.030.03-0.000.000.040.06
39didn0.010.01-0.000.000.010.02
10know0.010.02-0.000.000.020.03
16way0.010.02-0.000.000.020.03
12don0.010.02-0.000.000.020.03
27ve0.010.01-0.000.000.020.02
38things0.010.01-0.000.000.010.02
24think0.010.01-0.000.000.020.03
15day0.010.01-0.000.000.020.03
31right0.010.01-0.000.000.020.02
28work0.010.01-0.000.000.020.02
22going0.010.01-0.000.000.020.03
35got0.010.01-0.000.000.020.02
49little0.010.01-0.000.000.010.02
19did0.010.01-0.000.000.020.03
33really0.010.01-0.000.000.020.02
45let0.010.01-0.000.000.010.02
30world0.010.01-0.000.000.020.02
41great0.010.01-0.000.000.010.02
42come0.010.01-0.000.000.010.02
25year0.010.01-0.000.000.020.03
13god0.010.01-0.000.000.020.03
17good0.010.01-0.000.000.020.03
6time0.020.02-0.000.000.030.04
20years0.010.01-0.000.000.020.03
37story0.010.01-0.000.000.010.02
32say0.010.010.00-0.000.020.02
43does0.010.010.00-0.000.010.02
26old0.010.010.00-0.000.020.03
47told0.010.010.00-0.000.010.02
46state0.010.010.00-0.000.010.02
18new0.010.010.00-0.000.020.03
34home0.010.010.00-0.000.020.02
14police0.020.010.00-0.000.020.03
11men0.020.010.00-0.000.020.03
0man0.050.020.02-0.020.060.07
\n" ] }, "metadata": {}, "execution_count": 11 } ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.1" }, "colab": { "provenance": [], "machine_shape": "hm" } }, "nbformat": 4, "nbformat_minor": 5 }