File size: 5,466 Bytes
20e7d98
 
e941413
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4e716b
20e7d98
 
 
 
1d19e08
995fd1a
1d19e08
 
20e7d98
 
 
 
 
 
 
3ffebe4
 
20e7d98
 
51c069c
 
 
 
 
20e7d98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51c069c
20e7d98
51c069c
 
 
 
20e7d98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51c069c
ed32cab
51c069c
 
 
20e7d98
 
 
 
 
 
 
 
 
 
 
 
 
 
ab8da27
 
20e7d98
 
51c069c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
license: mit
language:
- am
- ar
- bg
- bn
- cs
- da
- de
- el
- en
- es
- fa
- fi
- fr
- gu
- ha
- hi
- hu
- id
- it
- ja
- jv
- kn
- ko
- lt
- mr
- nl
- 'no'
- yo
- zh
- pl
- pt
- ro
- ru
- sk
- sv
- sw
- ta
- te
- th
- tr
- uk
- ur
- vi
- tl
---

# Shitsu

<p align="center">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="400"/>
</p>

A text scorer which scores text based on the amount of useful, textbook-like information in it.
It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.

Our model is based on fasttext embeddings, meaning that it can be used on large amounts of data with limited compute quickly.

This scorer can be used to filter useful information from large text corpora in many languages.

This model can also be found on [Github](https://github.com/lightblue-tech/shitsu).

# How to use

### With our scorer package

```bash
pip install git+https://github.com/lightblue-tech/shitsu.git
```

```python
from shitsu import ShitsuScorer

text_list = [
    "Photosynthesis is a system of biological processes by which photosynthetic organisms, such as most plants, algae, and cyanobacteria, convert light energy, typically from sunlight, into the chemical energy necessary to fuel their metabolism.",
    "Congratulations! You have all been selected to receive a free gift card worth $1000. Click on this link [Link] to claim your reward now. Limited time offer, so act fast! Don't miss out on this amazing opportunity."]

# Choose a language from one of: 'am', 'ar', 'bg', 'bn', 'cs', 'da', 'de', 'el', 'en', 'es', 'fa', 'fi', 'fr', 'gu', 'ha', 'hi', 'hu', 'id', 'it', 'ja', 'jv', 'kn', 'ko', 'lt', 'mr', 'nl', 'no', 'yo', 'zh'
language_code = "en"
scorer = ShitsuScorer(language_code)
scores = scorer.score(text_list)
scores
# array([ 0.9897383 , -0.08109612], dtype=float32)
```

### Without our scorer package (i.e. without pip install)

<details>
  <summary>Show full code</summary>
    
  ```python

from safetensors.torch import load_model
import fasttext
from huggingface_hub import hf_hub_download
from tqdm.auto import tqdm
import torch
import numpy as np
import torch
import torch.nn as nn

class FasttextEmbedRegressor(nn.Module):
    def __init__(self, input_size=300):
        super(FasttextEmbedRegressor, self).__init__()
        layer_1_size = 64
        layer_2_size = 32
        self.fc1 = nn.Linear(input_size, layer_1_size)
        self.fc2 = nn.Linear(layer_1_size, layer_2_size)
        self.fc3 = nn.Linear(layer_2_size, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class ShitsuScorer:
    def __init__(self, lang_code):
        fasttext_model_path = hf_hub_download(repo_id=f"facebook/fasttext-{lang_code}-vectors", filename="model.bin")
        self.fasttext_model = fasttext.load_model(fasttext_model_path)
        self.regressor_model = FasttextEmbedRegressor().eval()
        regressor_model_path = hf_hub_download(repo_id=f"lightblue/shitsu_text_scorer", filename=f"{lang_code}.safetensors")
        load_model(self.regressor_model, regressor_model_path)

    def score(self, text_list):
        embeddings = np.stack([self.fasttext_model.get_sentence_vector(x.replace("\n", " ")) for x in tqdm(text_list)])
        return self.regressor_model(torch.Tensor(embeddings)).detach().numpy().flatten()

text_list = [
    "Photosynthesis is a system of biological processes by which photosynthetic organisms, such as most plants, algae, and cyanobacteria, convert light energy, typically from sunlight, into the chemical energy necessary to fuel their metabolism.",
    "Congratulations! You have all been selected to receive a free gift card worth $1000. Click on this link [Link] to claim your reward now. Limited time offer, so act fast! Don't miss out on this amazing opportunity."]

scorer = ShitsuScorer("en")
scores = scorer.score(text_list)
scores
# array([ 0.9897383 , -0.08109612], dtype=float32)
```

</details>
<br/>



# How we made the training data

We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:

```python
system_message = """You are a text filtering AI model.
Your input is a piece of text.
Your output is a score of how likely the text is to appear in a useful {language} textbook, encyclopedia, or any other important document.

Output your score on a scale of 0-100, with 0 meaning that the text contains no useful {language} information and 100 meaning that the text is very useful and is exceedingly likely to appear in a {language} textbook, encyclopedia, or any other important document. If the text is not mostly fluent, natural {language}, output 0.

Your output should be only an integer from 0-100."""
```

This resulted in the dataset found at [lightblue/text_ratings](https://huggingface.co/datasets/lightblue/text_ratings).

We then trained a small neural network on top of fasttext's embeddings to predict these scores.

We chose the 44 languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.