Update README.md
Browse files
README.md
CHANGED
@@ -6,10 +6,11 @@ library_name: fasttext
|
|
6 |
pipeline_tag: text-classification
|
7 |
inference: false
|
8 |
---
|
9 |
-
# llm-data-textbook-quality-fasttext-classifer-v2
|
10 |
|
11 |
|
12 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/
|
|
|
13 |
|
14 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
15 |
|
@@ -28,11 +29,11 @@ The classifier had been applied to various pretraining dataset. See [**Benchmark
|
|
28 |
|
29 |
Please note textbook quality is a subset of high quality.
|
30 |
|
31 |
-
## Feedback welcomed!
|
32 |
Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
|
33 |
|
34 |
|
35 |
-
## Examples
|
36 |
Educational value is [0, 2]. Detailed formula is explained below.
|
37 |
```python
|
38 |
predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
|
@@ -56,7 +57,7 @@ From inspection, it can be noted that the model does like scientific knowledge.
|
|
56 |
It is also interested in Arsenal as a football club, however, it does not think a summary of a particular match has good educational value.
|
57 |
|
58 |
|
59 |
-
## Usage
|
60 |
```python
|
61 |
from typing import List
|
62 |
import re
|
@@ -95,7 +96,7 @@ predict_educational_value(["Hi"])
|
|
95 |
# Output: [3.0000010156072676e-05]
|
96 |
|
97 |
```
|
98 |
-
# Benchmark
|
99 |
To make sure this classifier makes sense, it is applied to various datasets.
|
100 |
|
101 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
@@ -123,6 +124,7 @@ The score can be roughly interpreted as:
|
|
123 |
|[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
|
124 |
|[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
|
125 |
|[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
|
|
|
126 |
|[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |First 100,000 | 1.115 |Real|
|
127 |
|[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
|
128 |
|[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|
|
|
|
6 |
pipeline_tag: text-classification
|
7 |
inference: false
|
8 |
---
|
9 |
+
# ๐llm-data-textbook-quality-fasttext-classifer-v2
|
10 |
|
11 |
|
12 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/acAPg-_NawdIfE2XXwcgc.png)
|
13 |
+
|
14 |
|
15 |
## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
|
16 |
|
|
|
29 |
|
30 |
Please note textbook quality is a subset of high quality.
|
31 |
|
32 |
+
## ๐ฌFeedback welcomed!
|
33 |
Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
|
34 |
|
35 |
|
36 |
+
## โ๏ธExamples
|
37 |
Educational value is [0, 2]. Detailed formula is explained below.
|
38 |
```python
|
39 |
predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
|
|
|
57 |
It is also interested in Arsenal as a football club, however, it does not think a summary of a particular match has good educational value.
|
58 |
|
59 |
|
60 |
+
## ๐ ๏ธUsage
|
61 |
```python
|
62 |
from typing import List
|
63 |
import re
|
|
|
96 |
# Output: [3.0000010156072676e-05]
|
97 |
|
98 |
```
|
99 |
+
# ๐Benchmark
|
100 |
To make sure this classifier makes sense, it is applied to various datasets.
|
101 |
|
102 |
Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
|
|
|
124 |
|[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
|
125 |
|[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
|
126 |
|[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
|
127 |
+
|[teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |First 100,000 | 1.121 |Synthetic|
|
128 |
|[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |First 100,000 | 1.115 |Real|
|
129 |
|[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
|
130 |
|[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|
|