Text Classification
fastText
English
kenhktsui commited on
Commit
f0e676b
โ€ข
1 Parent(s): ae5e8a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -6
README.md CHANGED
@@ -6,10 +6,11 @@ library_name: fasttext
6
  pipeline_tag: text-classification
7
  inference: false
8
  ---
9
- # llm-data-textbook-quality-fasttext-classifer-v2
10
 
11
 
12
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/IPmnl6Fc4bvUYnpkVZg8N.png)
 
13
 
14
  ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
15
 
@@ -28,11 +29,11 @@ The classifier had been applied to various pretraining dataset. See [**Benchmark
28
 
29
  Please note textbook quality is a subset of high quality.
30
 
31
- ## Feedback welcomed!
32
  Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
33
 
34
 
35
- ## Examples
36
  Educational value is [0, 2]. Detailed formula is explained below.
37
  ```python
38
  predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
@@ -56,7 +57,7 @@ From inspection, it can be noted that the model does like scientific knowledge.
56
  It is also interested in Arsenal as a football club, however, it does not think a summary of a particular match has good educational value.
57
 
58
 
59
- ## Usage
60
  ```python
61
  from typing import List
62
  import re
@@ -95,7 +96,7 @@ predict_educational_value(["Hi"])
95
  # Output: [3.0000010156072676e-05]
96
 
97
  ```
98
- # Benchmark
99
  To make sure this classifier makes sense, it is applied to various datasets.
100
 
101
  Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
@@ -123,6 +124,7 @@ The score can be roughly interpreted as:
123
  |[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
124
  |[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
125
  |[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
 
126
  |[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |First 100,000 | 1.115 |Real|
127
  |[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
128
  |[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|
 
6
  pipeline_tag: text-classification
7
  inference: false
8
  ---
9
+ # ๐Ÿ“šllm-data-textbook-quality-fasttext-classifer-v2
10
 
11
 
12
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/acAPg-_NawdIfE2XXwcgc.png)
13
+
14
 
15
  ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
16
 
 
29
 
30
  Please note textbook quality is a subset of high quality.
31
 
32
+ ## ๐Ÿ’ฌFeedback welcomed!
33
  Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
34
 
35
 
36
+ ## โœ๏ธExamples
37
  Educational value is [0, 2]. Detailed formula is explained below.
38
  ```python
39
  predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
 
57
  It is also interested in Arsenal as a football club, however, it does not think a summary of a particular match has good educational value.
58
 
59
 
60
+ ## ๐Ÿ› ๏ธUsage
61
  ```python
62
  from typing import List
63
  import re
 
96
  # Output: [3.0000010156072676e-05]
97
 
98
  ```
99
+ # ๐Ÿ“ŠBenchmark
100
  To make sure this classifier makes sense, it is applied to various datasets.
101
 
102
  Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
 
124
  |[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
125
  |[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
126
  |[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
127
+ |[teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |First 100,000 | 1.121 |Synthetic|
128
  |[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |First 100,000 | 1.115 |Real|
129
  |[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
130
  |[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|