kenhktsui
/

llm-data-textbook-quality-fasttext-classifier-v2

Text Classification

fastText

English

Model card Files Files and versions Community

kenhktsui commited on May 20, 2024

Commit

f0e676b

verified ·

1 Parent(s): ae5e8a4

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -6

README.md CHANGED Viewed

@@ -6,10 +6,11 @@ library_name: fasttext
 pipeline_tag: text-classification
 inference: false
 ---
-# llm-data-textbook-quality-fasttext-classifer-v2
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/IPmnl6Fc4bvUYnpkVZg8N.png)
 ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
@@ -28,11 +29,11 @@ The classifier had been applied to various pretraining dataset. See [**Benchmark
 Please note textbook quality is a subset of high quality.
-## Feedback welcomed!
 Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
-## Examples
 Educational value is [0, 2]. Detailed formula is explained below.
 ```python
 predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
@@ -56,7 +57,7 @@ From inspection, it can be noted that the model does like scientific knowledge.
 It is also interested in Arsenal as a football club, however, it does not think a summary of a particular match has good educational value.
-## Usage
 ```python
 from typing import List
 import re
@@ -95,7 +96,7 @@ predict_educational_value(["Hi"])
 # Output: [3.0000010156072676e-05]
 ```
-# Benchmark
 To make sure this classifier makes sense, it is applied to various datasets.
 Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
@@ -123,6 +124,7 @@ The score can be roughly interpreted as:
 |[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
 |[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
 |[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
 |[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |First 100,000 | 1.115 |Real|
 |[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
 |[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|

 pipeline_tag: text-classification
 inference: false
 ---
+# 📚llm-data-textbook-quality-fasttext-classifer-v2
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/acAPg-_NawdIfE2XXwcgc.png)
 ## **"Garbage in, garbage out. A language model is only as good as its training data irrespective of its parameter count."**
 Please note textbook quality is a subset of high quality.
+## 💬Feedback welcomed!
 Please give a like and leave a comment if you find this model helpful. I am in a continual journey to make LLM data curation better and easier.
+## ✏️Examples
 Educational value is [0, 2]. Detailed formula is explained below.
 ```python
 predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
 It is also interested in Arsenal as a football club, however, it does not think a summary of a particular match has good educational value.
+## 🛠️Usage
 ```python
 from typing import List
 import re
 # Output: [3.0000010156072676e-05]
 ```
+# 📊Benchmark
 To make sure this classifier makes sense, it is applied to various datasets.
 Educational Value = 2 point * P(High) + 1 point * P(Mid) + 0 point * P(Low)
 |[HuggingFaceTB/cosmopedia auto_math_text](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.347 |Synthetic|
 |[armanc/scientific_papers pubmed](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.260 |Real|
 |[HuggingFaceTB/cosmopedia stories](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) |First 100,000 | 1.154 |Synthetic|
+|[teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |First 100,000 | 1.121 |Synthetic|
 |[timdettmers/openassistant-guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) |First 100,000 | 1.115 |Real|
 |[open-web-math/open-web-math](https://huggingface.co/datasets/open-web-math/open-web-math) |First 100,000 | 1.089 |Real|
 |[armanc/scientific_papers arxiv](https://huggingface.co/datasets/armanc/scientific_papers) |First 100,000 | 1.068 |Real|