tokyotech-llm
/

edu-classifier

Text Classification

fastText

Japanese

Model card Files Files and versions

xet

Community

aya-se commited on Jan 27

Commit

3cff5d9

1 Parent(s): 8499e52

Update README.md

Browse files

Files changed (2) hide show

README.md +23 -11
README_ja.md +38 -1

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 license: other
 license_name: mixed
-license_link: https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/README.md
 language: ja
 pipeline_tag: text-classification
 library_name: fasttext
@@ -9,7 +9,7 @@ library_name: fasttext
 # Swallow Edu Classifier
-[日本語版の README はこちら](https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/README_ja.md)
 ## Model summary
@@ -17,8 +17,8 @@ library_name: fasttext
 This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
-1. **Wiki-based classifier**: trained on Japanese Wikipedia text from academic categories. This classifier is released under the [Apache-2.0 License](https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/APACHE_LICENSE_VERSION_2.0.md).
-2. **LLM-based classifier**: trained on annotations provided by an LLM, governed by the license applicable to the underlying LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/LLAMA_3) or [Gemma Terms of Use](https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
 These classifiers were developed as part of a quality-filtering process for Swallow Corpus Version 2, used in the training of the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our ablation experiments have shown that applying a filter based on the classifier’s scores improved the LLM’s ability related to Japanese knowledge.
@@ -30,18 +30,30 @@ The Wiki-based classifier outputs a probability between 0 and 1, indicating how
 from huggingface_hub import hf_hub_download
 import fasttext
-model = fasttext.load_model(hf_hub_download("tokyotech-llm/edu-classifier", "model.bin"))
-```
-### Best practice
-If you aim to assign appropriate ranked scores to a wide range of documents, it is recommended to use the LLM-based classifier. The Wiki-based classifier tends to assign scores close to 0 for most documents, making it specialized for detecting the few documents that resemble Wikipedia. In contrast, the LLM-based classifier can provide grading based on a broader definition of educational value.
-## Training
-### Wiki-based classifier
-### LLM-based classifier
 ## How to cite

 ---
 license: other
 license_name: mixed
+license_link: https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/README.md
 language: ja
 pipeline_tag: text-classification
 library_name: fasttext
 # Swallow Edu Classifier
+[日本語版の README はこちら](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/README_ja.md)
 ## Model summary
 This repository contains fastText classifiers for judging the educational value of Japanese web pages. It includes two types of classifiers:
+1. **Wiki-based classifier**: trained on Japanese Wikipedia text from academic categories. This classifier is released under the [Apache-2.0 License](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/APACHE_LICENSE_VERSION_2.0.md).
+2. **LLM-based classifier**: trained on annotations provided by an LLM, governed by the license applicable to the underlying LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
 These classifiers were developed as part of a quality-filtering process for Swallow Corpus Version 2, used in the training of the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our ablation experiments have shown that applying a filter based on the classifier’s scores improved the LLM’s ability related to Japanese knowledge.
 from huggingface_hub import hf_hub_download
 import fasttext
+# Example text
+text = "Llama 3.1 Swallow\nLlama 3.1 SwallowはLlama 3.1の英語の能力を維持しながら、日本語の能力を強化した大規模言語モデル (8B, 70B) です。"
+text = text.replace("\n", " ")
+# If you use Wiki-based classifier
+model = fasttext.load_model(hf_hub_download("tokyotech-llm/edu-classifier", "wiki.bin"))
+res = model.predict(text, k=-1)
+## Use the positive prediction probability as the educational score
+edu_score = res[1][0] if res[0][0] == "__label__pos" else 1 - res[1][0]
+# If you use LLM-based classifier
+model = fasttext.load_model(
+    hf_hub_download("tokyotech-llm/edu-classifier", "llm_llama.bin")
+)
+res = model.predict(text, k=-1)
+## Use the weighted sum of the prediction probabilities as the educational score
+edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
+```
+### Best practice
+If you aim to assign appropriate ranked scores to a wide range of documents, it is recommended to use the LLM-based classifier. The Wiki-based classifier tends to assign scores close to 0 for most documents, making it specialized for detecting the few documents that resemble Wikipedia. In contrast, the LLM-based classifier can provide grading based on a broader definition of educational value.
 ## How to cite

README_ja.md CHANGED Viewed

@@ -4,7 +4,7 @@
 **注意**：日本語でのみ動作します。英語やそれ以外の言語での品質は保証しません。
-日本語ウェブページの教育的価値を判定する fastText 分類器です。本リポジトリには学術カテゴリに属する日本語 Wikipedia テキストを元に訓練された分類器（Wiki-based classifier）と、LLM によるアノテーションを元に訓練された分類器（LLM-based classifier）が含まれます。前者には [Apache-2.0 ライセンス](https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/APACHE_LICENSE_VERSION_2.0.md)、後者にはアノテーションに使用された LLM に応じたライセンス（[Llama 3.1 Community License Agreement](https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md)、[Gemma Terms of Use](https://huggingface.co/datasets/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)）が適用されます。
 これらの分類器は[Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6)シリーズの訓練に用いられた Swallow Corpus Version 2 の品質フィルタリングの一環として開発されました。Ablation 実験では、分類器のスコアに基づくフィルタリングの適用により、LLM の日本語知識が向上することを確認しました。
@@ -12,6 +12,43 @@
 Wiki-based classifier は与えられた文書が Wikipedia らしいかどうかを 0〜1 の確率で出力します。一方、LLM-based classifier は与えられた文書の教育的スコアが 0、1、2、3 のいずれに属するかどうかを 4 ラベル分類問題として予測します。各ラベルの予測確率に基づくスコアの期待値（0〜3）を最終的なスコアとして用いることができます。
 ### ベストプラクティス
 広範な文書に適切な序列のスコアを付与したい場合、LLM-based classifier の使用を推奨します。Wiki-based classifier はほとんどの文書に 0 付近のスコアを付与する傾向にあるため、Wikipedia らしいわずかな文書の検出に特化しています。一方、LLM-based classifier はより一般的な教育的価値の定義に基づいた採点をすることができます。

 **注意**：日本語でのみ動作します。英語やそれ以外の言語での品質は保証しません。
+日本語ウェブページの教育的価値を判定する fastText 分類器です。本リポジトリには学術カテゴリに属する日本語 Wikipedia テキストを元に訓練された分類器（Wiki-based classifier）と、LLM によるアノテーションを元に訓練された分類器（LLM-based classifier）が含まれます。前者には [Apache-2.0 ライセンス](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/APACHE_LICENSE_VERSION_2.0.md)、後者にはアノテーションに使用された LLM に応じたライセンス（[Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md)、[Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)）が適用されます。
 これらの分類器は[Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6)シリーズの訓練に用いられた Swallow Corpus Version 2 の品質フィルタリングの一環として開発されました。Ablation 実験では、分類器のスコアに基づくフィルタリングの適用により、LLM の日本語知識が向上することを確認しました。
 Wiki-based classifier は与えられた文書が Wikipedia らしいかどうかを 0〜1 の確率で出力します。一方、LLM-based classifier は与えられた文書の教育的スコアが 0、1、2、3 のいずれに属するかどうかを 4 ラベル分類問題として予測します。各ラベルの予測確率に基づくスコアの期待値（0〜3）を最終的なスコアとして用いることができます。
+```python
+from huggingface_hub import hf_hub_download
+import fasttext
+# テキストの例
+text = "Llama 3.1 Swallow\nLlama 3.1 SwallowはLlama 3.1の英語の能力を維持しながら、日本語の能力を強化した大規模言語モデル (8B, 70B) です。"
+text = text.replace("\n", " ")
+# Wiki-based classifierを使用する場合
+model = fasttext.load_model(hf_hub_download("tokyotech-llm/edu-classifier", "wiki.bin"))
+res = model.predict(text, k=-1)
+## 正例の予測確率を教育的スコアとみなす
+edu_score = res[1][0] if res[0][0] == "__label__pos" else 1 - res[1][0]
+# LLM-based classifierを使用する場合
+model = fasttext.load_model(
+    hf_hub_download("tokyotech-llm/edu-classifier", "llm_llama.bin")
+)
+res = model.predict(text, k=-1)
+## 各スコアの予測確率の期待値を教育的スコアとみなす
+edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
+```
 ### ベストプラクティス
 広範な文書に適切な序列のスコアを付与したい場合、LLM-based classifier の使用を推奨します。Wiki-based classifier はほとんどの文書に 0 付近のスコアを付与する傾向にあるため、Wikipedia らしいわずかな文書の検出に特化しています。一方、LLM-based classifier はより一般的な教育的価値の定義に基づいた採点をすることができます。
+## 引用
+```bibtex
+@inproceedings{hattori-2025-swallow-v2,
+  author = {服部 翔 and 岡崎 直観 and 水木 栄 and 藤井 一喜 and 中村 泰士 and 大井 聖也 and 塩谷 泰平 and 齋藤 幸史郎 and Youmi Ma and 前田 航希 and 岡本 拓己 and 石田 茂樹 and 横田 理央 and 高村 大也},
+  title = {Swallowコーパスv2: 教育的な日本語ウェブコーパスの構築},
+  booktitle = {言語処理学会第31回年次大会 (NLP2025)},
+  comment = mar,
+  year = {2025},
+}
+```