Upload folder using huggingface_hub
Browse files- .gitattributes +1 -0
- README.md +124 -0
- hubrainv5_decoder.pth +3 -0
- hubrainv5_encoder.pth +3 -0
- latens_space.png +3 -0
- metadata.tsv +108 -0
- pos_vocab.json +19 -0
- vectors.tsv +0 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
latens_space.png filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: pytorch
|
| 3 |
+
tags:
|
| 4 |
+
- hungarian
|
| 5 |
+
- transformer
|
| 6 |
+
- encoder
|
| 7 |
+
- tokenization-free
|
| 8 |
+
- character-based
|
| 9 |
+
- glass-box
|
| 10 |
+
license: cc-by-sa-4.0
|
| 11 |
+
datasets:
|
| 12 |
+
- webkorpusz-2.0
|
| 13 |
+
metrics:
|
| 14 |
+
- pos-accuracy
|
| 15 |
+
- word-reconstruction
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# 🧠 HuBrain: Tokenization-free Hungarian Semantic Encoder / Tokenizáció-mentes magyar szemantikai encoder
|
| 19 |
+
|
| 20 |
+
[English Version](#english) | [Magyar Változat](#magyar)
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
<a name="english"></a>
|
| 25 |
+
## 🌍 English Description
|
| 26 |
+
|
| 27 |
+
HuBrain is an experimental, character-based **Glass-Box Semantic Encoder** designed to model the morphological richness and semantic relationships of the Hungarian language without traditional tokenization (e.g., BPE).
|
| 28 |
+
|
| 29 |
+
### 🚀 Live Visualization
|
| 30 |
+
View the 1280-dimensional semantic latent space projection (PCA/T-SNE) here:
|
| 31 |
+
👉 **[HuBrain Projector Visualization](https://projector.tensorflow.org/?config=https://jevcsak.hu/model/hubrain.json)**
|
| 32 |
+
|
| 33 |
+

|
| 34 |
+
|
| 35 |
+
### 📈 Training Progress (latest logs)
|
| 36 |
+
The model is currently in **Phase 2 (Joint Training)**. Recent logs show high stability and emergent factual knowledge:
|
| 37 |
+
- **POS Accuracy (Pm):** ~91.5% - 97.1%
|
| 38 |
+
- **Word Reconstruction (Wm):** ~30% - 74% (Emerging)
|
| 39 |
+
- **Latent Stability (Mag):** ~100 (Balanced vector magnitude)
|
| 40 |
+
- **Learning Rate:** 2.4e-05
|
| 41 |
+
|
| 42 |
+
### 🛠️ Technical Specifications
|
| 43 |
+
- **Architecture:** Transformer Encoder with RoPE support.
|
| 44 |
+
- **Dimensions:** 1536 (256 Anchors + 1280 Semantic Context).
|
| 45 |
+
- **Layers:** 18 Layers, 24 Heads.
|
| 46 |
+
- **Input:** Raw characters (64-char fixed word length).
|
| 47 |
+
- **Vocab:** No OOV issues (Character-level coverage).
|
| 48 |
+
|
| 49 |
+
### 📥 Model Download
|
| 50 |
+
The weighted model files (`.pth`) are stored on Hugging Face due to their large size (6.7 GB). You can download them using the following command:
|
| 51 |
+
|
| 52 |
+
```bash
|
| 53 |
+
# Required: pip install huggingface_hub
|
| 54 |
+
python download_model.py
|
| 55 |
+
```
|
| 56 |
+
Or manually from: [https://huggingface.co/Braien/HuBrain-Encoder](https://huggingface.co/Braien/HuBrain-Encoder)
|
| 57 |
+
|
| 58 |
+
### ⚙️ Requirements
|
| 59 |
+
```bash
|
| 60 |
+
pip install torch numpy huggingface_hub
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### 🧪 Diagnostic Tools
|
| 64 |
+
- **`test_mask_prediction.py`**: Context-based word recovery.
|
| 65 |
+
- **`test_analogy.py`**: Semantic analogies (e.g. King-Man+Woman).
|
| 66 |
+
- **`export_projector.py`**: Export to TF Projector format.
|
| 67 |
+
|
| 68 |
+
### ⚖️ Licensing & Data Sources
|
| 69 |
+
This model was trained using the **Webkorpusz 2.0** dataset. By using this model, you agree to comply with the following licenses:
|
| 70 |
+
- **Common Crawl subcorpus**: Used under the same terms as [Common Crawl](https://commoncrawl.org/terms-of-use/) itself.
|
| 71 |
+
- **Wikipedia subcorpus & processed data**: Licensed under **Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)**.
|
| 72 |
+
- **Disclaimer**: The training data originates from automated web crawling; the model creator assumes no responsibility for the content.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
<a name="magyar"></a>
|
| 77 |
+
## 🇭🇺 Magyar leírás
|
| 78 |
+
|
| 79 |
+
A HuBrain egy kísérleti, karakter-alapú **Glass-Box Szemantikai Encoder**, amely a magyar nyelv morfológiai gazdagságát és szemantikai összefüggéseit modellezi hagyományos tokenizáció (pl. BPE) használata nélkül.
|
| 80 |
+
|
| 81 |
+
### 🚀 Élő Vizualizáció
|
| 82 |
+
A modell látens terének 1280 dimenziós szemantikai leképzése megtekinthető itt:
|
| 83 |
+
👉 **[HuBrain Projector Visualization](https://projector.tensorflow.org/?config=https://jevcsak.hu/model/hubrain.json)**
|
| 84 |
+
|
| 85 |
+

|
| 86 |
+
|
| 87 |
+
### 📈 Tréning Állapot (utolsó logok)
|
| 88 |
+
A modell jelenleg a **Phase 2 (Joint Training)** fázisban van. Az utolsó logok stabil tanulást és kialakuló tudást mutatnak:
|
| 89 |
+
- **POS Pontosság (Pm):** ~91.5% - 97.1%
|
| 90 |
+
- **Szó Rekonstrukció (Wm):** ~30% - 74% (Folyamatosan javul)
|
| 91 |
+
- **Látens Stabilitás (Mag):** ~100 (Kiegyensúlyozott vektor magnitúdó)
|
| 92 |
+
- **Tanulási ráta:** 2.4e-05
|
| 93 |
+
|
| 94 |
+
### 🛠️ Technikai adatok
|
| 95 |
+
- **Architektúra:** Transformer Encoder RoPE támogatással.
|
| 96 |
+
- **Dimenziók:** 1536 (256 Horgony + 1280 Szemantikai kontextus).
|
| 97 |
+
- **Rétegszám:** 18 réteg, 24 fej.
|
| 98 |
+
- **Bemenet:** Nyers karakterek (64 karakteres fix szóhossz).
|
| 99 |
+
- **Vocab:** Nincs OOV (szótáron kívüli szó) probléma a karakter-szintű lefedettség miatt.
|
| 100 |
+
|
| 101 |
+
### 📥 Modell letöltése
|
| 102 |
+
A nagyméretű modellfájlok (`.pth`, összesen 6.7 GB) a Hugging Face-en tárolódnak. Az alábbi parancs futtatásával töltheted le őket:
|
| 103 |
+
|
| 104 |
+
```bash
|
| 105 |
+
# Szükséges: pip install huggingface_hub
|
| 106 |
+
python download_model.py
|
| 107 |
+
```
|
| 108 |
+
Vagy manuálisan innen: [https://huggingface.co/Braien/HuBrain-Encoder](https://huggingface.co/Braien/HuBrain-Encoder)
|
| 109 |
+
|
| 110 |
+
### ⚙️ Követelmények
|
| 111 |
+
```bash
|
| 112 |
+
pip install torch numpy huggingface_hub
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### 🧪 Diagnosztikai eszközök
|
| 116 |
+
- **`test_mask_prediction.py`**: Környezet alapú szó-visszafejtés.
|
| 117 |
+
- **`test_analogy.py`**: Szemantikai analógiák (pl. király - férfi + nő).
|
| 118 |
+
- **`export_projector.py`**: Exportálás TF Projector vizualizációhoz.
|
| 119 |
+
|
| 120 |
+
### ⚖️ Licenc és Adatforrások
|
| 121 |
+
A modell tanításához a **Webkorpusz 2.0** adatbázist használtuk fel. A modell használatával Ön elfogadja az alábbi licencfeltételeket:
|
| 122 |
+
- **Common Crawl alkorpusz**: A [Common Crawl](https://commoncrawl.org/terms-of-use/) saját felhasználási feltételei szerint került felhasználásra.
|
| 123 |
+
- **Wikipedia alkorpusz és feldolgozott adatok**: A **Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)** licenc alá tartoznak.
|
| 124 |
+
- **Felelősségkizárás**: Az adatok automatizált webes gyűjtésből származnak, a tartalmukért a modell készítője nem vállal felelősséget.
|
hubrainv5_decoder.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c87b52de75d48a18261e5d782f3ffe7bd1258d6aef8c724ddfa79a734e7e4986
|
| 3 |
+
size 657997178
|
hubrainv5_encoder.pth
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f4663af533cf6129a897518f212d96e0677dfe20f8037f4a35f7d5fe67088658
|
| 3 |
+
size 6056357880
|
latens_space.png
ADDED
|
Git LFS Details
|
metadata.tsv
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Word Category POS
|
| 2 |
+
Budapest City [/N]
|
| 3 |
+
Debrecen City [/N]
|
| 4 |
+
Szeged City [/N]
|
| 5 |
+
Pécs City [/N]
|
| 6 |
+
Győr City [/N]
|
| 7 |
+
Miskolc City [/N]
|
| 8 |
+
London City [/N]
|
| 9 |
+
Párizs City [/N]
|
| 10 |
+
Berlin City [/N]
|
| 11 |
+
Róma City [/N]
|
| 12 |
+
Prága City [/N]
|
| 13 |
+
Bécs City [/N]
|
| 14 |
+
New York City [/N]
|
| 15 |
+
Tokió City [/N]
|
| 16 |
+
Moszkva City [/N]
|
| 17 |
+
kutya Animal [/N]
|
| 18 |
+
macska Animal [/N]
|
| 19 |
+
ló Animal [/N]
|
| 20 |
+
tehén Animal [/N]
|
| 21 |
+
farkas Animal [/N]
|
| 22 |
+
medve Animal [/N]
|
| 23 |
+
oroszlán Animal [/N]
|
| 24 |
+
tigris Animal [/N]
|
| 25 |
+
sas Animal [/N]
|
| 26 |
+
hal Animal [/N]
|
| 27 |
+
madár Animal [/N]
|
| 28 |
+
béka Animal [/N]
|
| 29 |
+
apa Family [/N]
|
| 30 |
+
anya Family [/N]
|
| 31 |
+
fiú Family [/N]
|
| 32 |
+
lány Family [/N]
|
| 33 |
+
testvér Family [/N]
|
| 34 |
+
báty Family [/N]
|
| 35 |
+
öcs Family [/N]
|
| 36 |
+
húg Family [/N]
|
| 37 |
+
nagypapa Family [/N]
|
| 38 |
+
nagymama Family [/N]
|
| 39 |
+
gyerek Family [/N]
|
| 40 |
+
unoka Family [/N]
|
| 41 |
+
fut Verb [/V]
|
| 42 |
+
megy Verb [/V]
|
| 43 |
+
jön Verb [/V]
|
| 44 |
+
lát Verb [/V]
|
| 45 |
+
hall Verb [/V]
|
| 46 |
+
beszél Verb [/V]
|
| 47 |
+
eszik Verb [/V]
|
| 48 |
+
iszik Verb [/V]
|
| 49 |
+
alszik Verb [/V]
|
| 50 |
+
játszik Verb [/V]
|
| 51 |
+
tanul Verb [/V]
|
| 52 |
+
dolgozik Verb [/V]
|
| 53 |
+
vág Verb [/V]
|
| 54 |
+
szeret Verb [/V]
|
| 55 |
+
akar Verb [/V]
|
| 56 |
+
tud Verb [/V]
|
| 57 |
+
kér Verb [/V]
|
| 58 |
+
ad Verb [/V]
|
| 59 |
+
nagy Adj [/Adj]
|
| 60 |
+
kicsi Adj [/Adj]
|
| 61 |
+
szép Adj [/Adj]
|
| 62 |
+
csúnya Adj [/Adj]
|
| 63 |
+
jó Adj [/Adj]
|
| 64 |
+
rossz Adj [/Adj]
|
| 65 |
+
hideg Adj [/Adj]
|
| 66 |
+
meleg Adj [/Adj]
|
| 67 |
+
erős Adj [/Adj]
|
| 68 |
+
gyenge Adj [/Adj]
|
| 69 |
+
okos Adj [/Adj]
|
| 70 |
+
buta Adj [/Adj]
|
| 71 |
+
asztal Tool [/N]
|
| 72 |
+
szék Tool [/N]
|
| 73 |
+
ágy Tool [/N]
|
| 74 |
+
szekrény Tool [/N]
|
| 75 |
+
kés Tool [/N]
|
| 76 |
+
villa Tool [/N]
|
| 77 |
+
kanál Tool [/N]
|
| 78 |
+
tányér Tool [/N]
|
| 79 |
+
pohár Tool [/N]
|
| 80 |
+
autó Tool [/N]
|
| 81 |
+
hajó Tool [/N]
|
| 82 |
+
repülő Tool [/N]
|
| 83 |
+
bicikli Tool [/N]
|
| 84 |
+
telefon Tool [/N]
|
| 85 |
+
számítógép Tool [/N]
|
| 86 |
+
idő Abstract [/N]
|
| 87 |
+
élet Abstract [/N]
|
| 88 |
+
halál Abstract [/N]
|
| 89 |
+
szabadság Abstract [/N]
|
| 90 |
+
törvény Legal [/N]
|
| 91 |
+
jog Legal [/N]
|
| 92 |
+
bíróság Legal [/N]
|
| 93 |
+
bűn Legal [/N]
|
| 94 |
+
igazság Legal [/N]
|
| 95 |
+
egy Number [/Num]
|
| 96 |
+
kettő Number [/Num]
|
| 97 |
+
három Number [/Num]
|
| 98 |
+
négy Number [/Num]
|
| 99 |
+
öt Number [/Num]
|
| 100 |
+
tíz Number [/Num]
|
| 101 |
+
száz Number [/Num]
|
| 102 |
+
ezer Number [/Num]
|
| 103 |
+
piros Color [/Adj]
|
| 104 |
+
kék Color [/Adj]
|
| 105 |
+
zöld Color [/Adj]
|
| 106 |
+
sárga Color [/Adj]
|
| 107 |
+
fekete Color [/Adj]
|
| 108 |
+
fehér Color [/Adj]
|
pos_vocab.json
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"token_to_id": {
|
| 3 |
+
"<PAD>": 0,
|
| 4 |
+
"<UNK>": 1,
|
| 5 |
+
"<MASK>": 2,
|
| 6 |
+
"[/N]": 3,
|
| 7 |
+
"[/V]": 4,
|
| 8 |
+
"[/Prev]": 5,
|
| 9 |
+
"[/Det]": 6,
|
| 10 |
+
"[/Adj]": 7,
|
| 11 |
+
"[Punct]": 8,
|
| 12 |
+
"[/Adv]": 9,
|
| 13 |
+
"[/Pron]": 10,
|
| 14 |
+
"[/Cnj]": 11,
|
| 15 |
+
"[/Post]": 12,
|
| 16 |
+
"[/X]": 13,
|
| 17 |
+
"[/Num]": 14
|
| 18 |
+
}
|
| 19 |
+
}
|
vectors.tsv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|