Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +136 -0
merges.json +1 -0
tokenizer_config.json +12 -0
vocab.json +131 -0

README.md ADDED Viewed

	@@ -0,0 +1,136 @@

+---
+language:
+- en
+license: mit
+tags:
+- chess
+- tokenizer
+- bpe
+- game-ai
+library_name: rustbpe
+datasets:
+- angeluriot/chess_games
+---
+# Chess BPE Tokenizer
+A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.
+## Model Details
+- **Tokenizer Type**: BPE (Byte Pair Encoding)
+- **Vocabulary Size**: 256
+- **Training Data**: [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
+- **Training Split**: train[0:1000]
+- **Move Format**: Custom notation with Unicode chess pieces (e.g., `w.♘g1♘f3..`)
+## Move Format Description
+The tokenizer is trained on a custom chess move notation:
+| Component | Description | Example |
+|-----------|-------------|---------|
+| Player prefix | `w.` (white) or `b.` (black) | `w.` |
+| Piece + Source | Unicode piece + square | `♘g1` |
+| Piece + Destination | Unicode piece + square | `♘f3` |
+| Flags | `.x.` (capture), `..+` (check), `..#` (checkmate) | `..` |
+### Examples
+| Move | Meaning |
+|------|---------|
+| `w.♘g1♘f3..` | White knight from g1 to f3 |
+| `b.♟c7♟c5..` | Black pawn from c7 to c5 |
+| `b.♟c5♟d4.x.` | Black pawn captures on d4 |
+| `w.♔e1♔g1♖h1♖f1..` | White kingside castle |
+| `b.♛d7♛d5..+` | Black queen to d5 with check |
+### Chess Piece Symbols
+| White | Black | Piece |
+|-------|-------|-------|
+| ♔ | ♚ | King |
+| ♕ | ♛ | Queen |
+| ♖ | ♜ | Rook |
+| ♗ | ♝ | Bishop |
+| ♘ | ♞ | Knight |
+| ♙ | ♟ | Pawn |
+## Usage
+### Installation
+```bash
+pip install rustbpe huggingface_hub
+```
+### Loading and Using the Tokenizer
+```python
+import json
+from huggingface_hub import hf_hub_download
+# Download tokenizer files
+vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
+config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")
+# Load vocabulary
+with open(vocab_path, 'r') as f:
+    vocab = json.load(f)
+with open(config_path, 'r') as f:
+    config = json.load(f)
+print(f"Vocab size: {len(vocab)}")
+print(f"Pattern: {config['pattern']}")
+```
+### Using with rustbpe (for encoding)
+```python
+import rustbpe
+# Note: rustbpe tokenizer needs to be retrained or loaded from merges
+# See the training script for details
+```
+### Training Your Own
+```python
+from bpess.main import train_chess_tokenizer, push_to_hub
+# Train
+tokenizer = train_chess_tokenizer(
+    vocab_size=4096,
+    dataset_fraction="train",
+    moves_key='moves_custom'
+)
+# Push to HuggingFace
+push_to_hub(
+    tokenizer=tokenizer,
+    repo_id="your-username/chess-bpe-tokenizer",
+    config={
+        "vocab_size": 4096,
+        "dataset_fraction": "train",
+        "moves_key": "moves_custom"
+    }
+)
+```
+## Training Details
+- **Library**: [rustbpe](https://github.com/karpathy/rustbpe) by Andrej Karpathy
+- **Algorithm**: Byte Pair Encoding with GPT-4 style regex pre-tokenization
+- **Source Dataset**: ~14M chess games from [angeluriot/chess_games](https://huggingface.co/datasets/angeluriot/chess_games)
+## Intended Use
+This tokenizer is designed for:
+- Training language models on chess games
+- Chess move prediction tasks
+- Game analysis and embedding generation
+## License
+MIT License

merges.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ []

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "tokenizer_type": "BPE",
+  "vocab_size": 256,
+  "pattern": "'(?i:[sdmt]|ll|ve|re)|[^\\r\\n\\p{L}\\p{N}]?+\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]++[\\r\\n]*|\\s*[\\r\\n]|\\s+(?!\\S)|\\s+",
+  "special_tokens": {},
+  "training_config": {
+    "vocab_size": 256,
+    "dataset_fraction": "train[0:1000]",
+    "moves_key": "moves_custom",
+    "separator": " "
+  }
+}

vocab.json ADDED Viewed

	@@ -0,0 +1,131 @@

+{
+  "\u0000": 0,
+  "\u0001": 1,
+  "\u0002": 2,
+  "\u0003": 3,
+  "\u0004": 4,
+  "\u0005": 5,
+  "\u0006": 6,
+  "\u0007": 7,
+  "\b": 8,
+  "\t": 9,
+  "\n": 10,
+  "\u000b": 11,
+  "\f": 12,
+  "\r": 13,
+  "\u000e": 14,
+  "\u000f": 15,
+  "\u0010": 16,
+  "\u0011": 17,
+  "\u0012": 18,
+  "\u0013": 19,
+  "\u0014": 20,
+  "\u0015": 21,
+  "\u0016": 22,
+  "\u0017": 23,
+  "\u0018": 24,
+  "\u0019": 25,
+  "\u001a": 26,
+  "\u001b": 27,
+  "\u001c": 28,
+  "\u001d": 29,
+  "\u001e": 30,
+  "\u001f": 31,
+  " ": 32,
+  "!": 33,
+  "\"": 34,
+  "#": 35,
+  "$": 36,
+  "%": 37,
+  "&": 38,
+  "'": 39,
+  "(": 40,
+  ")": 41,
+  "*": 42,
+  "+": 43,
+  ",": 44,
+  "-": 45,
+  ".": 46,
+  "/": 47,
+  "0": 48,
+  "1": 49,
+  "2": 50,
+  "3": 51,
+  "4": 52,
+  "5": 53,
+  "6": 54,
+  "7": 55,
+  "8": 56,
+  "9": 57,
+  ":": 58,
+  ";": 59,
+  "<": 60,
+  "=": 61,
+  ">": 62,
+  "?": 63,
+  "@": 64,
+  "A": 65,
+  "B": 66,
+  "C": 67,
+  "D": 68,
+  "E": 69,
+  "F": 70,
+  "G": 71,
+  "H": 72,
+  "I": 73,
+  "J": 74,
+  "K": 75,
+  "L": 76,
+  "M": 77,
+  "N": 78,
+  "O": 79,
+  "P": 80,
+  "Q": 81,
+  "R": 82,
+  "S": 83,
+  "T": 84,
+  "U": 85,
+  "V": 86,
+  "W": 87,
+  "X": 88,
+  "Y": 89,
+  "Z": 90,
+  "[": 91,
+  "\\": 92,
+  "]": 93,
+  "^": 94,
+  "_": 95,
+  "`": 96,
+  "a": 97,
+  "b": 98,
+  "c": 99,
+  "d": 100,
+  "e": 101,
+  "f": 102,
+  "g": 103,
+  "h": 104,
+  "i": 105,
+  "j": 106,
+  "k": 107,
+  "l": 108,
+  "m": 109,
+  "n": 110,
+  "o": 111,
+  "p": 112,
+  "q": 113,
+  "r": 114,
+  "s": 115,
+  "t": 116,
+  "u": 117,
+  "v": 118,
+  "w": 119,
+  "x": 120,
+  "y": 121,
+  "z": 122,
+  "{": 123,
+  "|": 124,
+  "}": 125,
+  "~": 126,
+  "": 127,
+  "�": 255
+}