Upload binary-tokenizer-005

Browse files

Files changed (3) hide show

README.md +49 -0
tokenizer.json +0 -0
train.sh +12 -0

README.md ADDED Viewed

	@@ -0,0 +1,49 @@

+# Binary Tokenizer 005
+A BPE tokenizer trained on binary executable files for security research and binary analysis tasks.
+## Overview
+This tokenizer uses Byte Pair Encoding (BPE) trained on latin-1 encoded binary data from system executables across multiple operating systems and architectures (Alpine, Ubuntu, Debian, Rocky Linux, ARM64, x86-64).
+- **Vocabulary Size**: 65,536 tokens
+- **Training Data**: System binaries from various OS distributions
+- **Encoding**: Latin-1 (each byte 0-255 maps to a single character)
+## Usage
+```python
+from tokenizers import Tokenizer
+# Load tokenizer
+tokenizer = Tokenizer.from_file("tokenizer.json")
+# Process binary data - MUST use latin-1 encoding
+with open("binary_file", "rb") as f:
+    raw_bytes = f.read()
+    text = raw_bytes.decode('latin-1')  # Convert bytes to latin-1 string
+    encoded = tokenizer.encode(text)
+    tokens = encoded.ids
+```
+## Important: Data Format
+The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:
+```python
+# CORRECT
+raw_bytes = b'\x7fELF\x01\x01'
+text = raw_bytes.decode('latin-1')  # → '\x7fELF\x01\x01'
+# WRONG - Do not use hex strings
+hex_str = "7f 45 4c 46 01 01"  # ❌ Will not work correctly
+```
+## Related Projects
+- [mjbommar/glaurung](https://github.com/mjbommar/glaurung) - Binary analysis framework
+- [mjbommar/glaurung-models](https://github.com/mjbommar/glaurung-models) - Binary embedding models and tokenizers
+## License
+For research and educational purposes.

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

train.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+target/release/bbpe train \
+		    --vocab-size 65536 \
+		    --mode random \
+		    --min-chunk-exp 2 \
+		    --max-chunk-exp 6 \
+		    --entropy-filter \
+		    --progress \
+		    --boundaries \
+		    --pad-pow2 \
+		    --max-token-length 16 \
+		    --sample-rate 0.15 \
+		    /home/ubuntu/src/glaurung-models/binary-sample/binaries/