lucapernice
/

BERT-Bytecode

genetic-improvement

genetic-programming

Model card Files Files and versions

lucapernice commited on Sep 27, 2025

Commit

5d4bddd

·

verified ·

1 Parent(s): f3e3f53

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -56,8 +56,7 @@ decoded = t.decode(pred, skip_special_tokens=True)  # "..." -> space-separated i
 new_bytes = bytes(map(int, decoded.split()))
 ```
-## Dataset
 Data construction summary:
 - Source dataset: bigcode/the-stack-dedup (subset: data/python), loaded in streaming mode.
@@ -66,10 +65,11 @@ Data construction summary:
 - Extract raw bytecode bytes from compiled_code.co_code and convert to a list of integers in [0, 255].
 - Save as JSON Lines: one sample per line, each line a JSON array of integers.
 - Cap: up to 100,000 samples in this release.
--
 Notes: Bytecode format is Python-version dependent (these samples use CPython 3.12, 2-byte instructions). No extra normalization or dedup beyond the source dataset. Any truncation/padding or chunking is handled at training time.
 ### About Source Dataset
 - Source: bigcode/the-stack-dedup
 - Link: https://huggingface.co/datasets/bigcode/the-stack-dedup

 new_bytes = bytes(map(int, decoded.split()))
 ```
+##  Dataset
 Data construction summary:
 - Source dataset: bigcode/the-stack-dedup (subset: data/python), loaded in streaming mode.
 - Extract raw bytecode bytes from compiled_code.co_code and convert to a list of integers in [0, 255].
 - Save as JSON Lines: one sample per line, each line a JSON array of integers.
 - Cap: up to 100,000 samples in this release.
 Notes: Bytecode format is Python-version dependent (these samples use CPython 3.12, 2-byte instructions). No extra normalization or dedup beyond the source dataset. Any truncation/padding or chunking is handled at training time.
 ### About Source Dataset
 - Source: bigcode/the-stack-dedup
 - Link: https://huggingface.co/datasets/bigcode/the-stack-dedup