Update README.md
Browse files
README.md
CHANGED
|
@@ -56,8 +56,7 @@ decoded = t.decode(pred, skip_special_tokens=True) # "..." -> space-separated i
|
|
| 56 |
new_bytes = bytes(map(int, decoded.split()))
|
| 57 |
```
|
| 58 |
|
| 59 |
-
##
|
| 60 |
-
|
| 61 |
|
| 62 |
Data construction summary:
|
| 63 |
- Source dataset: bigcode/the-stack-dedup (subset: data/python), loaded in streaming mode.
|
|
@@ -66,10 +65,11 @@ Data construction summary:
|
|
| 66 |
- Extract raw bytecode bytes from compiled_code.co_code and convert to a list of integers in [0, 255].
|
| 67 |
- Save as JSON Lines: one sample per line, each line a JSON array of integers.
|
| 68 |
- Cap: up to 100,000 samples in this release.
|
| 69 |
-
|
| 70 |
Notes: Bytecode format is Python-version dependent (these samples use CPython 3.12, 2-byte instructions). No extra normalization or dedup beyond the source dataset. Any truncation/padding or chunking is handled at training time.
|
| 71 |
|
| 72 |
### About Source Dataset
|
|
|
|
| 73 |
- Source: bigcode/the-stack-dedup
|
| 74 |
- Link: https://huggingface.co/datasets/bigcode/the-stack-dedup
|
| 75 |
|
|
|
|
| 56 |
new_bytes = bytes(map(int, decoded.split()))
|
| 57 |
```
|
| 58 |
|
| 59 |
+
## Dataset
|
|
|
|
| 60 |
|
| 61 |
Data construction summary:
|
| 62 |
- Source dataset: bigcode/the-stack-dedup (subset: data/python), loaded in streaming mode.
|
|
|
|
| 65 |
- Extract raw bytecode bytes from compiled_code.co_code and convert to a list of integers in [0, 255].
|
| 66 |
- Save as JSON Lines: one sample per line, each line a JSON array of integers.
|
| 67 |
- Cap: up to 100,000 samples in this release.
|
| 68 |
+
|
| 69 |
Notes: Bytecode format is Python-version dependent (these samples use CPython 3.12, 2-byte instructions). No extra normalization or dedup beyond the source dataset. Any truncation/padding or chunking is handled at training time.
|
| 70 |
|
| 71 |
### About Source Dataset
|
| 72 |
+
|
| 73 |
- Source: bigcode/the-stack-dedup
|
| 74 |
- Link: https://huggingface.co/datasets/bigcode/the-stack-dedup
|
| 75 |
|