lucapernice commited on
Commit
5d4bddd
·
verified ·
1 Parent(s): f3e3f53

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -56,8 +56,7 @@ decoded = t.decode(pred, skip_special_tokens=True) # "..." -> space-separated i
56
  new_bytes = bytes(map(int, decoded.split()))
57
  ```
58
 
59
- ## Dataset
60
-
61
 
62
  Data construction summary:
63
  - Source dataset: bigcode/the-stack-dedup (subset: data/python), loaded in streaming mode.
@@ -66,10 +65,11 @@ Data construction summary:
66
  - Extract raw bytecode bytes from compiled_code.co_code and convert to a list of integers in [0, 255].
67
  - Save as JSON Lines: one sample per line, each line a JSON array of integers.
68
  - Cap: up to 100,000 samples in this release.
69
- -
70
  Notes: Bytecode format is Python-version dependent (these samples use CPython 3.12, 2-byte instructions). No extra normalization or dedup beyond the source dataset. Any truncation/padding or chunking is handled at training time.
71
 
72
  ### About Source Dataset
 
73
  - Source: bigcode/the-stack-dedup
74
  - Link: https://huggingface.co/datasets/bigcode/the-stack-dedup
75
 
 
56
  new_bytes = bytes(map(int, decoded.split()))
57
  ```
58
 
59
+ ## Dataset
 
60
 
61
  Data construction summary:
62
  - Source dataset: bigcode/the-stack-dedup (subset: data/python), loaded in streaming mode.
 
65
  - Extract raw bytecode bytes from compiled_code.co_code and convert to a list of integers in [0, 255].
66
  - Save as JSON Lines: one sample per line, each line a JSON array of integers.
67
  - Cap: up to 100,000 samples in this release.
68
+
69
  Notes: Bytecode format is Python-version dependent (these samples use CPython 3.12, 2-byte instructions). No extra normalization or dedup beyond the source dataset. Any truncation/padding or chunking is handled at training time.
70
 
71
  ### About Source Dataset
72
+
73
  - Source: bigcode/the-stack-dedup
74
  - Link: https://huggingface.co/datasets/bigcode/the-stack-dedup
75