Bremin commited on
Commit
2cc06ce
·
verified ·
1 Parent(s): 38a0a5c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - code
5
+ - multi-language
6
+ - pretraining-data
7
+ ---
8
+
9
+ # code-graph-v4
10
+
11
+ Packaged git clones for the graphjepa / code-transformer project.
12
+ with full git history.
13
+
14
+ ## Contents
15
+
16
+ - clones_c_full.tar.gz
17
+ - clones_cpp_full.tar.gz
18
+ - clones_csharp_full.tar.gz
19
+ - clones_java_full.tar.gz
20
+ - clones_javascript_full.tar.gz
21
+ - clones_python_full.tar.gz
22
+ - clones_typescript_full.tar.gz
23
+
24
+ Each tarball contains `{language}/{repo_id}/...` — extract anywhere,
25
+ point the parser at the extracted directory.
26
+
27
+ ## On the receiving (big) machine
28
+
29
+ ```bash
30
+ from huggingface_hub import hf_hub_download
31
+ path = hf_hub_download(
32
+ repo_id="IDMedicine/code-graph-v4",
33
+ filename="clones_python_full.tar.gz",
34
+ repo_type="model",
35
+ local_dir=".",
36
+ )
37
+ tar -xzf $path -C ./data_multilang/
38
+
39
+ # Then process each repo with build_bundle.py (needs include_git=True for
40
+ # temporal processing; or single-snapshot parsing if code-only).
41
+ ```
42
+
43
+ ## Limitations
44
+
45
+ - If packaged without `.git` (the `_code` variants), **no temporal
46
+ processing is possible** downstream — only single-snapshot SSL.
47
+ - If packaged with `.git` (the `_full` variants), tarballs are larger
48
+ but the full commit history is preserved for `build_bundle.py`.