Full-text search
Search in
Scope to owner or repo
+ 1,000 results
ll922 / RedPajama-Data-1T-Sample-Backup
README.md
dataset
7 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:100K<n<1M, format:parquet, modality:text, library:datasets, library:dask, library:polars, library:mlcroissant, region:us, redpajama, language-modeling, backup, parquet
20
# RedPajama Data 1T Sample Backup
21
22
This dataset is a backup mirror of `togethercomputer/RedPajama-Data-1T-Sample`.
⋯
34
"togethercomputer/RedPajama-Data-1T-Sample",
⋯
46
"ll922/RedPajama-Data-1T-Sample-Backup",
⋯
57
Original dataset: `togethercomputer/RedPajama-Data-1T-Sample`
ethzanalytics / RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g
README.md
model
6 matches
tags: transformers, gpt_neox, text-generation, auto-gptq, license:apache-2.0, region:us
10
# redpajama gptq: RedPajama-INCITE-Chat-3B-v1
⋯
16
A GPTQ quantization of the [RedPajama-INCITE-Chat-3B-v1](https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3...
⋯
39
model_repo = Path.cwd() / "RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g"
40
device = "cuda:0" if torch.cuda.is_available() else "cpu"
theblackcat102 / redpajama-3b-evol-coder
README.md
model
2 matches
datajuicer / redpajama-arxiv-refined-by-data-juicer
README.md
dataset
4 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- ArXiv (refined by Data-Juicer)
15
16
A refined version of ArXiv dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some ...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-arxiv-refine-result.jsonl) (About 85GB).
datajuicer / redpajama-cc-2023-06-refined-by-data-juicer
README.md
dataset
5 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- CommonCrawl-2023-06 (refined by Data-Juicer)
15
16
A refined version of CommonCrawl-2023-06 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). ...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2023-06-refine-result.js...
datajuicer / redpajama-cc-2022-05-refined-by-data-juicer
README.md
dataset
5 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)
15
16
A refined version of CommonCrawl-2022-05 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). ...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2022-05-refine-result.js...
datajuicer / redpajama-stack-code-refined-by-data-juicer
README.md
dataset
5 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama & TheStack -- Github Code (refined by Data-Juicer)
15
16
A refined version of Github Code dataset in RedPajama & TheStack by [Data-Juicer](https://github.com/alibaba/data-juicer...
⋯
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-stack-code-refine-result.jsonl) (About 232GB).
⋯
28
### RedPajama code refinement
datajuicer / redpajama-pile-stackexchange-refined-by-data-juicer
README.md
dataset
4 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama & The Pile -- StackExchange (refined by Data-Juicer)
15
16
A refined version of StackExchange dataset in RedPajama & The Pile by [Data-Juicer](https://github.com/alibaba/data-juic...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-pile-stackexchange-refine-result.jsonl) (About 71GB).
datajuicer / redpajama-wiki-refined-by-data-juicer
README.md
dataset
4 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- Wikipedia (refined by Data-Juicer)
15
16
A refined version of Wikipedia dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing s...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-wiki-refine-result.jsonl) (About 68GB).
datajuicer / redpajama-cc-2021-04-refined-by-data-juicer
README.md
dataset
5 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)
15
16
A refined version of CommonCrawl-2021-04 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). ...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2021-04-refine-result.js...
datajuicer / redpajama-cc-2020-05-refined-by-data-juicer
README.md
dataset
5 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- CommonCrawl-2020-05 (refined by Data-Juicer)
15
16
A refined version of CommonCrawl-2020-05 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). ...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2020-05-refine-result.js...
datajuicer / redpajama-cc-2019-30-refined-by-data-juicer
README.md
dataset
5 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)
15
16
A refined version of CommonCrawl-2019-30 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). ...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-cc-refine-results/redpajama-cc-2019-30-refine-result.js...
datajuicer / redpajama-book-refined-by-data-juicer
README.md
dataset
4 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:100K<n<1M, region:us, data-juicer, pretraining
14
# RedPajama -- Book (refined by Data-Juicer)
15
16
A refined version of Book dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-book-refine-result.jsonl) (About 91GB).
datajuicer / redpajama-c4-refined-by-data-juicer
README.md
dataset
4 matches
tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
# RedPajama -- C4 (refined by Data-Juicer)
15
16
A refined version of C4 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "ba...
20
...aliyuncs.com/LLM_data/our_refined_datasets/pretraining/redpajama-c4-refine-result.jsonl) (About 832GB).
latam-gpt / red_pajama_es_hq
README.md
dataset
7 matches
tags: language:es, size_categories:100M<n<1B, format:parquet, modality:tabular, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, arxiv:2406.17557, region:us
27
# RedPajama's High Quality Spanish subset
⋯
31
...lity dataset distilled from the Spanish subsection of [RedPajama-Data-v2](https://github.com/togethercomputer/RedPajama-...
⋯
38
ds = load_dataset("latam-gpt/red_pajama_es_hq")
⋯
48
ds = load_dataset("latam-gpt/red_pajama_es_hq")
⋯
72
The text documents of the source database (RedPajama-Data-v2) were collected using 84 CommonCrawl snapshots, processed u...
canho / RedPajamas_EN_Phase1
README.md
dataset
7 matches
tags: task_categories:text-generation, language:en, size_categories:1K<n<10K, format:parquet, modality:text, library:datasets, library:pandas, library:polars, library:mlcroissant, region:us, redpajama, information-extraction, atomic-facts, negation, unanswerable
16
# RedPajamas EN Phase1
17
18
This dataset contains Phase 1 logic and information-extraction annotations for English RedPajama text chunks.
⋯
27
- Source family: `togethercomputer/RedPajama-Data-V2` / local RedPajama English sample
⋯
31
- Destination repo: `canho/RedPajamas_EN_Phase1`