Full-text search
+ 1,000 results

latam-gpt / red_pajama_es_hq
README.md
dataset
7 matches
tags:
language:es, size_categories:100M<n<1B, format:parquet, modality:tabular, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, arxiv:2406.17557, region:us
27
28
29
30
31
# RedPajama's High Quality Spanish subset
## What is this?
The following is a high-quality dataset distilled from the Spanish subsection of [RedPajama-Data-v2](https://github.com/togethercomputer/RedPajama-Data), created using the methodology proposed in [FineWEB-Edu](https://arxiv.org/abs/2406.17557).

ethzanalytics / RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g
README.md
model
6 matches
tags:
transformers, gpt_neox, text-generation, auto-gptq, license:apache-2.0, autotrain_compatible, region:us
10
11
12
13
14
# redpajama gptq: RedPajama-INCITE-Chat-3B-v1
<a href="https://colab.research.google.com/gist/pszemraj/86d2e8485df182302646ed2c5a637059/inference-with-redpajama-incite-chat-3b-v1-gptq-4bit-128g.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

theblackcat102 / redpajama-3b-evol-coder
README.md
model
2 matches
datajuicer / redpajama-cc-2020-05-refined-by-data-juicer
README.md
dataset
5 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- CommonCrawl-2020-05 (refined by Data-Juicer)
A refined version of CommonCrawl-2020-05 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-cc-2019-30-refined-by-data-juicer
README.md
dataset
5 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)
A refined version of CommonCrawl-2019-30 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-book-refined-by-data-juicer
README.md
dataset
4 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:100K<n<1M, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- Book (refined by Data-Juicer)
A refined version of Book dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-c4-refined-by-data-juicer
README.md
dataset
4 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- C4 (refined by Data-Juicer)
A refined version of C4 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-arxiv-refined-by-data-juicer
README.md
dataset
4 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- ArXiv (refined by Data-Juicer)
A refined version of ArXiv dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-cc-2023-06-refined-by-data-juicer
README.md
dataset
5 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- CommonCrawl-2023-06 (refined by Data-Juicer)
A refined version of CommonCrawl-2023-06 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-cc-2022-05-refined-by-data-juicer
README.md
dataset
5 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)
A refined version of CommonCrawl-2022-05 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-stack-code-refined-by-data-juicer
README.md
dataset
5 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama & TheStack -- Github Code (refined by Data-Juicer)
A refined version of Github Code dataset in RedPajama & TheStack by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-pile-stackexchange-refined-by-data-juicer
README.md
dataset
4 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama & The Pile -- StackExchange (refined by Data-Juicer)
A refined version of StackExchange dataset in RedPajama & The Pile by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original merged dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-wiki-refined-by-data-juicer
README.md
dataset
4 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- Wikipedia (refined by Data-Juicer)
A refined version of Wikipedia dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.
datajuicer / redpajama-cc-2021-04-refined-by-data-juicer
README.md
dataset
5 matches
tags:
task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining
14
15
16
17
18
# RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)
A refined version of CommonCrawl-2021-04 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.
This dataset is usually used to pretrain a Large Language Model.

pcuenq / RedPajama-3B-instruct-lora
README.md
model
7 matches
tags:
peft, lora, alpaca, redpajama, dataset:johnrobinsn/alpaca-cleaned, base_model:togethercomputer/RedPajama-INCITE-Base-3B-v1, base_model:adapter:togethercomputer/RedPajama-INCITE-Base-3B-v1, license:apache-2.0, region:us
13
14
15
16
17
# RedPajama-3B-instruct-lora
This is an instruction fine-tuned model of https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1, using `int8` mixed training.
## Training dataset

unionai / RedPajama-INCITE-Base-3B-v1-wikipedia
README.md
model
3 matches

unionai / RedPajama-INCITE-Base-3B-v1-wikipedia-8bit
README.md
model
3 matches