Full Text Search - Hugging Face

Full-text search

models datasets spaces

+ 1,000 results

latam-gpt / red_pajama_es_hq

README.md

dataset

7 matches

tags: language:es, size_categories:100M<n<1B, format:parquet, modality:tabular, modality:text, library:datasets, library:dask, library:mlcroissant, library:polars, arxiv:2406.17557, region:us

# RedPajama's High Quality Spanish subset

## What is this?

The following is a high-quality dataset distilled from the Spanish subsection of [RedPajama-Data-v2](https://github.com/togethercomputer/RedPajama-Data), created using the methodology proposed in [FineWEB-Edu](https://arxiv.org/abs/2406.17557).

ethzanalytics / RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g

README.md

model

6 matches

tags: transformers, gpt_neox, text-generation, auto-gptq, license:apache-2.0, autotrain_compatible, region:us

# redpajama gptq: RedPajama-INCITE-Chat-3B-v1

<a href="https://colab.research.google.com/gist/pszemraj/86d2e8485df182302646ed2c5a637059/inference-with-redpajama-incite-chat-3b-v1-gptq-4bit-128g.ipynb">

<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>

theblackcat102 / redpajama-3b-evol-coder

README.md

model

2 matches

tags: transformers, pytorch, gpt_neox, text-generation, dataset:theblackcat102/evol-codealpaca-v1, model-index, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

Redpajama 3B finetuned on evol-codealpaca-v1

Follows the OpenAssistant chat format:

datajuicer / redpajama-cc-2020-05-refined-by-data-juicer

README.md

dataset

5 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- CommonCrawl-2020-05 (refined by Data-Juicer)

A refined version of CommonCrawl-2020-05 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-cc-2019-30-refined-by-data-juicer

README.md

dataset

5 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)

A refined version of CommonCrawl-2019-30 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-book-refined-by-data-juicer

README.md

dataset

4 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:100K<n<1M, region:us, data-juicer, pretraining

# RedPajama -- Book (refined by Data-Juicer)

A refined version of Book dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-c4-refined-by-data-juicer

README.md

dataset

4 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- C4 (refined by Data-Juicer)

A refined version of C4 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-arxiv-refined-by-data-juicer

README.md

dataset

4 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- ArXiv (refined by Data-Juicer)

A refined version of ArXiv dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-cc-2023-06-refined-by-data-juicer

README.md

dataset

5 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- CommonCrawl-2023-06 (refined by Data-Juicer)

A refined version of CommonCrawl-2023-06 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-cc-2022-05-refined-by-data-juicer

README.md

dataset

5 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)

A refined version of CommonCrawl-2022-05 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-stack-code-refined-by-data-juicer

README.md

dataset

5 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama & TheStack -- Github Code (refined by Data-Juicer)

A refined version of Github Code dataset in RedPajama & TheStack by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-pile-stackexchange-refined-by-data-juicer

README.md

dataset

4 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama & The Pile -- StackExchange (refined by Data-Juicer)

A refined version of StackExchange dataset in RedPajama & The Pile by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original merged dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-wiki-refined-by-data-juicer

README.md

dataset

4 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- Wikipedia (refined by Data-Juicer)

A refined version of Wikipedia dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

datajuicer / redpajama-cc-2021-04-refined-by-data-juicer

README.md

dataset

5 matches

tags: task_categories:text-generation, language:en, license:apache-2.0, size_categories:n<1K, format:json, modality:tabular, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, region:us, data-juicer, pretraining

# RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)

A refined version of CommonCrawl-2021-04 dataset in RedPajama by [Data-Juicer](https://github.com/alibaba/data-juicer). Removing some "bad" samples from the original dataset to make it higher-quality.

This dataset is usually used to pretrain a Large Language Model.

VSHNU0308 / redpajama

model

1 matches

tags: region:us

csuvikv / RedPajama

app.py

space

3 matches

uter/RedPajama-INCITE-Chat-3B-v1")

model = AutoModelForCausalLM.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-3B-v1", torch_dtype=torch.bfloat16)

def Bemenet(bemenet):

suheni / redpajama

dataset

1 matches

tags: region:us

pcuenq / RedPajama-3B-instruct-lora

README.md

model

7 matches

tags: peft, lora, alpaca, redpajama, dataset:johnrobinsn/alpaca-cleaned, base_model:togethercomputer/RedPajama-INCITE-Base-3B-v1, base_model:adapter:togethercomputer/RedPajama-INCITE-Base-3B-v1, license:apache-2.0, region:us

# RedPajama-3B-instruct-lora

This is an instruction fine-tuned model of https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-3B-v1, using `int8` mixed training.

## Training dataset

unionai / RedPajama-INCITE-Base-3B-v1-wikipedia

README.md

model

3 matches

tags: transformers, pytorch, gpt_neox, text-generation, causal-lm, redpajama, fine-tuning, wikipedia, en, dataset:wikipedia, license:apache-2.0, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

# RedPajama-INCITE-Base-3B-v1 fine-tuned on wikipedia

unionai / RedPajama-INCITE-Base-3B-v1-wikipedia-8bit

README.md

model

3 matches

tags: transformers, pytorch, gpt_neox, text-generation, causal-lm, redpajama, fine-tuning, wikipedia, en, dataset:wikipedia, license:apache-2.0, autotrain_compatible, text-generation-inference, endpoints_compatible, region:us

# RedPajama-INCITE-Base-3B-v1 fine-tuned on wikipedia