Tasks
Datasets
wikipedia
common_voice
dcep europarl jrc-acquis
squad
bookcorpus
c4
CLUECorpusSmall
parsinlu
oscar
squad_v2
cnn_dailymail
imagenet
conll2003
librispeech_asr
xsum
PropBank.Br
jrc-acquis
OSIAN
1.5B Arabic Corpus
gigaword
natural_questions
imagenet-21k
ontonotes
multi_nli
OSCAR Arabic Unshuffled
brWaC
wikisql
mustc
openslr
CoNLL-2012
lince
Indo4B
covost2
snli
OPUS
code_search_net
wmt19
sep_clean
mnli
xnli
blended_skill_talk
OpenLegalData
wtq
twitter
tab_fact
msr_sqa
librispeech
enh_single
ai-soco
mc4
common_crawl
xtreme
fever
trivia_qa
samsum
race
imdb
arabic_billion_words
open_subtitles
Libri1Mix
sep_noisy
openwebtext
flaubert
DAGW
emotion
cc100
piaf
PubMed
quoref
docred
gap
winograd_wsc
winogender
glue
ms_marco
OpenSLR
ag_news
SAIL 2017
biomedical literature from Scielo and Pubmed
squad2
openbookqa
w11wo/imdb-javanese
Libri2Mix
Libri3Mix
web_questions
wiki_dpr
cc_news
FQuAD
SQuAD-FR
id_liputan6
reddit singapore, malaysia
hardwarezone
array of dataset identifiers
opus100
jsut
wmt16
ComVE
interspeech_2021_asr
voxceleb
dihard
wham
Universal Dependencies
commonsenseqa
arc
qqp
the Pile
id_newspapers_2018
BFD
STSbenchmark
dindebat.dk
hestenettet.dk
danish OpenSubtitles
AI4Bharat IndicNLP Corpora
anli
mlqa
MIMIC-III
Wikipedia
go_emotions
tydiqa
pubmed
fquad
common_gen
arabic_speech_corpus
scientific_papers
NST Swedish ASR Database
NQ
Trivia
SQuAD
MLQA
DRCD
arcd
MS MARCO document ranking
Indonesian Wikipedia
wikipedia-turkish
germeval_14
wer
TQUAD
wmt14
mulit_nli
MNLI
mlsum - es
timit_asr
muchocine
indosum
EMBO/sd-panels
CSS10
https://arabicspeech.org/
NSC2018
sts
scancode-rules
trec
Twitter
IndianPolitics
conll2000
ljspeech
break_data
sst-2
Shuffled Dutch section of the OSCAR corpus (https://oscar-corpus.com/)
msmarco
SciDocs
yahoo-answers
Uniref100
Marefa-NER
squad_v1
nadi
220M words (IndoWiki, IndoWC, News)
parlament_parla
multi_nli_mismatch
CommonCrawl
triviaqa
coqa
eli5
mlsum
Wikihow
Jean-Baptiste/wikiner_fr
Arabic poetry from several eras
Yves/fhnw_swiss_parliament
discofuse
Interspeech 2021
xquad
xsum_nl
UniRef50
webqa
dureader
tweets_hate_speech_detection
susumu2357/squad_v2_sv
bioASQ
sail
yelp_polarity
100GB Chinese corpus
HARD-Arabic-Dataset
quora
emo
movies
cord19
vivos
arxiv_dataset
custom-book-corpus
sms_spam
wiki-mk
time-mk-news-2010-2015
quartz
Arabic Poetry Dataset (6th - 21st century)
Spotify Podcasts Dataset
legal entity recognition
created a new dataset based on https://www.openslr.org/92/
MLSUM
ai2_arc
openslr_hindi
Indic TTS Malayalam Speech Corpus
Openslr Malayalam Speech Corpus
SMC Malayalam Speech Corpus
IIIT-H Indic Speech Databases
Wikipedia (Hindi, Sanskrit, Gujarati)
shemo
cifar10
common_voice, infore_25h
Arabic Wikipedia
google_wellformed_query
RuSentiment
Squad
XQuad
Tydiqa
squad_v1_pt
masakhaner
mgb5
quotes-500K
LJSpeech
LibriTTS
indic tts
iiith
socian
bangla-sentiment-benchmark
EMBO/sd-nlp
common_voice mn
marefa-mt
L3CubeMahaSent
XSUM
Gigaword
ALFFA,Gamayun & IWSLT
bible_para
JW300 + [Menyo-20k](https://huggingface.co/datasets/menyo20k_mt)
google
codexglue
Oscar Corpus, News, Stories
BembaSpeech
Icelandic portion of the OSCAR corpus from INRIA
https://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation
event2Mind
EMBO/biolang
augmented_codesearchnet
pytorrent
JW300
iapp-wiki-qa-dataset
XQuAD
Finnish parliament session 2
CC100
kazakh_speech_corpus
custom danish dataset
coco
RuTweetCorp
RuReviews
fon_dataset
https://github.com/staeiou/arxiv_archive/tree/v1.0.1
Tesserae
Phi5
Thomas Aquinas
algebra_linear_1d
algebra_linear_1d_composed
measurement_time
numbers_gcd
kowiki
news
OpenSLR 77
DaNE
legal
DAMP-VSEP
tatoeba
setimes
csmsc
SUC 3.0
sqa
libri1mix
qasc
quarel
CC-aligned
TACDataset
ami
voxconverse
Farasa
urdu-text-news
Voicebank
DEMAND
WHAM!
WHAMR!
WSJ0-2Mix
WSJ0-3Mix
Timers and Such
wikimovies
imagenet_21k
NST Estonian ASR Database
Languages
en
es
fr
sv
de
fi
multilingual
zh
ru
ar
fa
it
id
pt
tr
nl
uk
eo
ja
pl
da
Chinese
bg
ro
he
el
hi
ca
et
no
cs
lt
af
vi
hu
sl
is
ms
ko
ht
hr
mr
tl
bn
mt
lv
gl
gu
mk
rw
sg
eu
ig
ur
lg
ny
or
sn
xh
ee
ts
ln
yo
as
si
mn
rn
ga
be
jv
sm
ta
th
ty
to
nso
fy
ha
lb
sq
te
yi
nn
nb
fj
gaa
bcl
crs
niu
guw
tn
co
wa
ceb
cy
ka
st
fo
br
mh
ilo
bzs
iso
gil
efi
lua
pon
pis
pap
pag
rm
oc
an
am
hy
sk
zu
ti
tw
bem
kg
lus
loz
swc
tvl
tll
ml
english
gv
bi
war
lu
ase
hil
lue
gd
km
kn
so
os
ps
se
kqn
srn
toi
mg
tt
wo
kw
tiv
ho
zne
wls
run
tpi
ne
az
cv
sc
kwy
ber
mfe
tum
chk
rnd
yap
ve
c++
mi
sw
tk
dv
yue
mos
sh
roa
my
code
protein
ky
lo
su
na
ba
sa
pa-IN
English
fy-NL
kk
bo
bs
pa
sr
sv-SE
sla
gem
kl
io
ce
ab
ISO 639-1 code for your language, or `multilingual`
luo
ine
fiu
zls
gmq
zle
itc
cel
gmw
afa
umb
iir
sem
urj
cpp
inc
zlw
French
sah
hsb
la
eng
jap
ch
gn
nv
mul
zh-tw
lzh
trk
py
ga-IE
kj
bat
grk
phi
om
dra
fse
ng
ss
kwn
csg
euq
nyk
alv
csn
aed
lun
pqe
aav
bnt
cpf
cus
mkh
nic
sal
mfs
prl
tzo
zai
thai
Cszech
Deustch
Swedish
Cszech Deustch
Cszech English
Cszech Spanish
Cszech French
Cszech Italian
Cszech Swedish
Deustch Cszech
Deustch English
Deustch Spanish
Deustch French
Deustch Italian
Deustch Swedish
English Cszech
English Deustch
English Italian
French Cszech
French Deustch
French English
French Spanish
French Italian
French Swedish
Italian Cszech
Italian Deustch
Italian English
Italian Spanish
Italian French
Italian Swedish
Swedish Cszech
Swedish Deustch
Swedish English
Swedish Spanish
Swedish French
Swedish Italian
ia
rm-sursilv
cnr
hbs
haw
hmn
ku
tg
ug
uz
vn
dutch
italian
???
scientific english
hi-en
ks
sd
[en]
nah specifically ncj
zh-HK
amh
hau
ibo
kin
lug
pcm
swa
wol
yor
ary
scn
nap
Guj
ssp
arz
scandinavia
art
cau
ccs
map
poz
pqw
sit
tdt
tut
yua
sami
kab
taw
vsl
wal
ach
English Spanish
English French
English Swedish
Spanish Cszech
Spanish Deustch
Spanish English
Spanish French
Spanish Italian
Spanish Swedish
rm-vallader
fon
Go
Java
javascript
php
python
en de nl es
cnh
nr
sot
ven
xho
zul
zh-CN
esperanto
xal