Pedro Ortiz Suarez
pjox
AI & ML interests
Language modeling, parsing, sequence tagging, NER, historical languages.
Organizations
pjox's activity
Set `sep="\s+"` for the duplicates file
2
#1 opened 6 months ago
by
lhoestq
Porn-related strings in the datasets (zh)
2
#8 opened 12 months ago
by
kiwakwok
colab crashed after trying to load the dataset
1
#4 opened over 1 year ago
by
MhondGhod
Change foldernames
4
#3 opened over 1 year ago
by
hac541309
Unsafe Files
20
#12 opened over 1 year ago
by
GetzPro
About the number of documents
6
#6 opened over 1 year ago
by
lixin4ever
Upload the rest of the data for 05-06-23
#1 opened over 1 year ago
by
pjox
Changing into Parquet
2
#5 opened over 1 year ago
by
hac541309
the link to RoBERTa base model directs us to bert-base-uncased
1
#1 opened over 1 year ago
by
hurrial
Deduplicated English Corpus
2
#3 opened over 1 year ago
by
conceptofmind
Data hosting on Huggingface
1
#2 opened over 1 year ago
by
hieuhocnlp
How to download only one language?
2
#1 opened almost 2 years ago
by
musabg
full of sexy content and does't have 200G in zh corpus
1
#10 opened almost 2 years ago
by
Hzhiqiang