mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
15
Extraction of structured data from the Common Crawl schema.org annotations, web tables, hyperlink graphs