Spaces:
Sleeping
Sleeping
ABOUT_TEXT = """ | |
## Web Languages Project | |
Welcome! This is a crowd-sourced effort to improve crawling | |
of low-resource languages. This dataset is public. | |
[Common Crawl](https://commoncrawl.github.io/cc-crawl-statistics/plots/languages) | |
recognizes a lot of languages, and we can see that we don't have | |
enough of languages like Hindi (500 million speakers!), smaller | |
country languages like Hungarian, and regional languages like Catalan. | |
We are interested in languages from all over the world. If you choose | |
to help, you'll be helping create lists of websites related to | |
languages that you read or speak. | |
### How can I contribute? | |
If you look below you'll see a huge list of living languages. If you | |
see one that looks interesting, click on it. You'll see a | |
language-specific document, probably mostly blank, that you can fill | |
out. | |
There are 2 ways to add to this document. If you aren't very familiar | |
with Github, you can copy the entire document into an email, fill it | |
out, and send it to web-languages ZAT commoncrawl ZOT org. We'll do the rest. | |
If you are familiar with Github, and are logged in, click on the pen | |
icon in the upper right corner to start editing the document. | |
Github will request that you fork the repo. Do that, edit the | |
document, and finally create a pull request. | |
To see a partially completed example, look at the | |
[Welsh](living/welsh.md) entry. | |
Sometimes asking a Large Language Model can be helpful: "What are some | |
top websites written in the Welsh language?" | |
### What kind of websites are you looking for? | |
If you look at the template, we have requested urls in a few | |
categories: News, Culture/History, Government, Political Parties, and | |
Other. Remember that we're looking for websites in this particular | |
language. If the language is only a part of the website, and that's | |
visible in the URL as https://example.com/catalan/, then that's the | |
URL you should add. | |
For a language like Hindi, with 500 million speakers, there are a lot | |
of websites to choose from. Please suggest websites that are important | |
and influential, and please think about diversity. Are all geographic | |
regions represented? | |
For a country-wide language like Hungarian, there are still probably a | |
wide variety of websites to choose from. If a website is all English, | |
however, that's not what we're looking for. | |
For a regional language like Catalan, things are trickier. Catalan has | |
multiple names -- it's called Valencian in some parts of Spain -- and | |
use of the Catalan language is a part of a vigorous debate in Spanish | |
national and regional politics. You might not be able to find | |
Catalan-language content for every political party, and government | |
websites might offer Catalan content one day and remove it after | |
the next election. In that case, please do the best you can. | |
If your favorite language has its own Wikipedia -- [check the list here](https://en.wikipedia.org/wiki/List_of_Wikipedias) -- | |
please include this link under "Other". | |
""" |