@chansung on Hugging Face: "Update on the Newsletter of 🤗 Daily Paper Automatic Korean translation is…"

chansung

posted an update Jan 22

Post

Update on the Newsletter of 🤗 Daily Paper

Automatic Korean translation is integrated. In the newspaper, "KO" links appear, and it will bring you to the translated version of full paper. This is done with the following workflow.

1. Grasp the list of arXiv IDs from 🤗 Daily Paper API
2. Distribute a number of sub-list of arXiv IDs to VMs (possibly spot instances since the job ends shortly)
3. Commit & push the translated paper in HTML to the designated GitHub repository
4. Newsletter will include the links to the HTML of each paper

Job distribution to a number of VMs are super easily done with [dstack]( https://dstack.ai/ ), and the translation sub-workflow is done through 1) download PDF of each paper with arxiv-dl package, 2) PDF => text with nougat-ocr package, 3) a custom trained model( nlp-with-deeplearning/enko-t5-small-v0 ) in 🤗 transformers to translate the English text into Korean line by line, and 4) reformat the translation into HTML.

Many people in Korea are not fluent in English but want to learn about new stuff in AI, so they usually use Google Translate or other services. This is why I made this feature for easier and direct access to the SOTA knowledge.

Are there other countries with the similar needs? If so, it would be wonderful to cooperate to support more languages. Please reach out anyone is interested in this.

PS; I always wanted to show the usefulness of open ML models by building a well working end to end product, and this newsletter shows it by featuring T5ForConditionalGeneration (translation), SOLAR LLM (summarization).

if you want to sub to the newsletter
: https://groups.google.com/g/hf-daily-paper-newsletter

if you want to look into the source codes
: https://github.com/deep-diver/hf-daily-paper-newsletter

s3nh

Jan 22

Thats so cool, really helpful for non English language audience <3

samusenps

Jan 22

•

edited Jan 22

Hey, this reminded me of a paper that highlights an important issue when collecting data for any language (mainly other than English) called:
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism
https://arxiv.org/abs/2401.05749

The major takeaway for me is
When collecting new training data it is important to be aware of data that may have already been translated en-large(correlated with low quality), multi-way parallelism is a promising way to detect this.

"a large fraction of the total sentences in lower resource languages have at least one translation (§ 4.1), implying that a large fraction of the total web in those languages is MT generated"

MT detection, specifically multi-way parallelism is a promising way to detect low quality, machine translated data, especially in lower resource languages, to filter both bilingual and monolingual data.

chansung

Jan 23

Thanks @samusenps for sharing the interesting paper.

Yeah it makes a lot of sense to me. I have been working as a professional book translator for few years as a side job. Translators have very little income while they have to spend a lot of time to produce quality translations (no 1:1 mapping. actually it is like recreating the original books).

With that kind of my perspective/experience in translation industries, no doubt about most contents on the web are machine translated.

Join the conversation