Lazarus NLP

non-profit

https://lazarusnlp.github.io/

lazarusnlp

Activity Feed

AI & ML interests

Neural Machine Translation, Sentence Embeddings, Low-Resource Languages

Recent Activity

w11wo authored a paper 30 days ago

COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

w11wo new activity 2 months ago

LazarusNLP/simcse-indobert-lite-base:Adding `safetensors` variant of this model

w11wo new activity 2 months ago

LazarusNLP/bloom-1b7-fp32:Adding `safetensors` variant of this model

View all activity

Organization Card

Community About org cards

Lazarus NLP is a collective initiative to revive the dying languages of Indonesia through speech and language technology.

Projects

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

This project aims to extend the multilingual and multicultural capability of IndoBERT. We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as IndoNLU, NusaX, and NusaWrites.

IndoT5: T5 Language Models for the Indonesian Language

IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the IndoNLG (text generation) benchmark.

Indonesian Sentence Embedding Models

We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like IndoBERT and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.

Indonesian Natural Language Inference Models

Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.

Many-to-Many Multilingual Translation Models

Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the NusaX benchmark.

Collections 5

models 40

LazarusNLP/simcse-indobert-lite-base

LazarusNLP/bloom-1b7-fp32

Text Generation • Updated Feb 10 • 9

LazarusNLP/bloomz-1b7-fp32

Text Generation • Updated Feb 10 • 1

LazarusNLP/congen-indobert-base

Sentence Similarity • Updated Feb 1 • 5

LazarusNLP/indo-t5-base-v2

Text2Text Generation • Updated Jan 31 • 3 • 1

LazarusNLP/indo-t5-base

Text2Text Generation • Updated Dec 15, 2024 • 26

LazarusNLP/indo-t5-base-v2-nusax

Text2Text Generation • Updated Dec 11, 2024 • 265

LazarusNLP/simcse-indoroberta-base

LazarusNLP/s-indobert-base-mmarco

LazarusNLP/indo-t5-base-nusax

Text2Text Generation • Updated Nov 20, 2024 • 34 • 1

datasets 5

LazarusNLP/multilingual-NLI-26lang-2mil7-id

Viewer • Updated Jan 30, 2024 • 105k • 27 • 1

LazarusNLP/mini_pile_cc

Viewer • Updated Jan 19, 2024 • 10M • 30

LazarusNLP/wikipedia_id_backtranslated

Viewer • Updated Jan 16, 2024 • 1M • 35

LazarusNLP/stsb_mt_id

Viewer • Updated Jan 6, 2024 • 2.88k • 24 • 2

LazarusNLP/wikipedia_id_20230520

Viewer • Updated May 27, 2023 • 10.1M • 33

Lazarus NLP

AI & ML interests

Recent Activity

Projects

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

IndoT5: T5 Language Models for the Indonesian Language

Indonesian Sentence Embedding Models

Indonesian Natural Language Inference Models

Many-to-Many Multilingual Translation Models

Collections 5

NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural

LazarusNLP/NusaBERT-base

LazarusNLP/NusaBERT-large

LazarusNLP/stsb_mt_id

LazarusNLP/all-indo-e5-small-v4

LazarusNLP/all-indo-e5-small-v3

LazarusNLP/all-indo-e5-small-v2

spaces 2

NusaBERT

LazarusNLP

models 40

LazarusNLP/simcse-indobert-lite-base

LazarusNLP/bloom-1b7-fp32

LazarusNLP/bloomz-1b7-fp32

LazarusNLP/congen-indobert-base

LazarusNLP/indo-t5-base-v2

LazarusNLP/indo-t5-base

LazarusNLP/indo-t5-base-v2-nusax

LazarusNLP/simcse-indoroberta-base

LazarusNLP/s-indobert-base-mmarco

LazarusNLP/indo-t5-base-nusax

datasets 5

LazarusNLP/multilingual-NLI-26lang-2mil7-id

LazarusNLP/mini_pile_cc

LazarusNLP/wikipedia_id_backtranslated

LazarusNLP/stsb_mt_id

LazarusNLP/wikipedia_id_20230520

AI & ML interests

Recent Activity

Team members 5

Projects

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

IndoT5: T5 Language Models for the Indonesian Language

Indonesian Sentence Embedding Models

Indonesian Natural Language Inference Models

Many-to-Many Multilingual Translation Models

Collections 5

spaces 2 Sort: Recently updated

NusaBERT

LazarusNLP

models 40 Sort: Recently updated

datasets 5 Sort: Recently updated

spaces 2

models 40

datasets 5