👋 Open to Work

Hasan Kurşun PRO

hasankursun

·

AI & ML interests

LLM, Embedding Models, OCR, MoE

Recent Activity

posted an update about 2 hours ago

Greek Corpus 150B is now live on the Hub. A deduplicated, ~146B-token Greek dataset for pretraining and fine-tuning foundation models — a pretrain layer + an instruction (SFT) layer, one unified schema, globally deduplicated. 📊 49.6M documents / ~146B pretrain tokens 📚 Web (FineWeb-2) + long-form PDFs (FinePDFs) + FineWiki + native Greek legislation (47k statutes from the Government Gazette) 💬 ~10B-token SFT layer (9.9M conversations) The newest in my Global Corpus family — Dutch, Turkish, Bulgarian, Greek — built on a consistent, reproducible pipeline. 🔗 https://huggingface.co/datasets/hasankursun/greek-corpus-150b #greek #llm #dataset #multilingual

updated a dataset about 12 hours ago

hasankursun/greek-corpus-150b

upvoted an article about 14 hours ago

Comparative evaluation of GPT‑OSS‑20B vs GPT‑OSS‑120B on Arabic & ILMAAM benchmarks

View all activity

Organizations

hasankursun 's collections 4