Hasan Kurşun PRO
hasankursun
AI & ML interests
LLM, Embedding Models, OCR, MoE
Recent Activity
posted an update about 2 hours ago
Greek Corpus 150B is now live on the Hub.
A deduplicated, ~146B-token Greek dataset for pretraining and fine-tuning foundation models — a pretrain layer + an instruction (SFT) layer, one unified schema, globally deduplicated.
📊 49.6M documents / ~146B pretrain tokens
📚 Web (FineWeb-2) + long-form PDFs (FinePDFs) + FineWiki + native Greek legislation (47k statutes from the Government Gazette)
💬 ~10B-token SFT layer (9.9M conversations)
The newest in my Global Corpus family — Dutch, Turkish, Bulgarian, Greek — built on a consistent, reproducible pipeline.
🔗 https://huggingface.co/datasets/hasankursun/greek-corpus-150b
#greek #llm #dataset #multilingual updated a dataset about 12 hours ago
hasankursun/greek-corpus-150b upvoted an article about 14 hours ago
Comparative evaluation of GPT‑OSS‑20B vs GPT‑OSS‑120B on Arabic & ILMAAM benchmarks