ZemResearch
AI & ML interests
On-Device AI, Small Language Models (SLMs), hyper-specialized code generation, model compression (quantization), and synthetic dataset curation.
Recent Activity
𧬠Hello from ZemResearch!
Mixing Artificial Intelligence with Chemistry, one dataset at a time.
π Who We Are
Welcome to ZemResearch! We are an open-source research initiative passionate about bridging the gap between computer science and molecular biology. We believe that training specialized, lightweight Large Language Models (LLMs) shouldn't require massive corporate budgetsβit just needs incredibly clean data and smart engineering.
π― What We Do
- π§Ή Extreme Data Cleaning: We don't just scrape data; we sterilize it. We heavily rely on tools like RDKit to ensure our molecular datasets obey the fundamental laws of chemistry.
- π€ Lightweight AI Models: We focus on fine-tuning accessible, efficient LLMs that can run smoothly without needing massive GPU clusters.
- π Open Science: Everything we build is dedicated to the global open-source community. Let's democratize AI drug discovery together!
π Our Flagship Project
- HippoCrates: A massive, heavily sterilized dataset containing 1.46 million molecular structures. It's ready-to-use (in Apache Parquet format) for text-generation and chemical bioactivity fine-tuning.
- HippoXic: A premium, domain-specific instruction-tuning dataset containing 10,630 highly curated rows focused on chemical toxicology, FDA clinical safety, and real-world side effects. It bridges the gap between molecular structures and clinical bio-safety reasoning.
π€ Let's Collaborate
Got a cool idea for molecular LLMs, or just want to chat about AI in healthcare? Feel free to explore our datasets, open a discussion in our repositories, or reach out. We are always open to new collaborations!