❗❗WARNING❗❗: Read usage and access section below to know how to use model and also currently the incremental training is stopped, I will continue to train model soon.

Celestia: A Next-Generation Transformer Model

Celestia is a breakthrough transformer model designed to deliver high-quality, contextually rich, and creative text generation. With 290 million parameters in its first variant, Celestia has been pretrained on approximately 8 billion tokens using an innovative incremental training approach on Kaggle with TPU v3-8 hardware. This efficient training method leverages the Fine Web Edu dataset to achieve performance that not only rivals but in many cases outperforms popular small-scale models from Hugging Face.

Key Features

Sophisticated Architecture:
Celestia is built on a state-of-the-art transformer architecture that includes:
- Multi-head attention with optimized key-value mechanisms.
- Sliding-window attention for efficient handling of long contexts.
- A Mixture-of-Experts (MoE) feed-forward network to boost performance.
- Advanced normalization techniques to ensure stability during training.
Resource-Efficient Training:
Despite having 290 million parameters, Celestia was pretrained on only 8 billion tokens. This was made possible by an incremental training strategy, which allowed us to push the boundaries of model performance even with limited resources. Training was carried out on TPU v3-8 on Kaggle, making it an excellent example of how cutting-edge research can be achieved on modest computational budgets.
Superior Performance:
In rigorous tests, Celestia has outperformed several well-known, like smollm by Hugging Face often considered benchmarks for both speed and accuracy. Its ability to understand complex, abstract, and nuanced contexts sets it apart from many existing alternatives.
Flexible Generation Capabilities:
Originally designed as a sentence completion model, Celestia excels at generating thought-provoking and creative continuations. It supports both beam search and temperature-based sampling, ensuring versatility in a variety of text-generation applications. With minor fine-tuning, Celestia can also be adapted for specialized tasks such as conversational agents, summarization, or other domain-specific applications.

Why Celestia Stands Apart

Efficiency & Innovation:
Celestia demonstrates that high-quality language models can be built with relatively modest computational resources. By leveraging an incremental training approach, it not only reduces training time and resource demands but also produces outputs with greater depth and nuance compared to many small-scale models available today.
Performance Beyond the Norm:
While many small models on Hugging Face have garnered attention for their accuracy and speed, Celestia has consistently shown superior results in generating creative, coherent, and context-aware text. Its performance on abstract reasoning, complex narrative generation, and sophisticated sentence completions has set a new standard for what can be achieved in this model size category.
Ongoing Development:
Celestia is not a finished product—it is an evolving project. Our long-term plan is to continue training on the full Fine Web Edu dataset, with periodic updates to the model as more tokens are processed and new techniques are integrated. This commitment to continuous improvement ensures that Celestia will remain at the cutting edge of language generation research.

Future Directions

We plan to:

Extend the pretraining to the entire Fine Web Edu dataset.
Regularly update the model with incremental training iterations.
Explore fine-tuning strategies for task-specific applications such as interactive conversations, summarization, and more.

Usage and Access

If you wish to utilize Celestia in your own projects, please contact naqeeb.ajk63@gmail.com for the complete model code and usage instructions. We provide support for both beam search and temperature-based sampling approaches, ensuring you have the tools to optimize the model for your specific needs. It is suggested not to use huggingface library for loading and using model, you should download model from website directly with tokenizer file and then use usage.py to use this model for inference, you should directly use that, also no need to download config.json file if you will use the usage code that i will write. I will write usage.py for temperature sampling as this model is "state of art model" and donot requires beam search. But it also show good results with beam search too. In the files section, i uploaded usage.py and Beam_search.py files so that any one can use these files to test model. Usage.py contains inference code with temperature and nucleus hybrid sampling, however beamsearch code is purely for beam search inference logic.

Limitations

Model still needs pretraining, till now it donot have some perfect factual knowledge, i did not evaluated model but I tested it with temperature sampling which reveals that, maybe fine-tuning can make it little more better. In future, I will do incremental training on combined datasets of The stack(v1) with fineweb-edu and finemath too. Till now it shows marvelous results despite of its low parameters (290M).

License

This project is licensed under the Apache License 2.0.

Celestia represents a new paradigm in resource-efficient language modeling—delivering superior performance and creative output even when compared to established small-scale models. We invite you to explore its capabilities and join us in pushing the boundaries of what is possible in natural language generation.

Naqeeb-2424
/

Celestia