eci-io
/

climategpt-7b-fsg

@@ -1,39 +1,90 @@
-# Model Card for climategpt/climategpt-7b-fsg
-- This model is the 7B parameter from-scratch general ("fsg") variant of the ClimateGPT model release.
-## Overview
-- **Developed by:** AppTek, Eqtylab, Erasmus AI
 - **Model type:** decoder-only Transformer
-- **Language(s) (NLP):** natively supported: English; supported via cascaded MT on web interface: Arabic, Bangla, Chinese (simplified), Dutch, Finnougoric, French, Germanic, Greek, Hebrew, Indonesian, Japenese, Korean, Lithuanian, Pashto, Persian, Portuguese, Russian, Spanish, Thai, Turkish, Vietnamese,
 - **License:** TO BE ADDED
-- **Repository:** https://huggingface.co/climategpt/climategpt-7b-fsg
-- **Paper:** TO BE ADDED
-- **Demo:** TO BE ADDED
 ## Uses
-- This model is intended to be directly used as a question answering model that is specialized in the climate domain.
-- The model is aimed at providing useful feedback for decision makers, scientists and jounalists involved in climate discussions.
-- The model can also be used as a starting point for interested developers for further finetuning.
-- The model is NOT intended to be a general-purpose chatbot (although it has chat capabilities).
-- For the full system including cascaded MT, RAG, etc., we recommend the user to go to our demo website: TO BE ADDED.
-- For hands-on finetuning deployment and inference, we recommend the user to directly use the Huggingface helpers.
-- For in-depth model conversion and finetuning, we recommend the user to use https://github.com/epfLLM/Megatron-LLM/.
-- **Despite the efforts from the development team to elimite them, as every other chat-capable LLMs, this model may generate biased, offensive, inaccurate responses.**
-## How to Get Started with the Model
-After downloading the HF formatted model, the HF helpers should work out-of-the-box.
-It is also possible to evaluate the model with https://github.com/EleutherAI/lm-evaluation-harness by plugging in the model identifier ```--model_args pretrained=climategpt/climategpt-7b-fsg```.
 ## Training
-- For pretraining, a 300B-token dataset with an emphasis on the climate domain is prepared and used.
-- For instruction finetuning, about 1.1B instruction-finetuning tokens (both in the climate domain but also general domain) are used.
 ## Environmental Impact
-- **Hardware Type:** H100
-- **Hours used:** 30720 hrs
-- **Cloud Provider:** TO BE ADDED
-- **Compute Region:** TO BE ADDED
-- **Carbon Emitted:** TO BE ADDED
 ## Citation
-**BibTeX:** TO BE ADDED

+---
+language:
+- en
+datasets:
+- OpenAssistant/oasst1
+- databricks/databricks-dolly-15k
+base_model: meta-llama/Llama-2-7b-hf
+tags:
+- climate
+co2_eq_emissions:
+  emissions: 265800
+  training_type: "pre-training"
+  geographical_location: "Washington, USA"
+  hardware_used: "8x NVIDIA H100 HBM"
+---
+# ClimateGPT 7B FSG
+<blockquote style="padding: 10px; margin: 0 0 10px; border-left: 5px solid #ddd;">
+⚠️ This is a research experiment to explore training from scratch on climate related data. If you're just interested in using the model, we recommend to use the Llama 2 based [ClimateGPT 7B](https://huggingface.co/eci-io/climategpt-7b).
+</blockquote>
+ClimateGPT is an ensemble of AI models designed to augment human decisions on the fast-moving field of climate change.
+ClimateGPT 7B FSB (from scratch climate) is a 7 billion transformer decoder model that was pre-trained for 319.5B tokens and then continuously pre-training on a collection of 4.2B tokens from curated climate documents.
+The model is further instruction fine-tuned on a dataset of instruction-completion pairs manually collected by AppTek in cooperation with climate scientists.
+[ClimateGPT 7B](https://huggingface.co/eci-io/climategpt-7b) outperforms Llama 2 70B Chat on our climate-specific benchmarks.
+The model is designed to be used together with retrieval augmentation to extend the knowledge, and increase the factuality of the model and with cascaded machine translation to increase the language coverage.
+<blockquote style="padding: 10px; margin: 0 0 10px; border-left: 5px solid #ddd;">
+A paper describing our approach will be released soon.
+</blockquote>
+## Model Details
+- **Trained by:** [AppTek](https://apptek.com)
+- **Powered by:** [Erasmus AI](https://erasmus.ai)
+- **Verified by:** [EQTYLab](https://eqtylab.io)
 - **Model type:** decoder-only Transformer
+- **Language(s) (NLP):** English
 - **License:** TO BE ADDED
+- **Continued pre-trained from:** Llama 2 7B
+- **Context length:** 4K tokens
+- **Input:** Text-only data
+- **Output:** Model generates text only
+- **Paper:** The paper will be released soon.
+- **Website:** [eci.io](https://eci.io)
 ## Uses
+- This is an experimental model and it is only intended to be used to reproduce our results and for LLM research. For any other use-case, we recommend to use [ClimateGPT 7B](https://huggingface.co/eci-io/climategpt-7b), [13B](https://huggingface.co/eci-io/climategpt-13b) or [70B](https://huggingface.co/eci-io/climategpt-70b)
+- **Despite the efforts from the development team to eliminate them, as every other chat-capable LLMs, this model may generate biased, offensive or inaccurate responses.**
+## Downstream Use
+ClimateGPT 7B FSG is an instruction-tuned model that can be directly used for climate-specific question-answering applications.
+It was trained to perform well with retrieval augmentation and supports up to 5 references in context.
+The model was trained using ChatML so the following format should be followed when prompting, including the  `<|im_start|>`, `<|im_end|>` tags, `system`, `user`, `context` and `assistant` identifiers and `[[0]]`, `[[1]]]` etc. tokens to indicate references.
+    """
+    <|im_start|>system
+    {system_message}<|im_end|>
+    <|im_start|>user
+    {prompt}<|im_end|>
+    <|im_start|>context
+    [[0]] "{reference1_title}", {reference1_year}
+    {reference1_text}
+    [[1]] "{reference2_title}", {reference2_year}
+    {reference2_text}
+    [...]<|im_end|>
+    <|im_start|>assistant
+    """
 ## Training
+- Details on the pre-training data are given in our paper.
+- For continued pre-training, 4.2B climate domain tokens (tokenized by the Llama tokenizer) are used.
+- For instruction fine-tuning, about 272K instruction-completion pairs (both in the climate domain but also general domain) are used.
+## Evaluation
+Detailed evaluation results are presented on our model card website: [eci.io/model-card](https://eci.io/model-card)
 ## Environmental Impact
+- **Hardware Type:** 8x NVIDIA H100 HBM
+- **Power Consumption per GPU:** 775W
+- **Hours used:** 14,288 hrs
+- **Cloud Provider:** MLFoundry
+- **Compute Region:** Washington, USA
+- **Energy Mix:** 100% Hydro Power (24g CO2eq/kWh according to IPCC 2014)
+- **Carbon Emitted:** 265.8kg CO2eq
 ## Citation
+**BibTeX:** Paper will be released soon.