metadata

language:
  - en
datasets:
  - OpenAssistant/oasst1
  - databricks/databricks-dolly-15k
base_model: meta-llama/Llama-2-7b-hf
tags:
  - climate
co2_eq_emissions:
  emissions: 265800
  training_type: pre-training
  geographical_location: Washington, USA
  hardware_used: 8x NVIDIA H100 HBM

ClimateGPT 7B FSG

⚠️ This is a research experiment to explore training from scratch on climate related data. If you're just interested in using the model, we recommend to use the Llama 2 based [ClimateGPT 7B](https://huggingface.co/eci-io/climategpt-7b).

ClimateGPT is an ensemble of AI models designed to augment human decisions on the fast-moving field of climate change. ClimateGPT 7B FSB (from scratch climate) is a 7 billion transformer decoder model that was pre-trained for 319.5B tokens and then continuously pre-training on a collection of 4.2B tokens from curated climate documents. The model is further instruction fine-tuned on a dataset of instruction-completion pairs manually collected by AppTek in cooperation with climate scientists. ClimateGPT 7B outperforms Llama 2 70B Chat on our climate-specific benchmarks. The model is designed to be used together with retrieval augmentation to extend the knowledge, and increase the factuality of the model and with cascaded machine translation to increase the language coverage.

A paper describing our approach will be released soon.

Model Details

Trained by: AppTek
Powered by: Erasmus AI
Verified by: EQTYLab
Model type: decoder-only Transformer
Language(s) (NLP): English
License: TO BE ADDED
Continued pre-trained from: Llama 2 7B
Context length: 4K tokens
Input: Text-only data
Output: Model generates text only
Paper: The paper will be released soon.
Website: eci.io

Uses

This is an experimental model and it is only intended to be used to reproduce our results and for LLM research. For any other use-case, we recommend to use ClimateGPT 7B, 13B or 70B
Despite the efforts from the development team to eliminate them, as every other chat-capable LLMs, this model may generate biased, offensive or inaccurate responses.

Downstream Use

ClimateGPT 7B FSG is an instruction-tuned model that can be directly used for climate-specific question-answering applications. It was trained to perform well with retrieval augmentation and supports up to 5 references in context.

The model was trained using ChatML so the following format should be followed when prompting, including the <|im_start|>, <|im_end|> tags, system, user, context and assistant identifiers and [[0]], [[1]]] etc. tokens to indicate references.

"""
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>context
[[0]] "{reference1_title}", {reference1_year}
{reference1_text}
[[1]] "{reference2_title}", {reference2_year}
{reference2_text}
[...]<|im_end|>
<|im_start|>assistant
"""

Training

Details on the pre-training data are given in our paper.
For continued pre-training, 4.2B climate domain tokens (tokenized by the Llama tokenizer) are used.
For instruction fine-tuning, about 272K instruction-completion pairs (both in the climate domain but also general domain) are used.

Evaluation

Detailed evaluation results are presented on our model card website: eci.io/model-card

Environmental Impact

Hardware Type: 8x NVIDIA H100 HBM
Power Consumption per GPU: 775W
Hours used: 14,288 hrs
Cloud Provider: MLFoundry
Compute Region: Washington, USA
Energy Mix: 100% Hydro Power (24g CO2eq/kWh according to IPCC 2014)
Carbon Emitted: 265.8kg CO2eq

Citation

BibTeX: Paper will be released soon.