Model Card for climategpt/climategpt-7b

This model is the 7B parameter variant of the ClimateGPT model release.

Overview

Developed by: AppTek, Eqtylab, Erasmus AI
Model type: decoder-only Transformer
Language(s) (NLP): natively supported: English; supported via cascaded MT on web interface: Arabic, Bangla, Chinese (simplified), Dutch, Finnougoric, French, Germanic, Greek, Hebrew, Indonesian, Japenese, Korean, Lithuanian, Pashto, Persian, Portuguese, Russian, Spanish, Thai, Turkish, Vietnamese,
License: TO BE ADDED
Finetuned from model: Llama2 7B
Repository: https://huggingface.co/climategpt/climategpt-7b
Paper: TO BE ADDED
Demo: TO BE ADDED

Uses

This model is intended to be directly used as a question answering model that is specialized in the climate domain.
The model is aimed at providing useful feedback for decision makers, scientists and jounalists involved in climate discussions.
The model can also be used as a starting point for interested developers for further finetuning.
The model is NOT intended to be a general-purpose chatbot (although it has chat capabilities).
For the full system including cascaded MT, RAG, etc., we recommend the user to go to our demo website: TO BE ADDED.
For hands-on finetuning deployment and inference, we recommend the user to directly use the Huggingface helpers.
For in-depth model conversion and finetuning, we recommend the user to use https://github.com/epfLLM/Megatron-LLM/.
Despite the efforts from the development team to elimite them, as every other chat-capable LLMs, this model may generate biased, offensive, inaccurate responses.

How to Get Started with the Model

After downloading the HF formatted model, the HF helpers should work out-of-the-box. It is also possible to evaluate the model with https://github.com/EleutherAI/lm-evaluation-harness by plugging in the model identifier --model_args pretrained=climategpt/climategpt-70b.

Training

For the Llama2 training data, we refer the user to https://huggingface.co/meta-llama/Llama-2-7b-hf.
For continued pretraining, 4.2B climate domain tokens (tokenized by the Llama tokenizer) are used.
For instruction finetuning, about 272K instruction-completion pairs (both in the climate domain but also general domain) are used.

Environmental Impact

Hardware Type: H100
Hours used: 230 hrs
Cloud Provider: TO BE ADDED
Compute Region: TO BE ADDED
Carbon Emitted: TO BE ADDED

Citation

BibTeX: TO BE ADDED