YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GeohazardGPT
GeohazardGPT is the first large language model purpose-built for geohazard analysis and engineering practice. Built on a Qwen3-8B backbone with LoRA-based parameter-efficient fine-tuning, it is trained on a curated domain corpus of 883 million tokens spanning 12 major geological hazard categories. When combined with a retrieval-augmented generation (RAG) pipeline over authoritative engineering standards, GeohazardGPT achieves performance comparable to much larger models on both general geohazard knowledge and professional engineering examination tasks.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen3-8B |
| Fine-tuning method | LoRA (rank 128, α 256) |
| Trainable parameters | 349M |
| Training data | ~100K instruction–response pairs |
| Domain corpus | 883M tokens / 1.82M documents |
| Hazard categories | 12 major / 49 subcategories |
| Context length | 32K tokens (extendable to 128K via YaRN) |
| Language | English |
| License | Apache 2.0 |
Intended Use
GeohazardGPT supports knowledge-intensive workflows in geohazard assessment and geotechnical engineering practice, including:
- Factual QA — precise recall of geohazard definitions, geomaterial properties, and code requirements
- Open-ended explanation — interpretation of hazard mechanisms, failure processes, and impact analysis
- Engineering recommendation — selection of stabilization measures, mitigation strategies, and monitoring plans for site-specific conditions
- Report summarization — structured extraction of key findings from investigation reports, case studies, and technical specifications
It is designed for use by geotechnical engineers, geohazard researchers, and practitioners who require technically accurate, domain-grounded responses. Model outputs should complement, not replace, professional field investigation and expert judgment.
Training Data
The instruction-tuning dataset was constructed using GeoInstruct, a taxonomy-guided and corpus-grounded instruction generation framework. It comprises:
- 49,776 domain-specific instruction–response pairs generated from a filtered geohazard corpus
- 51,699 general instruction samples (Alpaca-GPT4) to preserve general instruction-following capability
- ~100K total training pairs
The geohazard corpus draws from four sources:
| Source | Documents | Tokens |
|---|---|---|
| Open-access full-text papers | 1,613,089 | 788.9M |
| Licensed scientific books | 118,217 | 54.5M |
| Closed-access abstracts | 87,668 | 28.9M |
| Filtered C4 web corpus | 3,443 | 10.8M |
| Total | 1,822,417 | 883.1M |
RAG Integration
For standards-based engineering questions, GeohazardGPT is designed to be used with a retrieve-and-rerank RAG pipeline:
- Offline indexing — technical specifications are chunked into sections/clauses and encoded with
Qwen3-Embeddinginto aChromaDBvector database - Dense retrieval — top-30 candidate clauses are retrieved via approximate nearest-neighbor search
- Cross-encoder re-ranking — candidates are re-ranked using
Qwen3-Reranker-4B; top-15 clauses are retained as final evidence - Grounded generation — retrieved clauses are injected into the prompt alongside the query
The RAG corpus covers national and sectoral standards in geotechnical investigation, foundation engineering, seismic design, transportation infrastructure, and hydraulic engineering.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "pengfali/GeohazardGPT"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
prompt = "What engineering measures should be adopted for a landslide with a tension crack at the crest and signs of local seepage?"
messages = [
{"role": "system", "content": "You are an expert in geological disasters. This is a recommendation task."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)
Hardware Requirements
| Configuration | GPU Memory | Latency |
|---|---|---|
| GeohazardGPT (standalone) | ~10 GB | ~3.9 s/query |
| GeohazardGPT + RAG | ~26 GB | ~5.8 s/query |
Tested on NVIDIA A100 (80GB) under 4-bit deployment. The RAG configuration includes additional memory for Qwen3-Embedding-4B and Qwen3-Reranker-4B.
Citation
If you use GeohazardGPT in your research, please cite:
@article{ge2025geohazardgpt,
title={GeohazardGPT: Towards Large Language Models for Geohazards},
author={Ge, Qi and Li, Pengfa and Dai, Yinhao and Li, Jin and An, Ni and Yu, Yang and Lv, Qing and Sun, Hongyue},
journal={Under review},
year={2025}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 132