GeohazardGPT

GeohazardGPT is the first large language model purpose-built for geohazard analysis and engineering practice. Built on a Qwen3-8B backbone with LoRA-based parameter-efficient fine-tuning, it is trained on a curated domain corpus of 883 million tokens spanning 12 major geological hazard categories. When combined with a retrieval-augmented generation (RAG) pipeline over authoritative engineering standards, GeohazardGPT achieves performance comparable to much larger models on both general geohazard knowledge and professional engineering examination tasks.

Model Details

Property	Value
Base model	Qwen3-8B
Fine-tuning method	LoRA (rank 128, α 256)
Trainable parameters	349M
Training data	~100K instruction–response pairs
Domain corpus	883M tokens / 1.82M documents
Hazard categories	12 major / 49 subcategories
Context length	32K tokens (extendable to 128K via YaRN)
Language	English
License	Apache 2.0

Intended Use

GeohazardGPT supports knowledge-intensive workflows in geohazard assessment and geotechnical engineering practice, including:

Factual QA — precise recall of geohazard definitions, geomaterial properties, and code requirements
Open-ended explanation — interpretation of hazard mechanisms, failure processes, and impact analysis
Engineering recommendation — selection of stabilization measures, mitigation strategies, and monitoring plans for site-specific conditions
Report summarization — structured extraction of key findings from investigation reports, case studies, and technical specifications

It is designed for use by geotechnical engineers, geohazard researchers, and practitioners who require technically accurate, domain-grounded responses. Model outputs should complement, not replace, professional field investigation and expert judgment.

Training Data

The instruction-tuning dataset was constructed using GeoInstruct, a taxonomy-guided and corpus-grounded instruction generation framework. It comprises:

49,776 domain-specific instruction–response pairs generated from a filtered geohazard corpus
51,699 general instruction samples (Alpaca-GPT4) to preserve general instruction-following capability
~100K total training pairs

The geohazard corpus draws from four sources:

Source	Documents	Tokens
Open-access full-text papers	1,613,089	788.9M
Licensed scientific books	118,217	54.5M
Closed-access abstracts	87,668	28.9M
Filtered C4 web corpus	3,443	10.8M
Total	1,822,417	883.1M

RAG Integration

For standards-based engineering questions, GeohazardGPT is designed to be used with a retrieve-and-rerank RAG pipeline:

Offline indexing — technical specifications are chunked into sections/clauses and encoded with Qwen3-Embedding into a ChromaDB vector database
Dense retrieval — top-30 candidate clauses are retrieved via approximate nearest-neighbor search
Cross-encoder re-ranking — candidates are re-ranked using Qwen3-Reranker-4B; top-15 clauses are retained as final evidence
Grounded generation — retrieved clauses are injected into the prompt alongside the query

The RAG corpus covers national and sectoral standards in geotechnical investigation, foundation engineering, seismic design, transportation infrastructure, and hydraulic engineering.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pengfali/GeohazardGPT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "What engineering measures should be adopted for a landslide with a tension crack at the crest and signs of local seepage?"

messages = [
    {"role": "system", "content": "You are an expert in geological disasters. This is a recommendation task."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

Hardware Requirements

Configuration	GPU Memory	Latency
GeohazardGPT (standalone)	~10 GB	~3.9 s/query
GeohazardGPT + RAG	~26 GB	~5.8 s/query

Tested on NVIDIA A100 (80GB) under 4-bit deployment. The RAG configuration includes additional memory for Qwen3-Embedding-4B and Qwen3-Reranker-4B.

Citation

If you use GeohazardGPT in your research, please cite:

@article{ge2025geohazardgpt,
  title={GeohazardGPT: Towards Large Language Models for Geohazards},
  author={Ge, Qi and Li, Pengfa and Dai, Yinhao and Li, Jin and An, Ni and Yu, Yang and Lv, Qing and Sun, Hongyue},
  journal={Under review},
  year={2025}
}

License

This model is released under the Apache 2.0 License.

Downloads last month: 132

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support