YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GeohazardGPT

GeohazardGPT is the first large language model purpose-built for geohazard analysis and engineering practice. Built on a Qwen3-8B backbone with LoRA-based parameter-efficient fine-tuning, it is trained on a curated domain corpus of 883 million tokens spanning 12 major geological hazard categories. When combined with a retrieval-augmented generation (RAG) pipeline over authoritative engineering standards, GeohazardGPT achieves performance comparable to much larger models on both general geohazard knowledge and professional engineering examination tasks.


Model Details

Property Value
Base model Qwen3-8B
Fine-tuning method LoRA (rank 128, α 256)
Trainable parameters 349M
Training data ~100K instruction–response pairs
Domain corpus 883M tokens / 1.82M documents
Hazard categories 12 major / 49 subcategories
Context length 32K tokens (extendable to 128K via YaRN)
Language English
License Apache 2.0

Intended Use

GeohazardGPT supports knowledge-intensive workflows in geohazard assessment and geotechnical engineering practice, including:

  • Factual QA — precise recall of geohazard definitions, geomaterial properties, and code requirements
  • Open-ended explanation — interpretation of hazard mechanisms, failure processes, and impact analysis
  • Engineering recommendation — selection of stabilization measures, mitigation strategies, and monitoring plans for site-specific conditions
  • Report summarization — structured extraction of key findings from investigation reports, case studies, and technical specifications

It is designed for use by geotechnical engineers, geohazard researchers, and practitioners who require technically accurate, domain-grounded responses. Model outputs should complement, not replace, professional field investigation and expert judgment.


Training Data

The instruction-tuning dataset was constructed using GeoInstruct, a taxonomy-guided and corpus-grounded instruction generation framework. It comprises:

  • 49,776 domain-specific instruction–response pairs generated from a filtered geohazard corpus
  • 51,699 general instruction samples (Alpaca-GPT4) to preserve general instruction-following capability
  • ~100K total training pairs

The geohazard corpus draws from four sources:

Source Documents Tokens
Open-access full-text papers 1,613,089 788.9M
Licensed scientific books 118,217 54.5M
Closed-access abstracts 87,668 28.9M
Filtered C4 web corpus 3,443 10.8M
Total 1,822,417 883.1M

RAG Integration

For standards-based engineering questions, GeohazardGPT is designed to be used with a retrieve-and-rerank RAG pipeline:

  1. Offline indexing — technical specifications are chunked into sections/clauses and encoded with Qwen3-Embedding into a ChromaDB vector database
  2. Dense retrieval — top-30 candidate clauses are retrieved via approximate nearest-neighbor search
  3. Cross-encoder re-ranking — candidates are re-ranked using Qwen3-Reranker-4B; top-15 clauses are retained as final evidence
  4. Grounded generation — retrieved clauses are injected into the prompt alongside the query

The RAG corpus covers national and sectoral standards in geotechnical investigation, foundation engineering, seismic design, transportation infrastructure, and hydraulic engineering.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pengfali/GeohazardGPT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "What engineering measures should be adopted for a landslide with a tension crack at the crest and signs of local seepage?"

messages = [
    {"role": "system", "content": "You are an expert in geological disasters. This is a recommendation task."},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

Hardware Requirements

Configuration GPU Memory Latency
GeohazardGPT (standalone) ~10 GB ~3.9 s/query
GeohazardGPT + RAG ~26 GB ~5.8 s/query

Tested on NVIDIA A100 (80GB) under 4-bit deployment. The RAG configuration includes additional memory for Qwen3-Embedding-4B and Qwen3-Reranker-4B.


Citation

If you use GeohazardGPT in your research, please cite:

@article{ge2025geohazardgpt,
  title={GeohazardGPT: Towards Large Language Models for Geohazards},
  author={Ge, Qi and Li, Pengfa and Dai, Yinhao and Li, Jin and An, Ni and Yu, Yang and Lv, Qing and Sun, Hongyue},
  journal={Under review},
  year={2025}
}

License

This model is released under the Apache 2.0 License.


Downloads last month
132
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support