File size: 3,125 Bytes
0304ee0
 
 
77fb91e
15ac962
0304ee0
77fb91e
e20fe20
 
5827073
 
 
15ac962
e20fe20
 
5827073
 
 
9d04d52
2f5506b
5827073
 
 
 
 
5348448
ae209e0
 
 
e20fe20
ae209e0
 
 
77fb91e
5827073
77fb91e
 
 
 
 
5827073
 
 
964815c
9d04d52
 
 
d2834d0
9d04d52
15ac962
d2834d0
5827073
9a4f63e
 
 
 
 
 
 
 
 
 
 
 
 
5827073
41d72ac
5827073
41d72ac
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
language:
- nl
thumbnail: >-
  geobertje.png
tags:
- GEOBERTje
---

# GEOBERTje
**A Domain-Adapted Dutch Language Model Trained on Geological Borehole Descriptions**

<img src="geobertje.png" width="400">


## Description 
GEOBERTje is a language model built upon the [BERTje](https://github.com/wietsedv/bertje) architecture, comprising 109 million parameters. 
It has been further trained using masked language modeling (MLM) on a dataset of approximately 300,000 borehole descriptions 
in the Dutch language from the Flanders region in Belgium. 
It can serve as the base language model for a variety of geological applications. 
For instance, by leveraging the model's understanding of geological terminology and borehole data, 
professionals can streamline the process of interpreting subsurface information and generating detailed 3D representations of geological structures. 
This capability opens up new possibilities for improved exploration, interpretation, and analysis in the field of geology. 


## Hugging Face API
How to use GEOBERTje as a pipeline:
```python
from transformers import pipeline

pipe = pipeline("fill-mask", model="hghcomphys/geobertje-base-dutch-uncased")
pipe("grijs fijn zand met enkele [MASK]")
```

How to load it directly:
```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("hghcomphys/geobertje-base-dutch-uncased")
model = AutoModel.from_pretrained("hghcomphys/geobertje-base-dutch-uncased")
```

## Application
To showcase the potential application of GEOBERTje, we **fine-tuned** it on a limited dataset of 3,000 labeled samples. 
This fine-tuning allowed the model to classify various lithological categories.
For example `Grijs kleiig zand, zeer fijn, met enkele grindjes` will be classified as main lithology: `fijn zand`, second lithology: `klei`, third lithology: `grind`.
Our classifier obtained higher accuracy than conventional rule-based approaches or zero-shot classification using GPT-4.

The figure below shows comparative accuracy of rule-based, GPT-4, and GEOBERTje models in classifying
main, secondary, and tertiary lithology from geological drill core descriptions.
<img src="accuracy.png" hgiht="400">

## Citation
If you use GEOBERTje or fine-tune the model, please include this citation.
```bibtex
@misc{ghorbanfekr2024classificationgeologicalboreholedescriptions,
      title={Classification of Geological Borehole Descriptions Using a Domain Adapted Large Language Model}, 
      author={Hossein Ghorbanfekr and Pieter Jan Kerstens and Katrijn Dirix},
      year={2024},
      eprint={2407.10991},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10991}, 
```

## References
- Please see [this](https://github.com/VITObelgium/geobertje) repository for 
further details on the training of GEOBERTje and its fine-tuning for the lithology classification task. 
- The reference datasets are available from [here](https://huggingface.co/datasets/hghcomphys/geological-borehole-descriptions-dutch).
- Check out our paper on [arxiv](https://arxiv.org/abs/2407.10991).