Update README.md

f780d14 verified 11 months ago

4.61 kB

	---
	license: apache-2.0
	base_model: distilgpt2
	tags:
	- generated_from_trainer
	model-index:
	- name: distilgpt2-finetuned-microbiology
	results: []
	widget:
	- text: "Microorganisms are involved in the decomposition of organic matter,"
	- text: "Some microorganisms, such as yeast and certain bacteria, can convert"
	- text: "Microbial biotechnology can be used to increase the efficiency and"
	- text: "Some viruses carry oncogenes, which are genes that"
	- text: "Employing a diverse group of microorganisms with complementary pollutant degradation"
	- text: "Synthetic biology is an interdisciplinary field that combines"
	- text: "Disruption of the microbiota due to antifungal drug use can"
	- text: "Knowledge of microorganisms' genetic makeup can be used to"
	- text: "Bacteriophages, or phages, are viruses that"
	- text: "Microorganisms, such as bacteria and yeast, can be genetically engineered to produce"
	- text: "Changes in microbial diversity within aquatic ecosystems can"
	---
	# distilgpt2-finetuned-microbiology

	## Model description

	Small model for language modeling based on [distilgpt2](https://huggingface.co/distilgpt2) and on microbiology-related text data.
	It achieves the following results on the evaluation set:
	- Loss: 2.1073


	## Intended uses & limitations

	This model was finetuned solely for academic purposes, specifically:

	- Notes enhancement
	- Study
	- Research

	Keep in mind that the model itself does not always provide correct informtion, so always double check everything.

	_distilgpt2-finetuned-microbiology_ must not be used for medical/health purposes, as it was not trained for that.

	Besides the limitations already highlighted for distilgpt2, _distilgpt2-finetuned-microbiology_ was trained on a small microbiology-related texts dataset, so its knowledge is not nearly as comprehensive as many other sources of information. It is still useful when employed as _assistant_, not as substitute of human researchers/experts.

	## Training and evaluation data

	Training data were taken from [Biology dataset on HuggingFace](https://huggingface.co/datasets/andersonbcdefg/biology), and microbiology texts were extracted from the `.parquet` file associated with this dataset, following this workflow:

	### Data preprocessing and extraction

	Find all files and scripts on [GitHub](https://github.com/AstraBert/distilgpt2-finetuned-microbiology):

	```bash
	# UNZIP LARGE DATA FILES
	gzip -d data/*.gz
	# CONVERT .parquet FILE TO .jsonl
	python3 scripts/parquet_to_jsonl.py
	# FILTER MICROBIOLOGY TEXTS FROM microbiology.jsonl
	python3 scripts/data_preprocess.py
	```


	## Training procedure
	Training procedure is as descripted by this [HuggingFace notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb).

	You can find the related script on [GitHub](https://github.com/AstraBert/distilgpt2-finetuned-microbiology).

	You only have to run this command, once you preprocessed and extracted everything.
	```bash
	#GENERATE MODEL
	python3 scripts/build_distilgpt2-finetuned-microbiology.py
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3.0

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|
	\| No log \| 1.0 \| 364 \| 2.2399 \|
	\| 2.4867 \| 2.0 \| 728 \| 2.1351 \|
	\| 2.213 \| 3.0 \| 1092 \| 2.1073 \|


	### Framework versions

	- Transformers 4.38.1
	- Pytorch 2.1.0+cu121
	- Datasets 2.18.0
	- Tokenizers 0.15.2
	- accelerate 0.27.2
	- scikit-learn 1.2.2
	- huggingface_hub 0.20.3

	## Use the model in python

	Here is a snippet code on how to load the model in python:

	model_checkpoint = "as-cle-bert/distilgpt2-finetuned-microbiology"

	```python3
	# Load necessary dependencies
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
	```


	## References
	- [HuggingFace notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb) - template for building _distilgpt2-finetuned-microbiology_
	- [Biology dataset on HuggingFace](https://huggingface.co/datasets/andersonbcdefg/biology) - microbiology texts were extracted from the `.parquet` file associated with this dataset and put in [microbiology.jsonl](./data/microbiology.jsonl)