pranav-s commited on
Commit
aa693c2
1 Parent(s): 16991ec

Upload 5 files

Browse files

Add supporting files

Files changed (6) hide show
  1. .gitattributes +1 -0
  2. License +63 -0
  3. README.md +80 -0
  4. config.json +25 -0
  5. training_DOI.txt +3 -0
  6. vocab.txt +0 -0
.gitattributes CHANGED
@@ -32,3 +32,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
32
  *.zip filter=lfs diff=lfs merge=lfs -text
33
  *.zst filter=lfs diff=lfs merge=lfs -text
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
32
  *.zip filter=lfs diff=lfs merge=lfs -text
33
  *.zst filter=lfs diff=lfs merge=lfs -text
34
  *tfevents* filter=lfs diff=lfs merge=lfs -text
35
+ training_DOI.txt filter=lfs diff=lfs merge=lfs -text
License ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ GENERAL PUBLIC USE LICENSE AGREEMENT
2
+
3
+ PLEASE READ THIS DOCUMENT CAREFULLY BEFORE UTILIZING THE PROGRAM
4
+
5
+ BY UTILIZING THIS PROGRAM, YOU AGREE TO BECOME BOUND BY THE TERMS OF THIS LICENSE. IF YOU DO NOT AGREE TO THE TERMS OF THIS LICENSE, DO NOT USE THIS PROGRAM OR ANY PORTION THEREOF IN ANY FORM OR MANNER.
6
+
7
+ This Program is licensed, not sold to you by GEORGIA TECH RESEARCH CORPORATION ("GTRC"), owner of all code and accompanying documentation (hereinafter “Program”), for use only under the terms of this License, and GTRC reserves any rights not expressly granted to you.
8
+
9
+ 1. In accordance with the terms and conditions set forth herein, this License allows you to:
10
+
11
+ (a) make copies and distribute copies of the Program’s source code provide that any such copy clearly displays any and all appropriate copyright notices and disclaimer of warranty as set forth in Article 5 and 6 of this License. All notices that refer to this License, the developers of this Program, and to the absence of any warranty must be kept intact at all times. A copy of this License must accompany any and all copies of the Program distributed to third parties.
12
+
13
+ Notwithstanding anything to the contrary contained herein, a fee may be charged to cover the actual cost of the physical act of transferring a copy to a third party. At no time shall the program be sold for commercial gain either alone or incorporated with other program(s) without entering into a separate agreement with GTRC.
14
+
15
+
16
+ (b) modify the original copy or copies of the Program or any portion thereof (“Modification(s)”). Modifications may be copied and distributed under the terms and conditions as set forth above, provided the following conditions are met:
17
+
18
+ i) any and all modified files must be affixed with prominent notices that you have changed the files and the date that the changes occurred.
19
+
20
+ ii) any work that you distribute, publish, or make available, that in whole or in part contains portions of the Program or derivative work thereof, must be licensed at no charge to all third parties under the terms of this License.
21
+
22
+ iii) if the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to display and/or print an announcement with all appropriate copyright notices and disclaimer of warranty as set forth in Article 5 and 6 of this License to be clearly displayed. In addition, you must provide reasonable access to this License to the user.
23
+
24
+ Any portion of a Modification that can be reasonably considered independent of the Program and separate work in and of itself is not subject to the terms and conditions set forth in this License as long as it is not distributed with the Program or any portion thereof.
25
+
26
+
27
+
28
+ 2. This License further allows you to copy and distribute the Program or a work based on it, as set forth in Article 1 Section b in object code or executable form under the terms of Article 1 above provided that you also either:
29
+
30
+ i) accompany it with complete corresponding machine-readable source code, which must be distributed under the terms of Article 1, on a medium customarily used for software interchange; or,
31
+
32
+ ii) accompany it with a written offer, valid for no less than three (3) years from the time of distribution, to give any third party, for no consideration greater than the cost of physical transfer, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Article 1 on a medium customarily used for software interchange; or,
33
+
34
+
35
+ 3. Export Law Assurance.
36
+
37
+ You agree that the Software will not be shipped, transferred or exported, directly into any country prohibited by the United States Export Administration Act and the regulations thereunder nor will be used for any purpose prohibited by the Act.
38
+
39
+ 4. Termination.
40
+
41
+ If at anytime you are unable to comply with any portion of this License you must immediately cease use of the Program and all distribution activities involving the Program or any portion thereof.
42
+
43
+
44
+ 5. Disclaimer of Warranties and Limitation on Liability.
45
+
46
+ YOU ACCEPT THE PROGRAM ON AN "AS IS" BASIS. GTRC MAKES NO WARRANTY THAT ALL ERRORS CAN BE OR HAVE BEEN ELIMINATED FROM PROGRAM. GTRC SHALL NOT BE RESPONSIBLE FOR LOSSES OF ANY KIND RESULTING FROM THE USE OF PROGRAM AND ITS ACCOMPANYING DOCUMENT(S), AND CAN IN NO WAY PROVIDE COMPENSATION FOR ANY LOSSES SUSTAINED, INCLUDING BUT NOT LIMITED TO ANY OBLIGATION, LIABILITY, RIGHT, CLAIM OR REMEDY FOR TORT, OR FOR ANY ACTUAL OR ALLEGED INFRINGEMENT OF PATENTS, COPYRIGHTS, TRADE SECRETS, OR SIMILAR RIGHTS OF THIRD PARTIES, NOR ANY BUSINESS EXPENSE, MACHINE DOWNTIME OR DAMAGES CAUSED TO YOU BY ANY DEFICIENCY, DEFECT OR ERROR IN PROGRAM OR MALFUNCTION THEREOF, NOR ANY INCIDENTAL OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED. GTRC DISCLAIMS ALL WARRANTIES, BOTH EXPRESS AND IMPLIED RESPECTING THE USE AND OPERATION OF PROGRAM AND ITS ACCOMPANYING DOCUMENTATION, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR PARTICULAR PURPOSE AND ANY IMPLIED WARRANTY ARISING FROM COURSE OF PERFORMANCE, COURSE OF DEALING OR USAGE OF TRADE. GTRC MAKES NO WARRANTY THAT PROGRAM IS ADEQUATELY OR COMPLETELY DESCRIBED IN, OR BEHAVES IN ACCORDANCE WITH ANY ACCOMPANYING DOCUMENTATION. THE USER OF PROGRAM IS EXPECTED TO MAKE THE FINAL EVALUATION OF PROGRAM'S USEFULNESS IN USER'S OWN ENVIRONMENT.
47
+
48
+ GTRC represents that, to the best of its knowledge, the software furnished hereunder does not infringe any copyright or patent.
49
+
50
+ GTRC shall have no obligation for support or maintenance of Program.
51
+
52
+ 6. Copyright Notice.
53
+
54
+ THE SOFTWARE AND ACCOMPANYING DOCUMENTATION ARE COPYRIGHTED WITH ALL RIGHTS RESERVED BY GTRC. UNDER UNITED STATES COPYRIGHT LAWS, THE SOFTWARE AND ITS ACCOMPANYING DOCUMENTATION MAY NOT BE COPIED EXCEPT AS GRANTED HEREIN.
55
+
56
+ You acknowledge that GTRC is the sole owner of Program, including all copyrights subsisting therein. Any and all copies or partial copies of Program made by you shall bear the copyright notice set forth below and affixed to the original version or such other notice as GTRC shall designate. Such notice shall also be affixed to all improvements or enhancements of Program made by you or portions thereof in such a manner and location as to give reasonable notice of GTRC's copyright as set forth in Article 1.
57
+
58
+ Said copyright notice shall read as follows:
59
+
60
+ Copyright 2022
61
+ Georgia Tech Research Corporation
62
+ Atlanta, Georgia 30332-4024
63
+ All Rights Reserved
README.md CHANGED
@@ -1,3 +1,83 @@
1
  ---
 
 
 
 
 
2
  license: other
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - transformers
5
+ - feature-extraction
6
+ - materials
7
  license: other
8
  ---
9
+
10
+ # MaterialsBERT
11
+
12
+ This model is a fine-tuned version of [PubMedBERT model](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) on a dataset of 2.4 million materials science abstracts.
13
+ It was introduced in [this](https://arxiv.org/abs/2209.13136) paper. This model is uncased.
14
+
15
+ ## Model description
16
+
17
+ Domain-specific fine-tuning has been [shown](https://arxiv.org/abs/2007.15779) to improve performance in downstream performance on a variety of NLP tasks. MaterialsBERT fine-tunes PubMedBERT, a pre-trained language model trained using biomedical literature. This model was chosen as the biomedical domain is close to the materials science domain. MaterialsBERT when further fine-tuned on a variety of downstream sequence labeling tasks in materials science, outperformed other baseline language models tested on three out of five datasets.
18
+
19
+ ## Intended uses & limitations
20
+
21
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
22
+ be fine-tuned on materials-science relevant downstream tasks.
23
+
24
+ Note that this model is primarily aimed at being fine-tuned on tasks that use a sentence or a paragraph (potentially masked)
25
+ to make decisions, such as sequence classification, token classification or question answering.
26
+
27
+
28
+ ## How to Use
29
+
30
+ Here is how to use this model to get the features of a given text in PyTorch:
31
+
32
+ ```python
33
+ from transformers import BertForMaskedLM, BertTokenizer
34
+ tokenizer = BertTokenizer.from_pretrained('pranav-s/MaterialsBERT')
35
+ model = BertForMaskedLM.from_pretrained('pranav-s/MaterialsBERT')
36
+ text = "Enter any text you like"
37
+ encoded_input = tokenizer(text, return_tensors='pt')
38
+ output = model(**encoded_input)
39
+ ```
40
+
41
+ ## Training data
42
+
43
+ A fine-tuning corpus of 2.4 million materials science abstracts was used. The DOI's of the journal articles used are provided in the file training_DOI.txt
44
+
45
+ ## Training procedure
46
+
47
+ ### Training hyperparameters
48
+
49
+ The following hyperparameters were used during training:
50
+ - learning_rate: 5e-05
51
+ - train_batch_size: 32
52
+ - eval_batch_size: 32
53
+ - seed: 42
54
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
55
+ - lr_scheduler_type: linear
56
+ - num_epochs: 3.0
57
+ - mixed_precision_training: Native AMP
58
+
59
+
60
+ ### Framework versions
61
+
62
+ - Transformers 4.17.0
63
+ - Pytorch 1.10.2
64
+ - Datasets 1.18.3
65
+ - Tokenizers 0.11.0
66
+
67
+
68
+ ## Citation
69
+
70
+ If you find MaterialsBERT useful in your research, please cite the following paper:
71
+
72
+ ```latex
73
+ @misc{materialsbert,
74
+ author = {Pranav Shetty, Arunkumar Chitteth Rajan, Christopher Kuenneth, Sonkakshi Gupta, Lakshmi Prerana Panchumarti, Lauren Holm, Chao Zhang, and Rampi Ramprasad},
75
+ title = {A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing},
76
+ year = {2022},
77
+ eprint = {arXiv:2209.13136},
78
+ }
79
+ ```
80
+
81
+ <a href="https://huggingface.co/exbert/?model=pranav-s/MaterialsBERT">
82
+ <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
83
+ </a>
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/data/pranav/projects/matbert/pretrained_models/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext/",
3
+ "architectures": [
4
+ "BertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.17.0.dev0",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
training_DOI.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2eb7af71f0c95a953fd54503e667c41d512513b7a57d464ec8c78ec0d99cba8f
3
+ size 59030492
vocab.txt ADDED
The diff for this file is too large to render. See raw diff