PCSciBERT_uncased was initiated with the uncased variant of SciBERT (https://huggingface.co/allenai/scibert_scivocab_uncased) and pre-trained on texts from 1,560,661 research articles of the physics and computer science domain in arXiv. The tokenizer for PCSciBERT_uncased uses the same vocabulary from allenai/scibert_scivocab_uncased.

The model was also evaluated on its downstream performance in named entity recognition using the adsabs/WIESP2022-NER and CS-NER (https://github.com/jd-coderepos/contributions-ner-cs/tree/main) dataset. Overall, PCSciBERT_uncased achieved higher micro F1 scores than SciBERT(uncased) for both WIESP (Micro F1: 81.54%) and CS-NER (Micro F1: 75.67%) datasets.

It improves the performance of SciBERT(uncased) on CS-NER test dataset by 0.26% and on WIESP test dataset by 0.8%.

Downloads last month
28
Safetensors
Model size
110M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train jmzk96/PCSciBERT_uncased