arazd
/

MIReAD

Text Classification

representations

scientific documents

Inference Endpoints

Model card Files Files and versions Community

MIReAD / README.md

arazd's picture

Update README.md

7777bf4 over 1 year ago

|

2.19 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	tags:
	- pubmed
	- arxiv
	- representations
	- scientific documents
	- bert
	widget:
	- example_title: "Journal prediction"
	- text: "Tissue-based diagnostics and research is incessantly evolving with the development of new molecular tools. It has long been realized that immunohistochemistry can add an important new level of information on top of morphology and that protein expression patterns in a cancer may yield crucial diagnostic and prognostic information. We have generated an immunohistochemistry-based map of protein expression profiles in normal tissues, cancer and cell lines."
	---
	This is the finetuned model presented in **MIReAD: a simple method for learning high-quality representations from
	scientific documents (ACL 2023)**.

	We trained MIReAD on >500,000 PubMed and arXiv abstracts across over 2,000 journal classes. MIReAD was initialized with SciBERT weights and finetuned to predict journal class based on the abstract and title of the paper. MIReAD uses SciBERT's tokenizer.

	Overall, with MIReAD you can:
	* extract semantically meaningful representation using paper's abstact
	* predict journal class based on paper's abstract

	To load the MIReAD model:
	```python
	from transformers import BertForSequenceClassification, AutoTokenizer

	mpath = 'arazd/miread'
	model_hub = BertForSequenceClassification.from_pretrained(mpath)
	tokenizer = AutoTokenizer.from_pretrained(mpath)
	```

	To use MIReAD for feature extraction and classification:
	```python
	# sample abstract text
	abstr = 'Learning semantically meaningful representations from scientific documents can ...'
	source_len = 512
	inputs = tokenizer(abstr,
	max_length = source_len,
	pad_to_max_length=True,
	truncation=True,
	return_tensors="pt")

	# classification (getting logits over 2,734 journal classes)
	out = model(**inputs)
	logits = out.logits

	# feature extraction (getting 768-dimensional feature profiles)
	out = model.bert(**inputs)
	# IMPORTANT: use [CLS] token representation as document-level representation (hence, 0th idx)
	feature = out.last_hidden_state[:, 0, :]

	```