---
license: cc-by-4.0
language:
- en
metrics:
- f1
- accuracy
library_name: transformers
pipeline_tag: text-classification
widget:
  - text: |
      X chromosome inactivation (XCI) serves as a paradigm for RNA-mediated regulation of gene expression, wherein the long non-coding RNA XIST spreads across the X chromosome in cis to mediate gene silencing chromosome-wide. In female naive human pluripotent stem cells (hPSCs), XIST is in a dispersed configuration, and XCI does not occur, raising questions about XIST's function. We found that XIST spreads across the X chromosome and induces dampening of X-linked gene expression in naive hPSCs. Surprisingly, XIST also targets specific autosomal regions, where it induces repressive chromatin changes and gene expression dampening. Thereby, XIST equalizes X-linked gene dosage between male and female cells while inducing differences in autosomes. The dispersed Xist configuration and autosomal localization also occur transiently during XCI initiation in mouse PSCs. Together, our study identifies XIST as the regulator of X chromosome dampening, uncovers an evolutionarily conserved trans-acting role of XIST/Xist, and reveals a correlation between XIST/Xist dispersal and autosomal targeting.
---

# lncrna-biocontext
This model is designed to determine whether a given abstract talks about an lncRNA in the context of disease or not.

The model has been trained on data from [lncBook-Wiki](https://ngdc.cncb.ac.cn/lncbook/) about papers
which have been curated by experts based on the biological context they discuss. We have collected the
abstracts for these papers and simplified the classification into disease/not disease. We then fine-tune a 
[longformer](https://huggingface.co/allenai/longformer-base-4096) model to make a binary classification.

We achieve pretty good results:

| Metric | Score |
|-|- |
| Accuracy | 0.84 |
| F1 | 0.82 |
| ROC| 0.98 |

Though the test set is only 59 examples, with 22 discussing disease. 

## Key stats
The model size is ~600MB. It will run really fast on the MPS device of an M-series mac. It should also run pretty fast on a normal CPU.

The context windiw is 4,096 tokens which comes from the base longformer model. In training we limit the context to 1280 tokens because that
was a bit bigger than the longest abstract we saw. Long abstracts may cause trouble.

## Limitations
The base model is trained on MLM on wikipedia text (and maybe some other stuff). As such, it might not have a 
great understanding of scientific literature.

The dataset used to train this model was _tiny_ at only 588 examples overall. This means only 470 samples 
for training with 59 for validation and testing. These have been deliberately sampled to be roughly equally distributed between the 
two classes.

The dataset these are derived from is also _massively_ imbalanced, having 19,229 examples but only 294 that are not disease. As a result
the model is trained on a dataset that hugely undersamples the disease context abstracts.

While the model has been tested on some abstracts derived from lncBook's annotations, it hasn't really been tested on 'wild' abstracts.

## Next steps

The next step will be to be able to classify both the specific disease (e.g. lung adenocarcinoma), and the non-disease 
context (e.g. localisation) a paper discusses.