Spaces:
Runtime error
Runtime error
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# ESM | |
## Overview | |
This page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental | |
AI Research Team, providing the state-of-the-art ESMFold and ESM-2, and the previously released ESM-1b and ESM-1v. | |
Transformer protein language models were introduced in the paper [Biological structure and function emerge from scaling | |
unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by | |
Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, | |
C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. | |
The first version of this paper was [preprinted in 2019](https://www.biorxiv.org/content/10.1101/622803v1?versioned=true). | |
ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks, | |
and enables atomic resolution structure prediction. | |
It was released with the paper [Language models of protein sequences at the scale of evolution enable accurate | |
structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, | |
Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido and Alexander Rives. | |
Also introduced in this paper was ESMFold. It uses an ESM-2 stem with a head that can predict folded protein | |
structures with state-of-the-art accuracy. Unlike [AlphaFold2](https://www.nature.com/articles/s41586-021-03819-2), | |
it relies on the token embeddings from the large pre-trained protein language model stem and does not perform a multiple | |
sequence alignment (MSA) step at inference time, which means that ESMFold checkpoints are fully "standalone" - | |
they do not require a database of known protein sequences and structures with associated external query tools | |
to make predictions, and are much faster as a result. | |
The abstract from | |
"Biological structure and function emerge from scaling unsupervised learning to 250 | |
million protein sequences" is | |
*In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised | |
learning has led to major advances in representation learning and statistical generation. In the life sciences, the | |
anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling | |
at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To | |
this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 | |
million protein sequences spanning evolutionary diversity. The resulting model contains information about biological | |
properties in its representations. The representations are learned from sequence data alone. The learned representation | |
space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to | |
remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and | |
can be identified by linear projections. Representation learning produces features that generalize across a range of | |
applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and | |
improving state-of-the-art features for long-range contact prediction.* | |
The abstract from | |
"Language models of protein sequences at the scale of evolution enable accurate structure prediction" is | |
*Large language models have recently been shown to develop emergent capabilities with scale, going beyond | |
simple pattern matching to perform higher level reasoning and generate lifelike images and text. While | |
language models trained on protein sequences have been studied at a smaller scale, little is known about | |
what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters, | |
the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn | |
information enabling the prediction of the three-dimensional structure of a protein at the resolution of | |
individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly | |
from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for | |
sequences with low perplexity that are well understood by the language model. ESMFold inference is an | |
order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic | |
proteins in practical timescales.* | |
Tips: | |
- ESM models are trained with a masked language modeling (MLM) objective. | |
The original code can be found [here](https://github.com/facebookresearch/esm) and was | |
was developed by the Fundamental AI Research team at Meta AI. | |
ESM-1b, ESM-1v and ESM-2 were contributed to huggingface by [jasonliu](https://huggingface.co/jasonliu) | |
and [Matt](https://huggingface.co/Rocketknight1). | |
ESMFold was contributed to huggingface by [Matt](https://huggingface.co/Rocketknight1) and | |
[Sylvain](https://huggingface.co/sgugger), with a big thank you to Nikita Smetanin, Roshan Rao and Tom Sercu for their | |
help throughout the process! | |
The HuggingFace port of ESMFold uses portions of the [openfold](https://github.com/aqlaboratory/openfold) library. | |
The `openfold` library is licensed under the Apache License 2.0. | |
## Documentation resources | |
- [Text classification task guide](../tasks/sequence_classification) | |
- [Token classification task guide](../tasks/token_classification) | |
- [Masked language modeling task guide](../tasks/masked_language_modeling) | |
## EsmConfig | |
[[autodoc]] EsmConfig | |
- all | |
## EsmTokenizer | |
[[autodoc]] EsmTokenizer | |
- build_inputs_with_special_tokens | |
- get_special_tokens_mask | |
- create_token_type_ids_from_sequences | |
- save_vocabulary | |
## EsmModel | |
[[autodoc]] EsmModel | |
- forward | |
## EsmForMaskedLM | |
[[autodoc]] EsmForMaskedLM | |
- forward | |
## EsmForSequenceClassification | |
[[autodoc]] EsmForSequenceClassification | |
- forward | |
## EsmForTokenClassification | |
[[autodoc]] EsmForTokenClassification | |
- forward | |
## EsmForProteinFolding | |
[[autodoc]] EsmForProteinFolding | |
- forward | |
## TFEsmModel | |
[[autodoc]] TFEsmModel | |
- call | |
## TFEsmForMaskedLM | |
[[autodoc]] TFEsmForMaskedLM | |
- call | |
## TFEsmForSequenceClassification | |
[[autodoc]] TFEsmForSequenceClassification | |
- call | |
## TFEsmForTokenClassification | |
[[autodoc]] TFEsmForTokenClassification | |
- call | |