Spaces:

chendl
/

multimodal

Runtime error

multimodal / transformers /docs /source /en /model_doc /esm.mdx

add transformers

455a40f about 2 years ago

7.1 kB

	<!--Copyright 2022 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.
	-->

	# ESM

	## Overview
	This page provides code and pre-trained weights for Transformer protein language models from Meta AI's Fundamental
	AI Research Team, providing the state-of-the-art ESMFold and ESM-2, and the previously released ESM-1b and ESM-1v.
	Transformer protein language models were introduced in the paper [Biological structure and function emerge from scaling
	unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by
	Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott,
	C. Lawrence Zitnick, Jerry Ma, and Rob Fergus.
	The first version of this paper was [preprinted in 2019](https://www.biorxiv.org/content/10.1101/622803v1?versioned=true).

	ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks,
	and enables atomic resolution structure prediction.
	It was released with the paper [Language models of protein sequences at the scale of evolution enable accurate
	structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie,
	Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido and Alexander Rives.

	Also introduced in this paper was ESMFold. It uses an ESM-2 stem with a head that can predict folded protein
	structures with state-of-the-art accuracy. Unlike [AlphaFold2](https://www.nature.com/articles/s41586-021-03819-2),
	it relies on the token embeddings from the large pre-trained protein language model stem and does not perform a multiple
	sequence alignment (MSA) step at inference time, which means that ESMFold checkpoints are fully "standalone" -
	they do not require a database of known protein sequences and structures with associated external query tools
	to make predictions, and are much faster as a result.


	The abstract from
	"Biological structure and function emerge from scaling unsupervised learning to 250
	million protein sequences" is


	*In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised
	learning has led to major advances in representation learning and statistical generation. In the life sciences, the
	anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling
	at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To
	this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250
	million protein sequences spanning evolutionary diversity. The resulting model contains information about biological
	properties in its representations. The representations are learned from sequence data alone. The learned representation
	space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to
	remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and
	can be identified by linear projections. Representation learning produces features that generalize across a range of
	applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and
	improving state-of-the-art features for long-range contact prediction.*


	The abstract from
	"Language models of protein sequences at the scale of evolution enable accurate structure prediction" is

	*Large language models have recently been shown to develop emergent capabilities with scale, going beyond
	simple pattern matching to perform higher level reasoning and generate lifelike images and text. While
	language models trained on protein sequences have been studied at a smaller scale, little is known about
	what they learn about biology as they are scaled up. In this work we train models up to 15 billion parameters,
	the largest language models of proteins to be evaluated to date. We find that as models are scaled they learn
	information enabling the prediction of the three-dimensional structure of a protein at the resolution of
	individual atoms. We present ESMFold for high accuracy end-to-end atomic level structure prediction directly
	from the individual sequence of a protein. ESMFold has similar accuracy to AlphaFold2 and RoseTTAFold for
	sequences with low perplexity that are well understood by the language model. ESMFold inference is an
	order of magnitude faster than AlphaFold2, enabling exploration of the structural space of metagenomic
	proteins in practical timescales.*


	Tips:

	- ESM models are trained with a masked language modeling (MLM) objective.

	The original code can be found [here](https://github.com/facebookresearch/esm) and was
	was developed by the Fundamental AI Research team at Meta AI.
	ESM-1b, ESM-1v and ESM-2 were contributed to huggingface by [jasonliu](https://huggingface.co/jasonliu)
	and [Matt](https://huggingface.co/Rocketknight1).

	ESMFold was contributed to huggingface by [Matt](https://huggingface.co/Rocketknight1) and
	[Sylvain](https://huggingface.co/sgugger), with a big thank you to Nikita Smetanin, Roshan Rao and Tom Sercu for their
	help throughout the process!

	The HuggingFace port of ESMFold uses portions of the [openfold](https://github.com/aqlaboratory/openfold) library.
	The `openfold` library is licensed under the Apache License 2.0.

	## Documentation resources

	- [Text classification task guide](../tasks/sequence_classification)
	- [Token classification task guide](../tasks/token_classification)
	- [Masked language modeling task guide](../tasks/masked_language_modeling)

	## EsmConfig

	[[autodoc]] EsmConfig
	- all

	## EsmTokenizer

	[[autodoc]] EsmTokenizer
	- build_inputs_with_special_tokens
	- get_special_tokens_mask
	- create_token_type_ids_from_sequences
	- save_vocabulary


	## EsmModel

	[[autodoc]] EsmModel
	- forward

	## EsmForMaskedLM

	[[autodoc]] EsmForMaskedLM
	- forward

	## EsmForSequenceClassification

	[[autodoc]] EsmForSequenceClassification
	- forward

	## EsmForTokenClassification

	[[autodoc]] EsmForTokenClassification
	- forward

	## EsmForProteinFolding

	[[autodoc]] EsmForProteinFolding
	- forward

	## TFEsmModel

	[[autodoc]] TFEsmModel
	- call

	## TFEsmForMaskedLM

	[[autodoc]] TFEsmForMaskedLM
	- call

	## TFEsmForSequenceClassification

	[[autodoc]] TFEsmForSequenceClassification
	- call

	## TFEsmForTokenClassification

	[[autodoc]] TFEsmForTokenClassification
	- call