Instructions to use raphassaraf/MNLP_M3_document_encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use raphassaraf/MNLP_M3_document_encoder with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("raphassaraf/MNLP_M3_document_encoder")

sentences = [
    "What are the de facto required fields in a SAM/BAM read group?",
    "Question: Several gene set enrichment methods are available, the most famous/popular is the Broad Institute tool. Many other tools are available (See for example the biocView of GSE which list 82 different packages). There are several parameters in consideration :\n\nthe statistic used to order the genes, \nif it competitive or self-contained,\nif it is supervised or not,\nand how is the enrichment score calculated.\n\nI am using the fgsea - Fast Gene Set Enrichment Analysis package to calculate the enrichment scores and someone told me that the numbers are different from the ones on the Broad Institute despite all the other parameters being equivalent.\nAre these two methods (fgsea and Broad Institute GSEA) equivalent to calculate the enrichment score?\nI looked to the algorithms of both papers, and they seem fairly similar, but I don't know if in real datasets they are equivalent or not.\nIs there any article reviewing and comparing how does the enrichment score method affect to the result?\n\nAnswer: According to the FGSEA preprint:\n\nWe ran reference GSEA with default parameters. The permutation number\n  was set to 1000, which means that for each input gene set 1000\n  independent samples were generated. The run took 100 seconds and\n  resulted in 79 gene sets with GSEA-adjusted FDR q-value of less than\n  10−2. All significant gene sets were in a positive mode. First, to get\n  a similar nominal p-values accuracy we ran FGSEA algorithm on 1000\n  permutations. This took 2 seconds, but resulted in no significant hits\n  due after multiple testing correction (with FRD ≤ 1%).\n\nThus, FGSEA and GSEA are not identical.\nAnd again in the conclusion:\n\nConsequently, gene sets can be ranked more precisely in the results\n  and, which is even more important, standard multiple testing\n  correction methods can be applied instead of approximate ones as in\n  [GSEA].\n\nThe author argues that FGSEA is more accurate, so it can't be equivalent.\nIf you are interested specifically in the enrichment score, that was addressed by the author in the preprint comments:\n\nValues of enrichment scores and normalized enrichment scores are the\n  same for both broad version and fgsea.\n\nSo that part seems to be the same.",
    "Question: I am running samtools mpileup (v1.4) on a bam file with very choppy coverage (ChIP-seq style data). I want to get a first-pass list of positions with SNVs and their frequency as reported by the read counts, but no matter what I do, I keep getting all SNVs filtered out as not passing QC.\nWhat's the magic parameter set for an initial list of SNVs and frequencies?\nEDIT: this is a question I posted on \"the other\" website, but didn't get a reply there.\n\nAnswer: I used this in the past for ChIP-seq data and it generated SNVs:\nsamtools mpileup \\\n--uncompressed --max-depth 10000 --min-MQ 20 --ignore-RG --skip-indels \\\n--fasta-ref ref.fa file.bam \\\n| bcftools call --consensus-caller \\\n> out.vcf\n\nThis was samtools 1.3 in case that makes a difference.",
    "Question: The SAM specification indicates that each read group must have a unique ID field, but does not mark any other field as required. \nI have also discovered that htsjdk throws exceptions if the sample (SM) field is empty, though there is no indication in the specification that this is required. \nAre there other read group fields that I should expect to be required by common tools? \n\nAnswer: The sample tag (i.e. SM) was a mandatory tag in the initial SAM spec (see the .pages file; you need a mac to open it). When transitioned to Latex, this requirement was mysteriously dropped. Picard is conforming to the initial spec. Anyway, the sample tag is important to quite a few tools. I would encourage you to add it."
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

Notebooks
Google Colab
Kaggle

MNLP_M3_document_encoder / checkpoints /checkpoint-3 /config.json

raphassaraf

Training in progress, step 1000

09385a6 verified 12 months ago

raw

history blame

661 Bytes

	{
	"architectures": [
	"BertModel"
	],
	"attention_probs_dropout_prob": 0.1,
	"classifier_dropout": null,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 384,
	"id2label": {
	"0": "LABEL_0"
	},
	"initializer_range": 0.02,
	"intermediate_size": 1536,
	"label2id": {
	"LABEL_0": 0
	},
	"layer_norm_eps": 1e-12,
	"max_position_embeddings": 512,
	"model_type": "bert",
	"num_attention_heads": 12,
	"num_hidden_layers": 12,
	"pad_token_id": 0,
	"position_embedding_type": "absolute",
	"torch_dtype": "float32",
	"transformers_version": "4.51.3",
	"type_vocab_size": 2,
	"use_cache": true,
	"vocab_size": 30522
	}