InstaDeepAI/nucleotide-transformer-2.5b-1000g · Questions about Nucleotide-Transformer and 1000 Genomes Project Data

Hello InstaDeepAI and community members,

I am currently delving into the fascinating work presented in the nucleotide-transformer paper, particularly the use of data from the 1000 Genomes Project. The research is incredibly insightful, and I have a few questions that I hope the community or the authors can help clarify:

Data Interpretation: The paper mentions the use of 3202 high-coverage human genomes, totaling 20.5 trillion nucleotides. From my understanding, considering that each human genome is about 3.2 billion nucleotides, and accounting for diploid genomes, the calculation, 3.2B * 2 * 3202 = 20,492.8B = 20.5T, seems to match up with your figures. Can anyone confirm if this interpretation is correct or if there's something I'm missing?
Handling Large Data Volumes: In handling data from the 1000 Genomes Project, I've encountered the challenge of managing large volumes of data. With each genome being about 3GB and considering diploidy, the data for 3202 samples would be roughly 18.76TB (= 3GB * 2 * 3202) . How did the team manage such a vast amount of training data? Did you utilize the entire dataset directly for training, or were there specific preprocessing or reduction techniques applied?
FASTA File Generation Issues: I've been facing difficulties in generating FASTA files from VCF files, with processes taking an excessive amount of time even on high-spec servers. I've tried using tools like bcftools consensus and gatk FastaAlternateReferenceMaker but to no avail. I'm curious about the tools or methods the team used for this conversion process and would appreciate any suggestions or recommendations.

Any insights or guidance on these matters would be greatly appreciated and would significantly enhance my understanding and application of this data.

Thank you for your time and contributions to this exciting field!

Best regards