GreenGenomicsLab commited on
Commit
849e9a1
·
verified ·
1 Parent(s): 261b39f

Upload 10 files

Browse files
Files changed (1) hide show
  1. README.txt +74 -3
README.txt CHANGED
@@ -1,6 +1,77 @@
1
- to run, list your fasta files in the filelist.txt files and submit the .sbatch script, or just run run_la4sr_TI-inc-algaGPT.sh if no scheduler is available
2
 
3
- alternatively, you can run the inference script (for raw outputs from next-token generation) and the model metrics script seperately
4
 
5
- expected outputs from default run are in results-archive
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
 
1
+ # LA4SR TI-inclusive algaGPT Distribution
2
 
3
+ This distribution contains the TI-inclusive algaGPT model packaged for easy use with Singularity, along with datasets, inference scripts, and evaluation tools.
4
 
5
+ ## Directory Structure
6
+
7
+ * **`la4sr_sp2.sif`** (6.8 GB): Singularity container with the complete computational environment.
8
+ * **Model Files:**
9
+
10
+ * `ckpt.pt`: Model checkpoint file (990 MB).
11
+ * `meta.pkl`: Metadata for the model.
12
+ * `model.py`: Python script defining the model architecture.
13
+ * **Datasets:**
14
+
15
+ * FASTA files (`generated_prompts_*_headed.fa`) for various taxa (algae, archaea, bacteria, fungi, viruses).
16
+ * **Scripts:**
17
+
18
+ * `run_la4sr_TI-inc-algaGPT.sh`: Main inference and metrics generation script.
19
+ * `run_la4sr_loop.sbatch`: SLURM batch script to run multiple inference jobs.
20
+ * `infer_TI-inc-algaGPT.py`: Python inference script.
21
+ * `llm-metrics-two-files.py`: Generates classification metrics and visualizations.
22
+ * **Utility Files:**
23
+
24
+ * `filelist.txt`, `contam-filelist.txt`, `algae-filelist.txt`: Lists of FASTA files to analyze.
25
+ * `la4sr_sp2.sif.md5`: Checksum for container verification.
26
+
27
+ ### Subdirectories:
28
+
29
+ * **`cache/`**: Cache directory for Hugging Face models and tokenizers.
30
+ * **`results-archive/`**: Archived results from previous runs.
31
+ * **`algaGPT_fungi-algae2x-update_cleaned/`**: Fine-tuned algaGPT variant optimized for fungi.
32
+ * **`TI-free-la4sr/`**: Pythia-based TI-free flagship model.
33
+ * **`slurm-logs/`**: SLURM job output logs.
34
+
35
+ ## Quick Start
36
+
37
+ 1. **Setup:** Ensure Singularity is installed on your HPC or local system.
38
+
39
+ 2. **Inference (no scheduler):**
40
+
41
+ ```bash
42
+ ./run_la4sr_TI-inc-algaGPT.sh resume <algal_fasta> <contaminant_fasta>
43
+ ```
44
+
45
+ Replace `<algal_fasta>` and `<contaminant_fasta>` with your FASTA file paths.
46
+
47
+ 3. **Inference (with SLURM scheduler):**
48
+
49
+ * Update `algae-filelist.txt` and `contam-filelist.txt` with paths to your FASTA files.
50
+ * Submit the SLURM job array:
51
+
52
+ ```bash
53
+ sbatch run_la4sr_loop.sbatch
54
+ ```
55
+
56
+ ## Outputs
57
+
58
+ Results, including TSV files, metrics reports, misclassification reports, and visualizations, are stored in the `results/` directory.
59
+
60
+ ## Additional Information
61
+
62
+ * To manually run the inference script:
63
+
64
+ ```bash
65
+ singularity exec --nv la4sr_sp2.sif python3 infer_TI-inc-algaGPT.py --init_from resume input.fasta -o output.tsv
66
+ ```
67
+
68
+ * To generate metrics independently:
69
+
70
+ ```bash
71
+ singularity exec la4sr_sp2.sif python3 llm-metrics-two-files.py algal_results.tsv contaminant_results.tsv -o metrics_report.txt -m misclassified_report.txt -v
72
+ ```
73
+
74
+ ---
75
+
76
+ For further assistance, contact the maintainers.
77