File size: 4,976 Bytes
e80d020
 
 
 
 
 
0ac4fe1
 
4adfff6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c330994
 
 
 
 
 
4adfff6
 
c330994
4adfff6
 
 
 
 
 
 
0ac4fe1
e80d020
4adfff6
e80d020
4adfff6
e80d020
4adfff6
e80d020
4adfff6
e80d020
4adfff6
 
c330994
 
e80d020
4adfff6
 
 
e80d020
4adfff6
e80d020
4adfff6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e80d020
4adfff6
 
e80d020
4adfff6
 
 
 
 
e80d020
 
4adfff6
e80d020
4adfff6
 
 
 
 
e80d020
 
4adfff6
 
 
 
 
 
 
 
 
e80d020
 
4adfff6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>mtDNA Tool – System Overview</title>

  <style>
    .custom-container {
      background-color: #ffffff !important;
      color: #222222 !important;
      font-family: Arial, sans-serif !important;
      line-height: 1.6 !important;
      padding: 2rem !important;
      max-width: 900px !important;
      margin: auto !important;
    }

    .custom-container h1,
    .custom-container h2,
    .custom-container h3,
    .custom-container strong,
    .custom-container b,
    .custom-container p,
    .custom-container li,
    .custom-container ol,
    .custom-container ul,
    .custom-container span {
      color: #222222 !important;
      font-weight: normal !important;
    }

    .custom-container h1,
    .custom-container h2 {
      font-weight: bold !important;
    }

    .custom-container img {
      max-width: 100%;
      border: 1px solid #ccc;
      padding: 5px;
      background: #fff;
    }

    .custom-container code {
      background: none !important;
      color: #222 !important;
      font-family: inherit !important;
      font-size: inherit !important;
      padding: 0 !important;
      border-radius: 0 !important;
    }


    .custom-container .highlight {
      background: #ffffcc;
      padding: 4px 8px;
      border-left: 4px solid #ffcc00;
      margin: 1rem 0;
      color: #333 !important;
    }
  </style>
</head>

<body>
  <div class="custom-container">

    <h1>mtDNA Location Classifier – Brief System Pipeline and Usage Guide</h1>

    <p>The <strong>mtDNA Tool</strong> is a lightweight pipeline designed to help researchers extract metadata such as geographic origin, sample type (ancient/modern), and optional niche labels (e.g., ethnicity, specific location) from mtDNA GenBank accession numbers. It supports batch input and produces structured Excel summaries.</p>

    <h2>System Overview Diagram</h2>
    <p>The figure below shows the core execution flow—from input accession to final output.</p>
    <img src="https://huggingface.co/spaces/VyLala/mtDNALocation/resolve/main/flowchart.png" alt="mtDNA Pipeline Flowchart">


    <h2>Key Steps</h2>
    <ol>
      <li><strong>Input</strong>: One or more GenBank accession numbers are submitted (e.g., via UI, CSV, or text).</li>

      <li><strong>Metadata Collection</strong>: Using <code>fetch_ncbi_metadata</code>, the pipeline retrieves metadata like country, isolate, collection date, and reference title. If available, supplementary material and full-text articles are parsed using DOI, PubMed, or Google Custom Search.</li>

      <li><strong>Text Extraction & Preprocessing</strong>:
        <ul>
          <li>All available documents are parsed and cleaned (tables, paragraphs, overlapping sections).</li>
          <li>Text is merged into two formats: a smaller <code>chunk</code> and a full <code>all_output</code>.</li>
        </ul>
      </li>

      <li><strong>LLM-based Inference (Gemini + RAG)</strong>:
        <ul>
          <li>Chunks are embedded with FAISS and stored for reuse.</li>
          <li>The Gemini model answers specific queries like predicted country, sample type, and any niche label requested by the user.</li>
        </ul>
      </li>

      <li><strong>Result Structuring</strong>:
        <ul>
          <li>Each output includes predicted fields + explanation text (methods used, quotes, sources).</li>
          <li>Summarized and saved using <code>save_to_excel</code>.</li>
        </ul>
      </li>
    </ol>

    <h2>Output Format</h2>
    <p>The final output is an Excel file with the following fields:</p>
    <ul>
      <li><code>Sample ID</code></li>
      <li><code>Predicted Country</code> and <code>Country Explanation</code></li>
      <li><code>Predicted Sample Type</code> and <code>Sample Type Explanation</code></li>
      <li><code>Sources</code> (links to articles)</li>
      <li><code>Time Cost</code></li>
    </ul>

    <h2>System Highlights</h2>
    <ul>
      <li>RAG + Gemini integration for improved explanation and transparency</li>
      <li>Excel export for structured research use</li>
      <li>Optional ethnic/location/language inference using isolate names</li>
      <li>Quality check (e.g., fallback on short explanations, low token count)</li>
      <li>Report Button – After results are displayed, users can submit errors or mismatches using the report text box below the output table</li>
    </ul>

    <h2>Citation</h2>
    <div class="highlight">
      Phung, V. (2025). mtDNA Location Classifier. HuggingFace Spaces. https://huggingface.co/spaces/VyLala/mtDNALocation
    </div>

    <h2>Contact</h2>
    <p>If you are a researcher working with historical mtDNA data or edge-case accessions and need scalable inference or logging, reach out through the HuggingFace space or email provided in the repo README.</p>

  </div>
</body>
</html>