dmozzherin commited on
Commit
4c99959
Β·
1 Parent(s): 382a0cf

instruction

Browse files
Files changed (2) hide show
  1. INSTRUCTIONS.md +150 -0
  2. Modelfile +13 -0
INSTRUCTIONS.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Insect Label Parser β€” Setup Instructions
2
+
3
+ This tool reads raw entomology collection label text and extracts structured
4
+ data (country, state, locality, date, collector, elevation, etc.) as JSON.
5
+ It runs entirely on your computer β€” no internet connection required after
6
+ the one-time setup.
7
+
8
+ ---
9
+
10
+ ## Step 1 β€” Which file do I need?
11
+
12
+ Copy one of these files from `output/gguf/` to your computer:
13
+
14
+ | File | Size | Use when |
15
+ |------|------|----------|
16
+ | `insect-parser-q4_k_m.gguf` | 3.2 GB | Your computer has **8 GB RAM** (most laptops) |
17
+ | `insect-parser-q5_k_m.gguf` | 3.4 GB | Your computer has **16 GB RAM or more** (slightly better quality) |
18
+
19
+ Not sure how much RAM you have?
20
+ - **Mac:** Apple menu β†’ About This Mac β†’ look for "Memory"
21
+ - **Windows:** Settings β†’ System β†’ About β†’ look for "Installed RAM"
22
+
23
+ > **The Q4 file works well for this task.** Label parsing is a simple
24
+ > extraction job β€” the quality difference between Q4 and Q5 is very small.
25
+
26
+ ---
27
+
28
+ ## Option A: LM Studio (recommended for most users β€” no terminal needed)
29
+
30
+ LM Studio is a free desktop app with a chat interface, similar to ChatGPT
31
+ but running fully on your own machine.
32
+
33
+ ### Install
34
+
35
+ 1. Go to **lmstudio.ai** and download the version for your operating system
36
+ (Mac, Windows, or Linux)
37
+ 2. Install and open it
38
+
39
+ ### Load the model
40
+
41
+ 1. In LM Studio, click **My Models** in the left sidebar
42
+ 2. Click **"Load model from file"** (or drag the `.gguf` file into the window)
43
+ 3. Navigate to the `insect-parser-q4_k_m.gguf` file you copied in Step 1
44
+ 4. Wait for the model to load (progress bar at the bottom)
45
+
46
+ ### Configure the system prompt
47
+
48
+ This step tells the model what it is supposed to do.
49
+
50
+ 1. Click the **Chat** icon in the left sidebar
51
+ 2. Find the **System Prompt** box (usually at the top of the right panel)
52
+ 3. Paste this text exactly:
53
+
54
+ ```
55
+ Parse this insect collection label and return a JSON object with the extracted fields. Only include fields that are present in the label.
56
+ ```
57
+
58
+ 4. Set **Temperature** to `0` in the model settings panel (this makes
59
+ output deterministic β€” the same label always gives the same result)
60
+
61
+ ### Parse a label
62
+
63
+ Paste the raw label text into the chat box and press Enter. The model will
64
+ return a JSON object. Example:
65
+
66
+ **Input:**
67
+ ```
68
+ U.S.A., Texas: Austin, Travis Co., 15.iv.2021, J. Doe, sweeping
69
+ ```
70
+
71
+ **Output:**
72
+ ```json
73
+ {
74
+ "country": "USA",
75
+ "state": "Texas",
76
+ "county": "Travis",
77
+ "verbatim_locality": "Austin",
78
+ "verbatim_date": "15.iv.2021",
79
+ "start_date_year": "2021",
80
+ "start_date_month": "4",
81
+ "start_date_day": "15",
82
+ "verbatim_collectors": "J. Doe",
83
+ "verbatim_method": "sweeping"
84
+ }
85
+ ```
86
+
87
+ ---
88
+
89
+ ## Option B: Ollama (for users comfortable with a terminal)
90
+
91
+ Ollama is a lightweight tool that runs models from the command line and also
92
+ exposes a local API for scripting.
93
+
94
+ ### Requirement: Ollama version 0.20.7 or newer
95
+
96
+ Older versions do not support this model's architecture. Check your version:
97
+
98
+ ```
99
+ ollama --version
100
+ ```
101
+
102
+ If it shows a version older than 0.20.7, update from **ollama.com**.
103
+
104
+ ### Install
105
+
106
+ Go to **ollama.com**, download, and install for your operating system.
107
+
108
+ ### Register the model
109
+
110
+ Open a terminal, navigate to the project folder, and run:
111
+
112
+ ```bash
113
+ ollama create insect-parser -f Modelfile
114
+ ```
115
+
116
+ You only need to do this once.
117
+
118
+ ### Parse a label
119
+
120
+ ```bash
121
+ ollama run insect-parser "U.S.A., Texas: Austin, 15.iv.2021, J. Doe"
122
+ ```
123
+
124
+ Or pipe a text file:
125
+
126
+ ```bash
127
+ cat my_label.txt | ollama run insect-parser
128
+ ```
129
+
130
+ ---
131
+
132
+ ## Troubleshooting
133
+
134
+ **The model is very slow.**
135
+ This is normal on a laptop without a dedicated GPU. The Q4 file typically
136
+ takes 5–30 seconds per label on a CPU. If you have an NVIDIA or AMD GPU
137
+ with 4+ GB of video memory, Ollama and LM Studio will use it automatically
138
+ and be much faster.
139
+
140
+ **LM Studio says "not enough memory."**
141
+ Try the Q4 file if you were using Q5. If Q4 also fails, your computer may
142
+ have less than 8 GB of RAM available β€” try closing other applications first.
143
+
144
+ **Ollama says "unknown model architecture: gemma4".**
145
+ Your Ollama version is too old. Update it from **ollama.com**.
146
+
147
+ **The output is not valid JSON.**
148
+ Occasionally the model will include a short thinking passage before the
149
+ JSON. Copy just the `{ ... }` portion of the output. If this happens
150
+ frequently, make sure Temperature is set to `0`.
Modelfile ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM ./ento-label-parser-q4_k_m.gguf
2
+
3
+ SYSTEM """Parse this insect collection label and return a JSON object with the extracted fields. Only include fields that are present in the label."""
4
+
5
+ # Deterministic output β€” this is a structured extraction task, not creative generation
6
+ PARAMETER temperature 0.0
7
+ PARAMETER top_p 0.9
8
+ PARAMETER repeat_penalty 1.1
9
+
10
+ # Gemma 4 uses a thinking/reasoning mode before outputting the final answer.
11
+ # Labels are short inputs but the model may think for several hundred tokens
12
+ # before writing the JSON. 4096 gives enough room for thinking + output.
13
+ PARAMETER num_ctx 4096