phyloforfun commited on
Commit
576994f
2 Parent(s): c47d7dd b2ad93f

Merge branch 'main' of https://huggingface.co/spaces/phyloforfun/VoucherVision

Browse files
Files changed (1) hide show
  1. custom_prompts/SLTPvM_long.yaml +75 -0
custom_prompts/SLTPvM_long.yaml ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ prompt_author: Will Weaver
2
+ prompt_author_institution: University of Michigan
3
+ prompt_name: SLTPvB_long
4
+ prompt_version: v-1-0
5
+ prompt_description: Prompt developed by the University of Michigan.
6
+ SLTPvB prompts all have standardized column headers (fields) that were chosen due to their reliability and prevalence in herbarium records.
7
+ All field descriptions are based on the official Darwin Core guidelines.
8
+ SLTPvB_long - The most verbose prompt option. Descriptions closely follow DwC guides. Detailed rules for the LLM to follow. Works best with double or triple OCR to increase attention back to the OCR (select 'use both OCR models' or 'handwritten + printed' along with trOCR).
9
+ SLTPvB_medium - Shorter verion of _long.
10
+ SLTPvB_short - The least verbose possible prompt while still providing rules and DwC descriptions.
11
+ LLM: General Purpose
12
+ instructions: 1. Refactor the unstructured OCR text into a dictionary based on the JSON structure outlined below.
13
+ 2. Map the unstructured OCR text to the appropriate JSON key and populate the field given the user-defined rules.
14
+ 3. JSON key values are permitted to remain empty strings if the corresponding information is not found in the unstructured OCR text.
15
+ 4. Duplicate dictionary fields are not allowed.
16
+ 5. Ensure all JSON keys are in camel case.
17
+ 6. Ensure new JSON field values follow sentence case capitalization.
18
+ 7. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format and data types specified in the template.
19
+ 8. Ensure output JSON string is valid JSON format. It should not have trailing commas or unquoted keys.
20
+ 9. Only return a JSON dictionary represented as a string. You should not explain your answer.
21
+ json_formatting_instructions: This section provides rules for formatting each JSON value organized by the JSON key.
22
+ rules:
23
+ catalogNumber: Barcode identifier, typically a number with at least 6 digits, but fewer than 30 digits.
24
+ scientificName: The scientific name of the taxon including genus, specific epithet, and any lower classifications.
25
+ genus: Taxonomic determination to genus. Genus must be capitalized. If genus is not present use the taxonomic family name followed by the word 'indet'.
26
+ specificEpithet: The name of the species epithet of the scientificName. Only include the species epithet.
27
+ speciesNameAuthorship: The authorship information for the scientificName formatted according to the conventions of the applicable Darwin Core nomenclatural code.
28
+ collectedBy: A comma separated list of names of people, groups, or organizations responsible for observing, recording, collecting, or presenting the original specimen. The primary collector or observer should be listed first.
29
+ collectorNumber: An identifier given to the occurrence at the time it was recorded, the specimen collectors number.
30
+ identifiedBy: A comma separated list of names of people, groups, or organizations who assigned the taxon to the subject organism. This is not the specimen collector.
31
+ verbatimCollectionDate: The verbatim original representation of the date and time information for when the specimen was collected. Date of collection exactly as it appears on the label. Do not change the format or correct typos.
32
+ collectionDate: Date the specimen was collected formatted as year-month-day, YYYY-MM-DD. If specific components of the date are unknown, they should be replaced with zeros. Use 0000-00-00 if the entire date is unknown, YYYY-00-00 if only the year is known, and YYYY-MM-00 if year and month are known but day is not.
33
+ collectionDateEnd: If a range of collection dates is provided, this is the later end date while collectionDate is the beginning date. Use the same formatting as for collectionDate.
34
+ occurrenceRemarks: Verbatim text describing the specimens geographic location. Text describing the appearance of the specimen. A statement about the presence or absence of a taxon at a the collection location. Text describing the significance of the specimen, such as a specific expedition or notable collection. Description of plant features such as leaf shape, size, color, stem texture, height, flower structure, scent, fruit or seed characteristics, root system type, overall growth habit and form, any notable aroma or secretions, presence of hairs or bristles, and any other distinguishing morphological or physiological characteristics.
35
+ habitat: Verbatim category or description of the habitat in which the specimen collection event occurred.
36
+ cultivated: Cultivated plants are intentionally grown by humans. In text descriptions, look for planting dates, garden, cult, cultivated, ornamental, cultivar names, garden, or farm to indicate cultivated plant. Use yes if cultivated, otherwise leave blank.
37
+ country: The name of the country or major administrative unit in which the specimen was originally collected.
38
+ stateProvince: The name of the next smaller administrative region than country (state, province, canton, department, region, etc.) in which the specimen was originally collected.
39
+ county: The full, unabbreviated name of the next smaller administrative region than stateProvince (county, shire, department, parish etc.) in which the specimen was originally collected.
40
+ locality: Description of geographic location, landscape, landmarks, regional features, nearby places, municipality, city, or any contextual information aiding in pinpointing the exact origin or location of the specimen.
41
+ verbatimCoordinates: Verbatim location coordinates as they appear on the label. Do not convert formats. Possible coordinate types include [Lat, Long, UTM, TRS].
42
+ decimalLatitude: Latitude decimal coordinate. Correct and convert the verbatim location coordinates to conform with the decimal degrees GPS coordinate format.
43
+ decimalLongitude: Longitude decimal coordinate. Correct and convert the verbatim location coordinates to conform with the decimal degrees GPS coordinate format.
44
+ minimumElevationInMeters: Minimum elevation or altitude in meters. Only if units are explicit then convert from feet ("ft" or "ft."" or "feet") to meters ("m" or "m." or "meters"). Round to integer. Values greater than 6000 are in feet and need to be converted.
45
+ maximumElevationInMeters: Maximum elevation or altitude in meters. If only one elevation is present, then max_elevation should be set to the null_value. Only if units are explicit then convert from feet ("ft" or "ft." or "feet") to meters ("m" or "m." or "meters"). Round to integer. Values greater than 6000 are in feet and need to be converted.
46
+ elevationUnits: Use m if the final elevation is reported in meters. Use ft if the final elevation is in feet. Units should match minimumElevationInMeters and maximumElevationInMeters.
47
+
48
+ mapping:
49
+ TAXONOMY:
50
+ - catalogNumber
51
+ - scientificName
52
+ - genus
53
+ - specificEpithet
54
+ - speciesNameAuthorship
55
+ - collectedBy
56
+ - collectorNumber
57
+ - identifiedBy
58
+ GEOGRAPHY:
59
+ - country
60
+ - stateProvince
61
+ - county
62
+ - locality
63
+ - verbatimCoordinates
64
+ - decimalLatitude
65
+ - decimalLongitude
66
+ - minimumElevationInMeters
67
+ - maximumElevationInMeters
68
+ - elevationUnits
69
+ COLLECTING:
70
+ - verbatimCollectionDate
71
+ - collectionDate
72
+ - collectionDateEnd
73
+ - cultivated
74
+ - habitat
75
+ - occurrenceRemarks