VoucherVision / custom_prompts /version_2_OSU.yaml
phyloforfun's picture
file upload gallery
bf58fe8
raw
history blame
No virus
10.3 kB
prompt_author: Will Weaver
prompt_author_institution: UM
prompt_description: Version 2 slightly modified for OSU, but still unfinished
LLM: gpt
instructions: '1. Refactor the unstructured OCR text into a dictionary based on the
JSON structure outlined below.
2. You should map the unstructured OCR text to the appropriate JSON key and then
populate the field based on its rules.
3. Some JSON key fields are permitted to remain empty if the corresponding information
is not found in the unstructured OCR text.
4. Ignore any information in the OCR text that doesn''t fit into the defined JSON
structure.
5. Duplicate dictionary fields are not allowed.
6. Ensure that all JSON keys are in lowercase.
7. Ensure that new JSON field values follow sentence case capitalization.
8. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format
and data types specified in the template.
9. Ensure the output JSON string is valid JSON format. It should not have trailing
commas or unquoted keys.
10. Only return a JSON dictionary represented as a string. You should not explain
your answer.'
json_formatting_instructions: "The next section of instructions outlines how to format\
\ the JSON dictionary. The keys are the same as those of the final formatted JSON\
\ object.\nFor each key there is a format requirement that specifies how to transcribe\
\ the information for that key. \nThe possible formatting options are:\n1. \"verbatim\
\ transcription\" - field is populated with verbatim text from the unformatted OCR.\n\
2. \"spell check transcription\" - field is populated with spelling corrected text\
\ from the unformatted OCR.\n3. \"boolean yes no\" - field is populated with only\
\ yes or no.\n4. \"boolean 1 0\" - field is populated with only 1 or 0.\n5. \"integer\"\
\ - field is populated with only an integer.\n6. \"[list]\" - field is populated\
\ from one of the values in the list.\n7. \"yyyy-mm-dd\" - field is populated with\
\ a date in the format year-month-day.\nThe desired null value is also given. Populate\
\ the field with the null value of the information for that key is not present in\
\ the unformatted OCR text."
mapping:
COLLECTING:
- collectors
- collector_number
- determined_by
- multiple_names
- verbatim_date
- date
- end_date
GEOGRAPHY:
- country
- state
- county
- min_elevation
- max_elevation
- elevation_units
LOCALITY:
- locality_name
- verbatim_coordinates
- decimal_coordinates
- datum
- plant_description
- cultivated
- habitat
MISCELLANEOUS: []
TAXONOMY:
- catalog_number
- genus
- species
- subspecies
- variety
- forma
rules:
Dictionary:
catalog_number:
description: The barcode identifier, typically a number with at least 6 digits,
but fewer than 30 digits.
format: verbatim transcription
null_value: ''
collector_number:
description: Unique identifier or number that denotes the specific collecting
event and associated with the collector.
format: verbatim transcription
null_value: s.n.
collectors:
description: Full name(s) of the individual(s) responsible for collecting the
specimen. When multiple collectors are involved, their names should be separated
by commas.
format: verbatim transcription
null_value: not present
country:
description: Country that corresponds to the current geographic location of
collection. Capitalize first letter of each word. If abbreviation is given
populate field with the full spelling of the country's name.
format: spell check transcription
null_value: ''
county:
description: Administrative division 2 that corresponds to the current geographic
location of collection; capitalize first letter of each word. Administrative
division 2 is equivalent to a U.S. county, parish, borough.
format: spell check transcription
null_value: ''
cultivated:
description: Cultivated plants are intentionally grown by humans. In text descriptions,
look for planting dates, garden locations, ornamental, cultivar names, garden,
or farm to indicate cultivated plant. The value 1 indicates that the specimen
was cultivated, the value zero otherwise.
format: boolean 1 0
null_value: '0'
date:
description: 'Date the specimen was collected formatted as year-month-day. If
specific components of the date are unknown, they should be replaced with
zeros. Examples: ''0000-00-00'' if the entire date is unknown, ''YYYY-00-00''
if only the year is known, and ''YYYY-MM-00'' if year and month are known
but day is not.'
format: yyyy-mm-dd
null_value: ''
datum:
description: Datum of location coordinates. Possible values are include in the
format list. Leave field blank if unclear. [WGS84, WGS72, WGS66, WGS60, NAD83,
NAD27, OSGB36, ETRS89, ED50, GDA94, JGD2011, Tokyo97, KGD2002, TWD67, TWD97,
BJS54, XAS80, GCJ-02, BD-09, PZ-90.11, GTRF, CGCS2000, ITRF88, ITRF89, ITRF90,
ITRF91, ITRF92, ITRF93, ITRF94, ITRF96, ITRF97, ITRF2000, ITRF2005, ITRF2008,
ITRF2014, Hong Kong Principal Datum, SAD69]
format: '[list]'
null_value: ''
decimal_coordinates:
description: Correct and convert the verbatim location coordinates to conform
with the decimal degrees GPS coordinate format.
format: spell check transcription
null_value: ''
determined_by:
description: Full name of the individual responsible for determining the taxanomic
name of the specimen. Sometimes the name will be near to the characters 'det'
to denote determination. This name may be isolated from other names in the
unformatted OCR text.
format: verbatim transcription
null_value: ''
elevation_units:
description: 'Elevation units must be meters. If min_elevation field is populated,
then elevation_units: ''m''. Otherwise elevation_units: ''''.'
format: spell check transcription
null_value: ''
end_date:
description: 'If a date range is provided, this represents the later or ending
date of the collection period, formatted as year-month-day. If specific components
of the date are unknown, they should be replaced with zeros. Examples: ''0000-00-00''
if the entire end date is unknown, ''YYYY-00-00'' if only the year of the
end date is known, and ''YYYY-MM-00'' if year and month of the end date are
known but the day is not.'
format: yyyy-mm-dd
null_value: ''
forma:
description: Taxonomic determination to form (f.).
format: verbatim transcription
null_value: ''
genus:
description: Taxonomic determination to genus. Genus must be capitalized. If
genus is not present use the taxonomic family name followed by the word 'indet'.
format: verbatim transcription
null_value: ''
habitat:
description: Description of a plant's habitat or the location where the specimen
was collected. Ignore descriptions of the plant itself.
format: verbatim transcription
null_value: ''
locality_name:
description: Description of geographic location, landscape, landmarks, regional
features, nearby places, or any contextual information aiding in pinpointing
the exact origin or site of the specimen.
format: verbatim transcription
null_value: ''
max_elevation:
description: Maximum elevation or altitude in meters. If only one elevation
is present, then max_elevation should be set to the null_value. Only if units
are explicit then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m'
or 'm.' or 'meters'). Round to integer.
format: integer
null_value: ''
min_elevation:
description: Minimum elevation or altitude in meters. Only if units are explicit
then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m' or 'm.' or
'meters'). Round to integer.
format: integer
null_value: ''
multiple_names:
description: Indicate whether multiple people or collector names are present
in the unformatted OCR text. If you see more than one person's name the value
is 'yes'; otherwise the value is 'no'.
format: boolean yes no
null_value: ''
plant_description:
description: Description of plant features such as leaf shape, size, color,
stem texture, height, flower structure, scent, fruit or seed characteristics,
root system type, overall growth habit and form, any notable aroma or secretions,
presence of hairs or bristles, and any other distinguishing morphological
or physiological characteristics.
format: verbatim transcription
null_value: ''
species:
description: Taxonomic determination to species, do not capitalize species.
format: verbatim transcription
null_value: ''
state:
description: Administrative division 1 that corresponds to the current geographic
location of collection. Capitalize first letter of each word. Administrative
division 1 is equivalent to a U.S. State.
format: spell check transcription
null_value: ''
subspecies:
description: Taxonomic determination to subspecies (subsp.).
format: verbatim transcription
null_value: ''
variety:
description: Taxonomic determination to variety (var).
format: verbatim transcription
null_value: ''
verbatim_coordinates:
description: Verbatim location coordinates as they appear on the label. Do not
convert formats. Possible coordinate types are one of [Lat, Long, UTM, TRS].
format: verbatim transcription
null_value: ''
verbatim_date:
description: Date of collection exactly as it appears on the label. Do not change
the format or correct typos.
format: verbatim transcription
null_value: s.d.
SpeciesName:
taxonomy:
- Genus_species