Spaces:
Running
Running
LLM: gpt | |
instructions: '1. Refactor the unstructured OCR text into a dictionary based on the | |
JSON structure outlined below. | |
2. You should map the unstructured OCR text to the appropriate JSON key and then | |
populate the field based on its rules. | |
3. Some JSON key fields are permitted to remain empty if the corresponding information | |
is not found in the unstructured OCR text. | |
4. Ignore any information in the OCR text that doesn''t fit into the defined JSON | |
structure. | |
5. Duplicate dictionary fields are not allowed. | |
6. Ensure that all JSON keys are in lowercase. | |
7. Ensure that new JSON field values follow sentence case capitalization. | |
8. Ensure all key-value pairs in the JSON dictionary strictly adhere to the format | |
and data types specified in the template. | |
9. Ensure the output JSON string is valid JSON format. It should not have trailing | |
commas or unquoted keys. | |
10. Only return a JSON dictionary represented as a string. You should not explain | |
your answer.' | |
json_formatting_instructions: "The next section of instructions outlines how to format\ | |
\ the JSON dictionary. The keys are the same as those of the final formatted JSON\ | |
\ object.\nFor each key there is a format requirement that specifies how to transcribe\ | |
\ the information for that key. \nThe possible formatting options are:\n1. \"verbatim\ | |
\ transcription\" - field is populated with verbatim text from the unformatted OCR.\n\ | |
2. \"spell check transcription\" - field is populated with spelling corrected text\ | |
\ from the unformatted OCR.\n3. \"boolean yes no\" - field is populated with only\ | |
\ yes or no.\n4. \"boolean 1 0\" - field is populated with only 1 or 0.\n5. \"integer\"\ | |
\ - field is populated with only an integer.\n6. \"[list]\" - field is populated\ | |
\ from one of the values in the list.\n7. \"yyyy-mm-dd\" - field is populated with\ | |
\ a date in the format year-month-day.\nThe desired null value is also given. Populate\ | |
\ the field with the null value of the information for that key is not present in\ | |
\ the unformatted OCR text." | |
mapping: | |
COLLECTING: | |
- collectors | |
- collector_number | |
- determined_by | |
- multiple_names | |
- verbatim_date | |
- date | |
- end_date | |
GEOGRAPHY: | |
- country | |
- state | |
- county | |
- min_elevation | |
- max_elevation | |
- elevation_units | |
LOCALITY: | |
- locality_name | |
- verbatim_coordinates | |
- decimal_coordinates | |
- datum | |
- plant_description | |
- cultivated | |
- habitat | |
MISCELLANEOUS: [] | |
TAXONOMY: | |
- catalog_number | |
- genus | |
- species | |
- subspecies | |
- variety | |
- forma | |
rules: | |
Dictionary: | |
catalog_number: | |
description: The barcode identifier, typically a number with at least 6 digits, | |
but fewer than 30 digits. | |
format: verbatim transcription | |
null_value: '' | |
collector_number: | |
description: Unique identifier or number that denotes the specific collecting | |
event and associated with the collector. | |
format: verbatim transcription | |
null_value: s.n. | |
collectors: | |
description: Full name(s) of the individual(s) responsible for collecting the | |
specimen. When multiple collectors are involved, their names should be separated | |
by commas. | |
format: verbatim transcription | |
null_value: not present | |
country: | |
description: Country that corresponds to the current geographic location of | |
collection. Capitalize first letter of each word. If abbreviation is given | |
populate field with the full spelling of the country's name. | |
format: spell check transcription | |
null_value: '' | |
county: | |
description: Administrative division 2 that corresponds to the current geographic | |
location of collection; capitalize first letter of each word. Administrative | |
division 2 is equivalent to a U.S. county, parish, borough. | |
format: spell check transcription | |
null_value: '' | |
cultivated: | |
description: Cultivated plants are intentionally grown by humans. In text descriptions, | |
look for planting dates, garden locations, ornamental, cultivar names, garden, | |
or farm to indicate cultivated plant. The value 1 indicates that the specimen | |
was cultivated, the value zero otherwise. | |
format: boolean 1 0 | |
null_value: '0' | |
date: | |
description: 'Date the specimen was collected formatted as year-month-day. If | |
specific components of the date are unknown, they should be replaced with | |
zeros. Examples: ''0000-00-00'' if the entire date is unknown, ''YYYY-00-00'' | |
if only the year is known, and ''YYYY-MM-00'' if year and month are known | |
but day is not.' | |
format: yyyy-mm-dd | |
null_value: '' | |
datum: | |
description: Datum of location coordinates. Possible values are include in the | |
format list. Leave field blank if unclear. [WGS84, WGS72, WGS66, WGS60, NAD83, | |
NAD27, OSGB36, ETRS89, ED50, GDA94, JGD2011, Tokyo97, KGD2002, TWD67, TWD97, | |
BJS54, XAS80, GCJ-02, BD-09, PZ-90.11, GTRF, CGCS2000, ITRF88, ITRF89, ITRF90, | |
ITRF91, ITRF92, ITRF93, ITRF94, ITRF96, ITRF97, ITRF2000, ITRF2005, ITRF2008, | |
ITRF2014, Hong Kong Principal Datum, SAD69] | |
format: '[list]' | |
null_value: '' | |
decimal_coordinates: | |
description: Correct and convert the verbatim location coordinates to conform | |
with the decimal degrees GPS coordinate format. | |
format: spell check transcription | |
null_value: '' | |
determined_by: | |
description: Full name of the individual responsible for determining the taxanomic | |
name of the specimen. Sometimes the name will be near to the characters 'det' | |
to denote determination. This name may be isolated from other names in the | |
unformatted OCR text. | |
format: verbatim transcription | |
null_value: '' | |
elevation_units: | |
description: 'Elevation units must be meters. If min_elevation field is populated, | |
then elevation_units: ''m''. Otherwise elevation_units: ''''.' | |
format: spell check transcription | |
null_value: '' | |
end_date: | |
description: 'If a date range is provided, this represents the later or ending | |
date of the collection period, formatted as year-month-day. If specific components | |
of the date are unknown, they should be replaced with zeros. Examples: ''0000-00-00'' | |
if the entire end date is unknown, ''YYYY-00-00'' if only the year of the | |
end date is known, and ''YYYY-MM-00'' if year and month of the end date are | |
known but the day is not.' | |
format: yyyy-mm-dd | |
null_value: '' | |
forma: | |
description: Taxonomic determination to form (f.). | |
format: verbatim transcription | |
null_value: '' | |
genus: | |
description: Taxonomic determination to genus. Genus must be capitalized. If | |
genus is not present use the taxonomic family name followed by the word 'indet'. | |
format: verbatim transcription | |
null_value: '' | |
habitat: | |
description: Description of a plant's habitat or the location where the specimen | |
was collected. Ignore descriptions of the plant itself. | |
format: verbatim transcription | |
null_value: '' | |
locality_name: | |
description: Description of geographic location, landscape, landmarks, regional | |
features, nearby places, or any contextual information aiding in pinpointing | |
the exact origin or site of the specimen. | |
format: verbatim transcription | |
null_value: '' | |
max_elevation: | |
description: Maximum elevation or altitude in meters. If only one elevation | |
is present, then max_elevation should be set to the null_value. Only if units | |
are explicit then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m' | |
or 'm.' or 'meters'). Round to integer. | |
format: integer | |
null_value: '' | |
min_elevation: | |
description: Minimum elevation or altitude in meters. Only if units are explicit | |
then convert from feet ('ft' or 'ft.' or 'feet') to meters ('m' or 'm.' or | |
'meters'). Round to integer. | |
format: integer | |
null_value: '' | |
multiple_names: | |
description: Indicate whether multiple people or collector names are present | |
in the unformatted OCR text. If you see more than one person's name the value | |
is 'yes'; otherwise the value is 'no'. | |
format: boolean yes no | |
null_value: '' | |
plant_description: | |
description: Description of plant features such as leaf shape, size, color, | |
stem texture, height, flower structure, scent, fruit or seed characteristics, | |
root system type, overall growth habit and form, any notable aroma or secretions, | |
presence of hairs or bristles, and any other distinguishing morphological | |
or physiological characteristics. | |
format: verbatim transcription | |
null_value: '' | |
species: | |
description: Taxonomic determination to species, do not capitalize species. | |
format: verbatim transcription | |
null_value: '' | |
state: | |
description: Administrative division 1 that corresponds to the current geographic | |
location of collection. Capitalize first letter of each word. Administrative | |
division 1 is equivalent to a U.S. State. | |
format: spell check transcription | |
null_value: '' | |
subspecies: | |
description: Taxonomic determination to subspecies (subsp.). | |
format: verbatim transcription | |
null_value: '' | |
variety: | |
description: Taxonomic determination to variety (var). | |
format: verbatim transcription | |
null_value: '' | |
verbatim_coordinates: | |
description: Verbatim location coordinates as they appear on the label. Do not | |
convert formats. Possible coordinate types are one of [Lat, Long, UTM, TRS]. | |
format: verbatim transcription | |
null_value: '' | |
verbatim_date: | |
description: Date of collection exactly as it appears on the label. Do not change | |
the format or correct typos. | |
format: verbatim transcription | |
null_value: s.d. | |
SpeciesName: | |
taxonomy: | |
- Genus_species | |