t5-base-sci-names

Biodiversity literature is dedicated to the identification, documentation, and categorization of plants, fungi, animals, and other living organisms. Correctly extracting the name of an organism within these documents involves finding the entire scientific name–including the genus, specific epithet, and author name. Extracting these names allows biologists to access documents about a species more comprehensively, and to track an organism’s history of documentation, which includes biological changes and changes in how scientists describe them.

t5-base-sci-names uses advances in text-to-text generation to generate scientific names and authors from biodiversity literature. This model was trained on hand-labeled biodiversity texts, including labeled information about a mentioned organism's genus (abbreviated and expanded), specific epithet, and author. This model was trained to output 0-N scientific names with specific prefixes (e.g. "genus = " or "epithet = ") and performs best with anywhere from 20-120 words.

You can also use the model in this tutorial for scientific names generation.

Thanks to Damon Little and Nelson Salinas at the New York Botanical Gardens for their support.

Note that this model is still a work in progress. Any feedback is welcome.