textgraphs / docs /nlp.md
Paco Nathan
A new start
91eaff6

A newer version of the Streamlit SDK is available: 1.36.0

Upgrade

The open source spaCy library in Python provides full-featured NLP capabilities. #honnibal2020spacy This serves as a core component of this project. Recent releases of spaCy have provided features to integrate with selected large models, and also support native features for entity linking.

On the one hand, spaCy pipelines offer a broad range of integrations and "opinionated" selections for both utility and ease of use. The resulting pipelines are optimized for annotating streams of spans of tokens. On the other hand, the opinionated API calls and the abstractions use for pipeline construction and configuration present some important constraints:

  • Pipelines are not especially well-suited for propagating other forms of generated data, beyond token/span streams.
  • Tokenization used in spaCy does not align with the requirements for relation extraction projects of interest.
  • Entity linking capabilities rely on using an internally defined "knowledge base" which is not well-suited for integrating with heterogeneous resources.

Consequently, while spaCy serves as a core component for NLP capabilities, this project presents a library of Python class definitions for KG construction which can be extended and configured to accommodate a broad range of LLM components. These "less opinionated" pipeline definitions, in the broader scope, are optimized for managing streams of KG candidate elements which have been produced by generative AI.