nlc-explorer / README.md
butterswords's picture
Update README.md
7333849
|
raw
history blame
2.85 kB
metadata
title: NLC Explorer
emoji: 🧭 πŸ” ⁉️
colorFrom: gray
colorTo: purple
sdk: streamlit
sdk_version: 1.10.0
app_file: app.py
pinned: false
license: mit

NLC-Explorer

A Natural Language Counterfactual Generator for Exploring Bias in Sentiment Analysis Algorithms

Overview

This project is a digression from the project on Interactive Model Cards. It focuses on providing a person more ways to explore a model's outputs through the generation of alternatives (technically counterfactuals). We believe the use of multiple alternatives may allow people to better understand the limitations of a model and develop a sense of its trustworthiness and bias.

Known Limitations
  • Words not in the spaCy vocab for en_core_web_lg won't have vectors and so won't have the ability to create similarity scores.
  • WordNet provides many limitations due to its age and lack of funding for ongoing maintenance. It provides access to a large variety of the English language but certain words simply do not exist.
  • There are currently only 2 lists (Countries and Professions). We would like to find community curated lists for: Race, Sexual Orientation and Gender Identity (SOGI), Religion, age, and other protected statuses.
  • We do not have a custom pipeline for Named Entity Recognition (NER), or a matcher, to identify complex terms (ex. "two spirit", "male to female", "Asian American", etc.) and so these will not be fully available for interrogation.
Key Dependencies and Packages
  1. Hugging Face Transformers - the model we've designed this iteration for is hosted on hugging face. It is: distilbert-base-uncased-finetuned-sst-2-english.
  2. Streamlit - This is the library we're using to build the prototype app because it is easy to stand up and quick to fix.
  3. spaCy - This is the main NLP Library we're using and it runs most of the text manipulation we're doing as part of the project.
  4. NLTK + WordNet - This is the initial lexical database we're using because it is accessible directly through Python and it is free. We will be considering a move to ConceptNet for future iterations based on better lateral movement across edges.
  5. Lime - We chose Lime over Shap because Lime has more of the functionality we need. Shap appears to provide greater performance but is not as easily suited to our original designs.
  6. Altair - We're using Altair because it's well integrated into Streamlit.