--- title: NLC Explorer emoji: 🧭 🔍 ⁉️ colorFrom: gray colorTo: purple sdk: streamlit sdk_version: 1.10.0 app_file: app.py pinned: false license: mit --- # NLC-Explorer ### A Natural Language Counterfactual Generator for Exploring Bias in Sentiment Analysis Algorithms ##### Overview This project is a digression from the project on [Interactive Model Cards](https://github.com/amcrisan/interactive-model-cards). It focuses on providing a person more ways to explore a model's outputs through the generation of alternatives (technically [counterfactuals](https://plato.stanford.edu/entries/counterfactuals/#WhatCoun)). We believe the use of multiple alternatives may allow people to better understand the limitations of a model and develop a sense of its trustworthiness and bias. ##### Known Limitations * Words not in the spaCy vocab for `en_core_web_lg` won't have vectors and so won't have the ability to create similarity scores. * WordNet provides many limitations due to its age and lack of funding for ongoing maintenance. It provides access to a large variety of the English language but certain words simply do not exist. * There are currently only 2 lists (Countries and Professions). We would like to find community curated lists for: Race, Sexual Orientation and Gender Identity (SOGI), Religion, age, and other protected statuses. * We do not have a custom pipeline for Named Entity Recognition (NER), or a matcher, to identify complex terms (ex. "two spirit", "male to female", "Asian American", etc.) and so these will not be fully available for interrogation. ##### Key Dependencies and Packages 1. [Hugging Face Transformers](https://huggingface.co/) - the model we've designed this iteration for is hosted on hugging face. It is: [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english). 2. [Streamlit](https://streamlit.io) - This is the library we're using to build the prototype app because it is easy to stand up and quick to fix. 3. [spaCy](https://spacy.io) - This is the main NLP Library we're using and it runs most of the text manipulation we're doing as part of the project. 4. [NLTK + WordNet](https://www.nltk.org/howto/wordnet.html) - This is the initial lexical database we're using because it is accessible directly through Python and it is free. We will be considering a move to [ConceptNet](https://conceptnet.io/) for future iterations based on better lateral movement across edges. 5. [Lime](https://github.com/marcotcr/lime) - We chose Lime over Shap because Lime has more of the functionality we need. Shap appears to provide greater performance but is not as easily suited to our original designs. 6. [Altair](https://altair-viz.github.io/user_guide/encoding.html) - We're using Altair because it's well integrated into Streamlit.