Edit model card

Model Card for RoBERTa Social Roles Classifier

This model is a token classifier that extracts social roles from explicit expressions of self-identification in sentences, e.g. I am a designer, entrepreneur, and mother.

Model Details

Model Description

We continue pretraining RoBERTa-base for 10 epochs on individuals' about pages, which is a subset of Common Crawl and can be accessed here.

Then, we finetuned on hand-annotated token-level labels as described in this paper. We use a train-dev-test split of 600/200/200 labeled sentences.

Our definition of "roles" or "occupations" on about pages is any singular noun referring to the subject of the bio. The roles and occupations can be ones that the subject actively participated in the past, e.g. Throughout my life I have been a teacher, a startup founder, and a seashell collector.

Subject of the about page

  • First person biographies: the subject is I, me, my, mine.
  • Third person biographies: we assume the bio’s subject is the main person referenced in the excerpt sentence.

Positive examples of self-identification

  • I am a chef, author, and mom living in Virginia.
  • As an award-winning geologist, Sebastian has given talks around the world.
  • Knitter, blogger, & dreamer. In the last example above, the sentence’s relation to the subject of the bio is implied rather than stated.

Negative examples

  • My wife loves beekeeping as well.
  • Janice works hard to accommodate every client.

Language(s) (NLP): English

License: Apache 2.0

Uses

We use tagged social roles in web pages to assess the social impact of LLM pretraining data curation decisions. Text linked to descriptions of their creators can also facilitate other areas of research, including self-presentation and language variation.

Evaluation

On our test set, we achieve a precision score of 0.856, recall score of 0.945, and F1 score of 0.898.

Citation

@misc{lucy2024aboutme,
      title={AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters}, 
      author={Li Lucy and Suchin Gururangan and Luca Soldaini and Emma Strubell and David Bamman and Lauren Klein and Jesse Dodge},
      year={2024},
      eprint={2401.06408},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

lucy3_li@berkeley.edu

Downloads last month
1