jamesliounis/DataBERT · Hugging Face

Bidirectional Encoder Representations from Transformers (BERT) is a groundbreaking method in natural language processing (NLP) that has revolutionized how machines understand human language. Developed by researchers at Google, BERT is based on the Transformer architecture and utilizes deep learning techniques to process words in relation to all the other words in a sentence, contrary to traditional methods that look at words sequentially. This bidirectional approach allows BERT to capture the context of a word more effectively, leading to significant improvements in a variety of NLP tasks, such as question answering, named entity recognition, and sentiment analysis.

The objective of our project is to leverage the powerful capabilities of BERT for a specific, critical NLP task: identifying mentions of datasets within textual content. In the realm of research, especially within fields that heavily depend on data, such as economics, life sciences, and social sciences, recognizing references to datasets is crucial. These mentions can provide insights into the data sources researchers are utilizing, foster data sharing, and enhance reproducibility in scientific research. The goal is to eventually construct a centralized database of dataset mentions.

By fine-tuning BERT on our specially consolidated training data (for which details can be found here), our goal is to develop a robust classifier capable of accurately distinguishing sentences that contain dataset mentions from those that do not. The ability to automatically detect dataset references can significantly benefit researchers, librarians, and data curators by streamlining the process of linking research outcomes with the underlying data, thereby advancing the frontiers of open science and data-driven research.