DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science

About

This repository contains the official implementation of DiSCoMaT: Distantly Supervised Composition Extraction from Tables in Materials Science. This model is trained on 4,408 distantly supervised tables published in materials science research papers to extract compositions reported in the tables. These tables yielded a total of 38,799 tuples in the training set.

The tuples are of the form ${(id, c_k^{id}, p_k^{id}, u_k^{id} )}_{k=1}^{K^{id}}$, where

$id$ is the id of the material composition reported in the tables
$c_k^{id}$ is the k-th chemical element in the material composition
$p_k^{id}$ is the percentage of k-th chemical element in the material composition
$u_k^{id}$ is the unit of the of $p_k^{id}$ (either mole % or weight %)

The following figure represents the architecture of the model proposed in our work.

Notes

The code directory contains the file for training models reported in this paper.
The data directory contains the dataset for training models reported in this paper.
The notebooks directory contains Jupyter notebook to visualise the dataset.
The respective directories and sub-directories contain task-specific README files.

Citation

If you find this repository useful, please cite our work as follows:

Incoming ACL 2023. Will be added soon.