<h2>Importing modules for accessing the dataset

In [5]:
import pickle

import pandas as pd

In [6]:
with open('../data/train_data.pkl', 'rb') as f:
    train_data = pickle.load(f)
f.close()

<h2>Description of train_data


- **train_data** contains a list of dictionaries
- each dictionary is associated with **one table**
- Following keys are present in each dictionary:
    - **act_table** : It is the table extracted from the XML/HTML of the materials science research papers.
    - **caption** : Caption of the extracted table.
    - **row_label** : It tells whether the component/composition is present in the row.
    - **col_label** : It tells whether the component/composition is present in the column.
    - **edge_list** : List of edges of table graph.
    - **pii** : Personally identifiable information of research articles in Elsevier's ScienceDirect database.
    - **t_idx** : Number of the table in the respective research paper - 1
    - **regex_table** : 1, if regular expression is present in the table, else 0.
    - **num_rows** : Number of rows in the table
    - **num_cols** : Number of columns in the table
    - **num_cells** : Number of cells in the table.
    - **comp_table** : True/False, to identify if a table is composition table or not.
    - **input_ids** : obtained after tokenization using m3rg-iitd/matsicbert model from huggingface for each node.
    - **attention_mask** : obtained after tokenization using m3rg-iitd/matsicbert model from huggingface for each node.
    - **caption_input_ids** : obtained after tokenization of table caption using m3rg-iitd/matsicbert model from huggingface.
    - **caption_attention_mask** : obtained after tokenization of table caption using m3rg-iitd/matsicbert model from huggingface.
    - **footer** : Table footer text, if not provided, None.
    - **gid_row_label** : Index of row having glass ids
    - **gid_col_label** : Index of columns having glass ids
    - **sum_less_100** : 0, if complete information table; 1, if partial information table.

<h3> Showing example of information in one dictionary

In [70]:
idx = 30
table = train_data[idx]

print(table['t_idx'], table['pii'], table['caption'],'\n')
pd.DataFrame(table['act_table'])

0 S0022309309000416 Nominal compositions of samples, in mol%. 



Unnamed: 0,0,1,2,3,4,5,6
0,Series,Sample,Na2O,CaO,B2O3,Al2O3,SiO2
1,B7,B7N20,20,0,7,8,65
2,B7,B7N15,15,5,7,8,65
3,B7,B7N10,10,10,7,8,65
4,B7,B7N05,5,15,7,8,65
5,B7,B7N00,0,20,7,8,65
6,,,,,,,
7,B21,B21N20,20,0,21,8,51
8,B21,B21N15,15,5,21,8,51
9,B21,B21N10,10,10,21,8,51


In [67]:
print(table['row_label'])
# 1 for rows where composition is present, else 0

[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]


In [69]:
print(table['col_label'])
# 2 for columns where chemical ompounds is present, else 0

[0, 0, 2, 2, 2, 2, 2]


In [72]:
print(table['regex_table'])

0


In [73]:
table['num_rows'], table['num_cols'], table['num_cells']

(12, 7, 84)

In [74]:
table['comp_table']

True

In [76]:
table['footer'] # no footer is present in this table

{}

In [78]:
table['gid_row_label'] # material ids are not present in rows

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [80]:
table['gid_col_label'] # material ids are present in second column

[0, 1, 0, 0, 0, 0, 0]

In [83]:
table['sum_less_100'] # since all the rows reporting material compostion add upto 100, this flag is 0 for this table

0