{ "cells": [ { "cell_type": "markdown", "id": "a6b5d57a-0922-4f4f-8003-ae41c9d45762", "metadata": {}, "source": [ "

Importing modules for accessing the dataset" ] }, { "cell_type": "code", "execution_count": 5, "id": "d612cf91-70ad-43d4-addc-ff91be943a35", "metadata": {}, "outputs": [], "source": [ "import pickle\n", "\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 6, "id": "8ae54d1c-26aa-4e7e-bbf0-789e41385c29", "metadata": {}, "outputs": [], "source": [ "with open('../data/train_data.pkl', 'rb') as f:\n", " train_data = pickle.load(f)\n", "f.close()" ] }, { "cell_type": "markdown", "id": "dc7fa971-de07-4c47-ac86-36019cf5925d", "metadata": {}, "source": [ "

Description of train_data" ] }, { "cell_type": "markdown", "id": "5f790948-2ae6-4228-9234-3e8e1d8536d1", "metadata": {}, "source": [ "\n", "- **train_data** contains a list of dictionaries\n", "- each dictionary is associated with **one table**\n", "- Following keys are present in each dictionary:\n", " - **act_table** : It is the table extracted from the XML/HTML of the materials science research papers.\n", " - **caption** : Caption of the extracted table.\n", " - **row_label** : It tells whether the component/composition is present in the row.\n", " - **col_label** : It tells whether the component/composition is present in the column.\n", " - **edge_list** : List of edges of table graph.\n", " - **pii** : Personally identifiable information of research articles in Elsevier's ScienceDirect database.\n", " - **t_idx** : Number of the table in the respective research paper - 1\n", " - **regex_table** : 1, if regular expression is present in the table, else 0.\n", " - **num_rows** : Number of rows in the table\n", " - **num_cols** : Number of columns in the table\n", " - **num_cells** : Number of cells in the table.\n", " - **comp_table** : True/False, to identify if a table is composition table or not.\n", " - **input_ids** : obtained after tokenization using m3rg-iitd/matsicbert model from huggingface for each node.\n", " - **attention_mask** : obtained after tokenization using m3rg-iitd/matsicbert model from huggingface for each node.\n", " - **caption_input_ids** : obtained after tokenization of table caption using m3rg-iitd/matsicbert model from huggingface.\n", " - **caption_attention_mask** : obtained after tokenization of table caption using m3rg-iitd/matsicbert model from huggingface.\n", " - **footer** : Table footer text, if not provided, None.\n", " - **gid_row_label** : Index of row having glass ids\n", " - **gid_col_label** : Index of columns having glass ids\n", " - **sum_less_100** : 0, if complete information table; 1, if partial information table." ] }, { "cell_type": "markdown", "id": "943f9d03-a476-49af-8380-93b41e155a38", "metadata": {}, "source": [ "

Showing example of information in one dictionary" ] }, { "cell_type": "code", "execution_count": 70, "id": "d5b3ed83-7005-4310-8842-ce8f5768f314", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 S0022309309000416 Nominal compositions of samples, in mol%. \n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456
0SeriesSampleNa2OCaOB2O3Al2O3SiO2
1B7B7N202007865
2B7B7N151557865
3B7B7N1010107865
4B7B7N055157865
5B7B7N000207865
6
7B21B21N2020021851
8B21B21N1515521851
9B21B21N10101021851
10B21B21N0551521851
11B21B21N0002021851
\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6\n", "0 Series Sample Na2O CaO B2O3 Al2O3 SiO2\n", "1 B7 B7N20 20 0 7 8 65\n", "2 B7 B7N15 15 5 7 8 65\n", "3 B7 B7N10 10 10 7 8 65\n", "4 B7 B7N05 5 15 7 8 65\n", "5 B7 B7N00 0 20 7 8 65\n", "6 \n", "7 B21 B21N20 20 0 21 8 51\n", "8 B21 B21N15 15 5 21 8 51\n", "9 B21 B21N10 10 10 21 8 51\n", "10 B21 B21N05 5 15 21 8 51\n", "11 B21 B21N00 0 20 21 8 51" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idx = 30\n", "table = train_data[idx]\n", "\n", "print(table['t_idx'], table['pii'], table['caption'],'\\n')\n", "pd.DataFrame(table['act_table'])" ] }, { "cell_type": "code", "execution_count": 67, "id": "8f1a5439-6400-42d8-ab57-b9f631636028", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]\n" ] } ], "source": [ "print(table['row_label'])\n", "# 1 for rows where composition is present, else 0" ] }, { "cell_type": "code", "execution_count": 69, "id": "80ddc5c2-9380-4f4a-a695-ef7c7aabf482", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0, 0, 2, 2, 2, 2, 2]\n" ] } ], "source": [ "print(table['col_label'])\n", "# 2 for columns where chemical ompounds is present, else 0" ] }, { "cell_type": "code", "execution_count": 72, "id": "6b6b0ab7-e661-4e9e-9b58-db6bc39489b8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n" ] } ], "source": [ "print(table['regex_table'])" ] }, { "cell_type": "code", "execution_count": 73, "id": "a7238beb-c7db-470b-9faa-b7d0296a8756", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(12, 7, 84)" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table['num_rows'], table['num_cols'], table['num_cells']" ] }, { "cell_type": "code", "execution_count": 74, "id": "c2edb306-65de-4af8-b9cb-266eaf770384", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table['comp_table']" ] }, { "cell_type": "code", "execution_count": 76, "id": "d4fa38c8-47eb-4d6b-b1a6-9a5a71ca308d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table['footer'] # no footer is present in this table" ] }, { "cell_type": "code", "execution_count": 78, "id": "f60caba1-eeab-485c-a9a4-3c8ba4118c03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table['gid_row_label'] # material ids are not present in rows" ] }, { "cell_type": "code", "execution_count": 80, "id": "acc59f2f-fcae-4612-bbdd-96f789bb9d17", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[0, 1, 0, 0, 0, 0, 0]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table['gid_col_label'] # material ids are present in second column" ] }, { "cell_type": "code", "execution_count": 83, "id": "b1c74c22-322b-4f54-acb9-3ee0b2e736a2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table['sum_less_100'] # since all the rows reporting material compostion add upto 100, this flag is 0 for this table" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }