Spaces:

uc-ctds
/

llama-data-model-generator-demo

Running on Zero

File size: 10,866 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {
    "id": "0"
   },
   "source": [
    "# Creation of Serialized File From AI Model Output\n",
    "---\n",
    "This notebook demonstrates how to use the AI-assisted data model output (originally just a collection of TSV files) to a serialized file, a [PFB (Portable Format for Bioinformatics)](https://pmc.ncbi.nlm.nih.gov/articles/PMC10035862/) file.\n",
    "\n",
    "PFB is widely used within NIH-funded initiatives that our center is a part of, as a means for efficient storage and transfer of data between systems."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {
    "id": "1"
   },
   "source": [
    "### Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "2",
    "outputId": "93bf3200-e3e2-4607-b7fc-23de90f967e1"
   },
   "outputs": [],
   "source": [
    "%pip install pandas gen3"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {
    "id": "3"
   },
   "source": [
    "We need some helper files to demonstrate this, so pull them in from Huggingface."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "4",
    "outputId": "ca90e09b-4d66-4019-ea91-4f9694b246ec"
   },
   "outputs": [],
   "source": [
    "!git clone https://huggingface.co/spaces/uc-ctds/llama-data-model-generator-demo\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "%cd llama-data-model-generator-demo/serialized_file_creation_demo/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {
    "id": "5"
   },
   "source": [
    "### Imports and Initial Loading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {
    "id": "6"
   },
   "outputs": [],
   "source": [
    "from utils import *\n",
    "import os\n",
    "from pathlib import Path\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {
    "id": "7"
   },
   "outputs": [],
   "source": [
    "# read in the minimal Gen3 data model scaffold\n",
    "scaffold_file = \"./gen3_dm_scaffold.json\"\n",
    "scaffold = read_schema(scaffold_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {
    "id": "8"
   },
   "source": [
    "We are demonstrating the ability to use this against an AI-generated model, but not directly inferencing to get the data model. Instead we're using a Sythnetic Data Contribution (a sample of what a data contributor would provide AND the expected simplified data model). We use these to train and test the AI model. For simplicity, we're using the model here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {
    "id": "9"
   },
   "outputs": [],
   "source": [
    "# Find the simplified data model in a Synthetic Data Contribution directory\n",
    "sdm_dir = \"./submitted_genotyping_array.mass_cytometry_image.actionable_mutation\"\n",
    "sdm_file = next(\n",
    "    (f for f in os.listdir(sdm_dir) if f.endswith(\"_jsonschema_dd.json\")), None\n",
    ")\n",
    "sdm_path = os.path.join(sdm_dir, sdm_file)\n",
    "print(sdm_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {
    "id": "10"
   },
   "outputs": [],
   "source": [
    "sdm = read_schema(schema=sdm_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {
    "id": "11"
   },
   "source": [
    "### Creation of Serialized File"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {
    "id": "12"
   },
   "source": [
    "As of writing, PFB requires a Gen3-style data model, so the next steps are to ensure we can go from the simplified AI model output to a Gen3 data model. Note that in the future we may allow alternative, non-Gen3 models to create such PFBs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {
    "id": "13"
   },
   "outputs": [],
   "source": [
    "## Create a Gen3 data model from the simplified data model\n",
    "\n",
    "gdm = sdm_to_gen3(sdm)  # convert simplified data model nodes into the Gen3-style nodes\n",
    "gdm = merge_scaffold_into_gdm(\n",
    "    gdm, scaffold\n",
    ")  # merge the scaffold into the Gen3-style data model\n",
    "gdm = fix_project(gdm)  # ensure project links to program and has req'd props\n",
    "gdm = add_gen3_required_properties(\n",
    "    gdm\n",
    ")  # add required Gen3 properties to the project node\n",
    "gdm = add_yaml_suffix_to_nodes(gdm)  # ensure all nodes have .yaml suffix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {
    "id": "14"
   },
   "outputs": [],
   "source": [
    "## Write the Gen3-style data model to a JSON file\n",
    "sdm_name = Path(\n",
    "    sdm_path\n",
    ").stem  # get the stem (basename without extension) of the sdm file\n",
    "out_file = os.path.join(sdm_dir, f\"Gen3_{sdm_name}_pfb.json\")\n",
    "write_schema(gdm, out_file)  # write the schema to a file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {
    "id": "15"
   },
   "source": [
    "Now we have the data model in proper format, we can serialize it into a PFB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {
    "id": "16"
   },
   "outputs": [],
   "source": [
    "# Convert the Gen3-style data model to PFB format schema\n",
    "pfb_schema = os.path.join(sdm_dir, Path(out_file).stem + \".avro\")\n",
    "!pfb from -o $pfb_schema dict $out_file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {
    "id": "17"
   },
   "source": [
    "### PFB Utilities"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {
    "id": "18"
   },
   "source": [
    "Now we can demonstrate creation of a PFB when you have content for it (in this case in the form of TSV metadata). The above is a PFB which contains only the data model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {
    "id": "19"
   },
   "outputs": [],
   "source": [
    "# Get a list of TSV files in the sdm_dir\n",
    "tsv_files = [f for f in os.listdir(sdm_dir) if f.endswith(\".tsv\")]\n",
    "tsv_files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {
    "id": "20"
   },
   "outputs": [],
   "source": [
    "# calculate tsv file size and md5sum for each tsv_files\n",
    "for tsv_file in tsv_files:\n",
    "    tsv_path = os.path.join(sdm_dir, tsv_file)\n",
    "    file_size = os.path.getsize(tsv_path)\n",
    "    # get the md5sum of the TSV file using md5 bash command\n",
    "    md5sum = get_md5sum(tsv_path)\n",
    "    tsv_metadata = {\n",
    "        \"submitter_id\": \"actionable_mutation_metadata.tsv\",\n",
    "        \"file_format\": \"TSV\",\n",
    "        \"file_name\": \"actionable_mutation_metadata.tsv\",\n",
    "        \"file_size\": file_size,\n",
    "        \"md5sum\": md5sum,\n",
    "    }\n",
    "    os.makedirs(\n",
    "        os.path.join(sdm_dir, \"tsv_metadata\"), exist_ok=True\n",
    "    )  # create the tsv_metadata directory if it doesn't exist\n",
    "    tsv_metadata_stem = Path(tsv_file).stem\n",
    "    if tsv_metadata_stem.endswith(\"_metadata\"):\n",
    "        tsv_metadata_stem = tsv_metadata_stem.replace(\"_metadata\", \".json\")\n",
    "    elif tsv_metadata_stem.endswith(\"_file_manifest\"):\n",
    "        tsv_metadata_stem = tsv_metadata_stem.replace(\"_file_manifest\", \".json\")\n",
    "    tsv_metadata_file = os.path.join(sdm_dir, \"tsv_metadata\", tsv_metadata_stem)\n",
    "    with open(tsv_metadata_file, \"w\") as f:\n",
    "        json.dump(tsv_metadata, f, indent=4)\n",
    "    print(f\"\\tTSV metadata written to {tsv_metadata_file}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22",
   "metadata": {
    "id": "21"
   },
   "outputs": [],
   "source": [
    "%ls -l $sdm_dir/tsv_metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23",
   "metadata": {
    "id": "22"
   },
   "outputs": [],
   "source": [
    "tsv_metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24",
   "metadata": {
    "id": "23"
   },
   "outputs": [],
   "source": [
    "pfb_data = os.path.join(sdm_dir, Path(out_file).stem + \"_data.avro\")\n",
    "!pfb from -o $pfb_data json -s $pfb_schema --program DEV --project test $sdm_dir/tsv_metadata\n",
    "if Path(pfb_data).exists():\n",
    "    print(f\"PFB containing TSV files written to:\\n{pfb_data}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25",
   "metadata": {
    "id": "24"
   },
   "source": [
    "PFB contains a utility to convert from the serialized format to more readable and workable files, including TSVs. Here we demonstrate that utility:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26",
   "metadata": {
    "id": "25"
   },
   "outputs": [],
   "source": [
    "!gen3 pfb to -i $pfb_data tsv # convert the PFB file to TSV format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27",
   "metadata": {
    "id": "26"
   },
   "outputs": [],
   "source": [
    "!gen3 pfb show -i $pfb_data # show the contents of the PFB file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28",
   "metadata": {
    "id": "27"
   },
   "outputs": [],
   "source": [
    "!gen3 pfb show -i $pfb_data schema | jq # show the schema of the PFB file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29",
   "metadata": {
    "id": "28"
   },
   "source": [
    "Now we've gone all the way from a dump of data contribution files, to a simple structured data model, to a serialized PFB, and back to usable files!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {
    "id": "29"
   },
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}