Spaces:

shwetashweta05
/

Zero_to_Hero_Machine_Learning

Sleeping

App Files Files Community

shwetashweta05 commited on Dec 15, 2024

Commit

9bdd214

verified ·

1 Parent(s): eea6d1d

Upload XML_guide.ipynb

Browse files

Files changed (1) hide show

XML_guide.ipynb +263 -0

XML_guide.ipynb ADDED Viewed

	@@ -0,0 +1,263 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "61a28560-a233-418e-8266-442a4a0cb810",
+   "metadata": {},
+   "source": [
+    "# a. What is XML?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "88af4e03-44db-41a9-a8d7-66d4c06301d7",
+   "metadata": {},
+   "source": [
+    "- XML (eXtensible Markup Language) is a markup language used to store and transport data in a structured format.\n",
+    "- It is human-readable and machine-readable, with a hierarchical structure using tags.\n",
+    "- ##  Advantages:\n",
+    "- Flexible and self-descriptive.\n",
+    "- Widely used in data exchange between systems, such as web APIs and configuration files.\n",
+    "- ## Common File Extensions:\n",
+    "- .xml"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d9bfd21-9483-4fc1-9a3d-9c1067f437b9",
+   "metadata": {},
+   "source": [
+    "Example of XML Structure:\n",
+    "<person>\n",
+    "    <name>Shweta Singh</name>\n",
+    "    <age>27</age>\n",
+    "    <city>Kolkata</city>\n",
+    "</person>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91ef31d0-018d-4ece-b1aa-26b9fb11cec0",
+   "metadata": {},
+   "source": [
+    "## b. How to Read XML Files\n",
+    "- XML files can be parsed and processed using Python libraries like xml.etree.ElementTree, lxml, or pandas.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69b8009e-71cc-49ec-a604-8f5ef329b972",
+   "metadata": {},
+   "source": [
+    " 1. Using xml.etree.ElementTree:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8b12e0ae-6189-48f7-98e2-33dac2f4f9f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import xml.etree.ElementTree as ET\n",
+    "\n",
+    "# Parse an XML file\n",
+    "tree = ET.parse(\"file.xml\")\n",
+    "root = tree.getroot()\n",
+    "\n",
+    "# Access elements\n",
+    "for child in root:\n",
+    "    print(child.tag, child.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "65b985fa-4873-46dc-9770-8d9736547959",
+   "metadata": {},
+   "source": [
+    "- 2. Using pandas for tabular data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "83ad5cda-43d1-44a7-9cf3-fee7a584f5cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Read XML into a DataFrame\n",
+    "df = pd.read_xml(\"file.xml\")\n",
+    "print(df.head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5035955-6155-4a8c-8e16-8aac5f967e50",
+   "metadata": {},
+   "source": [
+    "- 3. Using lxml for advanced parsing:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "52cb4fa5-5b98-4701-9c1f-5bb16ce56c42",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from lxml import etree\n",
+    "\n",
+    "# Parse XML file\n",
+    "tree = etree.parse(\"file.xml\")\n",
+    "root = tree.getroot()\n",
+    "\n",
+    "# Extract specific elements\n",
+    "for element in root.iter(\"name\"):\n",
+    "    print(element.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ae8fff0c-2018-4098-bfae-1dd2fe0f4db1",
+   "metadata": {},
+   "source": [
+    "# c. Issues Encountered When Handling XML Files1. \n",
+    "1. Complex Structures:\n",
+    "- XML files can have deeply nested and complex hierarchies.\n",
+    "2. Large File Sizes:\n",
+    "- Parsing large XML files can consume significant memory.\n",
+    "3. Data Inconsistency:\n",
+    "- Missing or unexpected tags can cause parsing errors.\n",
+    "4. Encoding Issues:\n",
+    "- XML files with non-standard encoding formats (e.g., ISO-8859-1) may fail to parse."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8602e413-fbd8-4839-8eb6-440dbe6b2ae2",
+   "metadata": {},
+   "source": [
+    "# d. How to Overcome These Issues"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e74ed8c-476f-4ceb-826f-07361f98f10a",
+   "metadata": {},
+   "source": [
+    "1. Handle Complex Structures:\n",
+    "\n",
+    "- Use libraries like lxml for efficient navigation and processing of nested XML structures.\n",
+    "  \n",
+    "2. Optimize Large File Processing:\n",
+    "\n",
+    "- Use event-driven parsing with xml.sax or lxml.iterparse to process files in chunks:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7e0a577a-8fa4-4dd2-8426-48e1422674e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from lxml import etree\n",
+    "\n",
+    "# Process XML in chunks\n",
+    "for event, element in etree.iterparse(\"large_file.xml\", events=(\"end\",)):\n",
+    "    print(element.tag, element.text)\n",
+    "    element.clear()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e486525-88a1-4272-b205-9ecccd1775fe",
+   "metadata": {},
+   "source": [
+    "3. Handle Missing or Unexpected Tags:\n",
+    "\n",
+    "- Use default values or conditional checks to handle missing elements:\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2b14f7b0-d18c-4bf5-9b28-883acde3989b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for child in root:\n",
+    "    name = child.find(\"name\")\n",
+    "    print(name.text if name is not None else \"Unknown\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b53524c-7150-41d2-9bc9-c1e4dea2f1fa",
+   "metadata": {},
+   "source": [
+    "4. Resolve Encoding Issues:\n",
+    "\n",
+    "- Explicitly specify the encoding when parsing:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "58eb5b60-9304-4929-a6aa-4c9655a9c492",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tree = ET.parse(\"file.xml\", parser=ET.XMLParser(encoding=\"ISO-8859-1\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2732b34d-eadd-4acc-921d-1594b52843d9",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7e902fed-2e6f-4de8-879e-de88e665ae39",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e63144df-35e8-4a07-9dad-1c9466948487",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}