shwetashweta05 commited on
Commit
9bdd214
·
verified ·
1 Parent(s): eea6d1d

Upload XML_guide.ipynb

Browse files
Files changed (1) hide show
  1. XML_guide.ipynb +263 -0
XML_guide.ipynb ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "61a28560-a233-418e-8266-442a4a0cb810",
6
+ "metadata": {},
7
+ "source": [
8
+ "# a. What is XML?"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "88af4e03-44db-41a9-a8d7-66d4c06301d7",
14
+ "metadata": {},
15
+ "source": [
16
+ "- XML (eXtensible Markup Language) is a markup language used to store and transport data in a structured format.\n",
17
+ "- It is human-readable and machine-readable, with a hierarchical structure using tags.\n",
18
+ "- ## Advantages:\n",
19
+ "- Flexible and self-descriptive.\n",
20
+ "- Widely used in data exchange between systems, such as web APIs and configuration files.\n",
21
+ "- ## Common File Extensions:\n",
22
+ "- .xml"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "markdown",
27
+ "id": "5d9bfd21-9483-4fc1-9a3d-9c1067f437b9",
28
+ "metadata": {},
29
+ "source": [
30
+ "Example of XML Structure:\n",
31
+ "<person>\n",
32
+ " <name>Shweta Singh</name>\n",
33
+ " <age>27</age>\n",
34
+ " <city>Kolkata</city>\n",
35
+ "</person>"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "markdown",
40
+ "id": "91ef31d0-018d-4ece-b1aa-26b9fb11cec0",
41
+ "metadata": {},
42
+ "source": [
43
+ "## b. How to Read XML Files\n",
44
+ "- XML files can be parsed and processed using Python libraries like xml.etree.ElementTree, lxml, or pandas.\n"
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "markdown",
49
+ "id": "69b8009e-71cc-49ec-a604-8f5ef329b972",
50
+ "metadata": {},
51
+ "source": [
52
+ " 1. Using xml.etree.ElementTree:"
53
+ ]
54
+ },
55
+ {
56
+ "cell_type": "code",
57
+ "execution_count": null,
58
+ "id": "8b12e0ae-6189-48f7-98e2-33dac2f4f9f7",
59
+ "metadata": {},
60
+ "outputs": [],
61
+ "source": [
62
+ "import xml.etree.ElementTree as ET\n",
63
+ "\n",
64
+ "# Parse an XML file\n",
65
+ "tree = ET.parse(\"file.xml\")\n",
66
+ "root = tree.getroot()\n",
67
+ "\n",
68
+ "# Access elements\n",
69
+ "for child in root:\n",
70
+ " print(child.tag, child.text)"
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "markdown",
75
+ "id": "65b985fa-4873-46dc-9770-8d9736547959",
76
+ "metadata": {},
77
+ "source": [
78
+ "- 2. Using pandas for tabular data:"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "code",
83
+ "execution_count": null,
84
+ "id": "83ad5cda-43d1-44a7-9cf3-fee7a584f5cf",
85
+ "metadata": {},
86
+ "outputs": [],
87
+ "source": [
88
+ "import pandas as pd\n",
89
+ "\n",
90
+ "# Read XML into a DataFrame\n",
91
+ "df = pd.read_xml(\"file.xml\")\n",
92
+ "print(df.head())"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "markdown",
97
+ "id": "e5035955-6155-4a8c-8e16-8aac5f967e50",
98
+ "metadata": {},
99
+ "source": [
100
+ "- 3. Using lxml for advanced parsing:"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "id": "52cb4fa5-5b98-4701-9c1f-5bb16ce56c42",
107
+ "metadata": {},
108
+ "outputs": [],
109
+ "source": [
110
+ "from lxml import etree\n",
111
+ "\n",
112
+ "# Parse XML file\n",
113
+ "tree = etree.parse(\"file.xml\")\n",
114
+ "root = tree.getroot()\n",
115
+ "\n",
116
+ "# Extract specific elements\n",
117
+ "for element in root.iter(\"name\"):\n",
118
+ " print(element.text)"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "markdown",
123
+ "id": "ae8fff0c-2018-4098-bfae-1dd2fe0f4db1",
124
+ "metadata": {},
125
+ "source": [
126
+ "# c. Issues Encountered When Handling XML Files1. \n",
127
+ "1. Complex Structures:\n",
128
+ "- XML files can have deeply nested and complex hierarchies.\n",
129
+ "2. Large File Sizes:\n",
130
+ "- Parsing large XML files can consume significant memory.\n",
131
+ "3. Data Inconsistency:\n",
132
+ "- Missing or unexpected tags can cause parsing errors.\n",
133
+ "4. Encoding Issues:\n",
134
+ "- XML files with non-standard encoding formats (e.g., ISO-8859-1) may fail to parse."
135
+ ]
136
+ },
137
+ {
138
+ "cell_type": "markdown",
139
+ "id": "8602e413-fbd8-4839-8eb6-440dbe6b2ae2",
140
+ "metadata": {},
141
+ "source": [
142
+ "# d. How to Overcome These Issues"
143
+ ]
144
+ },
145
+ {
146
+ "cell_type": "markdown",
147
+ "id": "3e74ed8c-476f-4ceb-826f-07361f98f10a",
148
+ "metadata": {},
149
+ "source": [
150
+ "1. Handle Complex Structures:\n",
151
+ "\n",
152
+ "- Use libraries like lxml for efficient navigation and processing of nested XML structures.\n",
153
+ " \n",
154
+ "2. Optimize Large File Processing:\n",
155
+ "\n",
156
+ "- Use event-driven parsing with xml.sax or lxml.iterparse to process files in chunks:"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "code",
161
+ "execution_count": null,
162
+ "id": "7e0a577a-8fa4-4dd2-8426-48e1422674e3",
163
+ "metadata": {},
164
+ "outputs": [],
165
+ "source": [
166
+ "from lxml import etree\n",
167
+ "\n",
168
+ "# Process XML in chunks\n",
169
+ "for event, element in etree.iterparse(\"large_file.xml\", events=(\"end\",)):\n",
170
+ " print(element.tag, element.text)\n",
171
+ " element.clear()"
172
+ ]
173
+ },
174
+ {
175
+ "cell_type": "markdown",
176
+ "id": "2e486525-88a1-4272-b205-9ecccd1775fe",
177
+ "metadata": {},
178
+ "source": [
179
+ "3. Handle Missing or Unexpected Tags:\n",
180
+ "\n",
181
+ "- Use default values or conditional checks to handle missing elements:\n",
182
+ "\n"
183
+ ]
184
+ },
185
+ {
186
+ "cell_type": "code",
187
+ "execution_count": null,
188
+ "id": "2b14f7b0-d18c-4bf5-9b28-883acde3989b",
189
+ "metadata": {},
190
+ "outputs": [],
191
+ "source": [
192
+ "for child in root:\n",
193
+ " name = child.find(\"name\")\n",
194
+ " print(name.text if name is not None else \"Unknown\")"
195
+ ]
196
+ },
197
+ {
198
+ "cell_type": "markdown",
199
+ "id": "3b53524c-7150-41d2-9bc9-c1e4dea2f1fa",
200
+ "metadata": {},
201
+ "source": [
202
+ "4. Resolve Encoding Issues:\n",
203
+ "\n",
204
+ "- Explicitly specify the encoding when parsing:"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "code",
209
+ "execution_count": null,
210
+ "id": "58eb5b60-9304-4929-a6aa-4c9655a9c492",
211
+ "metadata": {},
212
+ "outputs": [],
213
+ "source": [
214
+ "tree = ET.parse(\"file.xml\", parser=ET.XMLParser(encoding=\"ISO-8859-1\"))"
215
+ ]
216
+ },
217
+ {
218
+ "cell_type": "code",
219
+ "execution_count": null,
220
+ "id": "2732b34d-eadd-4acc-921d-1594b52843d9",
221
+ "metadata": {},
222
+ "outputs": [],
223
+ "source": []
224
+ },
225
+ {
226
+ "cell_type": "code",
227
+ "execution_count": null,
228
+ "id": "7e902fed-2e6f-4de8-879e-de88e665ae39",
229
+ "metadata": {},
230
+ "outputs": [],
231
+ "source": []
232
+ },
233
+ {
234
+ "cell_type": "code",
235
+ "execution_count": null,
236
+ "id": "e63144df-35e8-4a07-9dad-1c9466948487",
237
+ "metadata": {},
238
+ "outputs": [],
239
+ "source": []
240
+ }
241
+ ],
242
+ "metadata": {
243
+ "kernelspec": {
244
+ "display_name": "Python 3 (ipykernel)",
245
+ "language": "python",
246
+ "name": "python3"
247
+ },
248
+ "language_info": {
249
+ "codemirror_mode": {
250
+ "name": "ipython",
251
+ "version": 3
252
+ },
253
+ "file_extension": ".py",
254
+ "mimetype": "text/x-python",
255
+ "name": "python",
256
+ "nbconvert_exporter": "python",
257
+ "pygments_lexer": "ipython3",
258
+ "version": "3.11.7"
259
+ }
260
+ },
261
+ "nbformat": 4,
262
+ "nbformat_minor": 5
263
+ }