shwetashweta05 commited on
Commit
4a57e16
·
verified ·
1 Parent(s): 3d35f5f

Upload CSV_guide.ipynb

Browse files
Files changed (1) hide show
  1. CSV_guide.ipynb +356 -0
CSV_guide.ipynb ADDED
@@ -0,0 +1,356 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "288dc3d6-2f59-4af4-b9a0-ac11110c95a4",
6
+ "metadata": {},
7
+ "source": [
8
+ "# a. What is CSV?"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "8a29ef9f-d2b1-44ae-aa00-b7307dc1f1fa",
14
+ "metadata": {},
15
+ "source": [
16
+ "- CSV (Comma-Separated Values) is a simple and widely used file format for storing structured data.\n",
17
+ "- Each row in a CSV file represents a record, and fields within a record are separated by a delimiter (typically a comma, but can also be semicolons, tabs, etc.)."
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "id": "aed4bbe7-49a7-44f7-a222-1dbc76b94b74",
23
+ "metadata": {},
24
+ "source": [
25
+ "## Advantages"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "id": "0908c962-52a0-481d-9c4a-734d0954aeb5",
31
+ "metadata": {},
32
+ "source": [
33
+ "- Lightweight and easy to create.\n",
34
+ "- Supported by almost all data tools and programming languages."
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "markdown",
39
+ "id": "9a3c3937-cb91-411b-8606-16728aabbbc1",
40
+ "metadata": {},
41
+ "source": [
42
+ "## Common File Extensions"
43
+ ]
44
+ },
45
+ {
46
+ "cell_type": "markdown",
47
+ "id": "41bf2a14-0cc1-458b-be33-62e9431a9b31",
48
+ "metadata": {},
49
+ "source": [
50
+ "- .csv\n",
51
+ "- .txt (sometimes used with a CSV structure)."
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "markdown",
56
+ "id": "00250776-617f-49d9-88bb-e6cba943f599",
57
+ "metadata": {},
58
+ "source": [
59
+ "# b. How to Read CSV Files"
60
+ ]
61
+ },
62
+ {
63
+ "cell_type": "markdown",
64
+ "id": "98989d08-8d4d-4a02-82b1-ba08757e71ff",
65
+ "metadata": {},
66
+ "source": [
67
+ "- Using Python, CSV files can be handled with libraries such as pandas or Python's built-in csv module."
68
+ ]
69
+ },
70
+ {
71
+ "cell_type": "markdown",
72
+ "id": "6776fe4e-8155-47ff-99f4-ec26c916c45d",
73
+ "metadata": {},
74
+ "source": [
75
+ "## 1. Using pandas:"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "code",
80
+ "execution_count": null,
81
+ "id": "a508ae8d-3a3d-43f0-9453-11c87877b2b1",
82
+ "metadata": {},
83
+ "outputs": [],
84
+ "source": [
85
+ "import pandas as pd\n",
86
+ "\n",
87
+ "# Read a CSV file\n",
88
+ "df = pd.read_csv(\"file.csv\")\n",
89
+ "print(df.head())\n",
90
+ "\n",
91
+ "# Reading a CSV file with a custom delimiter\n",
92
+ "df = pd.read_csv(\"file.csv\", sep=\";\")"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "markdown",
97
+ "id": "7c3f7a6a-0c13-45f2-930b-2c5796985efd",
98
+ "metadata": {},
99
+ "source": [
100
+ "## 2. Using Python's Built-in csv Module:"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "code",
105
+ "execution_count": null,
106
+ "id": "a33ffb8b-88b6-4061-b816-00397f2b3a3e",
107
+ "metadata": {},
108
+ "outputs": [],
109
+ "source": [
110
+ "import csv\n",
111
+ "\n",
112
+ "with open(\"file.csv\", \"r\") as file:\n",
113
+ " reader = csv.reader(file)\n",
114
+ " for row in reader:\n",
115
+ " print(row)"
116
+ ]
117
+ },
118
+ {
119
+ "cell_type": "markdown",
120
+ "id": "2a57c10b-51bf-4a4e-978a-51644964b856",
121
+ "metadata": {},
122
+ "source": [
123
+ "## 3.Reading Large CSV Files in Chunks:"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "code",
128
+ "execution_count": null,
129
+ "id": "5a056573-3d16-400a-8ccd-a15d0398b454",
130
+ "metadata": {},
131
+ "outputs": [],
132
+ "source": [
133
+ "# Process large CSV files in smaller chunks\n",
134
+ "for chunk in pd.read_csv(\"large_file.csv\", chunksize=1000):\n",
135
+ " print(chunk.head())"
136
+ ]
137
+ },
138
+ {
139
+ "cell_type": "markdown",
140
+ "id": "b52ebad6-0c64-4317-974a-3498f05feaea",
141
+ "metadata": {},
142
+ "source": [
143
+ "# c. Issues Encountered When Handling CSV Files"
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "markdown",
148
+ "id": "8fb34287-7754-4170-8095-46c2a82db4ba",
149
+ "metadata": {},
150
+ "source": [
151
+ "1. Delimiter Issues:\n",
152
+ " - Not all CSV files use commas as delimiters. Some may use semicolons, tabs, or other characters.\n",
153
+ "2. Encoding Problems:\n",
154
+ " - Non-UTF-8 encodings may cause errors while reading files.\n",
155
+ " - Example: \"UnicodeDecodeError.\"\n",
156
+ "3. Missing or Inconsistent Data:\n",
157
+ " - Some fields may be empty, and column lengths may vary.\n",
158
+ "4. Header Issues:\n",
159
+ " - Files may lack headers or have duplicate/misaligned headers.\n",
160
+ "5. Large File Sizes:\n",
161
+ " - Processing very large CSV files can lead to memory issues."
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "markdown",
166
+ "id": "67c01a56-9b7c-46ba-8a79-9586a244978c",
167
+ "metadata": {},
168
+ "source": [
169
+ "# d. How to Overcome These Issues"
170
+ ]
171
+ },
172
+ {
173
+ "cell_type": "markdown",
174
+ "id": "45564d75-7870-45e1-8d53-e78ff71ff018",
175
+ "metadata": {},
176
+ "source": [
177
+ "1. Delimiter Issues:\n",
178
+ " - Specify the correct delimiter while reading:"
179
+ ]
180
+ },
181
+ {
182
+ "cell_type": "code",
183
+ "execution_count": null,
184
+ "id": "36c282a6-cbdc-4a3e-933a-91080ea4dccc",
185
+ "metadata": {},
186
+ "outputs": [],
187
+ "source": [
188
+ "df = pd.read_csv(\"file.csv\", sep=\";\")"
189
+ ]
190
+ },
191
+ {
192
+ "cell_type": "markdown",
193
+ "id": "7b998672-5d7e-4a6a-8cc4-36b18446b9be",
194
+ "metadata": {},
195
+ "source": [
196
+ "2. Encoding Problems:\n",
197
+ " - Explicitly set the encoding:"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": null,
203
+ "id": "2657d869-a303-4e03-bc07-b15f012f76e6",
204
+ "metadata": {},
205
+ "outputs": [],
206
+ "source": [
207
+ "df = pd.read_csv(\"file.csv\", encoding=\"ISO-8859-1\")"
208
+ ]
209
+ },
210
+ {
211
+ "cell_type": "markdown",
212
+ "id": "113e7e43-7031-4904-9e87-c9df4acefaff",
213
+ "metadata": {},
214
+ "source": [
215
+ "3. Handling Missing Data:\n",
216
+ " - Fill missing values:"
217
+ ]
218
+ },
219
+ {
220
+ "cell_type": "code",
221
+ "execution_count": null,
222
+ "id": "67ea80c7-7a86-4694-b6ff-a55ca27caad5",
223
+ "metadata": {},
224
+ "outputs": [],
225
+ "source": [
226
+ "df.fillna(\"Unknown\", inplace=True)"
227
+ ]
228
+ },
229
+ {
230
+ "cell_type": "markdown",
231
+ "id": "a6d7a40b-c495-4482-b006-767c14209bf2",
232
+ "metadata": {},
233
+ "source": [
234
+ "- Drop rows/columns with missing data:"
235
+ ]
236
+ },
237
+ {
238
+ "cell_type": "code",
239
+ "execution_count": null,
240
+ "id": "c8191abd-6281-466e-aeba-5f8df351de2d",
241
+ "metadata": {},
242
+ "outputs": [],
243
+ "source": [
244
+ "df.dropna(inplace=True)"
245
+ ]
246
+ },
247
+ {
248
+ "cell_type": "markdown",
249
+ "id": "6542d341-d38f-4c59-a5ca-d2503bd35e51",
250
+ "metadata": {},
251
+ "source": [
252
+ "4. Header Issues:\n",
253
+ " - Manually assign headers:"
254
+ ]
255
+ },
256
+ {
257
+ "cell_type": "code",
258
+ "execution_count": null,
259
+ "id": "3f2ee8b5-c54d-4349-b473-a8c3d6230c38",
260
+ "metadata": {},
261
+ "outputs": [],
262
+ "source": [
263
+ "df = pd.read_csv(\"file.csv\", header=None, names=[\"Col1\", \"Col2\", \"Col3\"])"
264
+ ]
265
+ },
266
+ {
267
+ "cell_type": "markdown",
268
+ "id": "2461fa9d-02bb-4008-85d0-1cc47e412671",
269
+ "metadata": {},
270
+ "source": [
271
+ "5. Optimizing for Large Files:\n",
272
+ " - Use chunk processing:"
273
+ ]
274
+ },
275
+ {
276
+ "cell_type": "code",
277
+ "execution_count": null,
278
+ "id": "e684dc56-f980-4d37-affb-3d7fde7a99b0",
279
+ "metadata": {},
280
+ "outputs": [],
281
+ "source": [
282
+ "for chunk in pd.read_csv(\"file.csv\", chunksize=5000):\n",
283
+ " process(chunk)"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "markdown",
288
+ "id": "5a9677e2-e475-4829-9660-a2ec1674d221",
289
+ "metadata": {},
290
+ "source": [
291
+ "### Use lightweight libraries like dask or polars for very large files."
292
+ ]
293
+ },
294
+ {
295
+ "cell_type": "code",
296
+ "execution_count": null,
297
+ "id": "c378769c-56a9-4675-b988-e6b57eeed54e",
298
+ "metadata": {},
299
+ "outputs": [],
300
+ "source": []
301
+ },
302
+ {
303
+ "cell_type": "code",
304
+ "execution_count": null,
305
+ "id": "fe9e2b34-a679-4b8e-923a-f296f775a6a2",
306
+ "metadata": {},
307
+ "outputs": [],
308
+ "source": []
309
+ },
310
+ {
311
+ "cell_type": "code",
312
+ "execution_count": null,
313
+ "id": "3ece1968-048b-4337-a79e-3c9a7161231d",
314
+ "metadata": {},
315
+ "outputs": [],
316
+ "source": []
317
+ },
318
+ {
319
+ "cell_type": "code",
320
+ "execution_count": null,
321
+ "id": "b940d8eb-c668-4553-9bb9-c1b8e39cf211",
322
+ "metadata": {},
323
+ "outputs": [],
324
+ "source": []
325
+ },
326
+ {
327
+ "cell_type": "code",
328
+ "execution_count": null,
329
+ "id": "8a88eeae-cfdf-48bd-aa05-3b0c29ff25f0",
330
+ "metadata": {},
331
+ "outputs": [],
332
+ "source": []
333
+ }
334
+ ],
335
+ "metadata": {
336
+ "kernelspec": {
337
+ "display_name": "Python 3 (ipykernel)",
338
+ "language": "python",
339
+ "name": "python3"
340
+ },
341
+ "language_info": {
342
+ "codemirror_mode": {
343
+ "name": "ipython",
344
+ "version": 3
345
+ },
346
+ "file_extension": ".py",
347
+ "mimetype": "text/x-python",
348
+ "name": "python",
349
+ "nbconvert_exporter": "python",
350
+ "pygments_lexer": "ipython3",
351
+ "version": "3.11.7"
352
+ }
353
+ },
354
+ "nbformat": 4,
355
+ "nbformat_minor": 5
356
+ }