shwetashweta05 commited on
Commit
3d35f5f
·
verified ·
1 Parent(s): caeab36

Delete csv_guide.ipynb

Browse files
Files changed (1) hide show
  1. csv_guide.ipynb +0 -356
csv_guide.ipynb DELETED
@@ -1,356 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "288dc3d6-2f59-4af4-b9a0-ac11110c95a4",
6
- "metadata": {},
7
- "source": [
8
- "# a. What is CSV?"
9
- ]
10
- },
11
- {
12
- "cell_type": "markdown",
13
- "id": "8a29ef9f-d2b1-44ae-aa00-b7307dc1f1fa",
14
- "metadata": {},
15
- "source": [
16
- "- CSV (Comma-Separated Values) is a simple and widely used file format for storing structured data.\n",
17
- "- Each row in a CSV file represents a record, and fields within a record are separated by a delimiter (typically a comma, but can also be semicolons, tabs, etc.)."
18
- ]
19
- },
20
- {
21
- "cell_type": "markdown",
22
- "id": "aed4bbe7-49a7-44f7-a222-1dbc76b94b74",
23
- "metadata": {},
24
- "source": [
25
- "## Advantages"
26
- ]
27
- },
28
- {
29
- "cell_type": "markdown",
30
- "id": "0908c962-52a0-481d-9c4a-734d0954aeb5",
31
- "metadata": {},
32
- "source": [
33
- "- Lightweight and easy to create.\n",
34
- "- Supported by almost all data tools and programming languages."
35
- ]
36
- },
37
- {
38
- "cell_type": "markdown",
39
- "id": "9a3c3937-cb91-411b-8606-16728aabbbc1",
40
- "metadata": {},
41
- "source": [
42
- "## Common File Extensions"
43
- ]
44
- },
45
- {
46
- "cell_type": "markdown",
47
- "id": "41bf2a14-0cc1-458b-be33-62e9431a9b31",
48
- "metadata": {},
49
- "source": [
50
- "- .csv\n",
51
- "- .txt (sometimes used with a CSV structure)."
52
- ]
53
- },
54
- {
55
- "cell_type": "markdown",
56
- "id": "00250776-617f-49d9-88bb-e6cba943f599",
57
- "metadata": {},
58
- "source": [
59
- "# b. How to Read CSV Files"
60
- ]
61
- },
62
- {
63
- "cell_type": "markdown",
64
- "id": "98989d08-8d4d-4a02-82b1-ba08757e71ff",
65
- "metadata": {},
66
- "source": [
67
- "- Using Python, CSV files can be handled with libraries such as pandas or Python's built-in csv module."
68
- ]
69
- },
70
- {
71
- "cell_type": "markdown",
72
- "id": "6776fe4e-8155-47ff-99f4-ec26c916c45d",
73
- "metadata": {},
74
- "source": [
75
- "## 1. Using pandas:"
76
- ]
77
- },
78
- {
79
- "cell_type": "code",
80
- "execution_count": null,
81
- "id": "a508ae8d-3a3d-43f0-9453-11c87877b2b1",
82
- "metadata": {},
83
- "outputs": [],
84
- "source": [
85
- "import pandas as pd\n",
86
- "\n",
87
- "# Read a CSV file\n",
88
- "df = pd.read_csv(\"file.csv\")\n",
89
- "print(df.head())\n",
90
- "\n",
91
- "# Reading a CSV file with a custom delimiter\n",
92
- "df = pd.read_csv(\"file.csv\", sep=\";\")"
93
- ]
94
- },
95
- {
96
- "cell_type": "markdown",
97
- "id": "7c3f7a6a-0c13-45f2-930b-2c5796985efd",
98
- "metadata": {},
99
- "source": [
100
- "## 2. Using Python's Built-in csv Module:"
101
- ]
102
- },
103
- {
104
- "cell_type": "code",
105
- "execution_count": null,
106
- "id": "a33ffb8b-88b6-4061-b816-00397f2b3a3e",
107
- "metadata": {},
108
- "outputs": [],
109
- "source": [
110
- "import csv\n",
111
- "\n",
112
- "with open(\"file.csv\", \"r\") as file:\n",
113
- " reader = csv.reader(file)\n",
114
- " for row in reader:\n",
115
- " print(row)"
116
- ]
117
- },
118
- {
119
- "cell_type": "markdown",
120
- "id": "2a57c10b-51bf-4a4e-978a-51644964b856",
121
- "metadata": {},
122
- "source": [
123
- "## 3.Reading Large CSV Files in Chunks:"
124
- ]
125
- },
126
- {
127
- "cell_type": "code",
128
- "execution_count": null,
129
- "id": "5a056573-3d16-400a-8ccd-a15d0398b454",
130
- "metadata": {},
131
- "outputs": [],
132
- "source": [
133
- "# Process large CSV files in smaller chunks\n",
134
- "for chunk in pd.read_csv(\"large_file.csv\", chunksize=1000):\n",
135
- " print(chunk.head())"
136
- ]
137
- },
138
- {
139
- "cell_type": "markdown",
140
- "id": "b52ebad6-0c64-4317-974a-3498f05feaea",
141
- "metadata": {},
142
- "source": [
143
- "# c. Issues Encountered When Handling CSV Files"
144
- ]
145
- },
146
- {
147
- "cell_type": "markdown",
148
- "id": "8fb34287-7754-4170-8095-46c2a82db4ba",
149
- "metadata": {},
150
- "source": [
151
- "1. Delimiter Issues:\n",
152
- " - Not all CSV files use commas as delimiters. Some may use semicolons, tabs, or other characters.\n",
153
- "2. Encoding Problems:\n",
154
- " - Non-UTF-8 encodings may cause errors while reading files.\n",
155
- " - Example: \"UnicodeDecodeError.\"\n",
156
- "3. Missing or Inconsistent Data:\n",
157
- " - Some fields may be empty, and column lengths may vary.\n",
158
- "4. Header Issues:\n",
159
- " - Files may lack headers or have duplicate/misaligned headers.\n",
160
- "5. Large File Sizes:\n",
161
- " - Processing very large CSV files can lead to memory issues."
162
- ]
163
- },
164
- {
165
- "cell_type": "markdown",
166
- "id": "67c01a56-9b7c-46ba-8a79-9586a244978c",
167
- "metadata": {},
168
- "source": [
169
- "# d. How to Overcome These Issues"
170
- ]
171
- },
172
- {
173
- "cell_type": "markdown",
174
- "id": "45564d75-7870-45e1-8d53-e78ff71ff018",
175
- "metadata": {},
176
- "source": [
177
- "1. Delimiter Issues:\n",
178
- " - Specify the correct delimiter while reading:"
179
- ]
180
- },
181
- {
182
- "cell_type": "code",
183
- "execution_count": null,
184
- "id": "36c282a6-cbdc-4a3e-933a-91080ea4dccc",
185
- "metadata": {},
186
- "outputs": [],
187
- "source": [
188
- "df = pd.read_csv(\"file.csv\", sep=\";\")"
189
- ]
190
- },
191
- {
192
- "cell_type": "markdown",
193
- "id": "7b998672-5d7e-4a6a-8cc4-36b18446b9be",
194
- "metadata": {},
195
- "source": [
196
- "2. Encoding Problems:\n",
197
- " - Explicitly set the encoding:"
198
- ]
199
- },
200
- {
201
- "cell_type": "code",
202
- "execution_count": null,
203
- "id": "2657d869-a303-4e03-bc07-b15f012f76e6",
204
- "metadata": {},
205
- "outputs": [],
206
- "source": [
207
- "df = pd.read_csv(\"file.csv\", encoding=\"ISO-8859-1\")"
208
- ]
209
- },
210
- {
211
- "cell_type": "markdown",
212
- "id": "113e7e43-7031-4904-9e87-c9df4acefaff",
213
- "metadata": {},
214
- "source": [
215
- "3. Handling Missing Data:\n",
216
- " - Fill missing values:"
217
- ]
218
- },
219
- {
220
- "cell_type": "code",
221
- "execution_count": null,
222
- "id": "67ea80c7-7a86-4694-b6ff-a55ca27caad5",
223
- "metadata": {},
224
- "outputs": [],
225
- "source": [
226
- "df.fillna(\"Unknown\", inplace=True)"
227
- ]
228
- },
229
- {
230
- "cell_type": "markdown",
231
- "id": "a6d7a40b-c495-4482-b006-767c14209bf2",
232
- "metadata": {},
233
- "source": [
234
- "- Drop rows/columns with missing data:"
235
- ]
236
- },
237
- {
238
- "cell_type": "code",
239
- "execution_count": null,
240
- "id": "c8191abd-6281-466e-aeba-5f8df351de2d",
241
- "metadata": {},
242
- "outputs": [],
243
- "source": [
244
- "df.dropna(inplace=True)"
245
- ]
246
- },
247
- {
248
- "cell_type": "markdown",
249
- "id": "6542d341-d38f-4c59-a5ca-d2503bd35e51",
250
- "metadata": {},
251
- "source": [
252
- "4. Header Issues:\n",
253
- " - Manually assign headers:"
254
- ]
255
- },
256
- {
257
- "cell_type": "code",
258
- "execution_count": null,
259
- "id": "3f2ee8b5-c54d-4349-b473-a8c3d6230c38",
260
- "metadata": {},
261
- "outputs": [],
262
- "source": [
263
- "df = pd.read_csv(\"file.csv\", header=None, names=[\"Col1\", \"Col2\", \"Col3\"])"
264
- ]
265
- },
266
- {
267
- "cell_type": "markdown",
268
- "id": "2461fa9d-02bb-4008-85d0-1cc47e412671",
269
- "metadata": {},
270
- "source": [
271
- "5. Optimizing for Large Files:\n",
272
- " - Use chunk processing:"
273
- ]
274
- },
275
- {
276
- "cell_type": "code",
277
- "execution_count": null,
278
- "id": "e684dc56-f980-4d37-affb-3d7fde7a99b0",
279
- "metadata": {},
280
- "outputs": [],
281
- "source": [
282
- "for chunk in pd.read_csv(\"file.csv\", chunksize=5000):\n",
283
- " process(chunk)"
284
- ]
285
- },
286
- {
287
- "cell_type": "code",
288
- "execution_count": null,
289
- "id": "3e3ebeb4-1758-499c-9c0a-a6389b6ed6cd",
290
- "metadata": {},
291
- "outputs": [],
292
- "source": []
293
- },
294
- {
295
- "cell_type": "code",
296
- "execution_count": null,
297
- "id": "c378769c-56a9-4675-b988-e6b57eeed54e",
298
- "metadata": {},
299
- "outputs": [],
300
- "source": []
301
- },
302
- {
303
- "cell_type": "code",
304
- "execution_count": null,
305
- "id": "fe9e2b34-a679-4b8e-923a-f296f775a6a2",
306
- "metadata": {},
307
- "outputs": [],
308
- "source": []
309
- },
310
- {
311
- "cell_type": "code",
312
- "execution_count": null,
313
- "id": "3ece1968-048b-4337-a79e-3c9a7161231d",
314
- "metadata": {},
315
- "outputs": [],
316
- "source": []
317
- },
318
- {
319
- "cell_type": "code",
320
- "execution_count": null,
321
- "id": "b940d8eb-c668-4553-9bb9-c1b8e39cf211",
322
- "metadata": {},
323
- "outputs": [],
324
- "source": []
325
- },
326
- {
327
- "cell_type": "code",
328
- "execution_count": null,
329
- "id": "8a88eeae-cfdf-48bd-aa05-3b0c29ff25f0",
330
- "metadata": {},
331
- "outputs": [],
332
- "source": []
333
- }
334
- ],
335
- "metadata": {
336
- "kernelspec": {
337
- "display_name": "Python 3 (ipykernel)",
338
- "language": "python",
339
- "name": "python3"
340
- },
341
- "language_info": {
342
- "codemirror_mode": {
343
- "name": "ipython",
344
- "version": 3
345
- },
346
- "file_extension": ".py",
347
- "mimetype": "text/x-python",
348
- "name": "python",
349
- "nbconvert_exporter": "python",
350
- "pygments_lexer": "ipython3",
351
- "version": "3.11.7"
352
- }
353
- },
354
- "nbformat": 4,
355
- "nbformat_minor": 5
356
- }