Spaces:

shwetashweta05
/

Zero_to_Hero_Machine_Learning

Sleeping

App Files Files Community

shwetashweta05 commited on 30 days ago

Commit

4a57e16

verified ·

1 Parent(s): 3d35f5f

Upload CSV_guide.ipynb

Browse files

Files changed (1) hide show

CSV_guide.ipynb +356 -0

CSV_guide.ipynb ADDED Viewed

	@@ -0,0 +1,356 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "288dc3d6-2f59-4af4-b9a0-ac11110c95a4",
+   "metadata": {},
+   "source": [
+    "# a. What is CSV?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a29ef9f-d2b1-44ae-aa00-b7307dc1f1fa",
+   "metadata": {},
+   "source": [
+    "- CSV (Comma-Separated Values) is a simple and widely used file format for storing structured data.\n",
+    "-  Each row in a CSV file represents a record, and fields within a record are separated by a delimiter (typically a comma, but can also be semicolons, tabs, etc.)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aed4bbe7-49a7-44f7-a222-1dbc76b94b74",
+   "metadata": {},
+   "source": [
+    "## Advantages"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0908c962-52a0-481d-9c4a-734d0954aeb5",
+   "metadata": {},
+   "source": [
+    "- Lightweight and easy to create.\n",
+    "- Supported by almost all data tools and programming languages."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a3c3937-cb91-411b-8606-16728aabbbc1",
+   "metadata": {},
+   "source": [
+    "## Common File Extensions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "41bf2a14-0cc1-458b-be33-62e9431a9b31",
+   "metadata": {},
+   "source": [
+    "- .csv\n",
+    "- .txt (sometimes used with a CSV structure)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00250776-617f-49d9-88bb-e6cba943f599",
+   "metadata": {},
+   "source": [
+    "# b. How to Read CSV Files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98989d08-8d4d-4a02-82b1-ba08757e71ff",
+   "metadata": {},
+   "source": [
+    "- Using Python, CSV files can be handled with libraries such as pandas or Python's built-in csv module."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6776fe4e-8155-47ff-99f4-ec26c916c45d",
+   "metadata": {},
+   "source": [
+    "## 1. Using pandas:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a508ae8d-3a3d-43f0-9453-11c87877b2b1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Read a CSV file\n",
+    "df = pd.read_csv(\"file.csv\")\n",
+    "print(df.head())\n",
+    "\n",
+    "# Reading a CSV file with a custom delimiter\n",
+    "df = pd.read_csv(\"file.csv\", sep=\";\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7c3f7a6a-0c13-45f2-930b-2c5796985efd",
+   "metadata": {},
+   "source": [
+    "## 2. Using Python's Built-in csv Module:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a33ffb8b-88b6-4061-b816-00397f2b3a3e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import csv\n",
+    "\n",
+    "with open(\"file.csv\", \"r\") as file:\n",
+    "    reader = csv.reader(file)\n",
+    "    for row in reader:\n",
+    "        print(row)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a57c10b-51bf-4a4e-978a-51644964b856",
+   "metadata": {},
+   "source": [
+    "## 3.Reading Large CSV Files in Chunks:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5a056573-3d16-400a-8ccd-a15d0398b454",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Process large CSV files in smaller chunks\n",
+    "for chunk in pd.read_csv(\"large_file.csv\", chunksize=1000):\n",
+    "    print(chunk.head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b52ebad6-0c64-4317-974a-3498f05feaea",
+   "metadata": {},
+   "source": [
+    "# c. Issues Encountered When Handling CSV Files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8fb34287-7754-4170-8095-46c2a82db4ba",
+   "metadata": {},
+   "source": [
+    "1. Delimiter Issues:\n",
+    "    - Not all CSV files use commas as delimiters. Some may use semicolons, tabs, or other characters.\n",
+    "2. Encoding Problems:\n",
+    "    - Non-UTF-8 encodings may cause errors while reading files.\n",
+    "    - Example: \"UnicodeDecodeError.\"\n",
+    "3. Missing or Inconsistent Data:\n",
+    "    - Some fields may be empty, and column lengths may vary.\n",
+    "4. Header Issues:\n",
+    "    - Files may lack headers or have duplicate/misaligned headers.\n",
+    "5. Large File Sizes:\n",
+    "    - Processing very large CSV files can lead to memory issues."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "67c01a56-9b7c-46ba-8a79-9586a244978c",
+   "metadata": {},
+   "source": [
+    "# d. How to Overcome These Issues"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "45564d75-7870-45e1-8d53-e78ff71ff018",
+   "metadata": {},
+   "source": [
+    "1. Delimiter Issues:\n",
+    "   - Specify the correct delimiter while reading:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "36c282a6-cbdc-4a3e-933a-91080ea4dccc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"file.csv\", sep=\";\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b998672-5d7e-4a6a-8cc4-36b18446b9be",
+   "metadata": {},
+   "source": [
+    "2. Encoding Problems:\n",
+    "   - Explicitly set the encoding:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2657d869-a303-4e03-bc07-b15f012f76e6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"file.csv\", encoding=\"ISO-8859-1\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "113e7e43-7031-4904-9e87-c9df4acefaff",
+   "metadata": {},
+   "source": [
+    "3. Handling Missing Data:\n",
+    "    - Fill missing values:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "67ea80c7-7a86-4694-b6ff-a55ca27caad5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.fillna(\"Unknown\", inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a6d7a40b-c495-4482-b006-767c14209bf2",
+   "metadata": {},
+   "source": [
+    "- Drop rows/columns with missing data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c8191abd-6281-466e-aeba-5f8df351de2d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.dropna(inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6542d341-d38f-4c59-a5ca-d2503bd35e51",
+   "metadata": {},
+   "source": [
+    "4. Header Issues:\n",
+    "   - Manually assign headers:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f2ee8b5-c54d-4349-b473-a8c3d6230c38",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"file.csv\", header=None, names=[\"Col1\", \"Col2\", \"Col3\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2461fa9d-02bb-4008-85d0-1cc47e412671",
+   "metadata": {},
+   "source": [
+    "5. Optimizing for Large Files:\n",
+    "   - Use chunk processing:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e684dc56-f980-4d37-affb-3d7fde7a99b0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for chunk in pd.read_csv(\"file.csv\", chunksize=5000):\n",
+    "    process(chunk)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5a9677e2-e475-4829-9660-a2ec1674d221",
+   "metadata": {},
+   "source": [
+    "### Use lightweight libraries like dask or polars for very large files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c378769c-56a9-4675-b988-e6b57eeed54e",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe9e2b34-a679-4b8e-923a-f296f775a6a2",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3ece1968-048b-4337-a79e-3c9a7161231d",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b940d8eb-c668-4553-9bb9-c1b8e39cf211",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a88eeae-cfdf-48bd-aa05-3b0c29ff25f0",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}