{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "title: Which cheese are we eating?\n",
    "description: 'Did you ever wonder what kind of cheese you should buy? They all look the same. And then the embarrasement when you can just point and say: that one. Meet the cheese classifier.'\n",
    "date: '2025-03-13'\n",
    "categories:\n",
    "- machine learning\n",
    "- python\n",
    "- computer vision\n",
    "execute:\n",
    "  message: false\n",
    "  warning: false\n",
    "editor_options:\n",
    "  chunk_output_type: console\n",
    "output-file: meet-the-cheese\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's start with the why\n",
    "I love cheese. Sometimes it is quite difficult to distinguish the varieties. Think about the embarrasement when you are in front of a mountain of cheese and can only point with your finger.\n",
    "\n",
    "Therefore, I decided to built a ML classifier to help me.\n",
    "\n",
    "The special difficulty here is that cheeses all look quite similar. Take, for example, the swiss Gruyere and the French Comte.\n",
    "\n",
    "They are twins.\n",
    "\n",
    "## Let’s continue with with the data.\n",
    "\n",
    "First, we need some data. Fast.ai provides an easy download module to download images from DuckDuckGo.\n",
    "\n",
    "As an alternative, we could use a dataset, if we have one. Let’s start by downloading the files and then create a dataset.\n",
    "\n",
    "### Getting data from DuckDuckGo\n",
    "\n",
    "Let’s start by defining what we want to download. We want cheese. In particular, French cheese. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cheeses = [\n",
    "    \"Camembert\",\n",
    "    \"Roquefort\",\n",
    "    \"Comté\",\n",
    "    \"Époisses de Bourgogne\",\n",
    "    \"Tomme de Savoie\",\n",
    "    \"Bleu d’Auvergne\",\n",
    "    \"Brie de Meaux\",\n",
    "    \"Mimolette\",\n",
    "    \"Munster\",\n",
    "    \"Livarot\",\n",
    "    \"Pont-l’Évêque\",\n",
    "    \"Reblochon\",\n",
    "    \"Chabichou du Poitou\",\n",
    "    \"Valençay\",\n",
    "    \"Pélardon\",\n",
    "    \"Fourme d’Ambert\",\n",
    "    \"Selles-sur-Cher\",\n",
    "    \"Cantal\",\n",
    "    \"Neufchâtel\",\n",
    "    \"Banon\",\n",
    "    \"Gruyere\"\n",
    "]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To have a larger variety of images we define some extra search terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_terms = [\n",
    "    \"cheese close-up texture\",\n",
    "    \"cheese macro shot\",\n",
    "    \"cheese cut section\"\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we work with Fast.ai , let's import the basic stuff."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from duckduckgo_search import DDGS\n",
    "from fastcore.all import *\n",
    "from fastai.vision.all import *\n",
    "def search_images(keywords, max_images=20): return L(DDGS().images(keywords, max_results=max_images)).itemgot('image')\n",
    "import time, json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then define our download function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastdownload import download_url\n",
    "from pathlib import Path\n",
    "import time\n",
    "\n",
    "data_acquisition=False\n",
    "\n",
    "def download():\n",
    "    # Loop through all combinations of cheeses and search terms\n",
    "    for cheese in cheeses:\n",
    "        dest = Path(\"which_cheese\") / cheese  # Create subdirectory for each cheese\n",
    "        dest.mkdir(exist_ok=True, parents=True)\n",
    "\n",
    "        for term in search_terms:\n",
    "            query = f\"{cheese} {term}\"\n",
    "            download_images(dest, urls=search_images(f\"{query} photo\"))\n",
    "            time.sleep(5)\n",
    "\n",
    "        # Resize images after downloading\n",
    "        resize_images(dest, max_size=400, dest=dest)\n",
    "\n",
    "# Run download only if data acquisition is enabled\n",
    "if data_acquisition:\n",
    "    download()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can verify the images now or later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if data_acquisition:\n",
    "    failed = verify_images(get_image_files(path))\n",
    "    failed.map(Path.unlink)\n",
    "    len(failed)\n",
    "    failed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading data from a Kaggle dataset\n",
    "I created a dataset of these images to avoid having to download again when I start over.\n",
    "\n",
    "Sadly to uncertain copyright issues of this data, my dataset needs to remain private. But you can easily create your own.\n",
    "\n",
    "As I run most of my code locally, I have some code to get it from Kaggle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2025-03-08T21:04:15.839924Z",
     "start_time": "2025-03-08T21:04:15.606127Z"
    }
   },
   "outputs": [],
   "source": [
    "competition_name= None\n",
    "dataset_name = 'cheese'\n",
    "\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')\n",
    "if competition_name:\n",
    "    if iskaggle: \n",
    "        comp_path = Path('../input/'+ competition_name)\n",
    "    else:\n",
    "        comp_path = Path(competition_name)\n",
    "        if not path.exists():\n",
    "            import zipfile,kaggle\n",
    "            kaggle.api.competition_download_cli(str(comp_path))\n",
    "            zipfile.ZipFile(f'{comp_path}.zip').extractall(comp_path)\n",
    "\n",
    "\n",
    "if dataset_name:\n",
    "    if iskaggle:\n",
    "        path = Path(f'../input/{dataset_name}')\n",
    "    else:\n",
    "        path = Path(dataset_name)\n",
    "        if not path.exists():\n",
    "            import zipfile, kaggle\n",
    "            kaggle.api.dataset_download_cli(dataset_name, path='.')\n",
    "            zipfile.ZipFile(f'{dataset_name}.zip').extractall(path)        \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have downloaded the data, we can start using it."
   ]
  }
 ],
 "metadata": {
  "kaggle": {
   "accelerator": "none",
   "dataSources": [
    {
     "datasetId": 6811989,
     "sourceId": 10951089,
     "sourceType": "datasetVersion"
    }
   ],
   "dockerImageVersionId": 30919,
   "isGpuEnabled": false,
   "isInternetEnabled": true,
   "language": "python",
   "sourceType": "notebook"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}