{ "cells": [ { "cell_type": "markdown", "id": "73ee3ec9", "metadata": {}, "source": [ "# **Hopsworks Feature Store** \n", "\n", "- Part 01: Backfill Features to the Feature Store\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/logicalclocks/hopsworks-tutorials/blob/master/advanced_tutorials/air_quality/1_backfill_feature_groups.ipynb)\n", "\n", "\n", "## 🗒️ This notebook is divided into the following sections:\n", "1. Fetch historical data\n", "2. Connect to the Hopsworks feature store\n", "3. Create feature groups and insert them to the feature store\n", "\n", "![tutorial-flow](../../images/01_featuregroups.png)" ] }, { "cell_type": "markdown", "id": "f04d5c5e", "metadata": {}, "source": [ "### 📝 Imports" ] }, { "cell_type": "code", "execution_count": 2, "id": "f65f0db4-1e4b-4f28-a17c-eadcb0d0f016", "metadata": { "tags": [] }, "outputs": [], "source": [ "%pip install geopy folium streamlit-folium geopy --q" ] }, { "cell_type": "code", "execution_count": 3, "id": "cd165941", "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import time\n", "import requests\n", "from urllib.request import urlopen\n", "import json\n", "import pandas as pd\n", "import folium\n", "from functions import *\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "id": "ba9903fc", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "b7a1965a-0da7-4263-a68a-8b2e8cb753f1", "metadata": {}, "source": [ "## 🌍 Representing the Target cities " ] }, { "cell_type": "code", "execution_count": 4, "id": "bd578db1-69e7-4230-b3f2-807b8056283a", "metadata": { "tags": [] }, "outputs": [], "source": [ "target_url='https://repo.hops.works/dev/jdowling/target_cities.json'\n", "response = urlopen(target_url)\n", "target_cities = json.loads(response.read())\n" ] }, { "cell_type": "markdown", "id": "2246ca9d", "metadata": {}, "source": [ "## 🌫 Processing Air Quality data" ] }, { "cell_type": "markdown", "id": "b4a1c5d1", "metadata": {}, "source": [ "### [🇪🇺 EEA](https://discomap.eea.europa.eu/map/fme/AirQualityExport.htm)\n", "#### EEA means European Environmental Agency" ] }, { "cell_type": "code", "execution_count": 5, "id": "96b8be01-6286-4886-8043-56e0e49b314e", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'Amsterdam': [52.37, 4.89],\n", " 'Athina': [37.98, 23.73],\n", " 'Berlin': [52.52, 13.39],\n", " 'Gdansk': [54.37, 18.61],\n", " 'Kraków': [50.06, 19.94],\n", " 'London': [51.51, -0.13],\n", " 'Madrid': [40.42, -3.7],\n", " 'Marseille': [43.3, 5.37],\n", " 'Milano': [45.46, 9.19],\n", " 'München': [48.14, 11.58],\n", " 'Napoli': [40.84, 14.25],\n", " 'Paris': [48.85, 2.35],\n", " 'Sevilla': [37.39, -6.0],\n", " 'Stockholm': [59.33, 18.07],\n", " 'Tallinn': [59.44, 24.75],\n", " 'Varna': [43.21, 27.92],\n", " 'Wien': [48.21, 16.37]}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_cities[\"EU\"]" ] }, { "cell_type": "code", "execution_count": 6, "id": "5bb2a868-5f3a-4065-b651-318c24826b97", "metadata": {}, "outputs": [], "source": [ "df_eu = pd.read_csv(\"data/backfill_pm2_5_eu.csv\")" ] }, { "cell_type": "code", "execution_count": 7, "id": "5620df22-f744-4550-a81a-7e5d71aae542", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_eu.isna().sum().sum()" ] }, { "cell_type": "code", "execution_count": 8, "id": "b0e23728-a01d-45bc-bf25-4a9c77f21d66", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of this dataframe: (63548, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
city_namedatepm2_5
16477Kraków2017-01-0516.0
12612Gdansk2016-09-1510.0
58456Varna2018-12-0311.0
\n", "
" ], "text/plain": [ " city_name date pm2_5\n", "16477 Kraków 2017-01-05 16.0\n", "12612 Gdansk 2016-09-15 10.0\n", "58456 Varna 2018-12-03 11.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Size of this dataframe:\", df_eu.shape)\n", "\n", "df_eu.sample(3)" ] }, { "cell_type": "markdown", "id": "c2e45567-dd6b-4e5e-a153-82a2f4f32fbc", "metadata": {}, "source": [ "### [🇺🇸 USEPA](https://aqs.epa.gov/aqsweb/documents/data_api.html#daily)\n", "#### USEPA means United States Environmental Protection Agency\n", "[Manual downloading](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "c4952759-0fb9-4229-8b78-2e37cffb144d", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'Albuquerque': [35.08, -106.65],\n", " 'Atlanta': [33.75, -84.39],\n", " 'Chicago': [41.88, -87.62],\n", " 'Columbus': [39.96, -83.0],\n", " 'Dallas': [32.78, -96.8],\n", " 'Denver': [39.74, -104.98],\n", " 'Houston': [29.76, -95.37],\n", " 'Los Angeles': [34.05, -118.24],\n", " 'New York': [40.71, -74.01],\n", " 'Phoenix-Mesa': [33.66, -112.04],\n", " 'Salt Lake City': [40.76, -111.89],\n", " 'San Francisco': [37.78, -122.42],\n", " 'Tampa': [27.95, -82.46]}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_cities[\"US\"]" ] }, { "cell_type": "code", "execution_count": 10, "id": "c6aceaee-9431-48fd-818a-41fbdd07575c", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_us = pd.read_csv(\"data/backfill_pm2_5_us.csv\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "4e7ff20e-8a1a-4fa3-b801-71beead7b5f2", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_us.isna().sum().sum()" ] }, { "cell_type": "code", "execution_count": 12, "id": "3818e3e1-8674-4634-9023-92be8410fba5", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of this dataframe: (46037, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datecity_namepm2_5
399952016-05-09San Francisco7.3
182762016-04-10Denver3.1
321222014-10-17Phoenix-Mesa11.7
\n", "
" ], "text/plain": [ " date city_name pm2_5\n", "39995 2016-05-09 San Francisco 7.3\n", "18276 2016-04-10 Denver 3.1\n", "32122 2014-10-17 Phoenix-Mesa 11.7" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Size of this dataframe:\", df_us.shape)\n", "\n", "df_us.sample(3)" ] }, { "cell_type": "markdown", "id": "25557752-31c8-4da9-a52c-4415c4d20ae3", "metadata": {}, "source": [ "### 🏢 Processing special city - `Seattle`\n", "#### We need different stations across the Seattle. \n", "I downloaded daily `PM2.5` data manually [here](https://www.epa.gov/outdoor-air-quality-data/download-daily-data)" ] }, { "cell_type": "code", "execution_count": 13, "id": "2f54d2cb-991c-47cb-a686-76c9f7a87170", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'Bellevue-SE 12th St': [47.60086, -122.1484],\n", " 'DARRINGTON - FIR ST (Darrington High School)': [48.2469, -121.6031],\n", " 'KENT - JAMES & CENTRAL': [47.38611, -122.23028],\n", " 'LAKE FOREST PARK TOWNE CENTER': [47.755, -122.2806],\n", " 'MARYSVILLE - 7TH AVE (Marysville Junior High)': [48.05432, -122.17153],\n", " 'NORTH BEND - NORTH BEND WAY': [47.49022, -121.77278],\n", " 'SEATTLE - BEACON HILL': [47.56824, -122.30863],\n", " 'SEATTLE - DUWAMISH': [47.55975, -122.33827],\n", " 'SEATTLE - SOUTH PARK #2': [47.53091, -122.3208],\n", " 'Seattle-10th & Weller': [47.59722, -122.31972],\n", " 'TACOMA - ALEXANDER AVE': [47.2656, -122.3858],\n", " 'TACOMA - L STREET': [47.1864, -122.4517],\n", " 'Tacoma-S 36th St': [47.22634, -122.46256],\n", " 'Tukwila Allentown': [47.49854, -122.27839],\n", " 'Tulalip-Totem Beach Rd': [48.06534, -122.28519]}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target_cities[\"Seattle\"]" ] }, { "cell_type": "code", "execution_count": 14, "id": "31c8505d-68bc-40b6-be0f-42d8532dbd48", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_seattle = pd.read_csv(\"data/backfill_pm2_5_seattle.csv\")" ] }, { "cell_type": "code", "execution_count": 15, "id": "2f6583c9-3b2a-41c6-a020-aeede88c4867", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_seattle.isna().sum().sum()" ] }, { "cell_type": "code", "execution_count": 16, "id": "065a5b03-28f7-475c-9c6a-4340388157d8", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of this dataframe: (46479, 3)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
city_namedatepm2_5
3345MARYSVILLE - 7TH AVE (Marysville Junior High)2013-05-035.3
22979TACOMA - L STREET2018-08-1319.2
14456DARRINGTON - FIR ST (Darrington High School)2016-11-098.4
\n", "
" ], "text/plain": [ " city_name date pm2_5\n", "3345 MARYSVILLE - 7TH AVE (Marysville Junior High) 2013-05-03 5.3\n", "22979 TACOMA - L STREET 2018-08-13 19.2\n", "14456 DARRINGTON - FIR ST (Darrington High School) 2016-11-09 8.4" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(\"Size of this dataframe:\", df_seattle.shape)\n", "\n", "df_seattle.sample(3)" ] }, { "cell_type": "code", "execution_count": 17, "id": "e3b17ca4-0e9d-4207-ad62-90ea9c157def", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "city_name\n", "NORTH BEND - NORTH BEND WAY 3705\n", "TACOMA - L STREET 3696\n", "SEATTLE - BEACON HILL 3691\n", "MARYSVILLE - 7TH AVE (Marysville Junior High) 3648\n", "DARRINGTON - FIR ST (Darrington High School) 3614\n", "SEATTLE - SOUTH PARK #2 3577\n", "TACOMA - ALEXANDER AVE 3569\n", "KENT - JAMES & CENTRAL 3556\n", "SEATTLE - DUWAMISH 3439\n", "Seattle-10th & Weller 3097\n", "LAKE FOREST PARK TOWNE CENTER 2999\n", "Tacoma-S 36th St 2574\n", "Bellevue-SE 12th St 2172\n", "Tukwila Allentown 2074\n", "Tulalip-Totem Beach Rd 1068\n", "Name: count, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_seattle.city_name.value_counts()" ] }, { "cell_type": "markdown", "id": "c278a55d-f083-4f95-b292-92e545b9c408", "metadata": {}, "source": [ "### 🌟 All together" ] }, { "cell_type": "code", "execution_count": 18, "id": "0d55ae92-4bf9-43ae-8841-6767f5f68bec", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_air_quality = pd.concat([df_eu, df_us, df_seattle]).reset_index(drop=True)" ] }, { "cell_type": "code", "execution_count": 19, "id": "d5df39e2-2ce6-48df-9063-9827da8e7317", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
city_namedatepm2_5
155596Tacoma-S 36th St2023-03-1213.9
72851Chicago2018-07-0410.3
150716Bellevue-SE 12th St2022-12-071.8
88999Los Angeles2016-07-1110.5
127366Tacoma-S 36th St2017-12-014.6
\n", "
" ], "text/plain": [ " city_name date pm2_5\n", "155596 Tacoma-S 36th St 2023-03-12 13.9\n", "72851 Chicago 2018-07-04 10.3\n", "150716 Bellevue-SE 12th St 2022-12-07 1.8\n", "88999 Los Angeles 2016-07-11 10.5\n", "127366 Tacoma-S 36th St 2017-12-01 4.6" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_air_quality.sample(5)" ] }, { "cell_type": "code", "execution_count": 20, "id": "794c30fe-fb54-4fa0-a34c-5cef68f52473", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "(156064, 3)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_air_quality.shape" ] }, { "cell_type": "code", "execution_count": 21, "id": "ed9bc7f1-d62e-4b1f-97af-6ecd30fe4b67", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Index(['city_name', 'date', 'pm2_5'], dtype='object')" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_air_quality.columns" ] }, { "cell_type": "markdown", "id": "88a9e0ef-e9d2-4e3c-91af-c4e619b8c906", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "4687e802", "metadata": { "tags": [] }, "source": [ "## 🌦 Loading Weather Data from [Open Meteo](https://open-meteo.com/en/docs)" ] }, { "cell_type": "code", "execution_count": 22, "id": "c46283b4", "metadata": {}, "outputs": [], "source": [ "df_weather = pd.read_csv(\"data/backfill_weather.csv\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "1921b61c-d002-417e-88a6-9fe1cad0a7d4", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "city_name\n", "Amsterdam 3767\n", "Athina 3767\n", "Berlin 3767\n", "Gdansk 3767\n", "Kraków 3767\n", "London 3767\n", "Madrid 3767\n", "Marseille 3767\n", "Milano 3767\n", "München 3767\n", "Napoli 3767\n", "Paris 3767\n", "Sevilla 3767\n", "Stockholm 3767\n", "Tallinn 3767\n", "Varna 3767\n", "Wien 3767\n", "Albuquerque 3767\n", "Atlanta 3767\n", "Chicago 3767\n", "Columbus 3767\n", "Dallas 3767\n", "Denver 3767\n", "Houston 3767\n", "Los Angeles 3767\n", "New York 3767\n", "Phoenix-Mesa 3767\n", "Salt Lake City 3767\n", "San Francisco 3767\n", "Tampa 3767\n", "Bellevue-SE 12th St 3767\n", "DARRINGTON - FIR ST (Darrington High School) 3767\n", "KENT - JAMES & CENTRAL 3767\n", "LAKE FOREST PARK TOWNE CENTER 3767\n", "MARYSVILLE - 7TH AVE (Marysville Junior High) 3767\n", "NORTH BEND - NORTH BEND WAY 3767\n", "SEATTLE - BEACON HILL 3767\n", "SEATTLE - DUWAMISH 3767\n", "SEATTLE - SOUTH PARK #2 3767\n", "Seattle-10th & Weller 3767\n", "TACOMA - ALEXANDER AVE 3767\n", "TACOMA - L STREET 3767\n", "Tacoma-S 36th St 3767\n", "Tukwila Allentown 3767\n", "Tulalip-Totem Beach Rd 3767\n", "Name: count, dtype: int64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_weather.city_name.value_counts()" ] }, { "cell_type": "code", "execution_count": 24, "id": "8d5dcd0a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
city_namedatetemperature_maxtemperature_minprecipitation_sumrain_sumsnowfall_sumprecipitation_hourswind_speed_maxwind_gusts_maxwind_direction_dominant
56824Varna2014-03-019.45.52.62.60.007.013.222.7150
146508SEATTLE - SOUTH PARK #22022-12-085.61.87.97.60.2115.018.138.9285
53035Tallinn2014-01-31-8.6-17.01.00.00.983.029.655.8158
\n", "
" ], "text/plain": [ " city_name date temperature_max temperature_min \\\n", "56824 Varna 2014-03-01 9.4 5.5 \n", "146508 SEATTLE - SOUTH PARK #2 2022-12-08 5.6 1.8 \n", "53035 Tallinn 2014-01-31 -8.6 -17.0 \n", "\n", " precipitation_sum rain_sum snowfall_sum precipitation_hours \\\n", "56824 2.6 2.6 0.00 7.0 \n", "146508 7.9 7.6 0.21 15.0 \n", "53035 1.0 0.0 0.98 3.0 \n", "\n", " wind_speed_max wind_gusts_max wind_direction_dominant \n", "56824 13.2 22.7 150 \n", "146508 18.1 38.9 285 \n", "53035 29.6 55.8 158 " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_weather.sample(3)" ] }, { "cell_type": "markdown", "id": "cc9b7ad6", "metadata": {}, "source": [ "---" ] }, { "cell_type": "code", "execution_count": 25, "id": "a8f886c3-a5ac-4370-a6a2-22838ab7409e", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_air_quality.date = pd.to_datetime(df_air_quality.date)\n", "df_weather.date = pd.to_datetime(df_weather.date)\n", "\n", "df_air_quality[\"unix_time\"] = df_air_quality[\"date\"].apply(convert_date_to_unix)\n", "df_weather[\"unix_time\"] = df_weather[\"date\"].apply(convert_date_to_unix)" ] }, { "cell_type": "code", "execution_count": 26, "id": "1b6af890-87a3-4468-8eda-576c2dd75464", "metadata": { "tags": [] }, "outputs": [], "source": [ "df_air_quality.date = df_air_quality.date.astype(str)\n", "df_weather.date = df_weather.date.astype(str)" ] }, { "cell_type": "code", "execution_count": 27, "id": "2ad5ea08", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
city_namedatepm2_5unix_time
0Amsterdam2013-01-0114.01356994800000
1Amsterdam2013-01-028.01357081200000
2Amsterdam2013-01-0312.01357167600000
3Amsterdam2013-01-0412.01357254000000
4Amsterdam2013-01-0514.01357340400000
...............
156059MARYSVILLE - 7TH AVE (Marysville Junior High)2023-03-307.91680127200000
156060MARYSVILLE - 7TH AVE (Marysville Junior High)2023-03-313.71680213600000
156061MARYSVILLE - 7TH AVE (Marysville Junior High)2023-04-013.41680300000000
156062MARYSVILLE - 7TH AVE (Marysville Junior High)2023-04-023.11680386400000
156063MARYSVILLE - 7TH AVE (Marysville Junior High)2023-04-034.41680472800000
\n", "

156064 rows × 4 columns

\n", "
" ], "text/plain": [ " city_name date pm2_5 \\\n", "0 Amsterdam 2013-01-01 14.0 \n", "1 Amsterdam 2013-01-02 8.0 \n", "2 Amsterdam 2013-01-03 12.0 \n", "3 Amsterdam 2013-01-04 12.0 \n", "4 Amsterdam 2013-01-05 14.0 \n", "... ... ... ... \n", "156059 MARYSVILLE - 7TH AVE (Marysville Junior High) 2023-03-30 7.9 \n", "156060 MARYSVILLE - 7TH AVE (Marysville Junior High) 2023-03-31 3.7 \n", "156061 MARYSVILLE - 7TH AVE (Marysville Junior High) 2023-04-01 3.4 \n", "156062 MARYSVILLE - 7TH AVE (Marysville Junior High) 2023-04-02 3.1 \n", "156063 MARYSVILLE - 7TH AVE (Marysville Junior High) 2023-04-03 4.4 \n", "\n", " unix_time \n", "0 1356994800000 \n", "1 1357081200000 \n", "2 1357167600000 \n", "3 1357254000000 \n", "4 1357340400000 \n", "... ... \n", "156059 1680127200000 \n", "156060 1680213600000 \n", "156061 1680300000000 \n", "156062 1680386400000 \n", "156063 1680472800000 \n", "\n", "[156064 rows x 4 columns]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_air_quality" ] }, { "cell_type": "markdown", "id": "f2ebd846-0420-4e4c-8a5b-0827fa91c693", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "cb6f83ba", "metadata": {}, "source": [ "### 🔮 Connecting to Hopsworks Feature Store " ] }, { "cell_type": "code", "execution_count": 29, "id": "dd068240", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected. Call `.close()` to terminate connection gracefully.\n", "Copy your Api Key (first register/login): https://c.app.hopsworks.ai/account/api/generated\n", "Connected. Call `.close()` to terminate connection gracefully.\n", "\n", "Multiple projects found. \n", "\n", "\t (1) annikaij\n", "\t (2) miknie20\n", "\n", "Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/549019\n", "Connected. Call `.close()` to terminate connection gracefully.\n" ] } ], "source": [ "import hopsworks\n", "\n", "project = hopsworks.login()\n", "\n", "fs = project.get_feature_store() " ] }, { "cell_type": "code", "execution_count": 30, "id": "71db5ac1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{\"expectation_type\": \"expect_column_values_to_be_between\", \"kwargs\": {\"column\": \"pm2_5\", \"min_value\": 0.0, \"max_value\": 1000.0}, \"meta\": {}}" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from great_expectations.core import ExpectationSuite, ExpectationConfiguration\n", "\n", "expectation_suite = ExpectationSuite(expectation_suite_name=\"pmi_data\")\n", "\n", "expectation_suite.add_expectation(\n", " ExpectationConfiguration(\n", " expectation_type=\"expect_column_values_to_be_between\",\n", " kwargs={\n", " \"column\": \"pm2_5\", \n", " \"min_value\": 0.0,\n", " \"max_value\": 1000.0,\n", " }\n", " )\n", ")" ] }, { "cell_type": "markdown", "id": "63d8c3b9", "metadata": {}, "source": [ "## 🪄 Creating Feature Groups" ] }, { "cell_type": "markdown", "id": "4a2515c4", "metadata": {}, "source": [ "### 🌫 Air Quality Data" ] }, { "cell_type": "code", "execution_count": 31, "id": "9d7088a8", "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "air_quality_fg = fs.get_or_create_feature_group(\n", " name='air_quality',\n", " description='Air Quality characteristics of each day',\n", " version=1,\n", " primary_key=['city_name'], #'unix_time',\n", " online_enabled=False,\n", " expectation_suite = expectation_suite,\n", " event_time=\"unix_time\"\n", ") " ] }, { "cell_type": "code", "execution_count": 32, "id": "7e04a975-bb58-42e2-9abd-90e68ae37864", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature Group created successfully, explore it at \n", "https://c.app.hopsworks.ai:443/p/549019/fs/544841/fg/758117\n", "Validation failed.\n", "Validation Report saved successfully, explore a summary at https://c.app.hopsworks.ai:443/p/549019/fs/544841/fg/758117\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Uploading Dataframe: 100.00% |██████████| Rows 156064/156064 | Elapsed Time: 00:16 | Remaining Time: 00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Launching job: air_quality_1_offline_fg_materialization\n", "Job started successfully, you can follow the progress at \n", "https://c.app.hopsworks.ai/p/549019/jobs/named/air_quality_1_offline_fg_materialization/executions\n" ] }, { "data": { "text/plain": [ "(,\n", " {\n", " \"evaluation_parameters\": {},\n", " \"success\": false,\n", " \"statistics\": {\n", " \"evaluated_expectations\": 1,\n", " \"successful_expectations\": 0,\n", " \"unsuccessful_expectations\": 1,\n", " \"success_percent\": 0.0\n", " },\n", " \"results\": [\n", " {\n", " \"exception_info\": {\n", " \"raised_exception\": false,\n", " \"exception_message\": null,\n", " \"exception_traceback\": null\n", " },\n", " \"expectation_config\": {\n", " \"expectation_type\": \"expect_column_values_to_be_between\",\n", " \"kwargs\": {\n", " \"column\": \"pm2_5\",\n", " \"min_value\": 0.0,\n", " \"max_value\": 1000.0\n", " },\n", " \"meta\": {\n", " \"expectationId\": 473089\n", " }\n", " },\n", " \"success\": false,\n", " \"result\": {\n", " \"element_count\": 156064,\n", " \"missing_count\": 0,\n", " \"missing_percent\": 0.0,\n", " \"unexpected_count\": 84,\n", " \"unexpected_percent\": 0.05382407217551774,\n", " \"unexpected_percent_total\": 0.05382407217551774,\n", " \"unexpected_percent_nonmissing\": 0.05382407217551774,\n", " \"partial_unexpected_list\": [\n", " -1.0,\n", " -1.0,\n", " -1.0,\n", " -1.0,\n", " -0.2,\n", " -0.1,\n", " -1.2,\n", " -1.2,\n", " -1.1,\n", " -0.9,\n", " -0.6,\n", " -0.2,\n", " -1.0,\n", " -0.5,\n", " -0.7,\n", " -0.1,\n", " -0.4,\n", " -0.5,\n", " -0.1,\n", " -0.2\n", " ]\n", " },\n", " \"meta\": {\n", " \"ingestionResult\": \"INGESTED\",\n", " \"validationTime\": \"2024-04-27T01:53:43.000307Z\"\n", " }\n", " }\n", " ],\n", " \"meta\": {\n", " \"great_expectations_version\": \"0.15.12\",\n", " \"expectation_suite_name\": \"pmi_data\",\n", " \"run_id\": {\n", " \"run_name\": null,\n", " \"run_time\": \"2024-04-27T13:53:43.307739+00:00\"\n", " },\n", " \"batch_kwargs\": {\n", " \"ge_batch_id\": \"8f57f63a-049d-11ef-9d82-e2cf145aedc8\"\n", " },\n", " \"batch_markers\": {},\n", " \"batch_parameters\": {},\n", " \"validation_time\": \"20240427T135343.307573Z\",\n", " \"expectation_suite_meta\": {\n", " \"great_expectations_version\": \"0.15.12\"\n", " }\n", " }\n", " })" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "air_quality_fg.insert(df_air_quality, write_options={\"wait_for_job\": False})" ] }, { "cell_type": "markdown", "id": "a73a9029", "metadata": {}, "source": [ "### 🌦 Weather Data" ] }, { "cell_type": "code", "execution_count": 33, "id": "acc2b799", "metadata": {}, "outputs": [], "source": [ "weather_fg = fs.get_or_create_feature_group(\n", " name='weather',\n", " description='Weather characteristics of each day',\n", " version=1,\n", " primary_key=['city_name'], #'unix_time'\n", " online_enabled=False,\n", " event_time=\"unix_time\"\n", ") " ] }, { "cell_type": "code", "execution_count": 34, "id": "9583b4d1-e2e3-4f56-9e5d-23caa0c49457", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Feature Group created successfully, explore it at \n", "https://c.app.hopsworks.ai:443/p/549019/fs/544841/fg/760147\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Uploading Dataframe: 100.00% |██████████| Rows 169515/169515 | Elapsed Time: 00:22 | Remaining Time: 00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Launching job: weather_1_offline_fg_materialization\n", "Job started successfully, you can follow the progress at \n", "https://c.app.hopsworks.ai/p/549019/jobs/named/weather_1_offline_fg_materialization/executions\n" ] }, { "data": { "text/plain": [ "(, None)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "weather_fg.insert(df_weather, write_options={\"wait_for_job\": False})" ] }, { "cell_type": "code", "execution_count": null, "id": "b087a12f", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "ucloud-sml", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }