File size: 4,820 Bytes
7eca5fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27a6ebd
7eca5fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84f9d62
7eca5fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "marimo",
#     "plotly.express",
#     "plotly==6.0.1",
#     "duckdb==1.2.1",
#     "sqlglot==26.11.1",
#     "pyarrow==19.0.1",
#     "polars==1.27.1",
# ]
# ///

import marimo

__generated_with = "0.12.10"
app = marimo.App(width="medium")


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""#Loading CSVs with DuckDB""")
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(
        r"""
        <p> I remember when I first learnt about DuckDB, it was a gamechanger — I used to load the data I wanted to work on to a database software like MS SQL Server, and then build a bridge to an IDE with the language I wanted to use like Python, or R; it was quite the hassle. DuckDB changed my whole world — now I could just import the data file into the IDE, or notebook, make a duckdb connection, and there we go! But then, I realized I didn't even need the step of first importing the file using python. I could just query the csv file directly using SQL through a DuckDB connection.</p> 

        ##Introduction
        <p> I found this dataset on the evolution of AI research by discipline from <a href= "https://oecd.ai/en/data?selectedArea=ai-research&selectedVisualization=16731"> OECD</a>, and it piqued my interest. I feel like publications in natural language processing drastically jumped in the mid 2010s, and I'm excited to find out if that's the case. </p> 

        <p> In this notebook, we'll: </p>
        <ul>
            <li> Import the CSV file into the notebook</li>
            <li> Create another table within the database based on the CSV</li>
            <li> Dig into publications on natural language processing have evolved over the years</li>
        </ul>
        """
    )
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""##Load the CSV""")
    return


@app.cell
def _(mo):
    _df = mo.sql(
        f"""
        /* Another way to load the CSV could be 
        SELECT * 
        FROM read_csv('https://github.com/Mustjaab/Loading_CSVs_in_DuckDB/blob/main/AI_Research_Data.csv')
        */
        SELECT * 
        FROM "https://raw.githubusercontent.com/Mustjaab/Loading_CSVs_in_DuckDB/refs/heads/main/AI_Research_Data.csv"
        LIMIT 5;
        """
    )
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""##Create Another Table""")
    return


@app.cell
def _(mo):
    Discipline_Analysis = mo.sql(
        f"""
        -- Build a table based on the CSV where it just contains the specified columns
        CREATE TABLE Domain_Analysis AS
            SELECT Year, Concept, publications FROM "https://raw.githubusercontent.com/Mustjaab/Loading_CSVs_in_DuckDB/refs/heads/main/AI_Research_Data.csv"
        """
    )
    return Discipline_Analysis, Domain_Analysis


@app.cell
def _(Domain_Analysis, mo):
    Analysis = mo.sql(
        f"""
        SELECT * 
        FROM Domain_Analysis
        GROUP BY Concept, Year, publications
        ORDER BY Year
        """
    )
    return (Analysis,)


@app.cell
def _(Domain_Analysis, mo):
    _df = mo.sql(
        f"""
        SELECT 
            AVG(CASE WHEN Year < 2020 THEN publications END) AS avg_pre_2020,
            AVG(CASE WHEN Year >= 2020 THEN publications END) AS avg_2020_onward
        FROM Domain_Analysis
        WHERE Concept = 'Natural language processing';
        """
    )
    return


@app.cell
def _(Domain_Analysis, mo):
    NLP_Analysis = mo.sql(
        f"""
        SELECT 
            publications, 
            CASE 
                WHEN Year < 2020 THEN 'Pre-2020' 
                ELSE '2020-onward' 
            END AS period
        FROM Domain_Analysis
        WHERE Year >= 2000 
        AND Concept = 'Natural language processing';
        """,
        output=False
    )
    return (NLP_Analysis,)


@app.cell
def _(NLP_Analysis, px):
    px.box(NLP_Analysis, x='period', y='publications', color='period')
    return


@app.cell(hide_code=True)
def _(mo):
    mo.md(r"""<p> We can see there's a significant increase in NLP publications 2020 and onwards which definitely makes sense provided the rapid emergence of commercial large language models, and AI assistants. </p>""")

@app.cell(hide_code=True)
def _(mo):
    mo.md(
        r"""
        ##Conclusion
        <p> In this notebook, we learned how to:</p> 
        <ul>
            <li> Load a CSV into DuckDB </li>
            <li> Create other tables using the imported CSV </li>
            <li> Seamlessly analyze and visualize data between SQL, and Python cells</li>
        </ul>
        """
    )
    return


@app.cell
def _():
    import pyarrow
    import polars
    return polars, pyarrow


@app.cell
def _():
    import marimo as mo
    import plotly.express as px
    return mo, px


if __name__ == "__main__":
    app.run()