Motivation
In an age where culinary creativity often crosses into the realm of the absurd, the sanctity of pizza is under threat. Everywhere you look, there’s a new, wild rendition being labeled as “pizza”. Not every round dough with toppings earns the right to be called a pizza.
In “The Great Pizza Clean-Up”, we embark on a flavourful journey. Armed with Lea, a minimalistic alternative to DBT developed by Max Halford, and a dash of Italian flair, we set out to distinguish the authentic from the outrageous.
It’s time to restore the honour of this beloved dish, one data point at a time.
The data
In my quest to showcase Lea’s simplicity, I stumbled upon the perfect candidate: the Pizza Place Sales dataset. It consists of a year’s worth of sales from a fictitious pizza place, including the date and time of each order and the pizzas served, with additional details on the type, size, quantity, price, and ingredients.
But here’s the twist: amongst these pizze lie some so wildly unconventional that they just can’t make the cut.
With Lea’s capabilities and a little help from the pizza police 🚨, we’re set to explore some intriguing aspects of our data, while carefully navigating around the pizza pretenders:
- How many customers do we have each day? Are there any peak hours?
- How many pizzas are typically in an order? Do we have any bestsellers?
- How much money did we make this year? Can we identify any seasonality in the sales?
Join me as we slice through the data 🍕
Setting up
All the essential code for our analysis is available on GitHub.
Our data journey begins with four key CSV files, which you’ll find in the seeds/
folder here. These files are the foundational ingredients of our pizza-themed exploration. Our goal is to integrate them into our DuckDB database under a raw
schema, and Lea is the tool that simplifies this process.
We organise our workspace by creating a views/
folder, with a raw/
subfolder within it, for managing our data transformations. Think of it as setting up the prep area for our data analysis. Here’s how we smoothly transfer our CSV files into DuckDB using pandas. It’s important to note that the name of the dataframe will serve as the table name in DuckDB. Here’s a quick glimpse of the process:
import pathlib
import pandas as pd
here = pathlib.Path(__file__).parent
pizzas = pd.read_csv(here.parent.parent / "seeds" / "pizzas.csv")
Our project structure, ready for data processing, will resemble the following ⬇️
views
└── raw
├── order_details.py
├── orders.py
├── pizza_types.py
└── pizzas.py
Preparing the data
With our raw data now successfully loaded, it’s time to roll up our sleeves and delve deeper. A closer inspection of the pizzas
and pizza_types
tables reveals an opportunity for merging to streamline our analysis.
Take, for example, the data for “Barbecue Chicken Pizza” (a choice that might raise a few Italian eyebrows 🤨) in the pizzas
table:
pizza_id | pizza_type_id | size | price |
---|---|---|---|
bbq_ckn_s | bbq_ckn | S | 12.75 |
bbq_ckn_m | bbq_ckn | M | 16.75 |
bbq_ckn_l | bbq_ckn | L | 20.75 |
and from the **pizza_types**
table:
pizza_type_id | name | category | ingredients |
---|---|---|---|
bbq_ckn | The Barbecue Chicken Pizza | Chicken | Barbecued Chicken, Red Peppers, Green Peppers, Tomatoes, Red Onions, Barbecue Sauce |
A more intuitive format would merge these details, like so:
pizza_type_id | name | category | size | price |
---|---|---|---|---|
bbq_ckn | The Barbecue Chicken Pizza | Chicken | S | 12.75 |
bbq_ckn | The Barbecue Chicken Pizza | Chicken | M | 16.75 |
bbq_ckn | The Barbecue Chicken Pizza | Chicken | L | 20.75 |
And we’ll relocate the ingredients into a separate table:
pizza_type_id | ingredient_name |
---|---|
bbq_ckn | Barbecued Chicken |
bbq_ckn | Red Peppers |
bbq_ckn | Green Peppers |
bbq_ckn | Tomatoes |
bbq_ckn | Red Onions |
bbq_ckn | Barbecue Sauce |
Since our raw
data is already nestled in our database, transforming it becomes a breeze with Lea. We can use a SELECT
query for this transformation and place the resulting data in the staging/
folder.
For example, our pizzas.sql
file in the staging
folder would look something like this:
SELECT
pizza_type_id,
name,
category,
size,
price
FROM
raw.pizzas
JOIN raw.pizza_types USING (pizza_type_id)
We apply similar transformations to the orders
and order_details
data. For a detailed look at these queries, check out the provided code.
After these steps, our views/
folder is now neatly organised as follows ⬇️:
views
├── raw
│ ├── order_details.py
│ ├── orders.py
│ ├── pizza_types.py
│ └── pizzas.py
└── staging
├── orders.sql
├── pizza_ingredients.sql
└── pizzas.sql
With our data neatly prepped and staged, we’re all set for the deep-dive analysis.
Pizza Police
🚨🚨🚨 Alert! The pizza police have entered the chat! 🚨🚨🚨
With the help of our newly crafted ingredients table, our vigilant pizza police have spotted some truly outrageous pizza claims. Alfredo sauce with chicken, chipotle sauce, pineapple, and even thai sweet chilli sauce on a pizza? This culinary chaos is simply unacceptable!
Thankfully, the pizza police, along with Lea, are here to enforce a standard of authenticity. It’s time to establish some order in this pizza pandemonium. We’re setting up a pizza_police/
folder within our views/
directory to sort things out. And let’s get it straight once and for all: it’s “pizze,” not “pizzas”!
As it turns out, maintaining pizza integrity is just a matter of a few well-placed SELECT
queries on our staged data.
For example, here’s how we ensure only the most authentic pizze make it through our filter:
SELECT
*
FROM
staging.pizzas sp
WHERE
NOT EXISTS (
SELECT
1
FROM
staging.pizza_ingredients pi
WHERE
pi.pizza_type_id = sp.pizza_type_id
AND pi.ingredient_name IN (
'Alfredo Sauce',
'Barbecued Chicken',
'Chicken',
'Chipotle Sauce',
'Pineapple',
'Thai Sweet Chilli Sauce'
)
);
We’re applying a similar level of scrutiny to our orders
as well. You can view this query here.
Now, our views/
folder looks clean:
views
├── pizza_police
│ ├── orders.sql
│ └── pizze.sql
├── raw
│ ├── order_details.py
│ ├── orders.py
│ ├── pizza_types.py
│ └── pizzas.py
└── staging
├── orders.sql
├── pizza_ingredients.sql
└── pizzas.sql
With the pizza police on our side, we’re ready to answer our initial questions. With the assurance that only the most authentic pizze are included in our analysis.
Analytics
It’s time to crunch some numbers and turn our data into meaningful insights, all while respecting traditions.
Crafting KPIs
First, we create an analytics/
schema. Within this, we consolidate our primary queries into a kpis.sql
file:
- The daily customer count.
- The average number of pizzas per order.
- The total sales for the year.
Curious about the queries? Find them detailed here. Below are the results:
metric | value |
---|---|
average_pizze_per_order | 1.9 |
avg_orders_per_day | 49.9 |
total_sales | 1,722,907 $ |
total_orders | 18,214 |
Unraveling Seasonality in Sales
Our next step is to detect any seasonality in sales. For this, we craft a sales_by_month.sql
query to break down sales data by month. Here’s the query structure:
SELECT
monthname(datetime) AS MONTH,
SUM(quantity * price) AS total_sales
FROM
pizza_police.orders
JOIN pizza_police.pizze USING (pizza_type_id)
GROUP BY
MONTH
ORDER BY
total_sales DESC;
Diving Deeper
We’re not stopping there! We also have queries ready for identifying peak hours (peak_hours.sql
) and our top 10 best-selling pizzas (top10_bestseller.sql
), all neatly organized under the analytics
schema.
Now, our views/
folder illustrates our comprehensive data journey:
views
├── analytics
│ ├── kpis.sql
│ ├── peak_hours.sql
│ ├── sales_by_month.sql
│ └── top10_bestseller.sql
├── pizza_police
│ ├── orders.sql
│ └── pizze.sql
├── raw
│ ├── order_details.py
│ ├── orders.py
│ ├── pizza_types.py
│ └── pizzas.py
└── staging
├── orders.sql
├── pizza_ingredients.sql
└── pizzas.sql
And can be beautifully represented with a flowchart:
Wrapping It Up
The final step is to materialise our views with Lea. This is a straightforward process, detailed in the readme:
lea prepare
lea run
lea docs
, if you want to generate some beautiful docs
With these commands, our schemas and tables are now fully operational in DuckDB, ready to deliver insightful analyses.
As we bring our exploration of pizza sales data to a close, the key to unlocking our insights has been Lea. This journey demonstrated that diving into data analysis doesn’t require you to be a seasoned data engineer. Lea stands out as a testament to this, offering a user-friendly, minimalistic yet powerful tool for data processing and analysis.