Tutorial
Explanation
To simplify, let's consider a single profile, for a single year. Let's denote it as $p_i$, where $i = 1,\dots,N$. The clustering process consists of:
Split
N
into (let's assume equal) periods of sizem = period_duration
. We can rename $p_i$ as\[p_{j,k}, \qquad \text{where} \qquad j = 1,\dots,m, \quad k = 1,\dots,N/m.\]
Compute
num_rps
representative periods\[r_{j,\ell}, \qquad \text{where} \qquad j = 1,\dots,m, \qquad \ell = 1,\dots,\text{num\_rps}.\]
During computation of the representative periods, we obtained weight $w_{k,\ell}$ between the period $k$ and the representative period $\ell$, such that
\[p_{j,k} = \sum_{\ell = 1}^{\text{num\_rps}} r_{j,\ell} \ w_{k,\ell}, \qquad \forall j = 1,\dots,m, \quad k = 1,\dots,N/m\]
High level API/DuckDB API
This tutorial focuses on the highest level of the API, which requires the use of a DuckDB connection.
The high-level API of TulipaClustering focuses on using TulipaClustering as part of the Tulipa workflow. This API consists of three main functions: transform_wide_to_long!
, cluster!
, and dummy_cluster!
. In this tutorial we'll use all three.
Normally, you will have the DuckDB connection from the larger Tulipa workflow, so here we will create a temporary connection with fake data to show an example of the workflow. You can look into the source code of this documentation to see how to create this fake data.
Here is the content of that connection:
using DataFrames, DuckDB
nice_query(str) = DataFrame(DuckDB.query(connection, str))
nice_query("show tables")
Row | name |
---|---|
String | |
1 | profiles_wide |
And here is the first rows of profiles_wide
:
nice_query("from profiles_wide limit 10")
Row | year | timestep | avail | solar | demand |
---|---|---|---|---|---|
Int32 | Int64 | Float64 | Float64 | Float64 | |
1 | 2030 | 1 | 4.34091 | 0.0 | 3.67921 |
2 | 2030 | 2 | 4.17384 | 0.0 | 3.90639 |
3 | 2030 | 3 | 4.13356 | 0.0 | 4.13572 |
4 | 2030 | 4 | 4.03112 | 0.0 | 4.6879 |
5 | 2030 | 5 | 3.88836 | 0.0 | 5.24468 |
6 | 2030 | 6 | 3.81341 | 0.0 | 5.8205 |
7 | 2030 | 7 | 3.91162 | 0.438109 | 6.32735 |
8 | 2030 | 8 | 4.03719 | 1.38293 | 7.08529 |
9 | 2030 | 9 | 4.07284 | 2.21684 | 6.73004 |
10 | 2030 | 10 | 4.24665 | 2.66758 | 7.21155 |
And finally, this is the plot of the data:
using Plots
table = DuckDB.query(connection, "from profiles_wide")
plot(size=(800, 400))
timestep = [row.timestep for row in table]
for profile_name in (:avail, :solar, :demand)
value = [row[profile_name] for row in table]
plot!(timestep, value, lab=string(profile_name))
end
plot!()
Transform a wide profiles table into a long table
The long table format is a requirement of TulipaClustering, even for the dummy clustering example.
In this context, a wide table is a table where each new profile occupies a new column. A long table is a table where the profile names are stacked in a column with the corresponding values in a separate column. Given the name of the source table (in this case, profiles_wide
), we can create a long table with the following call:
using TulipaClustering
transform_wide_to_long!(connection, "profiles_wide", "profiles")
nice_query("FROM profiles LIMIT 10")
Row | year | timestep | profile_name | value |
---|---|---|---|---|
Int32 | Int64 | String | Float64 | |
1 | 2030 | 1 | avail | 4.34091 |
2 | 2030 | 2 | avail | 4.17384 |
3 | 2030 | 3 | avail | 4.13356 |
4 | 2030 | 4 | avail | 4.03112 |
5 | 2030 | 5 | avail | 3.88836 |
6 | 2030 | 6 | avail | 3.81341 |
7 | 2030 | 7 | avail | 3.91162 |
8 | 2030 | 8 | avail | 4.03719 |
9 | 2030 | 9 | avail | 4.07284 |
10 | 2030 | 10 | avail | 4.24665 |
Here, we decided to save the long profiles table with the name profiles
to use in the clustering below.
Dummy Clustering
A dummy cluster will essentially ignore the clustering and create the necessary tables for the next steps in the Tulipa workflow.
for table_name in (
"rep_periods_data",
"rep_periods_mapping",
"profiles_rep_periods",
"timeframe_data",
)
DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end
clusters = dummy_cluster!(connection)
nice_query("FROM rep_periods_data LIMIT 5")
Row | rep_period | num_timesteps | resolution |
---|---|---|---|
Int64 | Int64 | Float64 | |
1 | 1 | 672 | 1.0 |
nice_query("FROM rep_periods_mapping LIMIT 5")
Row | period | rep_period | weight |
---|---|---|---|
Int64 | Int64 | Float64 | |
1 | 1 | 1 | 1.0 |
nice_query("FROM profiles_rep_periods LIMIT 5")
Row | rep_period | timestep | year | profile_name | value |
---|---|---|---|---|---|
Int64 | Int64 | Int32 | String | Float64 | |
1 | 1 | 1 | 2030 | avail | 4.34091 |
2 | 1 | 2 | 2030 | avail | 4.17384 |
3 | 1 | 3 | 2030 | avail | 4.13356 |
4 | 1 | 4 | 2030 | avail | 4.03112 |
5 | 1 | 5 | 2030 | avail | 3.88836 |
nice_query("FROM timeframe_data LIMIT 5")
Row | period | num_timesteps |
---|---|---|
Int64 | Int64 | |
1 | 1 | 672 |
Clustering
We can perform a real clustering by using the cluster!
function with two extra arguments (see Explanation for their deeped meaning):
period_duration
: How long are the split periods;num_rps
: How many representative periods.
period_duration = 24
num_rps = 3
for table_name in (
"rep_periods_data",
"rep_periods_mapping",
"profiles_rep_periods",
"timeframe_data",
)
DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end
clusters = cluster!(connection, period_duration, num_rps)
nice_query("FROM rep_periods_data LIMIT 5")
Row | rep_period | num_timesteps | resolution |
---|---|---|---|
Int64 | Int64 | Float64 | |
1 | 1 | 24 | 1.0 |
2 | 2 | 24 | 1.0 |
3 | 3 | 24 | 1.0 |
nice_query("FROM rep_periods_mapping LIMIT 5")
Row | period | rep_period | weight |
---|---|---|---|
Int64 | Int64 | Float64 | |
1 | 1 | 1 | 0.0643044 |
2 | 1 | 2 | 0.935696 |
3 | 2 | 1 | 0.201336 |
4 | 2 | 2 | 0.798664 |
5 | 3 | 3 | 1.0 |
nice_query("FROM profiles_rep_periods LIMIT 5")
Row | rep_period | timestep | year | profile_name | value |
---|---|---|---|---|---|
Int64 | Int64 | Int32 | String | Float64 | |
1 | 1 | 1 | 2030 | avail | 4.37501 |
2 | 1 | 2 | 2030 | avail | 4.24057 |
3 | 1 | 3 | 2030 | avail | 4.1751 |
4 | 1 | 4 | 2030 | avail | 3.96964 |
5 | 1 | 5 | 2030 | avail | 3.76434 |
nice_query("FROM timeframe_data LIMIT 5")
Row | period | num_timesteps |
---|---|---|
Int64 | Int64 | |
1 | 1 | 24 |
2 | 2 | 24 |
3 | 3 | 24 |
4 | 4 | 24 |
5 | 5 | 24 |
Using a custom layout
You can customize the in-memory (and output) column names used for the time step and value columns by passing a ProfilesTableLayout
to cluster!
. The input SQL table must still use the standard column names (profile_name
, timestep
, value
). The layout only affects the in-memory DataFrames and the resulting profiles_rep_periods
output table. Other output tables are layout-agnostic.
Below we cluster again but ask the output table to use ts
and val
instead of the defaults timestep
and value
:
for table_name in (
"rep_periods_data",
"rep_periods_mapping",
"profiles_rep_periods",
"timeframe_data",
)
DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end
layout = TulipaClustering.ProfilesTableLayout(; timestep = :ts, value = :val)
clusters_custom = cluster!(connection, period_duration, num_rps; layout)
nice_query("FROM profiles_rep_periods LIMIT 5")
Row | rep_period | ts | year | profile_name | val |
---|---|---|---|---|---|
Int64 | Int64 | Int32 | String | Float64 | |
1 | 1 | 1 | 2030 | avail | 4.33712 |
2 | 1 | 2 | 2030 | avail | 4.27616 |
3 | 1 | 3 | 2030 | avail | 4.11844 |
4 | 1 | 4 | 2030 | avail | 3.95463 |
5 | 1 | 5 | 2030 | avail | 3.84521 |
Notice the columns ts
and val
in the output above (instead of timestep
/ value
).
If you prefer to always export the default column names, omit the layout
argument.