Tutorial

Explanation

To simplify, let's consider a single profile, for a single year. Let's denote it as $p_i$, where $i = 1,\dots,N$. The clustering process consists of:

  1. Split N into (let's assume equal) periods of size m = period_duration. We can rename $p_i$ as

    \[p_{j,k}, \qquad \text{where} \qquad j = 1,\dots,m, \quad k = 1,\dots,N/m.\]

  2. Compute num_rps representative periods

    \[r_{j,\ell}, \qquad \text{where} \qquad j = 1,\dots,m, \qquad \ell = 1,\dots,\text{num\_rps}.\]

  3. During computation of the representative periods, we obtained weight $w_{k,\ell}$ between the period $k$ and the representative period $\ell$, such that

    \[p_{j,k} = \sum_{\ell = 1}^{\text{num\_rps}} r_{j,\ell} \ w_{k,\ell}, \qquad \forall j = 1,\dots,m, \quad k = 1,\dots,N/m\]

High level API/DuckDB API

High level API

This tutorial focuses on the highest level of the API, which requires the use of a DuckDB connection.

The high-level API of TulipaClustering focuses on using TulipaClustering as part of the Tulipa workflow. This API consists of three main functions: transform_wide_to_long!, cluster!, and dummy_cluster!. In this tutorial we'll use all three.

Normally, you will have the DuckDB connection from the larger Tulipa workflow, so here we will create a temporary connection with fake data to show an example of the workflow. You can look into the source code of this documentation to see how to create this fake data.

Here is the content of that connection:

using DataFrames, DuckDB

nice_query(str) = DataFrame(DuckDB.query(connection, str))
nice_query("show tables")
1×1 DataFrame
Rowname
String
1profiles_wide

And here is the first rows of profiles_wide:

nice_query("from profiles_wide limit 10")
10×5 DataFrame
Rowyeartimestepavailsolardemand
Int32Int64Float64Float64Float64
1203014.340910.03.67921
2203024.173840.03.90639
3203034.133560.04.13572
4203044.031120.04.6879
5203053.888360.05.24468
6203063.813410.05.8205
7203073.911620.4381096.32735
8203084.037191.382937.08529
9203094.072842.216846.73004
102030104.246652.667587.21155

And finally, this is the plot of the data:

using Plots

table = DuckDB.query(connection, "from profiles_wide")
plot(size=(800, 400))
timestep = [row.timestep for row in table]
for profile_name in (:avail, :solar, :demand)
    value = [row[profile_name] for row in table]
    plot!(timestep, value, lab=string(profile_name))
end
plot!()
Example block output

Transform a wide profiles table into a long table

Required

The long table format is a requirement of TulipaClustering, even for the dummy clustering example.

In this context, a wide table is a table where each new profile occupies a new column. A long table is a table where the profile names are stacked in a column with the corresponding values in a separate column. Given the name of the source table (in this case, profiles_wide), we can create a long table with the following call:

using TulipaClustering

transform_wide_to_long!(connection, "profiles_wide", "profiles")

nice_query("FROM profiles LIMIT 10")
10×4 DataFrame
Rowyeartimestepprofile_namevalue
Int32Int64StringFloat64
120301avail4.34091
220302avail4.17384
320303avail4.13356
420304avail4.03112
520305avail3.88836
620306avail3.81341
720307avail3.91162
820308avail4.03719
920309avail4.07284
10203010avail4.24665

Here, we decided to save the long profiles table with the name profiles to use in the clustering below.

Dummy Clustering

A dummy cluster will essentially ignore the clustering and create the necessary tables for the next steps in the Tulipa workflow.

for table_name in (
    "rep_periods_data",
    "rep_periods_mapping",
    "profiles_rep_periods",
    "timeframe_data",
)
    DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end

clusters = dummy_cluster!(connection)

nice_query("FROM rep_periods_data LIMIT 5")
1×3 DataFrame
Rowrep_periodnum_timestepsresolution
Int64Int64Float64
116721.0
nice_query("FROM rep_periods_mapping LIMIT 5")
1×3 DataFrame
Rowperiodrep_periodweight
Int64Int64Float64
1111.0
nice_query("FROM profiles_rep_periods LIMIT 5")
5×5 DataFrame
Rowrep_periodtimestepyearprofile_namevalue
Int64Int64Int32StringFloat64
1112030avail4.34091
2122030avail4.17384
3132030avail4.13356
4142030avail4.03112
5152030avail3.88836
nice_query("FROM timeframe_data LIMIT 5")
1×2 DataFrame
Rowperiodnum_timesteps
Int64Int64
11672

Clustering

We can perform a real clustering by using the cluster! function with two extra arguments (see Explanation for their deeped meaning):

  • period_duration: How long are the split periods;
  • num_rps: How many representative periods.
period_duration = 24
num_rps = 3

for table_name in (
    "rep_periods_data",
    "rep_periods_mapping",
    "profiles_rep_periods",
    "timeframe_data",
)
    DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end

clusters = cluster!(connection, period_duration, num_rps)

nice_query("FROM rep_periods_data LIMIT 5")
3×3 DataFrame
Rowrep_periodnum_timestepsresolution
Int64Int64Float64
11241.0
22241.0
33241.0
nice_query("FROM rep_periods_mapping LIMIT 5")
5×3 DataFrame
Rowperiodrep_periodweight
Int64Int64Float64
1110.0643044
2120.935696
3210.201336
4220.798664
5331.0
nice_query("FROM profiles_rep_periods LIMIT 5")
5×5 DataFrame
Rowrep_periodtimestepyearprofile_namevalue
Int64Int64Int32StringFloat64
1112030avail4.37501
2122030avail4.24057
3132030avail4.1751
4142030avail3.96964
5152030avail3.76434
nice_query("FROM timeframe_data LIMIT 5")
5×2 DataFrame
Rowperiodnum_timesteps
Int64Int64
1124
2224
3324
4424
5524

Using a custom layout

You can customize the in-memory (and output) column names used for the time step and value columns by passing a ProfilesTableLayout to cluster!. The input SQL table must still use the standard column names (profile_name, timestep, value). The layout only affects the in-memory DataFrames and the resulting profiles_rep_periods output table. Other output tables are layout-agnostic.

Below we cluster again but ask the output table to use ts and val instead of the defaults timestep and value:

for table_name in (
    "rep_periods_data",
    "rep_periods_mapping",
    "profiles_rep_periods",
    "timeframe_data",
)
    DuckDB.query(connection, "DROP TABLE IF EXISTS $table_name")
end

layout = TulipaClustering.ProfilesTableLayout(; timestep = :ts, value = :val)
clusters_custom = cluster!(connection, period_duration, num_rps; layout)

nice_query("FROM profiles_rep_periods LIMIT 5")
5×5 DataFrame
Rowrep_periodtsyearprofile_nameval
Int64Int64Int32StringFloat64
1112030avail4.33712
2122030avail4.27616
3132030avail4.11844
4142030avail3.95463
5152030avail3.84521

Notice the columns ts and val in the output above (instead of timestep / value).

If you prefer to always export the default column names, omit the layout argument.