Tutorial
- Getting Started
- Hull Clustering with Blended Representative Periods
- Clustering by other columns
- Using a Custom Layout
- Extra Functions in High level API/DuckDB API
- Low level API
Getting Started
Input files
We use DuckDB as the database backend for the input data of the profiles. The input data must be in a long table format with at least the columns year
, profile_name
, timestep
, and value
. Extra columns such as scenario
are allowed. These defaults values are configurable by passing a ProfilesTableLayout
to cluster!
. See the Using a Custom Layout section for an example of how to use a custom layout.
For this tutorial we will use the profiles file available in the TulipaClustering repository.
In this example the profiles are in a CSV file, but you can also load them from other sources (e.g., Parquet, Excel, etc.) using DuckDB's readers.
using DataFrames, DuckDB
# change the following path to your file location
profiles_file = joinpath(@__DIR__,"../../test/inputs/EU/profiles.csv")
connection = DBInterface.connect(DuckDB.DB)
DuckDB.query(
connection,
"""
CREATE TABLE profiles AS
SELECT * FROM read_csv('$profiles_file');
""",
)
# helper function to query the DuckDB tables in the connection
nice_query(str) = DuckDB.query(connection, str) |> DataFrame
# show the tables in the connection using the helper function
nice_query("SHOW tables")
Row | name |
---|---|
String | |
1 | profiles |
And here we can have a look at the first rows of profiles
:
nice_query("FROM profiles LIMIT 10")
Row | year | profile_name | timestep | value |
---|---|---|---|---|
Int64 | String | Int64 | Float64 | |
1 | 2030 | AUS_SunPV | 1 | 0.0 |
2 | 2030 | AUS_SunPV | 2 | 0.0 |
3 | 2030 | AUS_SunPV | 3 | 0.0 |
4 | 2030 | AUS_SunPV | 4 | 0.0 |
5 | 2030 | AUS_SunPV | 5 | 0.0 |
6 | 2030 | AUS_SunPV | 6 | 0.0 |
7 | 2030 | AUS_SunPV | 7 | 0.0 |
8 | 2030 | AUS_SunPV | 8 | 0.01 |
9 | 2030 | AUS_SunPV | 9 | 0.088 |
10 | 2030 | AUS_SunPV | 10 | 0.107 |
Let's explore the first 10 unique profile names in the profiles
table:
nice_query("""
SELECT DISTINCT profile_name
FROM profiles
ORDER BY profile_name
LIMIT 10
""")
Row | profile_name |
---|---|
String | |
1 | AUS_SunPV |
2 | AUS_WindOff |
3 | AUS_WindOn |
4 | AUS_demand |
5 | BEL_SunPV |
6 | BEL_WindOff |
7 | BEL_WindOn |
8 | BEL_demand |
9 | BLK_SunPV |
10 | BLK_WindOff |
And finally, we can use nice_query
to filter the profiles and plot them. For example, here we filter and plot the profiles in the Netherlands (i.e., those starting with NED_
) and plot only a sample with the profiles for the first week of the year:
using Plots
df = nice_query("""
SELECT *
FROM profiles
WHERE profile_name LIKE 'NED_%'
ORDER BY profile_name, timestep
""")
sample = 1:168
plot(size=(800, 400))
for group in groupby(df, :profile_name)
name = group.profile_name[1]
plot!(group.timestep[sample], group.value[sample], label=name)
end
plot!(xlabel="Timestep", ylabel="Value", title="Profiles in the Netherlands")
Clustering
We can perform the clustering by using the cluster!
function by passing the connection with the profiles table and two extra arguments (see Concepts for their deeped meaning):
period_duration
: How long are the periods (e.g., 24 for daily periods if the timestep is hourly);num_rps
: How many representative periods.
In this example we use the function cluster!
with its default parameters. Section Hull Clustering with Blended Representative Periods explains the extra parameters that can be passed to the clustering function.
Finally, it will create new output tables in the DuckDB connection that we will explore in the next section.
After the clustering, four tables will be created in the DuckDB connection:
using TulipaClustering
period_duration = 24
num_rps = 4
clusters = cluster!(connection, period_duration, num_rps)
nice_query("SHOW tables")
Row | name |
---|---|
String | |
1 | profiles |
2 | profiles_rep_periods |
3 | rep_periods_data |
4 | rep_periods_mapping |
5 | timeframe_data |
Output Tables
The output tables are:
profiles_rep_periods
contains the profiles for each RP,
nice_query("FROM profiles_rep_periods LIMIT 5")
Row | rep_period | timestep | year | profile_name | value |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | String | Float64 | |
1 | 1 | 1 | 2030 | AUS_SunPV | 0.0 |
2 | 1 | 2 | 2030 | AUS_SunPV | 0.0 |
3 | 1 | 3 | 2030 | AUS_SunPV | 0.0 |
4 | 1 | 4 | 2030 | AUS_SunPV | 0.0 |
5 | 1 | 5 | 2030 | AUS_SunPV | 0.007 |
rep_periods_data
contains the general informations of the RPs,
nice_query("FROM rep_periods_data")
Row | year | rep_period | num_timesteps | resolution |
---|---|---|---|---|
Int64 | Int64 | Int64 | Float64 | |
1 | 2030 | 1 | 24 | 1.0 |
2 | 2030 | 2 | 24 | 1.0 |
3 | 2030 | 3 | 24 | 1.0 |
4 | 2030 | 4 | 24 | 1.0 |
rep_periods_mapping
containts the weights that are use to map the RPs to the original (or base) periods,
nice_query("FROM rep_periods_mapping LIMIT 5")
Row | year | period | rep_period | weight |
---|---|---|---|---|
Int64 | Int64 | Int64 | Float64 | |
1 | 2030 | 1 | 3 | 1.0 |
2 | 2030 | 2 | 3 | 1.0 |
3 | 2030 | 3 | 3 | 1.0 |
4 | 2030 | 4 | 3 | 1.0 |
5 | 2030 | 5 | 4 | 1.0 |
timeframe_data
contains information about the original (or base) periods
nice_query("FROM timeframe_data LIMIT 5")
Row | year | period | num_timesteps |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 2030 | 1 | 24 |
2 | 2030 | 2 | 24 |
3 | 2030 | 3 | 24 |
4 | 2030 | 4 | 24 |
5 | 2030 | 5 | 24 |
You can use DuckDB to explore the results using SQL queries or export them to CSV or Parquet files using DuckDB's writers.
For example, we can plot again the profiles in the Netherlands, but this time using the clustered profiles:
df = nice_query("""
SELECT *
FROM profiles_rep_periods
WHERE profile_name LIKE 'NED_%'
ORDER BY profile_name, timestep
""")
rep_periods = unique(df.rep_period)
plots = []
for rp in rep_periods
df_rp = filter(row -> row.rep_period == rp, df)
p = plot(size=(400, 300), title="Representative Period $rp")
for group in groupby(df_rp, :profile_name)
name = group.profile_name[1]
plot!(p, group.timestep, group.value, label=name)
end
show_legend = (rp == rep_periods[1])
plot!(p,
xlabel="Timestep",
ylabel="Value",
xticks=0:2:period_duration,
xlim=(1, period_duration),
ylim=(0, 1),
legend=show_legend ? :bottomleft : false,
legendfontsize=6
)
push!(plots, p)
end
plot(plots..., layout=(2, 2), size=(800, 600))
🎉 Congratulations! You have successfully finished the first part of the tutorial. Now you can continue with more concepts and options in the following sections 😉
Hull Clustering with Blended Representative Periods
The function cluster!
has several keyword arguments that can be used to customize the clustering process. Alternatively, you can use the help mode in Julia REPL by typing ?cluster!
to see all the available keyword arguments and their descriptions. Here is a summary of the most important keyword arguments for this tutorial:
method
(default:k_medoids
): clustering method to use:k_means
,:k_medoids
,:convex_hull
,:convex_hull_with_null
, or:conical_hull
.distance
(defaultDistances.Euclidean()
): semimetric used to measure distance between data points from the the package Distances.jl.weight_type
(default:dirac
): the type of weights to find; possible values are::dirac
: each period is represented by exactly one representative period (a one unit weight and the rest are zeros):convex
: each period is represented as a convex sum of the representative periods (a sum with nonnegative weights adding into one):conical
: each period is represented as a conical sum of the representative periods (a sum with nonnegative weights):conical_bounded
: each period is represented as a conical sum of the representative periods (a sum with nonnegative weights) with the total weight bounded from above by one.
As you can see, there are several keyword arguments that can be combined to explore different clustering strategies. Our proposed method is the Hull Clustering with Blended Representative Periods, which can be activated by setting the following keyword arguments:
method = :convex_hull
distance = Distances.CosineDist()
weight_type = :convex
You can read more about the proposed method in the Concepts section.
So, let's cluster again using the proposed method:
using Distances
clusters = cluster!(connection,
period_duration,
num_rps;
method = :convex_hull,
distance = Distances.CosineDist(),
weight_type = :convex
)
nice_query("SHOW tables")
Row | name |
---|---|
String | |
1 | profiles |
2 | profiles_rep_periods |
3 | rep_periods_data |
4 | rep_periods_mapping |
5 | timeframe_data |
As you can see, the output tables names are the same as before, but the results will be different. You can explore the results again using SQL queries or export them to CSV or Parquet files using DuckDB's writers.
Let's plot again the profiles in the Netherlands, but this time using the clustered profiles with the hull clustering method:
df = nice_query("""
SELECT *
FROM profiles_rep_periods
WHERE profile_name LIKE 'NED_%'
ORDER BY profile_name, timestep
""")
rep_periods = unique(df.rep_period)
plots = []
for rp in rep_periods
df_rp = filter(row -> row.rep_period == rp, df)
p = plot(size=(400, 300), title="Hull Clustering RP $rp")
for group in groupby(df_rp, :profile_name)
name = group.profile_name[1]
plot!(p, group.timestep, group.value, label=name)
end
show_legend = (rp == rep_periods[1])
plot!(p,
xlabel="Timestep",
ylabel="Value",
xticks=0:2:period_duration,
xlim=(1, period_duration),
ylim=(0, 1),
legend=show_legend ? :topleft : false,
legendfontsize=6
)
push!(plots, p)
end
plot(plots..., layout=(2, 2), size=(800, 600))
The first difference you may notice is that the representative periods (RPs) obtained with hull clustering are more extreme than those obtained with the default method. This is because hull clustering selects RPs that are more likely to be constraint-binding in an optimization model.
The parameters niters
and learning_rate
tell for how many iterations to run the descent and by how much to adjust the weights in each iterations. More iterations make the method slower but produce better results. Larger learning rate makes the method converge faster but in a less stable manner (i.e., weights might start going up and down a lot from iteration to iteration). Sometimes you need to find the right balance for yourself. In general, if the weights produced by the method look strange, try decreasing the learning rate and/or increasing the number of iterations.
For more details on the comparison of clustering methods please refer to the Scientific References section.
Clustering by other columns
TulipaClustering.jl
clusters by default using the columns year
, i.e., it will create representative periods for each year in the input data. The total number of representative periods will be num_rps * number_of_years
. This is useful when the profiles have a strong seasonal component that changes from year to year.
However, sometimes the user might want to cluster by other columns, e.g., scenario
or region
, or even by multiple columns, e.g., year
and scenario
. The package allows to cluster by different columns by passing a custom ProfilesTableLayout
to cluster!
.
The cols_to_groupby
argument in ProfilesTableLayout
is a vector of symbols, i.e., cols_to_groupby = [:year, :scenario]
.
When clustering by multiple columns, the total number of representative periods will be num_rps * number_of_unique_combinations_of_groupby_columns
. For example, if the input data has 3 unique years and 2 unique scenarios, and the user wants to cluster by year
and scenario
, then the total number of representative periods will be num_rps * 3 * 2 = num_rps * 6
.
Using a Custom Layout
Let's say that you have a table that uses different names for the columns of your data. For example, let's rename the column timestep
to hour
in the profiles table.
DuckDB.query(
connection,
"ALTER TABLE profiles
RENAME COLUMN timestep to hour;
",
)
nice_query("FROM profiles LIMIT 10")
Row | year | profile_name | hour | value |
---|---|---|---|---|
Int64 | String | Int64 | Float64 | |
1 | 2030 | AUS_SunPV | 1 | 0.0 |
2 | 2030 | AUS_SunPV | 2 | 0.0 |
3 | 2030 | AUS_SunPV | 3 | 0.0 |
4 | 2030 | AUS_SunPV | 4 | 0.0 |
5 | 2030 | AUS_SunPV | 5 | 0.0 |
6 | 2030 | AUS_SunPV | 6 | 0.0 |
7 | 2030 | AUS_SunPV | 7 | 0.0 |
8 | 2030 | AUS_SunPV | 8 | 0.01 |
9 | 2030 | AUS_SunPV | 9 | 0.088 |
10 | 2030 | AUS_SunPV | 10 | 0.107 |
In this case, you can use the custom column name by passing a ProfilesTableLayout
to cluster!
.
The layout names will also be preserved in the output tables. Below we cluster again, but ask passing the information to use hour
instead of the default timestep
:
layout = TulipaClustering.ProfilesTableLayout(; timestep = :hour)
clusters = cluster!(connection, period_duration, num_rps; layout)
nice_query("FROM profiles_rep_periods LIMIT 10")
Row | rep_period | hour | year | profile_name | value |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | String | Float64 | |
1 | 1 | 1 | 2030 | AUS_SunPV | 0.0 |
2 | 1 | 2 | 2030 | AUS_SunPV | 0.0 |
3 | 1 | 3 | 2030 | AUS_SunPV | 0.0 |
4 | 1 | 4 | 2030 | AUS_SunPV | 0.0 |
5 | 1 | 5 | 2030 | AUS_SunPV | 0.0 |
6 | 1 | 6 | 2030 | AUS_SunPV | 0.0 |
7 | 1 | 7 | 2030 | AUS_SunPV | 0.025 |
8 | 1 | 8 | 2030 | AUS_SunPV | 0.132 |
9 | 1 | 9 | 2030 | AUS_SunPV | 0.225 |
10 | 1 | 10 | 2030 | AUS_SunPV | 0.307 |
Notice the column hour
in the output above (instead of timestep
).
Extra Functions in High level API/DuckDB API
The high-level API of TulipaClustering focuses on using TulipaClustering as part of the Tulipa workflow. This API consists of three main functions: cluster!
, transform_wide_to_long!
, and dummy_cluster!
. These functions are designed to work with DuckDB connections and tables, making it easy to integrate clustering into your data processing pipeline. In the previous sections, we have already covered the usage of cluster!
and transform_wide_to_long!
. In this section, we will explore the third function, dummy_cluster!
, which is useful for testing, debugging, and to prepare the data for TulipaEnergyModel without actually clustering the profiles.
Dummy Clustering
A dummy cluster will essentially ignore the clustering, but it will create the necessary tables that are often used for the next steps in the Tulipa workflow.
clusters = dummy_cluster!(connection; layout)
nice_query("FROM rep_periods_data LIMIT 10")
Row | year | rep_period | num_timesteps | resolution |
---|---|---|---|---|
Int64 | Int64 | Int64 | Float64 | |
1 | 2030 | 1 | 8760 | 1.0 |
In this case the function created a single representative period for all the data.
Notice that we passed the layout
argument to dummy_cluster!
to ensure that the output tables have the correct column names, since we renamed the timestep
column to hour
in the previous section.
Transform a wide profiles table into a long table
The long table format is a requirement of TulipaClustering, even for the dummy clustering example.
A long table is a table where the profile names are stacked in a column with the corresponding values in a separate column. However, sometimes the input data is in a wide format, i.e., each profile is in a separate column.
In those cases, you can use the function transform_wide_to_long!
to transform a wide table into a long table. You need to provide the connection to DuckDB, the name of the source table (the wide table) and the name of the target table (the long table that will be created).
Low level API
The cluster!
function is a wrapper around the low-level clustering functions. It simplifies the process of clustering by handling the creation of temporary tables and managing the clustering workflow.
However, if you want to have more control over the clustering process, you can use the low-level functions directly. The low-level API consists of the following functions:
split_into_periods!
: Splits the profiles into periods based on the specified period duration.find_representative_periods
: Finds the representative periods using the specified clustering method.fit_rep_period_weights!
: Fits the weights for the representative periods to map them to the original periods.
At the end of the clustering process, you will get a TulipaClustering.ClusteringResult
struct that contains the detailed results of the clustering process:
profiles
is a dataframe with profiles for RPs,weight_matrix
is a matrix of weights of RPs in blended periods,clustering_matrix
andrp_matrix
are matrices of profile data for each base and representative period (useful to keep for the next step, but you should not need these unless you want to do some extra math here)auxiliary_data
contains some extra data that was generated during the clustering process and is generally not interesting to the user who is not planning to interact with the clustering method on a very low level. For example, if you use thek-medoids
method, theauxiliary_data
will contain the indices of the medoids in the original data.
So, although we recommend using the high-level API for most use cases, you can use the low-level functions if you need more control over the clustering process.