Group By

Modified on Mon, 5 Aug, 2024 at 10:24 AM

Description

With this function you can group a dataset based on one or multiple columns and perform calculations on each group.

Application

Some examples on when you typically would use Group By:

You have time series data with a sampling rate of 1 Hz spanning several days and want to know the average value for each minute, hour, or day.
You have data from several types of tests and want to know the maximum value occurring for each test type.
Or, on the same dataset you want to know how many data points you have for each test type.
You have a dataset with different materials and want to now the average value of a material property for each material in your dataset (e.g. the composites dataset from Challenge 1 - Composites Materials).

In general, whenever you can group your data by one or more variables (typically categorical variables) and want to get statistics (mean, min, max, count, …) or other characteristics for each group this can be done with Group By.

How to use

Choose the dataset to work on in the field Data.
Choose the column(s) by which to group the dataset in the field GroupBy columns.
- You need to choose at least one column. But a group can also be defined by multiple columns if required.
- The columns can have any data type. But if you choose a column with data type float you hardly will get any groups as the numbers have to be exactly the same to form a group. Typically, you would use categorical columns for grouping. Integer and Boolean columns would work as well.
Choose an Operation which should be applied to each group. See the section below for details.
The operation is not automatically applied to all available columns of the dataset. Instead, you have to choose on which columns the operations are applied for each group. Choose those in the field Columns.
- A column can't be assigned to GroupBy columns and Columns at the same time. This will result in an error when running the step. You either have to assign a column to GroupBy columns or Columns.
Typically, you should save the result of Group By in a new dataset by enabling Save output under different name because the resulting dataset is completely restructured (see below).
Click Apply to execute the Group By operation.

Structure of output

The resulting dataset has the following structure:

The number of rows is equivalent to the number of groups found in the dataset. There will be one row for each group. (If custom code is used this could be different as any number of rows could be defined as result.)
The Group By columns will be kept so you can identify each group.
All columns selected in the Columns field are kept in the dataset. All other columns will not be part of the new dataset.

The output would look like this:

GroupBy Col1	GroupBy Col2	…	Column1	Column2	…	ColumnK
`Group1 ID1`	`Group1 ID2`	…	`value`	`value`	…	`value`
`Group2 ID1`	`Group2 ID2`		`value`	`value`	…	`value`
…	…	…	…	…	…	…
`GroupN ID1`	`GroupN ID2`	…	`value`	`value`	…	`value`

Available operations

The following operations are available in GroupBy:

Operation	Description
Min	Returns the minimum value within a group for each selected column. This works also for categorical data and will return the first string after ordering all entries found in a group alphanumerical.
Max	Returns the maximum value within a group for each selected column. This works also for categorical data and will return the last string after ordering all entries found in a group alphanumerical.
Mean	Returns the mean value within a group for each selected column. Will throw an error due to wrong data type for categorical columns.
Sum	Returns the sum of all rows within a group for each selected column. For categorical columns this will yield a single long string for each group with all entries of a group concatenated.
Count	Counts the number of non-empty rows within a group for each selected column. If the dataset is complete (i.e. no missing values) this will return the same value for any selected column. Works on both numerical and categorical data.
Unique Count	Returns the number of distinct values within a group for each selected column. Works on both numerical and categorical data.

Examples

We use the dataset from Challenge 1 - Composite Materials for these examples.

Question 1: What is the average fiber length for each Carbon fiber in the dataset?

To answer that question we can use Group By. See the required settings in the screenshot below.

And this will be the resulting dataset:

You have only the two columns involved in the dataset (Carbon fiber as GroupBy column and Fiber length as the only column on which the operation was performed). To work further with this information you could do the following:

Rename the column Fiber length to Average fiber length so that it is clear what the column content is.
Join this new dataset with the original dataset. That way you have now additionally the average fiber length for each carbon fiber in your original dataset. You could compare now the actual fiber length of each data point with the average fiber length of the group.

Question 2: What is the maximum ultimate strain that was achieved in a composite for each glass fiber?

To answer that question we can use Group By as well. See the required settings in the screenshot below.

And this will be the resulting dataset: