Return specific percentiles of an array in a table in Quicksight - Like Pandas Describe Function

Hello,

I’ve searched the AWS Quicksight documentation and I’ve not been able to find what I am looking for. I am working within Quicksight and I’ve created a table that provides the following information in 4 columns:

Model Serial Number, Minimum Hour Meter Reading, Maximum Hour Meter Reading, and Total Hours

This table has two filters. First I select a model that I am interested in. Then I selected the timespan that I am interested in. Therefore, the first column of the table (Model Serial Number) lists all of the serial numbers of the model which show up within the timespan that I’ve selected. The Min and Max hour meter readings are Calculated Fields within my data. I simply use min(hour_meter) and max(hour_meter). The min and max values which are returned are those that exist in the timespan filter. The last column - Total Hours, is another calculated field which simply subtracts the max reading from the min reading to give me the total usage or run time within the time period that I’ve selected.

The table is fine - but what I really want is a summary table that gives me: number of units, median value of Total Hours, 90th percentile of Total Hours, and 99th percentile of Total Hours.

This is very similar to the Describe function within Pandas. I tried creating calculated fields in my dataset using the percentileDisc function but I am either not using correctly or it doesn’t do what I am looking for it to do.

Does anyone know how to create a simple table of the data that I want?

Thank you!

Hi @dhouser -

You can use this approach to generate your summary table.

To accomplish this you will use 2 level aware aggregation functions.

  1. sumOver - Amazon QuickSight
  2. percentileDiscOver - Amazon QuickSight

Step 1. Create a calculated field for your median of total hours.
The sumOver aggregates total hours at the Model Serial Number. Then the 2nd level aware function calculates the median of those aggregates.

c_median_total_hours_pre_agg

percentileDiscOver(sumOver(Hours,[{Model Serial Number}],PRE_AGG),50,[],PRE_AGG)

Notes: The [] is where you would put a partition. If you plan on having dimensions in your group by and want your descriptive stats at the dimension level you will need to add them here.

Step 2. Modify the c_median_total_hours_pre_agg field for the 90th percentile.
c_percentile_total_hours_90_pre_agg

percentileDiscOver(sumOver(Hours,[{Model Serial Number}],PRE_AGG),90,[],PRE_AGG)

Step 3. Modify the c_median_total_hours_pre_agg field for the 90th percentile.
c_percentile_total_hours_99_pre_agg

percentileDiscOver(sumOver(Hours,[{Model Serial Number}],PRE_AGG),99,[],PRE_AGG)

Step 4. :information_source: WORKAROUND: You need to wrap one of the calculated fields above with a min() or include another regular aggregate function in your visual. I’m not sure why this is required for QS to do a total aggregation.

Here is a calculated field I created to illustrate this workaround.
c_min_hours

min(Hours)

Result:

Thank you for your help! Unfortunately I am not able to get this to work. I am getting an error:

“We encountered an error trying to save your calculated field. Please try creating it again.”

I am pretty new to this so I am am not sure I follow what what the calculated fields are meant to be returning. For instance, what is the following function meant to be returning for us?

sumOver(Hours,[{Model Serial Number}],PRE_AGG)

I am guessing that “Hours” is simply the array of values coming from a calculated field (max(hrs) minus min(hrs)). What is the difference between what is returned from this function, and the “Hours” array? When I tried creating this as a calculated field it gave me the same error as I pointed out above. Also, is the “measure” for this function simply the “Hours” array because we are using “PRE_AGG”?

I will keep trying to decipher what you’ve proposed, but I for now I don’t fully understand the PRE_AGG modifier so I am having a bit of trouble. I think once I understand the mechanics of these functions I’ll have a better understanding.

I will update if I figure this out on my own. Thanks!

Hi @dhouser -

I missed that you wanted the percentile for the max-min. You can achieve that with 1 extra step. I’ve broken it out as it’s own calculated field.

Create a calculated field (level aware agg) for max-min.

c_max-min

maxOver(Hours,[{Model Serial Number}],PRE_AGG)-minOver(Hours,[{Model Serial Number}],PRE_AGG)

Update your percentiles to use your c_max-min
c_median_total_hours_pre_agg

percentileDiscOver({c_max-min},50,[],PRE_AGG)

Hours and PRE_AGG
Hour is a measure from the example dataset. You’re correct PRE_AGG stands for Pre-Aggregation, the LAA function is looking for an unaggregated measure. You can read more on Using Level-Aware Aggregations - Amazon QuickSight . I would also recommend reading up on Order of Evaluation to help understand what PRE_AGG is doing Order of Evaluation in Amazon QuickSight - Amazon QuickSight

Hi @robdhondt - Thanks again for all of your help!

I had no problem creating the c_max-min field. I checked the values it produced by exporting them to excel and they were correct. However, when I create the c_median_total_hours_pre_agg field, the value it gives me doesn’t equate to the median value (50th percentile value) of the c_max-min array. I tried using the percentileDiscOver() function that you provided, as well as the percentile() function. Both functions returned the same value, but when I check the number against the median value of the c_max-min array in excel they don’t match. Any thoughts?

Thanks again for your support - I really do appreciate it.

@dhouser - I just did the same with the sample dataset and the median, percentiles are matching in Excel.

Some ideas;

  1. At first I forgot to apply the same filters to the data I exported to Excel – by chance did you do the same?

  2. Are you introducing an additional dims/group bys? These would have to be added to the partitions of the functions. Example… you added… region… etc.

Hi @robdhondt - Yes the same filters are applied to all of my data tables.

The calculated fields that I created are identical to the ones you’ve provided:

c_max-min = maxOver(Hours,[{Model Serial Number}],PRE_AGG)-minOver(Hours,[{Model Serial Number}],PRE_AGG)

&

c_median_total_hours_pre_agg = percentileDiscOver({c_max-min},50,[],PRE_AGG)

I did notice something that I didn’t expect to see which I think may point to the source of error… I hope. As I mentioned above, I created a calculated field using:

percentile({c_max-min},50)

This gave me a single value.

When I created a calculated field using:

percentileDiscOver({c_max-min},50,[],PRE_AGG)

This gave me a total of 10,520,823 values when grouped by model (I created a table using the Count of the above calculated field as the value and I Grouped by Model) - this didn’t seem to make sense to me because there are only 1,969 distinct serial numbers within the model of interest. I then created the same table but instead grouped it by serial number. As expected, there are 1,969 rows (one for each distinct serial number), however the count of the above calculated field ranges from 1 to 32,574. I then modified the table and changed the value from Count to Min (or max or median) - this gave me the same 1,969 rows that all had the same Value. This Value is identical to the single value that the percentile({c_max-min},50) function gave me…

Investigating further, I created a table to give me a Count for the number of Timestamps for each model serial number. The values in this table are exactly the same as those values given in the Count of percentileDiscOver() values grouped by model serial number - that is, ranging from 1 to 32,574 entries.

This tells me that my errors are related to the number of data entries for each serial number in the raw data. The raw data contains values for “Hour Meter Reading” for each individual machine, and these values are logged every 15 minutes or so. Since some machines are older than others, the number of "Hour Meter Reading"s for each machine is different.

I did the same exercise for the c_max-min calculated field, and I got the same results - which makes sense. When grouped by model, this calculated field gives me 10,520,823 readings. When grouped by model serial number - each has the same count as the percentileDiscOver() calculated field. If I ask for the min or the max of c_max-min grouped by machine serial number, I get the same number - and its correct. So I am getting the correct value between 1 and 32,574 times.

Does any of the above insight give you an idea of what needs to be modified in my calculated fields? It seems that since not every machine serial number has the same amount of data logs, I am getting a skewed value for the median. I think I need to account for the number of unique timestamps somehow. Any suggestions?

@robdhondt - I did figure it out, now I just need to figure out how to fix it.

The c_max-min does give me 10,520,823 readings - several for each machine serial number. If there are 5 timestamp records the hour meter reading of a given machine serial number, I will get 5 counts of c_max-min. The median that I am currently getting is based on the 10,520,823 values c_max-min instead of 1,969 values of c_max-min (one value of c_max-min for each machine serial number). I checked and confirmed this in excel.

Simplified, I am getting the median of an array like this:

[100, 100, 100, 200, 200, 50, 50, 50, 50, 50]

instead of this:

[100, 200, 50]

I am not sure how to fix this…

@dhouser - You’re right. Maybe this approach will work better. Using the same max-min field.

Step 1. Create a function to act as row_number
This will only work if you do not have duplicates at the rank level.
c_row_number

Rank([DT ASC, Hours ASC],[{Model Serial Number}],PRE_AGG) 

Step 2. Filter to the c_row_number = 1.

Result:
image

Excel Check:
image

Hi @robdhondt

Sorry for the late reply - been away for the holidays!

This comes very close to working - unfortunately I seem to have duplicates in the rank level! I’m not sure why I do but I verified that I do. Is there a way to drop the duplicates somehow?

I’m so close!

Thanks again!

Hi @robdhondt !

I got it to work by creating a custom SQL query on the data by selecting Distinct values of the table in redshift!

I only need help with one last thing - I promise!

I now have accurate data on the hours - what I would like to do is simply divide the number of hours by the reporting time period for each machine to put the data in terms of Hours per Day. I can get the number of working days for each machine by subtracting the max reporting date from the min reporting date for each machine - but I cant simply divide c_max-min by num_days because of the aggregation rules.

I’m still trying to figure this out on my own but if you have any input that would be great!

Thanks again for all of your help!!

I figured it out!

I created a calculated field using:

c_hours =
dateDiff( minOver({timestamp},[{machine_serial}],PRE_AGG), maxOver({timestamp},[{machine_serial}],PRE_AGG), “DD”)

Then created another calculated field for hours per day by dividing c_max-min / c_hours

Thanks again for everything!

Hi @dhouser - I was too slow with my response. Your solution is the approach I would have taken as well. Nice job!

It was fun working on this problem with you!

Best of luck and happy new year

Rob

1 Like