Median result is multiplied x3.46

cechevarria · April 9, 2024, 12:46pm

Hello everyone,

I’m working with a dataset containing transactions in a comercial sector, every row is a transaction with a userid, ticket_amount, transaction_date, etc. I’m trying to obtain the anual median expenditure by user.

For this, first I calculate the sum of ticket amounts grouped by userid and year:

1- anual_user _spend=sumOver({ticket_amount},[year, userid], PRE_AGG)

And then I either represent it in a visual against year and aggregating it as median

Or I directly calculate it in a new field:

2-median_anual_spend=median(sumOver({ticket_amount}, [year, userid], PRE_AGG), [year])

I have data for 2022, 2023 and 2024, no matter how I calculate it (option 1 or 2). I always get values that are 3.46 times higher than I should (for each year) (based on working on the same dataset with pandas)

For example:
Quicksight
2022-----548,34€
2023-----618,17€

Calculated elsewhere
2022-----157.61€
2023-----178.68€

I have also made a table in quicksight with fields userid, year, anual_user _spend and median_anual_spend, exported it, and performed:

df.groupby(‘year’)[‘anual_user_spend’].median()

And obtained the correct value (quicksight result/3,46)

I don’t know what else to do to get the correct values

DylanM · April 9, 2024, 3:19pm

Hello @cechevarria, welcome to the QuickSight community!

So that 2nd calculation you wrote above would give you the median of the sum of each ticket amount for each user per year. That is likely why you are seeing values that are significantly higher. I am wondering if you would need the sumOver function at all if you are wanting to see the median ticket_amount for each user. My first thought would be to try a function like this:
median({ticket_amount}, [year, userid])

That should give you the median ticket amount value for each userid per year. Let me know if that helps!

DylanM · April 15, 2024, 3:39pm

Hello @cechevarria, since we have not heard back from you, I will mark my above response as the solution. If you have any follow-up questions on this let me know and I can help guide you further. Thank you!

cechevarria · April 16, 2024, 7:27am

Hi @DylanM,

I want the median of the total expenditure in a year, that’s why I need to sum all ticket_amounts per user per year

I have been told by a colleage that median is a problematic measurement to calculate upon window functions in bi tools, both power bi and now quicksight. I have given up and calculated just the mean (Quicksight does not alllow me to aggregate by the median here) over a simple sum(,) not sumOver(), and that gives the correct value, but that’s not robust to outlayers. If I don’t find a solution I will just upload the dataset with medians calculated, a shame

DylanM · April 16, 2024, 10:09pm

Hello @cechevarria, okay, I think I understand. That initial calculation you wrote up does make more sense now. I do think the lack of a medianOver function is a bit of a road block on ensuring your aggregation is calculation properly. I do think adding the median calculation on the dataset layer in custom SQL might be your best bet to get the value you are looking for. If you switched to an average calculation instead of median, it seems like that wouldn’t meet your expectations, that would be the closest option to a work-around. Since there is an avgOver aggregation, it would work better with the sumOver nested inside of it.

I can mark this as a feature request, but for now I think calculating on your dataset layer will be the best way to ensure your data is being returned correctly. Let me know if you have any follow-up questions, otherwise I will archive this for our support team. Thank you!

DylanM · April 22, 2024, 3:54pm

Hello @cechevarria, since we have not heard back with any remaining questions, I will archive this topic. Thank you!