Average aggregation over parallel periods of time

amico · November 20, 2022, 1:00pm

Hi

I have a dataset with the below structure:

transId	Name	Category	Start date	End date	Amount
111	Jhon	Taxi	20/10/2022 0:00	21/10/2022 0:00	100
222	Jhon	Taxi	20/10/2022 0:00	21/10/2022 0:00	600
333	Jhon	Taxi	25/10/2022 0:00	28/10/2022 0:00	100
444	Jhon	Taxi	25/10/2022 0:00	28/10/2022 0:00	150
555	Jhon	Taxi	05/11/2022 0:00	07/11/2022 0:00	400

I want to get the average amount per day and category.
In the example below, it should be 1350/6 = 225 in average for taxi per day.
Considering I can’t change my data structure and adjust it to this specific calculation (doing group by in the data prep layer etc.), what can be the suggested calculation?

Thanks!
Amihai

Max · November 21, 2022, 4:41pm

Hmmm,

Can you try this. Let me know if it doesn’t work as I don’t have the same format as you.

sumOver(sum(amount),[Category])/sumOver(avg(dateDiff({Start Date},{End Date},‘DD’)),[Category,{Start Date},{End Date}])

What this is doing it take the sum of amounts per category, and then dividing by the sum of the days where the days are averages over Category, Start Date, and End Date. This denominator should work as how you expected it by taking just the averages of days from start date to end date and summing them.

Let me know if that works.

gillepa · November 22, 2022, 9:17pm

@amico

Some ideas that might help:

By “average by days” you mean all distinct dates in either Start or End date, right? This is how you end up with 6 as the denominator: Oct 20, 21, 25, 28 and Nov 5 and 7. (There is another way to look at it: include all days, even the missing dates. Or we can we include the dates between start to end dates and we end up with more than 6 days as the denominator.)

Considering there are duplicate dates (start and end dates), we need first to find a way to calculate the number of distinct dates. For example, for the first 2 rows, you want to sum up the amounts (100+600) but use only one day when you calculate the average.

If there are no overlapping dates on different rows, then we use a

distinct_count({Start Date}, [Category]) *2

to get the denominator (the number of distinct dates). I multiplied by 2 because for every Start Date there is another distinct End Date that needs to be included in the denominator (“number of days”)

sum({Amount},[Category]) / ( distinct_count({Start Date}, [Category]) * 2 )

For more info on Level aware calculations:

Try it and let us know if it works for you.