SumOver -Summing distinct entries from a table with duplicate rows

Hi All,
I have a dataset where in the columns are Customer Name, The Purchase customer made and
Salary of customer.

Dummy Data:
Dummy Data

Now single person has made multiple purchases and hence has multiple records in the table alongside the same salary.

Problem: To calculate average salary of our customer which is
sum salary(over distinct Customer name )/distinct customer count

In this case how can I calculate the numerator using sum over function???

PS: Bringing in a new dataset with just distinct customer name and salary won’t work since we have many others fields, calculations and filters linked to the salary.

Is this possible inside of quicksight ???

1 Like

@emilyzhu @hafeng this case is to calculate distinct avg

For now, please do this:
select distinct customer, null as item, salary, 1 as flag
from original table
union
select *, 0 as flag
from original table

and then, create the calculated field as sumOver(salary * flag, [], PRE_AGG)/DistinctcountOver(customer, [], PRE_AGG).

Hey @Ying_Wang ,
Thanks for the reply.

  1. You mean to say, I create a new data set using union where first part of the data is unique customers and their salary(flag1) and then 2nd part of the data which is the actual dataset with flag 0.

The calculation made sense. Thanks!

  1. Is there no other way to go about with distinctsumover?

Thanks a ton for your help:)

Correct. Get the distinct customers with their salary, add a dummy column called “flag”. Let flag=1.
Union this pre-aggregated data with the original table. In the original table, the dummy column “flag”=0.

After then, the calculated field in above post should work.

If you visualize the data with KPI, or a table without any dimension, you can avoid distinctcountover with avgif(salary, flag=1). If you visualize the data with dimensions in the visual, the distinctcountover can’t be avoided. Enjoy it! In near future, we might have a better solution. We will contact you again after the new feature release.

1 Like

Hi @Ummehaani -

Here is an alt. solution, if you don’t have any true duplicate rows in your dataset but rather duplicate values (Customer Name). You can approach this in the analysis without data prep/ETL.

Note: You want to weigh the cost of the window functions if you are working with a large dataset (rows and columns). You can also generate row number in your ETL (for this example I kept everything at the analysis level)

6 Likes

I test around. This solution is very efficient. The performance is better than my expectation. Very good job Rob and the SPICE team. The columnar storage style is truly optimized in SPICE.

1 Like

I have seen a different approach used in some of the Youtube videos on the QuickSight channel.

Weighted Salary = Salary/countOver({Customer Name}, [{Customer Name}], PRE_AGG)

Total Salary = sumOver({Salary_weighted}, [], PRE_AGG)

Distinct Customers = distinctCountOver({Customer Name}, [], PRE_AGG)

Average Salary = {Total Salary}/{Distinct Customers}

Are there are any particular reasons why you would choose one approach over another?

1 Like

@David_Wong You are right. The weighted solution works correct too.

@Ying_Wang When you said in the future you might have a better solution, were you referring to the new level aware calculations? If so, how would use the new level aware calculations to solve this problem?

Hi @Ummehaani and @David_Wong, we just launched the Level-Aware-Calculation feature, now you are able to quickly solve this problem using avg(avg(salary, [customer])), this calculation won’t be impacted by other dimensions added to the visual, and will give you the correct total number.
Details please see the following user guide.