Joining Data -- Resultant Dataset Issues

Hi,

I am trying to join two different .xlsx files over a common column that I’ve uploaded as datasets in AWS Quicksight. One file contains Cases(all unique and identifiable through alm_case_no col) and the second file contains people info(people associated with a case, for e.g. there could be multiple people on a single case so the alm_case_no column contains duplication). I am using the alm_case_no column in both the files to form the join.

However, after joining the files, the resulting dataset only returns rows from the people info file no matter what type of join I use.

Additionally, if I use 2 columns from the Cases file in a visual the resultant numbers are not valid and are way off from when I analyze them in excel or power bi. My assumption is that since the columns are from the same file, join should be irrelevant here.

One last thing: If I were to import both the files in power bi, I would form a relation between the files through alm_case_no column and then I can reference columns from both the files in whatever visual I create. The resultant numbers are also validated. I am assuming this is how the joins should work in AWS quicksight as well.

Any help on above problems would be appreciated. Thanks!

Hi @amc_5 - Welcome to AWS QuickSight community and thanks for posting the question. Can you please share sample input records for both data sets and joining key. This will help in replicating the issue and provide the right solution.

Note - Please do not expose the PII elements.

Regards -Sanjeeb

1 Like

Hi @Sanjeeb2022,

Here’s some sample data in a single file.

If you say compare the values for:

  1. Make a horizontal bar chart of Case_type vs Award to a pivot table in excel file, you’ll see that there’s a difference in aggregated values.
  2. You could check with different joins as well. Issue remains same.

@Sanjeeb2022 Were you able to replicate the issue?

hi @amc_5 - Not yet, I will spend sometime next week on this. Apologies for the same.
Hi @Biswajit_1993 - Can you also have a look on this problem statement.

Regards - Sanjeeb

1 Like

@Sanjeeb2022 sure. looking forward to it.

Hi @Biswajit_1993 Could you please have a look?

HI @amc_5 ,
Thanks for posting you query in community.
No issues I am going through it and any finding I will revert back.

Thanks & Regards
Biswajit Dash

2 Likes

Hi @amc_5 , I checked with creating the two datasets one is cases and another is people_info as per your shared data. Then take join between these two datasets but from my end I can see the exact 104 rows as per your people_info row count.

PFB the join datasets & record count screen shots.

Left Outer Join between cases with people_info

Result KPI Chart

Thanks & Regards
Biswajit Dash

2 Likes

Thanks for your reply @Biswajit_1993

Actually I am used to working with Power BI and am not able to replicate the same feature here.

When I connect two tables in Power BI, it creates a relationship between them and then allows cross referencing but it also allows to use the individual tables in visuals as if no relationship exists between the tables(even when relationship is active).

What is happening here in AWS quicksight is that is creating a new (resultant) dataset based on the specified condition which enhances my cross referencing ability but at the same time altogether confining me to the resultant dataset instead of enabling me to use the individual datasets as well.

Yes the count is 104 but see the 2nd image. Case_type and Award columns are both present in the casesFF file but the resultant figures are not accurate. If I were to present it to someone, this wouldn’t make sense.

I hope you are getting my point!

2nd image:

Hi @amc_5

I believe this is naturally something the occurs when we do joins in our dataprep. How it can be mitigated - at the formula/calculated field level by creating Level Aware Calculations (LAC) and applying the appropriate group bys. More info about lac-A or lac-W in this blog post.

1 Like

Hi @Ramon_Lopez,

The closest I’ve come to solving this issue(i.e. sum of award according to case_type of unique alm_case_no since there are duplicate alm_case_no in the data) is:

sumOver(max({AWARD}),[{CASE_TYPE}])

but this is only returning the sum of max values. Could you please extend your help here?

Thanks!

hi @amc_5

I would think that something like this would do the trick.

max(award,[{Case_Type},{alm_case_no}])

Please try it out and let me know.

thanks!

1 Like

It works. Thanks @Ramon_Lopez!

2 Likes