How to handle multiple calculation when joining the two data source


Sorry if my English doesn’t well enough.

I have two data CSV containing of

  1. Participant’s data with a premium (participants_id, effective date, name, birthdate, premium per person etc)
  2. Claim (participants_id, effective date, name, claim date, and the amount of claim)

I want to join these two data. However, when I perform left join (participants data as primary) the sum of the claim amount and the premium become huge amount, the claim amount multiple by 3,57 times, and the premium is 24,08 times.

how to solve the issues as I joined only from the participants_id.


I’m assuming this is because participants can have multiple claims.

Can you share what values you are expecting vs what you are getting?

In regards to the premiums increasing this makes sense because you add an additional row of premiums per claim.

For instance, if person A has a premium of 10 dollars and has two claims, then the sum of the premiums would be 20 because there are two rows of claims.

In regards to the claim amount… you will need to make sure that there aren’t duplicate participant ids. That is the only way that should increase.

Hi Max,

Many thanks for your reply,

Yes, you are correct. participants can have multiple claims during the period and here is my expected value versus on quicksight.

Correct Value
Total Premium IDR. 27,629,163,785.46
Total Claim IDR. 22,068,414,902.18
Quicksight Value
Total Premium IDR. 663,153,240,876.00
Total Claim IDR. 79,377,511,963.00

My data structure currently I have looks like this

Claim Data

Insurance Period Participants ID Name Claim Number ID Claim Amount
2021 101 A 20218991910 125,000.00
2021 108 B 20218991911 145,000.00
2021 123 C 20218991912 1,792,810.00
2021 101 A 20218991913 8,937,100.00
2022 108 B 20229818819 862,300.00
2022 123 C 20229818820 927,100.00
2022 123 C 20229818821 1,582,700.00
2022 101 A 20229818822 7,162,730.00
2022 108 B 20229818823 192,800.00
2022 108 B 20229818824 9,817,100.00

Participants Data

Insurance Period Participants ID Name Annual Premium
2021 101 A 700,000.00
2021 108 B 1,212,000.00
2021 123 C 1,212,000.00
2021 154 D 700,000.00
2021 180 E 700,000.00
2021 127 F 700,000.00
2021 119 G 700,000.00
2022 101 A 920,000.00
2022 108 B 1,324,000.00
2022 123 C 1,324,000.00
2022 154 D 920,000.00
2022 180 E 920,000.00
2022 127 F 920,000.00
2022 121 H 1,324,000.00

In the claim data, there is a possibility that ID’s become so many duplications as one person can be submitted 1 to 5 times during one year, but every claim has the Claim number ID as a unique ID.

Thank you.

Yes, then you will need to configure a column that possibly divides the amount by the total amount of rows per participant id or however you want the granularity to be and then sum on that number. For instance if you have a Claim Amount then you would divide by the count of rows where that is duplicated and sum over it.

For instance, if you have 125,000 where that is duplicated 5 times, you would for each row divide 125,000 by 5 and then if you sum on that it should come to 125,000.

Hi Max,

Thank you for your advice.

I’ve realized the value of the Total Premium depends on how many claims are recorded for one person, if a person doesn’t have any claims yet, then the total premium is correct.

example: if one person has 5 claims, means (total premium * 5)

My next question is, what is the correct formula for solving the multiplied value?
my personal opinion might be


but I get some errors.
I really appreciated your help.

Thank you.

When you don’t get the results that you expect, I found that the best way to troubleshoot is to create a table visual and add every single field to it. It will show you where your duplicates are coming from. You can see the duplicate values below.
Claims Premiums

For example, the sum of the premiums for Participant ID 101 should be 125,000 + 8,937,100 + 7,162,730. However, after you perform your join, each value appears twice in your dataset, so if you just do a sum, it will be twice bigger than the correct sum.

There are 3 ways you can approach this:

Approach #1:
If you know there are duplicates, count how times each value appears in your dataset and divide by that number.
Rows per Claim = countOver({Annual Premium}, [{Participants ID}, {Insurance Period}], PRE_AGG)

Divide the value that you want to sum by the number of times that it appears (2 in this case).
Weighted Claim Amount = {Claim Amount} / {Rows per Claim}

Then take the sum of Weighted Claim Amount to get the correct sum.

Approach #2
Find a way to flag the first instance of each Claim Number ID and take the sum of all claim amounts with the flag.

Row Number = Rank([{Annual Premium} ASC], [{Claim Number ID}], PRE_AGG)

First Row = ifelse({Row Number} = 1, 1, 0)

Claim Amount (Deduplicated) = {Claim Amount} * {First Row}

Finally take the sum of Claim Amount (Depuplicated).

Approach #3
Nest a LAC-A function in an aggregate function.

Total Claim = sum(min({Claim Amount}, [{Claim Number ID}]))

What this is saying is basically take the min of every claim amount in your partition (Claim Number ID) and then take the sum of all those minimum values.

1 Like

Hi David,

Thanks for your help.

Yes, you’re right. I follow as you explained and create a table to show each value. I’ve tried using method No 3 and show the correct amount as I expected.

Now I’m trying to scale up the data just to ensure the amount as I expected.

Many thanks to @Max and @David_Wong to solved my problem.
Keep safe and healthy :slight_smile: