How to handle multiple calculation when joining the two data source

Yuldan · November 5, 2022, 2:49pm

Hi

Sorry if my English doesn’t well enough.

I have two data CSV containing of

Participant’s data with a premium (participants_id, effective date, name, birthdate, premium per person etc)
Claim (participants_id, effective date, name, claim date, and the amount of claim)

I want to join these two data. However, when I perform left join (participants data as primary) the sum of the claim amount and the premium become huge amount, the claim amount multiple by 3,57 times, and the premium is 24,08 times.

how to solve the issues as I joined only from the participants_id.

Thanks

Max · November 7, 2022, 4:57pm

I’m assuming this is because participants can have multiple claims.

Can you share what values you are expecting vs what you are getting?

In regards to the premiums increasing this makes sense because you add an additional row of premiums per claim.

For instance, if person A has a premium of 10 dollars and has two claims, then the sum of the premiums would be 20 because there are two rows of claims.

In regards to the claim amount… you will need to make sure that there aren’t duplicate participant ids. That is the only way that should increase.

Yuldan · November 8, 2022, 3:29am

Hi Max,

Many thanks for your reply,

Yes, you are correct. participants can have multiple claims during the period and here is my expected value versus on quicksight.

Correct Value
Total Premium	IDR. 27,629,163,785.46
Total Claim	IDR. 22,068,414,902.18
Quicksight Value
Total Premium	IDR. 663,153,240,876.00
Total Claim	IDR. 79,377,511,963.00

My data structure currently I have looks like this

Claim Data

Insurance Period	Participants ID	Name	Claim Number ID	Claim Amount
2021	101	A	20218991910	125,000.00
2021	108	B	20218991911	145,000.00
2021	123	C	20218991912	1,792,810.00
2021	101	A	20218991913	8,937,100.00
2022	108	B	20229818819	862,300.00
2022	123	C	20229818820	927,100.00
2022	123	C	20229818821	1,582,700.00
2022	101	A	20229818822	7,162,730.00
2022	108	B	20229818823	192,800.00
2022	108	B	20229818824	9,817,100.00

Participants Data

Insurance Period	Participants ID	Name	Annual Premium
2021	101	A	700,000.00
2021	108	B	1,212,000.00
2021	123	C	1,212,000.00
2021	154	D	700,000.00
2021	180	E	700,000.00
2021	127	F	700,000.00
2021	119	G	700,000.00
2022	101	A	920,000.00
2022	108	B	1,324,000.00
2022	123	C	1,324,000.00
2022	154	D	920,000.00
2022	180	E	920,000.00
2022	127	F	920,000.00
2022	121	H	1,324,000.00

In the claim data, there is a possibility that ID’s become so many duplications as one person can be submitted 1 to 5 times during one year, but every claim has the Claim number ID as a unique ID.

Thank you.

Max · November 18, 2022, 7:03pm

Yes, then you will need to configure a column that possibly divides the amount by the total amount of rows per participant id or however you want the granularity to be and then sum on that number. For instance if you have a Claim Amount then you would divide by the count of rows where that is duplicated and sum over it.

For instance, if you have 125,000 where that is duplicated 5 times, you would for each row divide 125,000 by 5 and then if you sum on that it should come to 125,000.

Yuldan · November 21, 2022, 3:18am

Hi Max,

Thank you for your advice.

I’ve realized the value of the Total Premium depends on how many claims are recorded for one person, if a person doesn’t have any claims yet, then the total premium is correct.

example: if one person has 5 claims, means (total premium * 5)

My next question is, what is the correct formula for solving the multiplied value?
my personal opinion might be

ifelse(
{MEMBER_ID}={CLAIMANT_ID},sum({TOTAL PREMIUM})/count({CLAIM_ID})),
sum({TOTAL PREMIUM})
)

but I get some errors.
I really appreciated your help.

Thank you.

David_Wong · November 21, 2022, 5:58am

When you don’t get the results that you expect, I found that the best way to troubleshoot is to create a table visual and add every single field to it. It will show you where your duplicates are coming from. You can see the duplicate values below.
Claims Premiums

For example, the sum of the premiums for Participant ID 101 should be 125,000 + 8,937,100 + 7,162,730. However, after you perform your join, each value appears twice in your dataset, so if you just do a sum, it will be twice bigger than the correct sum.

There are 3 ways you can approach this:

Approach #1:
If you know there are duplicates, count how times each value appears in your dataset and divide by that number.
Rows per Claim = countOver({Annual Premium}, [{Participants ID}, {Insurance Period}], PRE_AGG)

Divide the value that you want to sum by the number of times that it appears (2 in this case).
Weighted Claim Amount = {Claim Amount} / {Rows per Claim}

Then take the sum of Weighted Claim Amount to get the correct sum.

Approach #2
Find a way to flag the first instance of each Claim Number ID and take the sum of all claim amounts with the flag.

Row Number = Rank([{Annual Premium} ASC], [{Claim Number ID}], PRE_AGG)

First Row = ifelse({Row Number} = 1, 1, 0)

Claim Amount (Deduplicated) = {Claim Amount} * {First Row}

Finally take the sum of Claim Amount (Depuplicated).

Approach #3
Nest a LAC-A function in an aggregate function.

Total Claim = sum(min({Claim Amount}, [{Claim Number ID}]))

What this is saying is basically take the min of every claim amount in your partition (Claim Number ID) and then take the sum of all those minimum values.

Yuldan · November 21, 2022, 9:54am

Hi David,

Thanks for your help.

Yes, you’re right. I follow as you explained and create a table to show each value. I’ve tried using method No 3 and show the correct amount as I expected.

Now I’m trying to scale up the data just to ensure the amount as I expected.

Many thanks to @Max and @David_Wong to solved my problem.
Keep safe and healthy