Calculations with LEFT JOIN

Hi there!

I’m using a joined dataset for my dashboard.

  • Table A contains a list of phone calls.

| id | callStart | callEnd | callDuration | dimension1 | dimension2 |
| A | 2022-11-22 10:00:01 | 2022-11-22 10:04:23 | 262 | D1-A | D2-A |
| B | 2022-11-22 10:00:42 | 2022-11-22 10:02:43 | 122 | D1-A | D2-B |
| C | 2022-11-22 10:03:35 | 2022-11-22 10:03:59 | 24 | D1-B | D2-B |
| D | 2022-11-22 10:04:04 | 2022-11-22 10:07:54 | 230 | D1-C | D2-B |

  • Table B contains the list of state changes that each call has forego.

| id | stateA | stateB |
| A | S0 | S1 |
| A | S1 | S2 |
| A | S2 | S4 |
| B | S0 | S1 |
| B | S1 | S4 |
| C | S0 | S2 |
| C | S2 | S3 |
| C | S3 | S4 |
| D | S0 | S1 |
| D | S1 | S2 |
| D | S2 | S5 |

I’m building a Sankey diagram with the state transitions defined in table B, but I am failing to calculate simple metrics (averageCallDuration) when filtering by dimensions 1 & 2 because of the LEFT JOIN.

How do I need to calculate the averageCallDuration so that I can filter on the Sankey Diagram and still get the correct KPI?


Are you saying because id A has three “State’s” it went through that your calculations are off?

Here is what I have done before with custom SQL & calculated fields. Let me know if it will work for your situation.

First make a field in SQL that will divide the callDuration by the amount of rows that the id is duplicated in.

In Athena here is how it would look: callDuration/(count(id) OVER (PARTITION BY id))

So for A it would be

A | 262/3
A | 262/3
A | 262/3

Then in a calculated field for total callDuration you can sum this field.

For average you can make a calculated field that looks like this. sum({new_sql_field})/disinct_count(id)

Let me know if that helps

Yes, that’s right. Because each call goes through a different number of states. Hence, calls with more states contribute more to the average…

I didn’t want to make the calculation beforehand because I want to calculate the average duration of the calls after filtering by state (which leaves a lot of registries out each time the filter is applied and makes the beforehand calculation unusable)

Ex (after filtering in the Sankey Diagram the path S0-S1):

| id | stateA | stateB |
| A | S0 | S1 |
| B | S0 | S1 |
| D | S0 | S1 |

My guess is that the calculation has to be dynamic, taking into account the number of registries “not filtered”…

So for id A, if you filter to only the path s0-s1, without any calculation, the average would work right?

However, if you don’t have it filtered then my first response would work right?

I think so…

I was hoping the solution could be valid for any type of filter.

Say I want to visualize al calls going through S0, S1 and S2

  • For id A, I would have all three entries
  • For id C, I would only have 2 out of 3 (last state change is S3-S4)

Hmm, can you also in SQL take the amount of states it goes through per id: count(DISTINCT concat(stateA,‘-’,stateB) OVER (PARTITION BY ID)

That will get you the total amount of states it went through.

Then in a calculated field take the amount of current states and divide it by this field.


Then use this as a multiplier against my first response?

Thanks for your help Max.

As I wanted to make the calculation dynamic, thanks to your first response, I implemented it as follows:

  • Calculate a partial call duration (callDuration / number or registries)
    partialCallDuration = minOver(callDuration, [callId], PRE_AGG) / countOver(callDuration, [callId], PRE_AGG)

  • Calculate total duration as sum of partial durations
    totalDuration = sum(partialCallDuration)

  • Calculate the total number of calls
    countLlamadas = distinct_count(callId)

  • Calculate average call duration
    averageCallDuration = totalDuration / countLlamadas

Hope this helps to anyone else with this problem!

1 Like