New or existing column value with data partitioned by date

I have a dataset which contains a snapshot of data from an external system for each month. Every month we add a new snapshot into the dataset, and the rows have a {Snapshot} field indicating the month it relates to. The rows in each snapshot have an {ID} field.

I am trying to add a calculated field, which will tell me if an {ID} value in one month, was in the snapshot data from the previous month? If it was, then the field should return “Existing”, otherwise it returns “New”.

I suspect I need a lag function to do this, but I cannot get this working - either the field shows “Unavailable” in the dataset, or I cannot visualise the breakdown of New/Existing per snapshot month.

Any advice much appreciated!

@nday Can you share sample data?

1 Like

Hi @neelay - here is some CSV sample data:

ID,Value,Snapshot
A,43,2023-10-01
B,66,2023-10-01
C,54,2023-10-01
A,44,2023-11-01
B,67,2023-11-01
C,55,2023-11-01
D,78,2023-11-01

For the October snapshot there are 3 items (A, B and C). For the November snapshot there are 4 items - A, B and C are “Existing”, D is “New” since the ID didn’t exist in the October snapshot.

The idea is that I would have a table visual, grouped by Snapshot date, and under each date group I would have a count of the number of Existing/New items within each snapshot. November would show 3 existing and 1 new. October would show 3 existing (because there is no previous month of snapshot data, so we assume it was existing).

Hopefully this is something I can calculate at the dataset level, rather than an analysis-level calculation or having to setup the aggregation on the visual itself (as we will be using the same query across multiple analyses/dashboards).

I don’t think qs can do it. Lag is aggregated function, so it probably won’t help you either.

You probably need etl service between s3 and qs, which has to facilitate this. Probably redshift or something similar.

1 Like

Can you try to use the rank function to calculate a snapshot number based on the date in ascending order? If the snapshot number is 1, you know it’s a new item. Otherwise, it’s an existing item. You should be able to do it at the dataset level.

Snapshot Number = rank([Snapshot, [ID], PRE_AGG)
New/Existing = ifelse({Snapshot Number} = 1, ‘New’, ‘Existing’)

Finally do a count of ID, grouped by Snapshot and New/Existing.

4 Likes

hi @nday,

did the solution provided by @David_Wong help you solve your question? if so, please help us mark it as Solution.

kind regards,
Wakana

1 Like

@David_Wong That works perfectly, many thanks!

1 Like