Remove duplicate rows from dataset

Good morning,
in my dataset the are two columns (“T040” and “T041”) that contains temperature values. I would like to remove all the rows where T040 and T041 are equals to the previous row. I need something like this:

Original dataset

TimeStamp T040 T041
16:15:00 28.3 27.8
16:15:01 28.2 27.9
16:15:02 28.2 27.9
16:15:03 28.2 27.9
16:15:04 28.4 27.6

Dataset desidered

TimeStamp T040 T041
16:15:00 28.3 27.8
16:15:01 28.2 27.9
16:15:04 28.4 27.6

Is there a way to do it? I’m executing a DirectQuery on my S3 storage

Best regard
Enrico

You can do a lag function and a first value function for both columns T040 & T041

Here is how it would work for t040 (you would also do this for t041:

is_duplicate_t040 = ifelse(lead({t040},[{timestamp} ASC],1)=firstValue({t040]},[{timestamp} ASC],[{timestamp}]),‘Duplicate’,‘Keep’)

Then you would add a custom filter to only filter for values that say ‘Keep’ on both of these newly calculated fields.

So this:

Would turn into this:

Let me know if that works!

1 Like

Thanks for the reply, but I have already tried your idea without luck. I don’t need a filter on the dashboard, I need less data in the dataset.

Let me be more detailed. I create a calculate fields like this

vibrationMaxValue = maxOver({l3dspmaxvalue},[{t042}],PRE_FILTER)

In this way vibrationMaxValue will have a max value within t042 parition. This is right, but I need to exclude all duplicated values and I don’t know how to create. I tryied to change that field in

vibrationMaxValue = maxOver({l3dspmaxvalue},[{t042}, {is_duplicated_t040}],PRE_FILTER)

but it give me an error. All I want is to exclude duplicated values…

Thanks
Enrico

Hello @egobbo .

You can also create calculated fields at the dataset level and then create a filter at the dataset level as well.

This way you can filter the data directly there and make sure that all the analyses that use it received the filtered data.

However remember that it is the best practice to work with already clean and massaged datasets to avoid overloading the BI layer with calculations and filters so your visuals can render blazing fast.

I would recommend to create a data pipeline using AWS Glue or or your favorite ETL tool to ensure that the data that arrives to the consumption layer (Quicksight in this case) doesn’t contain duplicates.

Hope it helps, please mark this solution as solved if that’s the case also to help other members of the community., otherwise let us know.

Happy dashboarding!

1 Like

@EnriqueS thank you very much!! As first step I will setup filters in a not-so-best-practice way, in the meantime I will study Glue and use it as soon as I will be ready.

The Glue way is absolutelly the correct one, and I think I will use it a lot in the future.

Thanks
Enrico

Do you launch this code in quicksight or do you do it in sql? is_duplicate_t040 = ifelse(lead({t040},[{timestamp} ASC],1)=firstValue({t040]},[{timestamp} ASC],[{timestamp}]),‘Duplicate’,‘Keep’)

@Raulsc

in QuickSight

Regards,
Koushik

1 Like

HOW? I tried it and I got unavailable in that column created

Thanks you!

please create a new post with your question and post the details :slight_smile:

here! please

Hi Can you help me with this formula?

For instance:
I have these columns

id start id length country
1 12 2 IN
1 12 2 IN
1 12 2 IN
2 15 2 Mexico
3 10 2 US
4 25 1 Spain

These duplicates occur because there are corresponding columns which are not displayed in the QS visual. Now I want distinct values in the above table I created in QuickSight
How would my formula go since I got error when I tried to use the formula you shared?