Filtering Out Data Outliers (errors)

Joe_C · August 7, 2023, 11:48am

In my example, I have Clients A, B, C, D, etc., and each have a numerical measure (KPI_1) in the dataset. Some clients have KPI_1 values that are clearly errors and need to be removed from data visualizations.

How do I create a filter, within an Analysis (visualization), to remove these error outliers based on the 3 Sigma principle?
The goal is to remove values that are greater than 3 standard deviations above or below the mean. In the final visualization, I will be displaying KPI_1, as the value, with another field, ‘Region’, on the x-axis to display a column chart showing ‘Region’ groupings of the median/average KPI_1. This chart will render for individual Clients as well as the total aggregation of all Clients.

Any help is much appreciated! Thanks!
-Joe

Max · August 7, 2023, 1:25pm

Have you looked into stdev function?

sagmukhe · August 7, 2023, 3:31pm

@Joe_C - When you say 3 Sigma principle, I believe you are referring to the normal distribution. Is there any way for you to calculate the mean (average) and standard deviation? If yes, then you can basically put a filter to eliminate all the values which are greater than (mean + 3 * standard deviation) and less than (mean - 3 * standard deviation).

Joe_C · August 7, 2023, 4:35pm

This is what I’ve been trying (in various forms) without success. If I create the following calculated field, “3*stdev({KPI_1})”, how do I tell my filter to exclude all values that are greater than (or less than - in my case, I’m only concerned about excluding values greater than) the result of that calculated field? Within the Filter, I have to enter a ‘maximum value’, as opposed to creating a comparison to the calculated field.
Screen Shot 2023-08-07 at 12.37.35 PM

Joe_C · August 7, 2023, 4:56pm

Hi Max,
I’ve been using this formula, but in the use case below, I may not be including the appropriate groupings where I need to, i.e. :

3*stdev({KPI_1},[Client])

Here is a sample of the table I’m using to explore this filtering method:

Joe_C · August 9, 2023, 2:46pm

Hey Max! Any thoughts on my last comments below? Any additional help is much appreciated!

Shambhavi · October 3, 2023, 6:49pm

You can use a flag to set a “Outlier” if greater than (mean + 3 * standard deviation) and less than (mean - 3 * standard deviation) and “Not Outlier” if within the range. And filter Outliers

Shambhavi · October 4, 2023, 1:02pm

I tried what I said before, but it didnt work for me as it gives an error of “Mismatched aggregation”

Joe_C · October 4, 2023, 1:30pm

Hi Shambhavi, Thanks for responding and providing this insight. I’ll try your method as well and report back with my findings. Thanks again!

Shambhavi · October 4, 2023, 7:24pm

@sagmukhe @Max can you pls guide us here

Joe_C · October 5, 2023, 3:08pm

Could you please share what you’re using for your ‘-3Sigma’ calc field?

Shambhavi · October 5, 2023, 3:11pm

+3Sigma → {Mean Of Bill Rate} + (3 * {Std Dev Of Bill Rate})
-3Sigma → {Mean Of Bill Rate} - (3 * {Std Dev Of Bill Rate})

Joe_C · October 5, 2023, 3:19pm

Thanks! That’s what I just used, and I got the same ‘Mismatched Aggregation’ error when similarly trying to create the ‘Outlier Tag’. Hoping to get some additional feedback, as you requested. Thanks again for helping with this.

Shambhavi · October 5, 2023, 3:23pm

Yes. I even tried adding this calculated field at the data ingestion stage but also didnt help. Lets hope we get a response from someone

Joe_C · October 5, 2023, 3:27pm

I just added a ‘bill_rate’ to obscure my actual value I’m working with. It’s a number

Joe_C · November 6, 2023, 1:36pm

@sagmukhe and @Max, are you able to provide guidance on this topic? Thanks! ~Joe

ErikG · November 6, 2023, 1:53pm

@Joe_C you cant have aggregated and non-aggregated values in same calculation.

Joe_C · November 6, 2023, 2:01pm

Thanks. I’m hoping for a suggestion on a workaround to achieve my overall objective stated within my original post.

ErikG · November 6, 2023, 2:29pm

what are the current definition of the 3 KPIs?
I guess “bill rate” is a row level KPI and the others are calculated with group functions?

Joe_C · November 6, 2023, 2:53pm

Every KPI in our dataset, at the row level, that I’m trying to identify as an outlier is a numeric value, not a calculated field (‘Bill Rate’ was a bad pseudonym). That’s likely where I’m getting into trouble with the mismatch. I’m trying to compare a non-aggregated field ‘KPI_1’ to the aggregated calc field that’s using the standard deviation function.

I might try wrapping the kPI in a ‘min()’ or ‘max()’, since it theoretically shouldn’t impact the actual value being compared, but could resolve the mismatch.