Filtering Out Data Outliers (errors)

In my example, I have Clients A, B, C, D, etc., and each have a numerical measure (KPI_1) in the dataset. Some clients have KPI_1 values that are clearly errors and need to be removed from data visualizations.

How do I create a filter, within an Analysis (visualization), to remove these error outliers based on the 3 Sigma principle?
The goal is to remove values that are greater than 3 standard deviations above or below the mean. In the final visualization, I will be displaying KPI_1, as the value, with another field, ā€˜Region’, on the x-axis to display a column chart showing ā€˜Region’ groupings of the median/average KPI_1. This chart will render for individual Clients as well as the total aggregation of all Clients.

Any help is much appreciated! Thanks!
-Joe

Have you looked into stdev function?

@Joe_C - When you say 3 Sigma principle, I believe you are referring to the normal distribution. Is there any way for you to calculate the mean (average) and standard deviation? If yes, then you can basically put a filter to eliminate all the values which are greater than (mean + 3 * standard deviation) and less than (mean - 3 * standard deviation).

This is what I’ve been trying (in various forms) without success. If I create the following calculated field, ā€œ3*stdev({KPI_1})ā€, how do I tell my filter to exclude all values that are greater than (or less than - in my case, I’m only concerned about excluding values greater than) the result of that calculated field? Within the Filter, I have to enter a ā€˜maximum value’, as opposed to creating a comparison to the calculated field.
Screen Shot 2023-08-07 at 12.37.35 PM

Hi Max,
I’ve been using this formula, but in the use case below, I may not be including the appropriate groupings where I need to, i.e. :

3*stdev({KPI_1},[Client])

Here is a sample of the table I’m using to explore this filtering method:

Hey Max! Any thoughts on my last comments below? Any additional help is much appreciated!

You can use a flag to set a ā€œOutlierā€ if greater than (mean + 3 * standard deviation) and less than (mean - 3 * standard deviation) and ā€œNot Outlierā€ if within the range. And filter Outliers


I tried what I said before, but it didnt work for me as it gives an error of ā€œMismatched aggregationā€

Hi Shambhavi, Thanks for responding and providing this insight. I’ll try your method as well and report back with my findings. Thanks again!

@sagmukhe @Max can you pls guide us here

Could you please share what you’re using for your ā€˜-3Sigma’ calc field?

+3Sigma → {Mean Of Bill Rate} + (3 * {Std Dev Of Bill Rate})
-3Sigma → {Mean Of Bill Rate} - (3 * {Std Dev Of Bill Rate})

Thanks! That’s what I just used, and I got the same ā€˜Mismatched Aggregation’ error when similarly trying to create the ā€˜Outlier Tag’. Hoping to get some additional feedback, as you requested. Thanks again for helping with this.

Yes. I even tried adding this calculated field at the data ingestion stage but also didnt help. Lets hope we get a response from someone

1 Like

I just added a ā€˜bill_rate’ to obscure my actual value I’m working with. It’s a number

1 Like

@sagmukhe and @Max, are you able to provide guidance on this topic? Thanks! ~Joe

@Joe_C you cant have aggregated and non-aggregated values in same calculation.

Thanks. I’m hoping for a suggestion on a workaround to achieve my overall objective stated within my original post.

what are the current definition of the 3 KPIs?
I guess ā€œbill rateā€ is a row level KPI and the others are calculated with group functions?

Every KPI in our dataset, at the row level, that I’m trying to identify as an outlier is a numeric value, not a calculated field (ā€˜Bill Rate’ was a bad pseudonym). That’s likely where I’m getting into trouble with the mismatch. I’m trying to compare a non-aggregated field ā€˜KPI_1’ to the aggregated calc field that’s using the standard deviation function.

I might try wrapping the kPI in a ā€˜min()’ or ā€˜max()’, since it theoretically shouldn’t impact the actual value being compared, but could resolve the mismatch.