In my example, I have Clients A, B, C, D, etc., and each have a numerical measure (KPI_1) in the dataset. Some clients have KPI_1 values that are clearly errors and need to be removed from data visualizations.
How do I create a filter, within an Analysis (visualization), to remove these error outliers based on the 3 Sigma principle?
The goal is to remove values that are greater than 3 standard deviations above or below the mean. In the final visualization, I will be displaying KPI_1, as the value, with another field, āRegionā, on the x-axis to display a column chart showing āRegionā groupings of the median/average KPI_1. This chart will render for individual Clients as well as the total aggregation of all Clients.
@Joe_C - When you say 3 Sigma principle, I believe you are referring to the normal distribution. Is there any way for you to calculate the mean (average) and standard deviation? If yes, then you can basically put a filter to eliminate all the values which are greater than (mean + 3 * standard deviation) and less than (mean - 3 * standard deviation).
This is what Iāve been trying (in various forms) without success. If I create the following calculated field, ā3*stdev({KPI_1})ā, how do I tell my filter to exclude all values that are greater than (or less than - in my case, Iām only concerned about excluding values greater than) the result of that calculated field? Within the Filter, I have to enter a āmaximum valueā, as opposed to creating a comparison to the calculated field.
You can use a flag to set a āOutlierā if greater than (mean + 3 * standard deviation) and less than (mean - 3 * standard deviation) and āNot Outlierā if within the range. And filter Outliers
Thanks! Thatās what I just used, and I got the same āMismatched Aggregationā error when similarly trying to create the āOutlier Tagā. Hoping to get some additional feedback, as you requested. Thanks again for helping with this.
Every KPI in our dataset, at the row level, that Iām trying to identify as an outlier is a numeric value, not a calculated field (āBill Rateā was a bad pseudonym). Thatās likely where Iām getting into trouble with the mismatch. Iām trying to compare a non-aggregated field āKPI_1ā to the aggregated calc field thatās using the standard deviation function.
I might try wrapping the kPI in a āmin()ā or āmax()ā, since it theoretically shouldnāt impact the actual value being compared, but could resolve the mismatch.