How to improve QuickSight Q topic indexing time

alaresch · October 12, 2023, 2:38am

This post covers some best practices to reduce the time it takes for a QuickSight Q topic to index.

First, let’s define what a QuickSight Q topic index is. When you create a topic, Amazon QuickSight Q creates, stores, and maintains an index with definitions for data in that topic. The topic index is an index of unique string values for fields included in a topic. Q uses this index to generate correct answers when there are cell values mentioned for filters, e.g. “sales for Amazon” implying a filter on sales = 'Amazon.com, Inc.", provide autocomplete suggestions when someone asks a question, and suggest mappings of terms to columns or data values.
Read more here: Refreshing Amazon QuickSight Q topic indexes - Amazon QuickSight

There are a few elements that contribute to indexing time:

Size of the dataset in terms of the number of indexable columns (dimensions, not measures) and the number of rows
If row-level security (RLS) is enabled for the dataset
Type of dataset (SPICE vs. Direct Query)
Cardinality of the enabled fields

Here are some suggestions or best practices for each element above.

Size of the dataset
Consider creating a copy of your dataset that is curated for the topic’s specific use case. If the topic is only covering one area of the business, consider making a copy that removes unnecessary fields/rows. Reducing the available time range can also help if applicable for the topic business use case. For example, the dataset might contain 4 years of historical data that is used in the dashboard, but the topic is only needed for looking at the last year of data.

Row-level Security (RLS)
RLS is a requirement based on the business use case and data security standards. Adjusting your rules so they are more streamlined is a best practice to help reduce indexing time. The indexing performance depends on the number of unique values for the fields in the main dataset that are part of RLS rules and number of columns in the rules table. So if you include a high cardinality column in the rules table like User and there are many users, this will increase the time. Consider adding your users to user groups and setting pre-formatted text rules for the highest level like department or job function.

Here is another example to understand the indexing impact of RLS rules:

Imagine you had the data below

Region	Segment	Country
AMER	Enterprise	Argentina
AMER	SMB	Argentina
AMER	Strategic	Argentina
EMEA	Enterprise	Austria

For each field value that Q indexes, we keep track of which combinations of values from RLS rule columns appear in rows with that field value. If Region and Segment are the fields used for RLS rules, and the value “Argentina” is being indexed, then the index will keep track of the fact that “Argentina” appeared in rows with the following rule combinations: (AMER, Enterprise), (AMER, SMB), (AMER, Strategic). This allows Q indexes to ensure that users only see autocomplete values and answers that they are permitted to see.

In some cases, a field value may have a very large number of combinations of rule column values that it appears in rows with. Though uncommon, if a field value has more than 250,000 distinct rule combinations, the Q index will truncate them. This truncation will never result in a user being able to see a value they are not permitted to see, but it may result in a user not being able to see an answer, even though they are allowed to see it. Additionally, if there are a large number of indexed field values that each have large numbers of rule column value combinations, it can cause index refreshing to fail. This situation may be caused by using high cardinality columns as rule columns. Using rule columns with a smaller number of distinct values can help mitigate this.

Type of Dataset
We recommend using SPICE instead of Direct Query. For Direct Query, QuickSight does not have any control over how long it takes to extract the data. SPICE has an efficient export implementation and stores data in the robust in-memory engine that is built to serve data more rapidly. Read more here: Importing data into SPICE - Amazon QuickSight

Cardinality of the Fields
As mentioned in the definition at the top, the topic index is an index of unique string values for fields included in a topic. Since Q needs to create and store a copy of each string value for each enabled dimension, one way to reduce indexing time is to consider removing unnecessary high-cardinality fields. For example, if you have a Product Name and Product ID field, but for the topic readers will only want to see Product Name, you can remove Product ID to eliminate the need for every product ID to be indexed. In general, it is a best practice to only include the fields that are relevant to the specific topic use case to minimize lexical overlap. Read more here about best practices: Best practices for enabling business users to answer questions about data using natural language in Amazon QuickSight | AWS Business Intelligence Blog

Note: If you did want both Product Name and Product ID to be used for answering questions but will only need fuzzy matching of cell values on product name then you can use disableIndexing = true to have Q skip this field for the sake of indexing; the field will otherwise still be included it just will no longer have values suggested in autocomplete, nor fuzzy matching for mentions for filters. Read more here: TopicCalculatedField - Amazon QuickSight
You can also achieve this right from the topic authoring page by unchecking the “Show field values in search suggestions” setting.

Thanks!