Reference for Large QuickSight Customers

IHAC who are analyzing the image metadata, the dataset is around 200GB but have billions of rows stored in S3. For the POC, they used SPICE and it went well, but they will exceed >2-3B rows in a production deployment. They are tracking >300 million assets a day and are looking for query performance under 30 seconds.

What they are trying to do now is a direct query approach, but Athena is likely too slow. What are the alternatives we have ? Any best practices to optimize this workload ?

Also do we have any customer references who are processing this large number of rows in SPICE ?

Hi @zgaurash ,

The 1B rows is a hard limitation for SPICE and they will need to design a partition scheme to store all the data. SPICE does have a soft top limit of 50TB, so they have plenty of capacity but they will need to break it up.

I don’t know their data, but aggregation and time usually have potential. So if their grain is hourly or by the minute, then a daily aggregation might be a good starting point. Then have hot, warm and cold data in separate datasets as a drill down (or drill-to a separate tab). E.g. the first 90 days might be the hot partition (with all the details at the finest grain), the next 9 months might be the warm partition and then there might be several cold partitions for each of the following years.

Hope that helps.

ws

1 Like

Are you talking about partitioning in SPICE or at the data source ? Any reference docs / blog ?

Also, if we use a direct query against a Redshift cluster for 1000+ concurrent users , does Quicksight support querying a proxy like Heimdall / connection pooling to support this concurrency at redshift level ?

Hi @zgaurash - The partition should be done on athena table and stored the data in parquet format. and you can filter the data as per the usage in QS layer as well.

Regards - Sanjeeb