Spice Computation

Hello everyone,

I have a simple question about Spice and Dataset. Let’s imagine I have a CSV file with 10,000 rows and 4 columns: LoadDate, ColumnA, ColumnB, and ColumnC. If I load this file into QuickSight, it gives me a 10MB dataset. If I generate this file over 10 days, changing only the LoadDate column (same LoadDate for all rows), and then load all these files into QuickSight Dataset, it gives me a 100MB dataset. I expected a smaller dataset considering columnar storage, given that ColumnA, ColumnB, and ColumnC have the same data. Is my expectation correct?

Hi @woliveiras

SPICE uses columnar storage, which is optimized for analytics and often compresses data by storing columns separately. When many rows share the same values in certain columns (like ColumnA, ColumnB, and ColumnC in your example), SPICE can apply compression effectively.

SPICE also stores metadata, indexing, and other overhead for each file, which can result in a dataset size larger than expected when concatenating multiple files.

Full Refresh vs. Incremental Refresh: If you perform a full refresh, SPICE reloads the entire dataset, potentially leading to increased storage requirements as new, unique data accumulates. On the other hand, using incremental refresh can optimize storage by only appending new rows (e.g., for a new LoadDate) instead of reloading the entire dataset. This method is more storage-efficient when only a portion of your data changes over time.

Please refer the below documentation this might be helpful for you.

1 Like

Hey @Xclipse ,

Thank you for your response. Unfortunately, the behavior I’m observing in my environment is different from what you described.

I created two files:

File1.csv (5 MB)
Filetotal.csv (15 MB)
Both files have the same columns. The only difference is that Filetotal.csv contains three snapshots of the same data found in File1.csv, with the LoadDate column having different dates and the value columns having different values.

When I import File1.csv into a dataset in QuickSight, it generates a 10 MB dataset. However, when I import Filetotal.csv, QuickSight generates a 45 MB dataset.

This is not what I expected, given the repetition of column data.

Could you please provide some insights into why this might be happening?

File1.csv (5mb)
LoadDate, ColumnA, ColumnB, and ColumnC.
01/01/2024,AAA,BBB,10
01/01/2024,AAA,BBB2,15
01/01/2024,AAA,BBB3,15
01/01/2024,AAA1,BBB,100
01/01/2024,AAA1,BBB2,13
01/01/2024,AAA1,BBB3,11

Filetotal.csv (15mb)
LoadDate, ColumnA, ColumnB, and ColumnC.
01/01/2024,AAA,BBB,10
01/01/2024,AAA,BBB2,15
01/01/2024,AAA,BBB3,15
01/01/2024,AAA1,BBB,100
01/01/2024,AAA1,BBB2,13
01/01/2024,AAA1,BBB3,11
02/01/2024,AAA,BBB,102
02/01/2024,AAA,BBB2,152
02/01/2024,AAA,BBB3,151
02/01/2024,AAA1,BBB,190
02/01/2024,AAA1,BBB2,130
02/01/2024,AAA1,BBB3,111
03/01/2024,AAA,BBB,9
03/01/2024,AAA,BBB2,153
03/01/2024,AAA,BBB3,152
03/01/2024,AAA1,BBB,70
03/01/2024,AAA1,BBB2,135
03/01/2024,AAA1,BBB3,110

@woliveiras ,

SPICE is geared for performance and the expectation that repeated data would lead to a smaller dataset size is not true . If you want to get an estimate of the logical size of your dataset , use the following where we describe how it’s calculated spice-capacity-formula .

Kind regards,
Koushik

2 Likes