Dataset from an S3 folder, cannot add new columns on latest CSV file

David_Levinson · March 16, 2022, 10:36pm

We use S3 for our data source and the json used includes a URIPrefixes line to include all files in a S3 folder. We have CSV files in the folder.
For example,
Sales-Jan-2022.csv
Sales-Feb-2022.csv
…

The issue we have is when we add the latest month and we want to add additional new column(s) (at the end of the CSV file), it fails because the previous months do not have that column.

I would expect this to work and in previous months treat those new columns as null.

The workaround is to add the new columns in all the CSV files, but that could imply lots of work and to reupload huge amounts of data.

Thanks in advance!

Naveed · March 18, 2022, 8:06am

Hi,

you can add new column in data set mode/or on analysis if you the new column is based on the existing data.

lillie · March 21, 2022, 5:21pm

Hi David. Did Naveed’s suggestion work for you?

David_Levinson · March 21, 2022, 6:19pm

No…
I did not find a way to add the new column.

David_Levinson · March 25, 2022, 4:41pm

I will explain with a clear example and attach an image to clarify the example.
Any help will be greatly appreciated.

In an S3 folder I have two files: File1.csv and File2.csv
I created a S3 dataset in Quicksight from that folder using a JSON file:
I was expecting the dataset to include the 3 columns: Name, Number and Number 2.
But it only has the first two from File1.csv:
And in an analysis when using the dataset I get this error, that File2 has fewer columns:

In our use case, we continually add CSV files to a S3 folder. And sometimes we want to add new columns to the latest file. We would expect the new column to be created and to have null values in the previous CSV files. But that does not happen, we get an error and analysis stop working. Is there a way to solve this?

David_Levinson · March 28, 2022, 5:52pm

Hi Lillie, I added more info below…
Can someone take a look at it?

Thanks in advance!

Koushik_Muthanna · March 28, 2022, 8:17pm

Hi David,

The dataset which is created from a manifest for files in Amazon S3 should have the same number of fields ( columns ) . Amazon S3 Files

In your use-case, the schema is changing. You will have to accommodate these changes as part of your data pipeline.
An example solution below , there will be costs associated if you are testing the solution:
Store those files in S3 . Use an interactive query service like Amazon Athena ( Workshop Studio ) . Within Athena you can create a table that points to location of S3 bucket (S3 , Athena and QuickSight have to be in the same region ) .
Every time there is a schema changed based on the underlying files, you also need to update the table definition in Athena.
Amazon QuickSight can connect to Athena as a data source. Do note that with changing schema, you will also need to refresh your dataset for the new columns to be visible .
Use the newly created dataset in your analysis.

Regards,
Koushik