Some columns are missing in dataset

Hello,
I am importing a relatively large (around 400 MB) JSON file (DynamoDB-JSON format) into QuickSight as a dataset, either by uploading it directly (“Upload a file”) or by specifying a manifest file on s3 that points to the data file on s3.

There are some columns/database fields that only start to appear in the (about) last third of that data file (because these db fields were only recently been added to the db where the data originally comes from).

QuickSight successfully imports all lines from the json data file, I have checked this by comparing the number of successfully imported rows (no skipped rows) with the number of lines in the json file – they match.

However, in the resulting dataset, the rather newish columns / db fields are not displayed anywhere, neither in the field list nor in the sample data. There are no excluded fields and all fields ought to be displayed. Also, when I create a small subset of the imported json file (e.g. with only a handful of data records of which some contain those newish columns), the newish columns are displayed in the QS dataset.

What is the reason for those newish columns not being displayed at all in the dataset for the whole json file? Please note that this is not a refresh of an existing dataset but I am creating a new dataset based on the large json file.

Could it be that QuickSight only scans the first part of that large json file for determining the data structure and since the mentioned newish columns/fields likely do not appear within that first part, QuickSight ignores them when it encounters them when importing the remaining data from the file? If so, how can I get those columns/fields be made available in the dataset?

More generally, how can I ensure that all columns/fields in a data file end up in the data structure QS is creating based on its analysis of the data file, even if some column/field only appears a few times somewhere (and not necessarily near the beginning) of a data file?

Thanks,
Kaspar

PS: After moving a few data records which contain the mentioned newish fields from the rather end of the data file to the rather beginning of it, those columns finally get displayed in the dataset. However, this result was based on manual editing of the data file. Is there a way to tell QuickSight to build the data structure based on all columns/fields found in a data file?

Hi @Kaspar
QuickSight does perform schema and type inference on JSON files during import, including flattening nested JSON structures. However, there are some limitations on the types of JSON structures that are supported. QuickSight supports importing flat JSON files and semi-structured JSON data, and it automatically infers the schema, data types, and flattens the JSON data during import. It also supports JSON records with structures and JSON records with root elements as arrays. However, list attributes and list objects within a JSON record are not supported and will be skipped during import. This could be the reason why some columns are not appearing in your QuickSight dataset, if those columns are nested within list attributes or list objects in the JSON file. To address this issue, it is recommended to use tools like AWS Glue to unnest the JSON data before importing it into QuickSight. These tools can help flatten the nested JSON structures, making all columns accessible in the QuickSight dataset.
This post talks along similar direction.

How many columns do you have? There is a limit of 2K columns on SPICE dataset.

1 Like

@awsvig Thanks for those useful details about JSON structure. I’ve indeed already encountered such scenarios in which in my JSON file there were columns represented as lists/nested structure which then did not appear in the dataset. However, in this particular case, the impacted columns (DB fields) are not nested, they are very simple fields (AMOUNT (number), DURATION (number)). And as it has turned out (see my “PS:” in my original post) those columns will get displayed/made available in the dataset, as soon as I move the first occurrences of them in the JSON source data file from the bottom of the file towards the beginning, before the JSON file gets imported as a dataset in QuickSight.

In total, I have only about 20 columns in the JSON file, definitely less than the 2K limit. :slight_smile:

Therefore, can anyone confirm that QuickSight – for creating the data structure of a dataset – is indeed only scanning parts of a (large) JSON file and not always the whole content of it? And therefore, if some field/column (not a list or anything, just a plain simple data field like one containing numbers) for the first time only appears towards the end of a large JSON file, then it will not be considered for the data structure of the dataset into which the content of the JSON file is imported?