Parsing null values causes ingestion error on Athena datasource

Hi,

We are facing an issue with date parsing using athena as datasource.

When there are missing values for the date field, QuickSight will try to parse the dataset as a String. Changing the value to a date with expected format causes ingestion to fail when there are missing values. The only workaround we have found is using custom SQL and explicitly parsing a date, replacing null values with empty strings. This was not an issue when using s3 connection directly. Our new data connections is converting data from csv to parquet, crawling the data to athena.
This workaround is not feasible because we are moving upwards of hundreds of datasets in this manner and would like to find a dynamic way of fixing this.

Thanks in advance

Hi @carlsaug - Welcome to AWS QuickSight community and thanks for posting the question. If I understood the problem correctly, you are creating a table in athena where you have date field and have some empty value but QuickSight treated it as string and that causes the ingestion failure. Is my understanding correct? Is it possible to create the data set via API where you can enforce the data types and see the outcome.

Hi @Koushik_Muthanna @ArunSanthosh - Kindly give your feedback on the same?

Regards - Sanjeeb

Hi @Sanjeeb2022

Yes your understanding is correct! We are creating the data sets via API, getting the phsyical and logical table mapping from DescribeDataSet on the original data set. Then translating the PhyiscalTableMap to use Athena and using the logical table mapping from the source dataset (typescript example):

function createAthenaTableMap(
  dataSet: DatasetMigrationInput,
  sourceDataSet: DataSet,
  team: string,
): { [key: string]: PhysicalTable } {
  const athenaDataSourceArn = `arn:aws:quicksight:eu-west-1:12345678910:datasource/sophon-${team}-alpha-athena`; 
  const physicalTableMap: { [key: string]: PhysicalTable } = {};
  for (const tableId in sourceDataSet.PhysicalTableMap) {
    const table = sourceDataSet.PhysicalTableMap[tableId];

    if (table.S3Source) {
      physicalTableMap[tableId] = {
        RelationalTable: {
          DataSourceArn: athenaDataSourceArn,
          Catalog: 'AwsDataCatalog',
          Schema: `${team}-${stage}`,
          Name: dataSet.SophonAthenaTable,
          InputColumns: table.S3Source.InputColumns,
        } as RelationalTable,
      };
    }
  }

  return physicalTableMap;
}

..... 



  const createDataSetCommand = new CreateDataSetCommand({
    ...sourceDataSet,
    PhysicalTableMap: physicalTableMap,
    AwsAccountId: targetAwsAccount,
    DataSetId: formatName(dataSet.Name!),
    ImportMode: 'SPICE',
    Name: stage === 'prod' ? dataSet.Name : `${stage}-${dataSet.Name}`,
  });

The ingestion fails though on the CastColumnTypeOperation(s) in the LogicalTableMap when doing this since the date is not parsed correctly due to this issue.

Regards,
August

1 Like

Hi @carlsaug - Thank you so much. Is it possible to raise a ticket to AWS Customer support team so that this can be analyzed in detail. You are following right approach.

To raise the request, please follow the link - Creating support cases and case management - AWS Support

Hi @Koushik_Muthanna @ArunSanthosh @Max - Any expert advise on this.

Regards - Sanjeeb

1 Like

Hi @carlsaug ,

Below an example of missing date value in a parquet filet which is queried through and also ingested into SPICE . Check again what is causing the ingestion issues.

column = order_delivered_carrier_date ( timestamp )

Athena

QuickSight

Kind regards,
Koushik

1 Like

Hi Koushik,

As mentioned in #quicksight-users, could you please share the DescribeDataSet payload so we can see how your table maps are configured?

Kind regards,
August

1 Like

Sure

{
    "Status": 200,
    "DataSet": {
        "Arn": "arn:aws:quicksight:eu-central-1:XXXX:dataset/477ebcd8-a922-4c60-89bc-a086fa0eeb13",
        "DataSetId": "477ebcd8-a922-4c60-89bc-a086fa0eeb13",
        "Name": "orders_date_null_testing_table",
        "CreatedTime": "2023-06-15T11:17:27.553000+02:00",
        "LastUpdatedTime": "2023-06-15T11:18:13.065000+02:00",
        "PhysicalTableMap": {
            "027e1852-ce31-4770-a0cb-1c0ba9f9c897": {
                "RelationalTable": {
                    "DataSourceArn": "arn:aws:quicksight:eu-central-1:XXXX:datasource/a6daac79-a4ca-401f-b740-27db623756b8",
                    "Catalog": "AwsDataCatalog",
                    "Schema": "default",
                    "Name": "orders_date_null_testing_table",
                    "InputColumns": [
                        {
                            "Name": "order_id",
                            "Type": "STRING"
                        },
                        {
                            "Name": "customer_id",
                            "Type": "STRING"
                        },
                        {
                            "Name": "order_status",
                            "Type": "STRING"
                        },
                        {
                            "Name": "order_purchase_timestamp",
                            "Type": "DATETIME"
                        },
                        {
                            "Name": "order_approved_at",
                            "Type": "STRING"
                        },
                        {
                            "Name": "order_delivered_carrier_date",
                            "Type": "DATETIME"
                        },
                        {
                            "Name": "order_delivered_customer_date",
                            "Type": "STRING"
                        },
                        {
                            "Name": "order_estimated_delivery_date",
                            "Type": "STRING"
                        }
                    ]
                }
            }
        },
        "LogicalTableMap": {
            "9efbe083-b65c-4e18-8fcd-6aba6e914578": {
                "Alias": "orders_date_null_testing_table",
                "DataTransforms": [
                    {
                        "ProjectOperation": {
                            "ProjectedColumns": [
                                "order_id",
                                "customer_id",
                                "order_status",
                                "order_purchase_timestamp",
                                "order_approved_at",
                                "order_delivered_carrier_date",
                                "order_delivered_customer_date",
                                "order_estimated_delivery_date"
                            ]
                        }
                    }
                ],
                "Source": {
                    "PhysicalTableId": "027e1852-ce31-4770-a0cb-1c0ba9f9c897"
                }
            }
        },
        "OutputColumns": [
            {
                "Name": "order_id",
                "Type": "STRING"
            },
            {
                "Name": "customer_id",
                "Type": "STRING"
            },
            {
                "Name": "order_status",
                "Type": "STRING"
            },
            {
                "Name": "order_purchase_timestamp",
                "Type": "DATETIME"
            },
            {
                "Name": "order_approved_at",
                "Type": "STRING"
            },
            {
                "Name": "order_delivered_carrier_date",
                "Type": "DATETIME"
            },
            {
                "Name": "order_delivered_customer_date",
                "Type": "STRING"
            },
            {
                "Name": "order_estimated_delivery_date",
                "Type": "STRING"
            }
        ],
        "ImportMode": "SPICE",
        "ConsumedSpiceCapacityInBytes": 29759,
        "FieldFolders": {},
        "DataSetUsageConfiguration": {
            "DisableUseAsDirectQuerySource": false,
            "DisableUseAsImportedSource": false
        }
    },
    "RequestId": "4b11328a-41a6-4a71-b755-517f829257d0"
}
1 Like

Hi @Koushik_Muthanna,

The difference here seems to be that datatype is defined in the PhysicalTableMap rather than cast in the LogicalTableMap. Did you specify datatype when creating the data set via the API?

I am concerned about the user experience in that case, since they will be creating the data sets in the console and won’t be able to specify the table maps in this way. All our data sources on the account will be Athena.

Regarding the way we create the data, I think a large part of the issue is that empty rows are being crawled as empty strings rather than null values. Do you know if there is a specific crawler configuration to avoid this?

Would be curious to hear your thoughts on this.

Kind regards,
August