S3 dataset realtime refresh

Adnan · July 7, 2023, 4:35pm

I currently an S3 bucket folder that receives data every 5 minutes. I used S3 from manifest dataset in quicksight that allows data refresh down to every hour but I need my datasets refreshed at most in 15 minute intervals. If it is not possible to refresh that fast from the s3 Source, what workflow of other AWS services can I use to achieve a 15 minute interval dataset refresh? (I still need the s3 bucket to hold the incoming data)

duncan · July 7, 2023, 8:30pm

Hello @Adnan and welcome to the Quicksight community!

Would using Direct Query be an option? For an S3 data source the lowest granularity is a 1hr refresh rate.

Another workaround could be using Athena to query the data from the S3 bucket, and turn on incremental refresh for 15 mins.

Sanjeeb2022 · July 8, 2023, 1:11am

Hi @Adnan - For real time “Direct Query” is the solution. It is better to put the file in S3, then crawl the data via Glue crawler and create an Athena table and change the file format to Parquet and partition ( depends upon the source system) and then do a direct query in the athena table. If you can sustain with near real time, possible a good approach will be 15 mins incremental refresh ( SPICE) suggested by @duncan

Regards - Sanjeeb

neelay · July 9, 2023, 3:37pm

@Adnan
Lots of missing information…
Is your s3 bucket receives delta files every 5 min or full file?
how big is the file?
Does it carry aggregated data or raw data?
these answers will decide which additional AWS service or architecture piece will fit in…

Adnan · July 10, 2023, 3:41pm

Hi @neelay,
The bucket receives a full file every 5 minutes. The objects within the bucket are named by day - ex. 07/10/23 - and have object versioning enabled. The received file (CSV) replaces/ becomes the new version of the current day’s file object.

The day’s final object end up around 275 - 325 KB while the version to version size difference is about 2 KB

The file contains raw data with a few aggregated fields

Adnan · July 10, 2023, 3:41pm

Thank you! I will look into this.

neelay · July 10, 2023, 4:12pm

@Adnan you can use lambda trigger on s3, and from lambda call update-dataset api, which will trigger refresh data itself.

e.g.
change the s3 file_name of your dataset through [API]
(update_data_set - Boto3 1.28.1 documentation)

Lambda Steps:

(optional) Check if there is a files in s3 bucket with file_name_{YYYYMMDDHHMM}.csv convention. and delete them
Copy the latest version file as a new file with file_name_{YYYYMMDDHHMM}.csv
Call Update dataset API changes the file_source to this file. (this will update the dataset with new data)

The idea I am proposing here is, to use update-dataset API to trigger data-refresh because QS doesn’t have refresh API yet.

Leosandre · April 27, 2024, 9:50am

Hello @neelay .
Your solution is exactly what I am developing, however, instead of using the lambda function, I am using glue job. I did all the script configuration according to the boto3 documentation, the Glue Job runs successfully, but it does not update the dataset. Any idea what it could be? Thank you very much in advance.

Sanjeeb2022 · April 29, 2024, 3:42am

Hi @Leosandre - Can you share the code which is used for refreshing the data set. This will help in validating it and in case of issues, we can guide you.

Regards - Sanjeeb

Leosandre · April 29, 2024, 10:24am

@Sanjeeb2022 of course, below is the code used (hiding sensitive information).

import boto3

def update_dataset():
    quicksight = boto3.client('quicksight', region_name='us-east-1')
    account_id = 'xxxxxxxxx'
    dataset_id = 'yyyyyyyyyyyyyyyyyyyyyyyyyyyyy'
    s3_bucket = 'mybucket'
    s3_key = 'nrt/cmn_nrt.csv'

    try:
        response = quicksight.update_data_set(
            AwsAccountId=f'{account_id }',
            DataSetId=dataset_id,
            Name='newtest',
            PhysicalTableMap={
                'Primary': {
                    'S3Source': {
                        'DataSourceArn': f'arn:aws:s3:::{s3_bucket}/{s3_key}',
                        'InputColumns': [
                            {
                                'Name': 'texto',
                                'Type': 'STRING'
                            },
                            {
                                'Name': 'data',
                                'Type': 'DATE'
                            },
                            {
                                'Name': 'inteiro',
                                'Type': 'INTEGER'
                            },
                            {
                                'Name': 'decimal',
                                'Type': 'DECIMAL'
                            }
                        ],
                        'UploadSettings': {
                            'Format': 'CSV',
                            'StartFromRow': 1,
                            'ContainsHeader': True
                        }
                    }
                }
            },
            ImportMode='SPICE',
            DataSetReferences=[
                {
                    'DataSetPlaceholder': 'Primary',
                    'DataSetArn': f'arn:aws:quicksight:us-east-1:{account_id}:dataset/{dataset_id}'
                }
            ]
        )
        
        return {
            'statusCode': 200,
            'body': 'Dataset updated successfully!'
        }
    except Exception as e:
        return {
            'statusCode': 500,
            'body': f'Error updating dataset: {str(e)}'
        }

# Calling function
update_dataset()

Sanjeeb2022 · April 29, 2024, 12:10pm

Hi @Leosandre - Can you please remove the below details from the code and test it. Remove the DatasetReferences information.

DataSetReferences=[
{
‘DataSetPlaceholder’: ‘Primary’,
‘DataSetArn’: f’arn:aws:quicksight:us-east-1:{account_id}:dataset/{dataset_id}’
}

Regards - Sanjeeb

Leosandre · April 29, 2024, 1:19pm

Same problem @Sanjeeb2022 , the job runs successfully but it does not update the dataset.

the code:

import boto3

def update_dataset():
    quicksight = boto3.client('quicksight', region_name='us-east-1')
    account_id = 'xxxxxxxxx'
    dataset_id = 'yyyyyyyyyyyyyyyyyyyyyyyyyyyyy'
    s3_bucket = 'mybucket'
    s3_key = 'nrt/cmn_nrt.csv'

    try:
        response = quicksight.update_data_set(
            AwsAccountId='111111111111111',
            DataSetId=dataset_id,
            Name='newtest',
            PhysicalTableMap={
                'Primary': {
                    'S3Source': {
                        'DataSourceArn': f'arn:aws:s3:::{s3_bucket}/{s3_key}',
                        'InputColumns': [
                            {
                                'Name': 'texto',
                                'Type': 'STRING'
                            },
                            {
                                'Name': 'data',
                                'Type': 'DATE'
                            },
                            {
                                'Name': 'inteiro',
                                'Type': 'INTEGER'
                            },
                            {
                                'Name': 'decimal',
                                'Type': 'DECIMAL'
                            }
                        ],
                        'UploadSettings': {
                            'Format': 'CSV',
                            'StartFromRow': 1,
                            'ContainsHeader': True
                        }
                    }
                }
            },
            ImportMode='SPICE'
        )
        
        return {
            'statusCode': 200,
            'body': 'Dataset updated successfully!'
        }
    except Exception as e:
        return {
            'statusCode': 500,
            'body': f'Error updating dataset: {str(e)}'
        }

Sanjeeb2022 · April 29, 2024, 4:43pm

Hi @Leosandre - Can you please update the manifest file with updated file name and test it. If it is not working, then i will suggest to log a ticket to AWS customer support team so that they can help you on this. I already tested the latest code ( shared by you) and update the manifest file and it worked for me.

To log a ticket, please follow the link - Creating support cases and case management - AWS Support

Regards - Sanjeeb

Leosandre · April 29, 2024, 7:10pm

@Sanjeeb2022 in fact, I’m trying another way. I connected my local jupyter notebook with the AWS account and I’m trying here (it’s easier to see the execution logs) and now the error has changed, I’m getting the following error when executing the code:

{'statusCode': 500,
 'body': "Erro ao atualizar conjunto de dados: An error occurred (ValidationException) when calling the UpdateDataSet operation: 1 validation error detected: Value 'arn:aws:s3:::poc-cmn-teste-simulacao-nrt/nrt/cmn_nrt.csv' at 'physicalTableMap.Primary.member.s3Source.dataSourceArn' failed to satisfy constraint: Specified resource is not reachable in this region ('us-east-1')"}

In other words, from what I understand, it seems that my s3 bucket where the database is located is not in the same region as the quicksight dataset, which is not true, both the s3 bucket and the dataset are created in the same region as you can see the images below:

Dataset region:

The code:

def update_dataset():
    quicksight = boto3.client('quicksight', region_name='us-east-1')
    account_id = 'xxxxxxxxx'
    dataset_id = 'yyyyyyyyyyyyyyyyyyyyyyyyyyyyy'
    s3_bucket = 'mybucket'
    s3_key = 'nrt/cmn_nrt.csv'

    try:
        response = quicksight.update_data_set(
            AwsAccountId='1111111111111',
            DataSetId=dataset_id,
            Name='Cmn_NRT',
            PhysicalTableMap={
                'Primary': {
                    'S3Source': {
                        'DataSourceArn': f'arn:aws:s3:::{s3_bucket}/{s3_key}',
                        'InputColumns': [
                            {
                                'Name': 'texto',
                                'Type': 'STRING'
                            },
                            {
                                'Name': 'data',
                                'Type': 'DATETIME'
                            },
                            {
                                'Name': 'inteiro',
                                'Type': 'INTEGER'
                            },
                            {
                                'Name': 'decimal',
                                'Type': 'DECIMAL'
                            }
                        ],
                        'UploadSettings': {
                            'Format': 'CSV',
                            'StartFromRow': 1,
                            'ContainsHeader': True
                        }
                    }
                }
            },
            ImportMode='SPICE'
        )
        
        return {
            'statusCode': 200,
            'body': 'Conjunto de dados atualizado com sucesso!'
        }
    except Exception as e:
        return {
            'statusCode': 500,
            'body': f'Erro ao atualizar conjunto de dados: {str(e)}'
        }

Sanjeeb2022 · April 30, 2024, 3:10am

Hi @Leosandre - Even if I faced the same error, I am suspecting this is a bug. However after updating the manifest file and observing the error my data set is also updated. I believe it is better to create a case to AWS support team.

By saying that I am also tagging some experts to help on this. @ErikG @DylanM @duncan @David_Wong - Please help on this.

Regards - Sanjeeb