I currently an S3 bucket folder that receives data every 5 minutes. I used S3 from manifest dataset in quicksight that allows data refresh down to every hour but I need my datasets refreshed at most in 15 minute intervals. If it is not possible to refresh that fast from the s3 Source, what workflow of other AWS services can I use to achieve a 15 minute interval dataset refresh? (I still need the s3 bucket to hold the incoming data)
Hello @Adnan and welcome to the Quicksight community!
Would using Direct Query be an option? For an S3 data source the lowest granularity is a 1hr refresh rate.
Another workaround could be using Athena to query the data from the S3 bucket, and turn on incremental refresh for 15 mins.
Hi @Adnan - For real time âDirect Queryâ is the solution. It is better to put the file in S3, then crawl the data via Glue crawler and create an Athena table and change the file format to Parquet and partition ( depends upon the source system) and then do a direct query in the athena table. If you can sustain with near real time, possible a good approach will be 15 mins incremental refresh ( SPICE) suggested by @duncan
Regards - Sanjeeb
@Adnan
Lots of missing informationâŚ
Is your s3 bucket receives delta files every 5 min or full file?
how big is the file?
Does it carry aggregated data or raw data?
these answers will decide which additional AWS service or architecture piece will fit inâŚ
Hi @neelay,
The bucket receives a full file every 5 minutes. The objects within the bucket are named by day - ex. 07/10/23 - and have object versioning enabled. The received file (CSV) replaces/ becomes the new version of the current dayâs file object.
The dayâs final object end up around 275 - 325 KB while the version to version size difference is about 2 KB
The file contains raw data with a few aggregated fields
Thank you! I will look into this.
@Adnan you can use lambda trigger on s3, and from lambda call update-dataset api, which will trigger refresh data itself.
e.g.
change the s3 file_name of your dataset through [API]
(update_data_set - Boto3 1.28.1 documentation)
Lambda Steps:
- (optional) Check if there is a files in s3 bucket with file_name_{YYYYMMDDHHMM}.csv convention. and delete them
- Copy the latest version file as a new file with file_name_{YYYYMMDDHHMM}.csv
- Call Update dataset API changes the file_source to this file. (this will update the dataset with new data)
The idea I am proposing here is, to use update-dataset API to trigger data-refresh because QS doesnât have refresh API yet.
Hello @neelay .
Your solution is exactly what I am developing, however, instead of using the lambda function, I am using glue job. I did all the script configuration according to the boto3 documentation, the Glue Job runs successfully, but it does not update the dataset. Any idea what it could be? Thank you very much in advance.
Hi @Leosandre - Can you share the code which is used for refreshing the data set. This will help in validating it and in case of issues, we can guide you.
Regards - Sanjeeb
@Sanjeeb2022 of course, below is the code used (hiding sensitive information).
import boto3
def update_dataset():
quicksight = boto3.client('quicksight', region_name='us-east-1')
account_id = 'xxxxxxxxx'
dataset_id = 'yyyyyyyyyyyyyyyyyyyyyyyyyyyyy'
s3_bucket = 'mybucket'
s3_key = 'nrt/cmn_nrt.csv'
try:
response = quicksight.update_data_set(
AwsAccountId=f'{account_id }',
DataSetId=dataset_id,
Name='newtest',
PhysicalTableMap={
'Primary': {
'S3Source': {
'DataSourceArn': f'arn:aws:s3:::{s3_bucket}/{s3_key}',
'InputColumns': [
{
'Name': 'texto',
'Type': 'STRING'
},
{
'Name': 'data',
'Type': 'DATE'
},
{
'Name': 'inteiro',
'Type': 'INTEGER'
},
{
'Name': 'decimal',
'Type': 'DECIMAL'
}
],
'UploadSettings': {
'Format': 'CSV',
'StartFromRow': 1,
'ContainsHeader': True
}
}
}
},
ImportMode='SPICE',
DataSetReferences=[
{
'DataSetPlaceholder': 'Primary',
'DataSetArn': f'arn:aws:quicksight:us-east-1:{account_id}:dataset/{dataset_id}'
}
]
)
return {
'statusCode': 200,
'body': 'Dataset updated successfully!'
}
except Exception as e:
return {
'statusCode': 500,
'body': f'Error updating dataset: {str(e)}'
}
# Calling function
update_dataset()
Hi @Leosandre - Can you please remove the below details from the code and test it. Remove the DatasetReferences information.
DataSetReferences=[
{
âDataSetPlaceholderâ: âPrimaryâ,
âDataSetArnâ: fâarn:aws:quicksight:us-east-1:{account_id}:dataset/{dataset_id}â
}
Regards - Sanjeeb
Same problem @Sanjeeb2022 , the job runs successfully but it does not update the dataset.
the code:
import boto3
def update_dataset():
quicksight = boto3.client('quicksight', region_name='us-east-1')
account_id = 'xxxxxxxxx'
dataset_id = 'yyyyyyyyyyyyyyyyyyyyyyyyyyyyy'
s3_bucket = 'mybucket'
s3_key = 'nrt/cmn_nrt.csv'
try:
response = quicksight.update_data_set(
AwsAccountId='111111111111111',
DataSetId=dataset_id,
Name='newtest',
PhysicalTableMap={
'Primary': {
'S3Source': {
'DataSourceArn': f'arn:aws:s3:::{s3_bucket}/{s3_key}',
'InputColumns': [
{
'Name': 'texto',
'Type': 'STRING'
},
{
'Name': 'data',
'Type': 'DATE'
},
{
'Name': 'inteiro',
'Type': 'INTEGER'
},
{
'Name': 'decimal',
'Type': 'DECIMAL'
}
],
'UploadSettings': {
'Format': 'CSV',
'StartFromRow': 1,
'ContainsHeader': True
}
}
}
},
ImportMode='SPICE'
)
return {
'statusCode': 200,
'body': 'Dataset updated successfully!'
}
except Exception as e:
return {
'statusCode': 500,
'body': f'Error updating dataset: {str(e)}'
}
Hi @Leosandre - Can you please update the manifest file with updated file name and test it. If it is not working, then i will suggest to log a ticket to AWS customer support team so that they can help you on this. I already tested the latest code ( shared by you) and update the manifest file and it worked for me.
To log a ticket, please follow the link - Creating support cases and case management - AWS Support
Regards - Sanjeeb
@Sanjeeb2022 in fact, Iâm trying another way. I connected my local jupyter notebook with the AWS account and Iâm trying here (itâs easier to see the execution logs) and now the error has changed, Iâm getting the following error when executing the code:
{'statusCode': 500,
'body': "Erro ao atualizar conjunto de dados: An error occurred (ValidationException) when calling the UpdateDataSet operation: 1 validation error detected: Value 'arn:aws:s3:::poc-cmn-teste-simulacao-nrt/nrt/cmn_nrt.csv' at 'physicalTableMap.Primary.member.s3Source.dataSourceArn' failed to satisfy constraint: Specified resource is not reachable in this region ('us-east-1')"}
In other words, from what I understand, it seems that my s3 bucket where the database is located is not in the same region as the quicksight dataset, which is not true, both the s3 bucket and the dataset are created in the same region as you can see the images below:
Dataset region:
The code:
def update_dataset():
quicksight = boto3.client('quicksight', region_name='us-east-1')
account_id = 'xxxxxxxxx'
dataset_id = 'yyyyyyyyyyyyyyyyyyyyyyyyyyyyy'
s3_bucket = 'mybucket'
s3_key = 'nrt/cmn_nrt.csv'
try:
response = quicksight.update_data_set(
AwsAccountId='1111111111111',
DataSetId=dataset_id,
Name='Cmn_NRT',
PhysicalTableMap={
'Primary': {
'S3Source': {
'DataSourceArn': f'arn:aws:s3:::{s3_bucket}/{s3_key}',
'InputColumns': [
{
'Name': 'texto',
'Type': 'STRING'
},
{
'Name': 'data',
'Type': 'DATETIME'
},
{
'Name': 'inteiro',
'Type': 'INTEGER'
},
{
'Name': 'decimal',
'Type': 'DECIMAL'
}
],
'UploadSettings': {
'Format': 'CSV',
'StartFromRow': 1,
'ContainsHeader': True
}
}
}
},
ImportMode='SPICE'
)
return {
'statusCode': 200,
'body': 'Conjunto de dados atualizado com sucesso!'
}
except Exception as e:
return {
'statusCode': 500,
'body': f'Erro ao atualizar conjunto de dados: {str(e)}'
}
Hi @Leosandre - Even if I faced the same error, I am suspecting this is a bug. However after updating the manifest file and observing the error my data set is also updated. I believe it is better to create a case to AWS support team.
By saying that I am also tagging some experts to help on this. @ErikG @DylanM @duncan @David_Wong - Please help on this.
Regards - Sanjeeb