We are trying to mine data from the folder API’s. We are able to extract groups and users but are unable to extract the assets in folders as well, ie Datasets, Analyses, and Dashboards. While doing so we want to be able to output all relevant data from the APIs in a S3 parquet by date to have historical logs and append them into tables or datasets later. What is the easiest or best way in achieving this especially finding assets in folders?
When working with AWS QuickSight APIs to pull detailed information about folders and assets, it’s crucial to ensure that your data is clean and efficiently processed. Recently, I encountered a scenario where I needed to gather detailed information about QuickSight folders, including permissions and the assets (like dashboards, datasets, and analyses) contained within those folders. Here’s how I streamlined the data processing and saved only the relevant information to an S3 bucket.
The Challenge
In this project, the goal was to gather detailed information about QuickSight folders, including permissions and the assets contained within those folders. The task involved:
- Collecting and processing the necessary data efficiently.
- Ensuring that the final output was clean and saved to S3 without any unnecessary steps or complications.
The Step-by-Step Solution
Step 1: Data Collection
The first step was to gather all the necessary data from QuickSight:
- Listing Folders: Using the
list_quicksight_folders
function, I gathered all the folders in the QuickSight account. - Listing Folder Members: For each folder, I collected the assets (like dashboards, datasets, etc.) using the
list_folder_members
function. - Describing Permissions: Detailed permissions for each folder were pulled using
describe_folder_permissions
anddescribe_folder_resolved_permissions
.
This approach ensured that I had all the raw data needed: folder IDs, folder names, permissions, and assets linked to each folder.
Step 2: Processing the Data
Once the data was collected, the next step was to process it:
- Matching Asset Types: I created a helper function to match each asset (dashboards, analyses, datasets) to its corresponding folder by extracting the asset type and ID from the folder members.
- Simplifying Ownership: Instead of keeping detailed permission types, I used a simple “Y” or “N” flag to indicate whether the user had ownership permissions for the folder.
This processing ensured that all relevant data was accurately captured without adding unnecessary complexity.
Step 3: Creating a Clean DataFrame
As the data was being appended to the list (folder_details
), I ensured that only the necessary fields were included. This streamlined approach meant that the DataFrame created from this list was clean and contained only the relevant information.
Step 4: Saving the Data to S3
Finally, I saved the processed data to S3:
- Creating a Clean DataFrame: The list of processed data (
folder_details
) was converted to a Pandas DataFrame. Since I carefully managed which fields were included, the DataFrame was clean and only contained the necessary information. - Saving to S3: The final DataFrame was then saved as a Parquet file in the specified S3 bucket, using the current date as part of the filename.
Example: Key Parts of the Solution
Here’s how the main parts of the solution came together:
1. Collecting Folder Details:
folder_details.append({
'FolderId': folder_id,
'FolderName': folder['Name'],
'FolderType': folder['FolderType'],
'Email': email,
'Alias': alias,
'GroupName': group_name,
'is_owner': "Y" if "quicksight:CreateFolder" in permission['Actions'] else "N",
'AnalysisName': analysis_name,
'AnalysisId': analysis_id,
'DashboardName': dashboard_name,
'DashboardId': dashboard_id,
'DatasetName': dataset_name,
'DatasetId': dataset_id,
'SnapshotDate': snapshot_date
})
In this step, I carefully selected which fields to include. This ensured that only the relevant data was captured and prepared for saving.
2. Writing the Data to S3:
df = pd.DataFrame(folder_details)
if not df.empty:
date = datetime.now().strftime('%Y-%m-%d')
output_path = f's3://{s3_bucket}/{s3_prefix}/{date}.parquet'
df.to_parquet(output_path)
print(f"Data saved to {output_path}")
else:
print("No data to save.")
Here, I converted the clean list into a DataFrame and saved it to S3, ensuring that the output was exactly what was needed, with no extra or unnecessary data.
Conclusion
By following this streamlined approach, I was able to process QuickSight data efficiently, match it correctly to relevant assets, and ensure the final output was clean and accurate. The key takeaway here is that careful management of the data during collection and processing can significantly simplify the final output, ensuring that only the necessary information is saved.
This method can be applied to various data processing scenarios, especially when working with large datasets and complex API integrations. By focusing on clean data handling from the outset, you can save time, reduce complexity, and achieve more reliable results.
The Final Code
Here’s the complete code that implements the solution described above:
import boto3
import re
import pandas as pd
from datetime import datetime
# Initialize the QuickSight client
quicksight_client = boto3.client('quicksight', region_name='your-region')
# Set your AWS Account ID and S3 bucket
aws_account_id = 'your-account-id'
s3_bucket = 'your-s3-bucket'
s3_prefix = 'qs_data/folder_metadata'
arn_prefix = f"arn:aws:quicksight:us-east-1:{aws_account_id}:"
# Function to list all QuickSight folders
def list_quicksight_folders():
folders = []
next_token = None
while True:
request_params = {
'AwsAccountId': aws_account_id,
}
if next_token:
request_params['NextToken'] = next_token
response = quicksight_client.list_folders(**request_params)
folders.extend(response.get('FolderSummaryList', []))
next_token = response.get('NextToken')
if not next_token:
break
return folders
# Function to list folder members
def list_folder_members(folder_id):
members = []
next_token = None
while True:
request_params = {
'AwsAccountId': aws_account_id,
'FolderId': folder_id
}
if next_token:
request_params['NextToken'] = next_token
response = quicksight_client.list_folder_members(**request_params)
members.extend(response.get('FolderMemberList', []))
next_token = response.get('NextToken')
if not next_token:
break
return members
# Function to describe folder permissions
def describe_folder_permissions(folder_id):
response = quicksight_client.describe_folder_permissions(
AwsAccountId=aws_account_id,
FolderId=folder_id
)
return response.get('Permissions')
# Function to describe folder resolved permissions
def describe_folder_resolved_permissions(folder_id):
response = quicksight_client.describe_folder_resolved_permissions(
AwsAccountId=aws_account_id,
FolderId=folder_id
)
return response.get('Permissions')
# Helper function to extract email, alias, and group name from the Principal ARN
def extract_email_alias_and_group(principal_arn):
if "group" in principal_arn.lower():
# Extract group name
group_name_pattern = re.compile(r'.*/(.*)')
group_name_match = group_name_pattern.search(principal_arn)
group_name = group_name_match.group(1) if group_name_match else None
return None, None, group_name
elif "Fleet+P-" in principal_arn:
# Extract alias after "Fleet+P-"
alias_pattern = re.compile(r'Fleet\+P-(.*)')
alias_match = alias_pattern.search(principal_arn)
alias = alias_match.group(1) if alias_match else None
return None, alias, None
else:
# Extract email and alias
email_pattern = re.compile(r'.*/(.*)')
email_match = email_pattern.search(principal_arn)
if email_match:
email = email_match.group(1)
alias = email.split('@')[0] if '@' in email else None
return email, alias, None
return None, None, None
# Helper function to extract asset type and ID
def extract_asset_type_and_id(member_arn, arn_prefix):
arn_suffix = member_arn.replace(arn_prefix, "")
arn_parts = arn_suffix.split('/')
asset_type = arn_parts[0] # This will be 'dashboard', 'analysis', 'dataset', or 'datasource'
asset_id = arn_parts[1] # This is the asset ID
return asset_type, asset_id
# Function to get asset details
def get_asset_details():
dashboards = {}
datasets = {}
analyses = {}
# Fetch all dashboards
next_token = None
while True:
params = {'AwsAccountId': aws_account_id}
if next_token:
params['NextToken'] = next_token
response = quicksight_client.list_dashboards(**params)
dashboards.update({item['DashboardId']: item['Name'] for item in response.get('DashboardSummaryList', [])})
next_token = response.get('NextToken')
if not next_token:
break
# Fetch all datasets
next_token = None
while True:
params = {'AwsAccountId': aws_account_id}
if next_token:
params['NextToken'] = next_token
response = quicksight_client.list_data_sets(**params)
datasets.update({item['DataSetId']: item['Name'] for item in response.get('DataSetSummaries', [])})
next_token = response.get('NextToken')
if not next_token:
break
# Fetch all analyses
next_token = None
while True:
params = {'AwsAccountId': aws_account_id}
if next_token:
params['NextToken'] = next_token
response = quicksight_client.list_analyses(**params)
analyses.update({item['AnalysisId']: item['Name'] for item in response.get('AnalysisSummaryList', [])})
next_token = response.get('NextToken')
if not next_token:
break
return dashboards, datasets, analyses
# Main function to get all folder details
def get_folder_details():
dashboards, datasets, analyses = get_asset_details()
folders = list_quicksight_folders()
folder_details = []
snapshot_date = datetime.now().strftime('%Y-%m-%d') # Today's date as snapshot_date
for folder in folders:
folder_id = folder['FolderId']
permissions = describe_folder_permissions(folder_id)
resolved_permissions = describe_folder_resolved_permissions(folder_id)
members = list_folder_members(folder_id)
# Process standard permissions
for permission in permissions:
principal = permission['Principal']
email, alias, group_name = extract_email_alias_and_group(principal)
# Ensure alias is populated
if alias is None and email is not None:
alias = email.split('@')[0]
# Initialize asset columns with None
analysis_name = None
analysis_id = None
dashboard_name = None
dashboard_id = None
dataset_name = None
dataset_id = None
for member in members:
member_arn = member.get('MemberArn')
asset_type, asset_id = extract_asset_type_and_id(member_arn, arn_prefix)
if asset_type == 'analysis':
analysis_name = analyses.get(asset_id)
analysis_id = asset_id
elif asset_type == 'dashboard':
dashboard_name = dashboards.get(asset_id)
dashboard_id = asset_id
elif asset_type == 'dataset':
dataset_name = datasets.get(asset_id)
dataset_id = asset_id
folder_details.append({
'FolderId': folder_id,
'FolderName': folder['Name'],
'FolderType': folder['FolderType'],
'Email': email,
'Alias': alias,
'GroupName': group_name,
'is_owner': "Y" if "quicksight:CreateFolder" in permission['Actions'] else "N",
'AnalysisName': analysis_name,
'AnalysisId': analysis_id,
'DashboardName': dashboard_name,
'DashboardId': dashboard_id,
'DatasetName': dataset_name,
'DatasetId': dataset_id,
'SnapshotDate': snapshot_date # Snapshot date moved to the last column
})
# Process resolved permissions
for resolved_permission in resolved_permissions:
principal = resolved_permission['Principal']
email, alias, group_name = extract_email_alias_and_group(principal)
# Ensure alias is populated
if alias is None and email is not None:
alias = email.split('@')[0]
# Initialize asset columns with None
analysis_name = None
analysis_id = None
dashboard_name = None
dashboard_id = None
dataset_name = None
dataset_id = None
for member in members:
member_arn = member.get('MemberArn')
asset_type, asset_id = extract_asset_type_and_id(member_arn, arn_prefix)
if asset_type == 'analysis':
analysis_name = analyses.get(asset_id)
analysis_id = asset_id
elif asset_type == 'dashboard':
dashboard_name = dashboards.get(asset_id)
dashboard_id = asset_id
elif asset_type == 'dataset':
dataset_name = datasets.get(asset_id)
dataset_id = asset_id
folder_details.append({
'FolderId': folder_id,
'FolderName': folder['Name'],
'FolderType': folder['FolderType'],
'Email': email,
'Alias': alias,
'GroupName': group_name,
'is_owner': "Y" if "quicksight:CreateFolder" in resolved_permission['Actions'] else "N",
'AnalysisName': analysis_name,
'AnalysisId': analysis_id,
'DashboardName': dashboard_name,
'DashboardId': dashboard_id,
'DatasetName': dataset_name,
'DatasetId': dataset_id,
'SnapshotDate': snapshot_date # Snapshot date moved to the last column
})
return folder_details
# Run the function and get all details
folder_details = get_folder_details()
# Convert to Pandas DataFrame and write to S3 as Parquet
df = pd.DataFrame(folder_details)
if not df.empty:
date = datetime.now().strftime('%Y-%m-%d')
output_path = f's3://{s3_bucket}/{s3_prefix}/{date}.parquet'
df.to_parquet(output_path)
print(f"Data saved to {output_path}")
else:
print("No data to save.")