Trying to use the APIs to output relevant Folder information

pagekevi · August 27, 2024, 3:42pm

We are trying to mine data from the folder API’s. We are able to extract groups and users but are unable to extract the assets in folders as well, ie Datasets, Analyses, and Dashboards. While doing so we want to be able to output all relevant data from the APIs in a S3 parquet by date to have historical logs and append them into tables or datasets later. What is the easiest or best way in achieving this especially finding assets in folders?

pagekevi · August 27, 2024, 3:43pm

When working with AWS QuickSight APIs to pull detailed information about folders and assets, it’s crucial to ensure that your data is clean and efficiently processed. Recently, I encountered a scenario where I needed to gather detailed information about QuickSight folders, including permissions and the assets (like dashboards, datasets, and analyses) contained within those folders. Here’s how I streamlined the data processing and saved only the relevant information to an S3 bucket.

The Challenge

In this project, the goal was to gather detailed information about QuickSight folders, including permissions and the assets contained within those folders. The task involved:

Collecting and processing the necessary data efficiently.
Ensuring that the final output was clean and saved to S3 without any unnecessary steps or complications.

The Step-by-Step Solution

Step 1: Data Collection

The first step was to gather all the necessary data from QuickSight:

Listing Folders: Using the list_quicksight_folders function, I gathered all the folders in the QuickSight account.
Listing Folder Members: For each folder, I collected the assets (like dashboards, datasets, etc.) using the list_folder_members function.
Describing Permissions: Detailed permissions for each folder were pulled using describe_folder_permissions and describe_folder_resolved_permissions.

This approach ensured that I had all the raw data needed: folder IDs, folder names, permissions, and assets linked to each folder.

Step 2: Processing the Data

Once the data was collected, the next step was to process it:

Matching Asset Types: I created a helper function to match each asset (dashboards, analyses, datasets) to its corresponding folder by extracting the asset type and ID from the folder members.
Simplifying Ownership: Instead of keeping detailed permission types, I used a simple “Y” or “N” flag to indicate whether the user had ownership permissions for the folder.

This processing ensured that all relevant data was accurately captured without adding unnecessary complexity.

Step 3: Creating a Clean DataFrame

As the data was being appended to the list (folder_details), I ensured that only the necessary fields were included. This streamlined approach meant that the DataFrame created from this list was clean and contained only the relevant information.

Step 4: Saving the Data to S3

Finally, I saved the processed data to S3:

Creating a Clean DataFrame: The list of processed data (folder_details) was converted to a Pandas DataFrame. Since I carefully managed which fields were included, the DataFrame was clean and only contained the necessary information.
Saving to S3: The final DataFrame was then saved as a Parquet file in the specified S3 bucket, using the current date as part of the filename.

Example: Key Parts of the Solution

Here’s how the main parts of the solution came together:

1. Collecting Folder Details:

folder_details.append({
    'FolderId': folder_id,
    'FolderName': folder['Name'],
    'FolderType': folder['FolderType'],
    'Email': email,
    'Alias': alias,
    'GroupName': group_name,
    'is_owner': "Y" if "quicksight:CreateFolder" in permission['Actions'] else "N",
    'AnalysisName': analysis_name,
    'AnalysisId': analysis_id,
    'DashboardName': dashboard_name,
    'DashboardId': dashboard_id,
    'DatasetName': dataset_name,
    'DatasetId': dataset_id,
    'SnapshotDate': snapshot_date
})

In this step, I carefully selected which fields to include. This ensured that only the relevant data was captured and prepared for saving.

2. Writing the Data to S3:

df = pd.DataFrame(folder_details)
if not df.empty:
    date = datetime.now().strftime('%Y-%m-%d')
    output_path = f's3://{s3_bucket}/{s3_prefix}/{date}.parquet'
    df.to_parquet(output_path)
    print(f"Data saved to {output_path}")
else:
    print("No data to save.")

Here, I converted the clean list into a DataFrame and saved it to S3, ensuring that the output was exactly what was needed, with no extra or unnecessary data.

Conclusion

By following this streamlined approach, I was able to process QuickSight data efficiently, match it correctly to relevant assets, and ensure the final output was clean and accurate. The key takeaway here is that careful management of the data during collection and processing can significantly simplify the final output, ensuring that only the necessary information is saved.

This method can be applied to various data processing scenarios, especially when working with large datasets and complex API integrations. By focusing on clean data handling from the outset, you can save time, reduce complexity, and achieve more reliable results.

The Final Code

Here’s the complete code that implements the solution described above:

import boto3
import re
import pandas as pd
from datetime import datetime

# Initialize the QuickSight client
quicksight_client = boto3.client('quicksight', region_name='your-region')

# Set your AWS Account ID and S3 bucket
aws_account_id = 'your-account-id'
s3_bucket = 'your-s3-bucket'
s3_prefix = 'qs_data/folder_metadata'

arn_prefix = f"arn:aws:quicksight:us-east-1:{aws_account_id}:"

# Function to list all QuickSight folders
def list_quicksight_folders():
    folders = []
    next_token = None

    while True:
        request_params = {
            'AwsAccountId': aws_account_id,
        }
        
        if next_token:
            request_params['NextToken'] = next_token

        response = quicksight_client.list_folders(**request_params)
        folders.extend(response.get('FolderSummaryList', []))
        next_token = response.get('NextToken')
        if not next_token:
            break

    return folders

# Function to list folder members
def list_folder_members(folder_id):
    members = []
    next_token = None

    while True:
        request_params = {
            'AwsAccountId': aws_account_id,
            'FolderId': folder_id
        }
        
        if next_token:
            request_params['NextToken'] = next_token

        response = quicksight_client.list_folder_members(**request_params)
        members.extend(response.get('FolderMemberList', []))
        next_token = response.get('NextToken')
        if not next_token:
            break

    return members

# Function to describe folder permissions
def describe_folder_permissions(folder_id):
    response = quicksight_client.describe_folder_permissions(
        AwsAccountId=aws_account_id,
        FolderId=folder_id
    )
    return response.get('Permissions')

# Function to describe folder resolved permissions
def describe_folder_resolved_permissions(folder_id):
    response = quicksight_client.describe_folder_resolved_permissions(
        AwsAccountId=aws_account_id,
        FolderId=folder_id
    )
    return response.get('Permissions')

# Helper function to extract email, alias, and group name from the Principal ARN
def extract_email_alias_and_group(principal_arn):
    if "group" in principal_arn.lower():
        # Extract group name
        group_name_pattern = re.compile(r'.*/(.*)')
        group_name_match = group_name_pattern.search(principal_arn)
        group_name = group_name_match.group(1) if group_name_match else None
        return None, None, group_name
    elif "Fleet+P-" in principal_arn:
        # Extract alias after "Fleet+P-"
        alias_pattern = re.compile(r'Fleet\+P-(.*)')
        alias_match = alias_pattern.search(principal_arn)
        alias = alias_match.group(1) if alias_match else None
        return None, alias, None
    else:
        # Extract email and alias
        email_pattern = re.compile(r'.*/(.*)')
        email_match = email_pattern.search(principal_arn)
        if email_match:
            email = email_match.group(1)
            alias = email.split('@')[0] if '@' in email else None
            return email, alias, None
    return None, None, None

# Helper function to extract asset type and ID
def extract_asset_type_and_id(member_arn, arn_prefix):
    arn_suffix = member_arn.replace(arn_prefix, "")
    arn_parts = arn_suffix.split('/')
    asset_type = arn_parts[0]  # This will be 'dashboard', 'analysis', 'dataset', or 'datasource'
    asset_id = arn_parts[1]     # This is the asset ID
    return asset_type, asset_id

# Function to get asset details
def get_asset_details():
    dashboards = {}
    datasets = {}
    analyses = {}

    # Fetch all dashboards
    next_token = None
    while True:
        params = {'AwsAccountId': aws_account_id}
        if next_token:
            params['NextToken'] = next_token
        response = quicksight_client.list_dashboards(**params)
        dashboards.update({item['DashboardId']: item['Name'] for item in response.get('DashboardSummaryList', [])})
        next_token = response.get('NextToken')
        if not next_token:
            break

    # Fetch all datasets
    next_token = None
    while True:
        params = {'AwsAccountId': aws_account_id}
        if next_token:
            params['NextToken'] = next_token
        response = quicksight_client.list_data_sets(**params)
        datasets.update({item['DataSetId']: item['Name'] for item in response.get('DataSetSummaries', [])})
        next_token = response.get('NextToken')
        if not next_token:
            break

    # Fetch all analyses
    next_token = None
    while True:
        params = {'AwsAccountId': aws_account_id}
        if next_token:
            params['NextToken'] = next_token
        response = quicksight_client.list_analyses(**params)
        analyses.update({item['AnalysisId']: item['Name'] for item in response.get('AnalysisSummaryList', [])})
        next_token = response.get('NextToken')
        if not next_token:
            break

    return dashboards, datasets, analyses

# Main function to get all folder details
def get_folder_details():
    dashboards, datasets, analyses = get_asset_details()
    folders = list_quicksight_folders()
    folder_details = []
    snapshot_date = datetime.now().strftime('%Y-%m-%d')  # Today's date as snapshot_date

    for folder in folders:
        folder_id = folder['FolderId']
        permissions = describe_folder_permissions(folder_id)
        resolved_permissions = describe_folder_resolved_permissions(folder_id)
        members = list_folder_members(folder_id)

        # Process standard permissions
        for permission in permissions:
            principal = permission['Principal']
            email, alias, group_name = extract_email_alias_and_group(principal)

            # Ensure alias is populated
            if alias is None and email is not None:
                alias = email.split('@')[0]

            # Initialize asset columns with None
            analysis_name = None
            analysis_id = None
            dashboard_name = None
            dashboard_id = None
            dataset_name = None
            dataset_id = None

            for member in members:
                member_arn = member.get('MemberArn')
                asset_type, asset_id = extract_asset_type_and_id(member_arn, arn_prefix)
                
                if asset_type == 'analysis':
                    analysis_name = analyses.get(asset_id)
                    analysis_id = asset_id
                elif asset_type == 'dashboard':
                    dashboard_name = dashboards.get(asset_id)
                    dashboard_id = asset_id
                elif asset_type == 'dataset':
                    dataset_name = datasets.get(asset_id)
                    dataset_id = asset_id

            folder_details.append({
                'FolderId': folder_id,
                'FolderName': folder['Name'],
                'FolderType': folder['FolderType'],
                'Email': email,
                'Alias': alias,
                'GroupName': group_name,
                'is_owner': "Y" if "quicksight:CreateFolder" in permission['Actions'] else "N",
                'AnalysisName': analysis_name,
                'AnalysisId': analysis_id,
                'DashboardName': dashboard_name,
                'DashboardId': dashboard_id,
                'DatasetName': dataset_name,
                'DatasetId': dataset_id,
                'SnapshotDate': snapshot_date  # Snapshot date moved to the last column
            })

        # Process resolved permissions
        for resolved_permission in resolved_permissions:
            principal = resolved_permission['Principal']
            email, alias, group_name = extract_email_alias_and_group(principal)

            # Ensure alias is populated
            if alias is None and email is not None:
                alias = email.split('@')[0]

            # Initialize asset columns with None
            analysis_name = None
            analysis_id = None
            dashboard_name = None
            dashboard_id = None
            dataset_name = None
            dataset_id = None

            for member in members:
                member_arn = member.get('MemberArn')
                asset_type, asset_id = extract_asset_type_and_id(member_arn, arn_prefix)
                
                if asset_type == 'analysis':
                    analysis_name = analyses.get(asset_id)
                    analysis_id = asset_id
                elif asset_type == 'dashboard':
                    dashboard_name = dashboards.get(asset_id)
                    dashboard_id = asset_id
                elif asset_type == 'dataset':
                    dataset_name = datasets.get(asset_id)
                    dataset_id = asset_id

            folder_details.append({
                'FolderId': folder_id,
                'FolderName': folder['Name'],
                'FolderType': folder['FolderType'],
                'Email': email,
                'Alias': alias,
                'GroupName': group_name,
                'is_owner': "Y" if "quicksight:CreateFolder" in resolved_permission['Actions'] else "N",
                'AnalysisName': analysis_name,
                'AnalysisId': analysis_id,
                'DashboardName': dashboard_name,
                'DashboardId': dashboard_id,
                'DatasetName': dataset_name,
                'DatasetId': dataset_id,
                'SnapshotDate': snapshot_date  # Snapshot date moved to the last column
            })

    return folder_details

# Run the function and get all details
folder_details = get_folder_details()

# Convert to Pandas DataFrame and write to S3 as Parquet
df = pd.DataFrame(folder_details)

if not df.empty:
    date = datetime.now().strftime('%Y-%m-%d')
    output_path = f's3://{s3_bucket}/{s3_prefix}/{date}.parquet'
    df.to_parquet(output_path)
    print(f"Data saved to {output_path}")
else:
    print("No data to save.")