Amazon Athena is a serverless query engine that allows you to query data in Amazon S3 using standard SQL. You can also use the query result to create another table. This table can then be used as a source of further analytics or reporting. For example :
- You can use Athena to perform feature engineering and create new features that can be used for model building
- You can use Athena to create datasets that from the input to creating QuickSight dashboards.
In the spirit of creating everything as code, in this blog I show you how to create and delete an Athena table using python, specifically the PyAthena library
IAM Role to create/delete Athena tables via python
Before we run the python code, we need to create a role that has permission to create and delete Athena tables. Any code that you use, needs to be able to assume this IAM role. For example, if you are running the code through SageMaker then the SageMaker notebook while need these permissions in the policy.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "cloudwatch:PutMetricData", "athena:ListDataCatalogs", "glue:CreateTable", "glue:GetPartitions", "logs:DescribeLogStreams", "glue:DeleteTable", "athena:ListWorkGroups", "glue:GetDatabases", "logs:CreateLogGroup", "logs:PutLogEvents", "glue:GetTable", "glue:GetDatabase", "logs:CreateLogStream" ], "Resource": "*" }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::[bucket_name]", "arn:aws:s3:::[bucket_name]/*", ] }, { "Sid": "VisualEditor2", "Effect": "Allow", "Action": "athena:*", "Resource": [ "arn:aws:athena:*::workgroup/*", "arn:aws:athena:*::datacatalog/[databasename]" ] } ] }
Create/Drop Athena table using Python in SageMaker
To create/drop an Athena table using Python in SageMaker, use the code below. Ensure that the SageMaker notebook role has the above policy assigned.
#Install pyAthena import sys !{sys.executable} -m pip install PyAthena
from pyathena import connect import pandas as pd conn = connect(s3_staging_dir='s3://[bucket_name]', region_name='us-east-1').cursor() conn.execute("drop table if exists db.tableC") conn.execute(""" create table db.tableA as SELECT A.col1,B.col2,B.col3 FROM db.tableA as A, db.tableB as B WHERE A.id = B.id