Daniel Lacko12.07.202110 minutes
Whenever you are generating a lot of data, a natural question comes up. How to process the data? Do you need to set up a whole infrastructure to create a new full-fledged database? Perhaps, you do not need to actively work with the data, you just need to occasionally access them and get the necessary information. AWS Athena provides a solution for the latter case.
As the headline suggests, AWS Athena is a part of AWS analytic tool-set. This tool can help you create ad hoc databases or tables and run queries on them. All it needs is data in form of CSV, JSON, or some columnar formats such as Apache Parquet or ORC.
Query services like Amazon Athena make it easy to run interactive queries against data directly in Amazon S3 without worrying about formatting data or managing infrastructure. For example, Athena is great if you just need to run a quick query on logs to help you debug an application. You just define a table for your data and start querying using standard SQL.
It is not advisable to use Athena for anything related to production, business, or enterprise level. For these cases, there are dedicated services such as Amazon Redshift or EMR.
In short, AWS Glue can extract, transform and load data. It can be used to create a crawler that can infer database and table schema from your data in Amazon S3. Inferred schema is then stored in AWS Glue Data catalog. Athena can access this catalog and load databases and tables and run queries on them. AWS Glue is already integrated into AWS Athena. Tutorial on how to add a data source to Athena can be found here.
Athena can fetch data from other AWS services thanks to the integration. For example, if you have running AWS CloudTrail, you can create a table from CloudTrail -> Event History, with just one click. Afterward, you can run queries on logs from CloudTrail in the Athena query editor. Some other useful Athena integrations are:
You can read the full list of integrations here.
The interface is quite self-explanatory. On the left side, you can select Data source and Database and see what tables are present. On the right side, there is a text editor for SQL queries and buttons to run them or save them. After you run a query, query results will show up under the text editor. Some tips:
CloudTrail is an AWS service responsible for logging user activity. Logs are stored in S3 bucket. You can view user activity logs via CloudTrail -> Event History. It has one downside though - Event History can view only the last 90 days of logs. One way of viewing all logs is using AWS Athena:
Provided that you arrived at point 6 above, we can start writing a query. Let's see what services have been user intern_user:
SELECT DISTINCT eventsource FROM cloudtrail_logs_aws_dl_tutorial_bucket WHERE useridentity.username LIKE '%intern_user%'
We can now run the query and see results under the query editor:
Payment is per query. More precisely you are charged based on the amount of data scanned by each query. Pricing may differ per region. Examples:
It's all about the size of data Athena has to read, therefore if you manage to reduce the size of your file, you can make some savings. If you compress files using GZIP, you might see up to 3:1 compression gains.
It can be also beneficial to load the data in more efficient formats. Columnar formats like Apache Parquet allow more efficient querying. Using such formats allows Athena to read-only columns specified by a query, therefore reducing the amount of data read.