24. August 2017 20:32
by Aaron Medacco

New Pluralsight Course: Getting Started with AWS Athena

24. August 2017 20:32 by Aaron Medacco | 0 Comments

After a few months of developing and recording content, my first Pluralsight course, Getting Started with AWS Athena, is live and published. A lot of work went into this, especially since I've never recorded video content of professional quality, so I'm relieved to finally cross the finish line. I never realized how much can go into producing an online course which has given me a newfound respect for my fellow authors. 

AWS Athena Get Started

Besides learning how to produce quality video content, I underestimated how much more I would learn about AWS and Athena. There's certainly a difference between knowing enough to solve your problem with AWS, and knowing enough to teach others how to solve theirs with it. 

For those interested in checking it out, you can find the course here. You'll need to have an active Pluralsight subscription, otherwise you can start a free trial. If you work in technology, the value of a subscription is pretty crazy given the amount of content available.

The course is separated into 7 modules:

  1. Exploring AWS Athena

    Sets the stage of the course. I speak to the value proposition of Athena, why you would want to use it, the features, supported data formats, limitations, and pricing model. If you're someone whose unfamiliar with Athena, this module's designed to give you a primer.

  2. Establishing Access to Your Data

    Athena's not very useful if you can't invoke it. In this module, I show you how to upload data to S3 and configure a user account with Athena access within IAM. Many will find this review, especially those practiced with Amazon Web Services, but it's a prerequisite before getting your hands dirty in Athena.

  3. Understanding AWS Athena Internals

    You never want to be in a place where things seem like magic. Here I address the technologies that operate underneath Athena, namely Apache Hive and the Presto SQL engine. If you've never used these tools to query data before, knowing what they are and how they fit within Athena is important. The only real barrier to entry for using Athena is the ability to write SQL, so I imagine a lot of users with no experience with big data technologies will be trying it out and this module gives a small crash course to help offset that.

  4. Creating Databases & Tables to Define Your Schema

    We start getting our hands dirty in this module. We talk about what databases and tables are within Athena's catalog and how they compare to those of relational databases. This one's pretty hands-on heavy as I demonstrate how to construct tables correctly using both the console or using third-party tools over a JDBC connection.

  5. Retrieving Information by Querying Tables with SQL

    Here we finally get to started eating our cake. I cover the Presto SQL engine in brief detail and show how easy it is to query S3 data using ANSI-SQL. Athena's built to be easy to get started, so by the end of this module, most will feel comfortable enough to start using Athena on their own datasets.

  6. Optimizing Cost and Performance Using Best Practices

    My favorite module of the course, tailored to those who want to get more performance and keep more of their money. I review what methods you can employ to improve query times and reduce the amount of data scanned. The 3 primary ways of doing this involve compression, columnar formats, and table partitioning. In a lot of cases, it's not as simple as "Just compress your data and win." or "Columnar formats are faster so just use that." and I talk about what factors are important when deciding on an optimization strategy for Athena workloads. I also demonstrate how you would transform data into a columnar format using Amazon Elastic MapReduce for those who may have never done it before.

  7. AWS Athena vs. Other Solutions

    Finally, I thought it would be interesting to discuss how Athena stacks up against other data services with the Amazon cloud. Knowing when to use each service is vital for anyone responsible for proposing solutions within AWS, so I felt some high-level, admittedly apples to oranges, comparisons would help steer viewers in the right direction.

Right, so go watch the course! And leave feedback! I want to keep improving the quality of the content I create, and comments are extremely helpful.


19. August 2017 19:11
by Aaron Medacco

No More Excuses for AWS S3 Bucket Leaks

19. August 2017 19:11 by Aaron Medacco | 0 Comments

You hear about it all the time. Customers of Amazon Web Services storing sensitive information in their S3 buckets leaking it to the world because of misconfiguration. Well, per one of the announcements at AWS Summit New York, there is no longer an excuse for misconfiguring an S3 bucket. AWS Config now has new managed policies that will evaluate your account for any S3 buckets allowing global read and/or write access. 

Exposed S3 Bucket

I won't regurgitate what's already been said on AWS's blog, which you can read here. AWS Config is a pretty easy service to set up. Just know that you'll be charged $2 for each rule you enable on your account, which shouldn't be a problem for any business or organization storing sensitive information in S3. 

You have no excuse anymore! Protect against your own incompetence. No matter how comfortable you are in AWS.


4. August 2017 01:44
by Aaron Medacco

Use Bzip2 Compression w/ AWS Athena

4. August 2017 01:44 by Aaron Medacco | 0 Comments

For those using Amazon Athena to query their S3 data, it's no secret that you can save money and boost performance by using compression and columnar data formats whenever possible. Currently, Amazon's documentation doesn't list Bzip2 compression yet as a supported compression format for Athena, however it's absolutely supported.

This was confirmed by Jeff Barr's post made on optimizing performance with Athena. You can see Bzip2 is a splittable data format which allows you to take advantage of multiple readers. Other compression formats that are not splittable don't have this benefit, so it stands to reason you should use Bzip2 if you aren't using a columnar format such as ORC or Parquet.

Amazon Athena

In this post, I'll show how easy it is to compress your data to this format. Once this is done, just upload your data to S3, define a schema within the Athena catalog which points to the location you upload the compressed files to, and query away.

For Windows users:

  1. Download 7-zip here. Once installed, you'll be able to invoke 7-zip from File Explorer by right-clicking on files > 7-Zip > Add to archive....
  2. Select "bzip2" as the Archive format with "BZip2" as the Compression method: 

    Compress To Bzip2
  3. Click OK. 

For Linux users:

  1. Open a command prompt and change your working directory to that where the files you want to compress are located.
  2. Invoke the following command: 
    bzip2 file.csv
    If you want to compress multiple files, you can list them out:
    bzip2 file.csv file2.csv file3.csv
    For more information on other options with bzip2 command, check this out.

Those of you interested in using the recommended columnar storage formats, should check out the AWS documentation, which shows how you can spin up an EMR cluster to convert data to Parquet.


Copyright © 2016-2017 Aaron Medacco