Data and Analytics

Data Lakehouses Unleashed: How AWS and Apache Iceberg are Changing the Game

An exploration of how to build a data lakehouse entirely in Amazon S3.
Ryan Farina Featured Team Member
Ryan Farina | May 22 2023
8 min read

When working with data lakes in AWS, it has historically been a standard process to move that data into a warehouse. Essentially ingest data into S3, manipulate, catalog it, and then load it into Redshift, Snowflake, or some other analytical warehouse of your choice. Recently, however, a newer approach has been growing in popularity: the lakehouse. The approach to building one differs depending on the tool of choice, but ultimately, a lakehouse is about combining your warehouse with your data lake. In many cases, you see tools like Snowflake (with external tables) and Redshift (with Spectrum) try to get closer to the source data to implement this. There is another approach: build a table format directly into S3 and leverage tools such as Athena to perform analysis there. This allows us to maintain the highest flexibility in our data platform by leveraging S3 as our storage layer. In this post, we’ll walk through this approach and show how simple and straightforward getting started is.

Before we get into discussing how one goes about building a lakehouse entirely in S3, we should take a moment to discuss what is meant by a table format. Text files contain data in raw format as incremental points devoid of any guaranteed format, schema, or metadata. For example, a list of json entries are not guaranteed any relationship to each other. To process this data, you must review each data point to be able to make guaranteed statements about the whole dataset. Formats in this space are things like csv, json, and xml. The next evolution of data formats is called file formats. File formats attempt to provide context about the contents of data within a file so you can make assumptions simply by looking at some metadata within the file about the contents of the file. These formats already start to get much more complicated and come in the forms of but are not limited to, Parquet, Avro, and ORC. These file formats are considerably more efficient for analytical systems to process and understand but fall far short of what is needed to be able to perform true analysis with them. With these files alone, we lack the greater context of the collection of files and how they relate. This is where table formats come into play. Table formats start by leveraging file formats to manage data, but they also come with metadata in the form of manifest files that help analytical systems interpret how all the data relates in a way that is not exposed to end users. Examples of table formats are Iceberg, Hudi, and Delta. In this article, we will focus our discussion on Iceberg specifically.

Iceberg is a table format that provides some of the following benefits:

  • ACID Compliant Transactions
  • Schema Evolution
  • Time Travel
  • Hidden Partitions
  • …and more

All of these benefits are attainable with just a few AWS services: Athena, S3, and a bit of Glue.

With all these great benefits, it is important to point out that a table format provides you the ability to create write optimized solutions within AWS S3. Write optimized solutions focus on the ability to update, delete, and generally transform data at rest within S3. For solutions that have significant requirements for read optimization, this solution would need to be extended with another tier of data that focuses on that objective.

Setting Things Up

Before we get started, we will need some data. For this example we’ll use a random dataset about flights from 2008; you may use any dataset you would like. Be aware that the dataset schema definition will need to be fully written out.

Once we have selected a dataset, we can begin building. To make the data accessible, we need to incorporate it into Athena. There are two relatively easy ways to do this, the first is to point a Glue crawler at it and let it identify the format and schema for us. The second is to create an external table looking directly at the raw data in S3. I personally find it easier to simply use a Glue crawler for this task.

Once we have a table and schema available to us in Glue, we can now leverage Athena to query it. After navigating to Athena, ensure that your data is appropriately loaded and available by running a simple SELECT Athena query:

select * from flights;

You may also perform operations such as SELECT COUNT(*) in your query to validate the number of records.

Creating The Iceberg Table

With your raw data table setup and ready to go, we now create the Iceberg table. To do this, we simply use the CREATE TABLE statement provided with the full schema definition of the table along with any other properties. The initial table we create will be empty, but Athena will take care of configuring the target S3 location for Iceberg. When creating the table you will provide the full schema and data types. A list of available data types can be found here.

create table 
    demo.flights_iceberg (
        id bigint
        ,FlightYear int
        ,FlightMonth int
        ,DayofMonth int
        ,...
        ,LateAircraftDelay bigint
    )
    LOCATION 's3://your_bucket/flights/'
    TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

Another important optimization to consider when creating these tables is how partitions will play a role in your query patterns. To partition your data, you may provide a PARTITION value to the SQL statement. For example, should I decide that flightyear, flightmonth, and dayofmonth are always used to query my data, then I could add a partition keyword like this:

create table 
    demo.flights_iceberg (
        id bigint
        ,FlightYear int
        ,FlightMonth int
        ,DayofMonth int
        ,...
        ,LateAircraftDelay bigint
    )
    PARTITIONED BY (flightyear, flightmonth, dayofmonth)
    LOCATION 's3://your_bucket/flights/'
    TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

You may also provide partition transformations in the partitioned by operation. The list of these can be found here.

Loading Data

Now that we have an Iceberg table created and we have data in S3 that is accessible from Athena, we can now load the Iceberg table.

The most direct way to get this done for the initial load is to simply perform an INSERT INTO call.

insert into flights_iceberg
    select
        *
    from
        flights;

This should be pretty straightforward, flights is the table created from the Glue Crawler and flights_iceberg is the Iceberg table I had created. This operation took me an average of around 30 seconds to load 240 MB of data.

All operations are executed on both Athena Workgroup version 2 and Workgroup version 3. I discovered while working on this, version 2 was very particular about data types for the insert function. When swapping over to version 3, it was capable of casting the values from one table into the other. This may be something to keep in mind if you see errors while executing this operation.

One downside of the insert approach of course is that it is not intelligent. It will work for full loads and intentional inserts, but what about something with more nuance or control?

Two other valid ways to update information within an Iceberg table is to perform either an update, which would be useful to update all occurrences of a value in a table, or a merge.

MERGE INTO is certainly one of the more powerful features available. For full details you can review Athena’s guide here. Performing an UPDATE can help batch change certain values, the syntax details can be found here.

Managing Tables

Now that we have the table available to us as an Iceberg table, we may want to perform some optimizations or evolutions. This is where Iceberg truly shines—not only does it provide you the full capabilities of create, insert, update, and delete, but you can also perform operations such as ALTER TABLE to add or remove columns. Iceberg also contains certain features which aren’t fully implemented in Athena quite yet. To access these features, you would need to use another interface such as Spark, Flink, Hive, etc. For a full breakdown of how to utilize different interfaces, refer to Icebergs documentation.

After you have created your table, you will want to regularly perform some maintenance operations. The operations to note here are the OPTIMIZE and VACUUM operations. Optimize will compact rewrites and deletes into more optimized file sizes based on your requested configuration. Vacuum on the other hand will perform some snapshot expiration and remove orphaned files. These processes can help to ensure the performance of your tables and should be executed with some regularity depending on your analytical needs. This can vary from daily to monthly.

Conclusion

This process was very simple and very powerful. At this point, you now have an Iceberg table that can be further optimized to improve your write performance and extend your data lake into a lakehouse in S3. Once you have this table, you can perform any and all of your analytical operations, however, you will still want to build your materialized views, views, and other optimizations for read-heavy workloads. To clean up your environment, you can simply drop the Iceberg table by performing the operation DROP TABLE.

drop table flights_iceberg;

This operation will unregister the table from Glue and delete the files in S3.

Take the opportunity to unleash the full potential of your data infrastructure. Our experts are ready to help you get started building your own lakehouse on AWS today. Contact us for an initial discovery. Regardless of where you may be in the process, we are here to help!

Author