Is your data secure? Find out with our free IBM security assessment! Learn More →

Services
Focus Areas

Areas of Expertise
Engagements

Discover

Build

Support
Areas of Expertise

App Modernization

Public Sector

Serverless

IoT

DevOps

Migration

Data and Machine Learning (ML)

Enterprise Architecture

24/7 Monitoring

Team Support

Datadog

Overview

Are you taking advantage of modernizing your AWS apps to protect your cloud investments?

Overview

Our mission is to accelerate high-quality cloud adoption across the Public Sector.

Overview

Whether you are new to serverless or looking to scale, Trek10 allows you to focus on building applications, not managing servers.

Related Content

AWS Lambda

With AWS Lambda, you can run code without the need for managing servers in a cost-effective manner.

Blog

What is Serverless and Why Does it Matter?

Overview

Whether you’re looking to gain visibility into plant floor machinery or seeking to enhance process efficiency, Trek10 can help.

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

or should you build-your-own with DynamoDB?

Overview

Shorten the development lifecycle, increase reliability, and release software faster.

Related Content

AWS CloudFormation

AWS CloudFormation helps you save time and money by configuring and managing resources for you.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

At Trek10, we rapidly migrate your applications with a focus on cost-effectiveness

Related Content

Amazon WorkSpaces

Amazon WorkSpaces allows you to quickly scale according to your virtual desktop needs.

Containers on AWS

Containers on AWS makes managing container registries easy, autonomous, reliable, and safe from anywhere.

Overview

Uncover insights from your data no matter where you are in your analytics journey.

Related Content

Machine Learning Ops

MLOps constitute best practices for developing, deploying, and monitoring high precision Machine Learning models.

Amazon SageMaker

Amazon SageMaker enables developers and data scientists to easily build ML models.

Overview

Enterprise Architecture (EA) combines business and technology in a proven industry recognized framework to deliver business focused results based on your industry, environment, competition and the ever increasing capabilities of cloud technologies.

Related Content

Developer Acceleration

A series of in-person architect-led training modules designed to help your team develop the necessary skills and best practices to modernize your applications.

Overview

Maximize the uptime and security of your most critical applications.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Experienced solutions architects and developers at your service, on-demand.

Related Content

Amazon CloudWatch

Amazon CloudWatch makes performance monitoring simple for you and your business.

Disaster Recovery

Prevent downtime, strengthen resilience, and avoid unanticipated costs with a comprehensive Disaster Recovery Plan.

Overview

Let Trek10 help you hit the ground running with Datadog.

Related Content

AWS Premier Partner

Discover

Cloud-Native Immersion Day

Developer Acceleration

Retail | Industry Overview

SaaS on AWS

Serverless Workshop

Overview

Trek10's Cloud-Native Immersion Days are focused, high impact training sessions that will drench your teams in knowledge of the latest tech and best-practices.

Overview

Trek10’s expert-led Developer Acceleration workshops help enterprise teams quickly and safely jump-start their serverless journey.

Overview

Leveraging the vast capabilities of the AWS ecosystem, Trek10 provides retail businesses with solutions tailored to their unique needs, enabling them to innovate at speed and scale.

Overview

Trek10 helps companies migrate and build their SaaS offering on AWS with a cloud-native approach.

Overview

Whether it’s a greenfield project or re-architecting legacy, Trek10 is your guide to adopting cloud native architectures.

Build

DevOps Transformation

Internet of Things (IoT) Applications

Security

Overview

At Trek10, we leverage the best AWS native and third party tools for code-defined infrastructure, continuous integration, and automated deployment pipelines.

Overview

Trek10 helps you deliver on the promise of IoT by guiding you through the process of connecting your devices to AWS and by designing, implementing, and fully supporting your AWS cloud infrastructure.

Overview

Trek10’s security solutions and services will secure your AWS APIs and infrastructure. Schedule a meeting today to see if you qualify for a free security scan and report.

Support

CloudOps 24/7 Monitoring & Support

CloudOps Team Support

Overview

Trek10 brings managed services to the cloud. Our team works hard to reduce noise and maximize uptime in every AWS environment we manage.

Overview

Trek10 Team Support augments your team’s skills with access to a team of experienced and focused AWS solutions architects and cloud developers that specialize in leveraging AWS to the fullest.

Overview

Everyone who moves to AWS wants to secure their environment, but knowing where to start is hard. That is where Trek10 can help.
Case Studies
About
Careers
AWS Premier Partner
Community
CloudProse Blog

Spotlight

Serverless

Cost and Pricing Analysis

Cloud Native

Developer Experience

Databases

News

IoT

Monitoring, Ops & DevOps

Containers

Security and IAM

Generative AI and Machine Learning (ML)

Search Trek10

Data and Analytics

Data Lakehouses Unleashed: How AWS and Apache Iceberg are Changing the Game

An exploration of how to build a data lakehouse entirely in Amazon S3.

Ryan Farina | May 22 2023
8 min read

When working with data lakes in AWS, it has historically been a standard process to move that data into a warehouse. Essentially ingest data into S3, manipulate, catalog it, and then load it into Redshift, Snowflake, or some other analytical warehouse of your choice. Recently, however, a newer approach has been growing in popularity: the lakehouse. The approach to building one differs depending on the tool of choice, but ultimately, a lakehouse is about combining your warehouse with your data lake. In many cases, you see tools like Snowflake (with external tables) and Redshift (with Spectrum) try to get closer to the source data to implement this. There is another approach: build a table format directly into S3 and leverage tools such as Athena to perform analysis there. This allows us to maintain the highest flexibility in our data platform by leveraging S3 as our storage layer. In this post, we’ll walk through this approach and show how simple and straightforward getting started is.

Before we get into discussing how one goes about building a lakehouse entirely in S3, we should take a moment to discuss what is meant by a table format. Text files contain data in raw format as incremental points devoid of any guaranteed format, schema, or metadata. For example, a list of json entries are not guaranteed any relationship to each other. To process this data, you must review each data point to be able to make guaranteed statements about the whole dataset. Formats in this space are things like csv, json, and xml. The next evolution of data formats is called file formats. File formats attempt to provide context about the contents of data within a file so you can make assumptions simply by looking at some metadata within the file about the contents of the file. These formats already start to get much more complicated and come in the forms of but are not limited to, Parquet, Avro, and ORC. These file formats are considerably more efficient for analytical systems to process and understand but fall far short of what is needed to be able to perform true analysis with them. With these files alone, we lack the greater context of the collection of files and how they relate. This is where table formats come into play. Table formats start by leveraging file formats to manage data, but they also come with metadata in the form of manifest files that help analytical systems interpret how all the data relates in a way that is not exposed to end users. Examples of table formats are Iceberg, Hudi, and Delta. In this article, we will focus our discussion on Iceberg specifically.

Iceberg is a table format that provides some of the following benefits:

ACID Compliant Transactions
Schema Evolution
Time Travel
Hidden Partitions
…and more

All of these benefits are attainable with just a few AWS services: Athena, S3, and a bit of Glue.

With all these great benefits, it is important to point out that a table format provides you the ability to create write optimized solutions within AWS S3. Write optimized solutions focus on the ability to update, delete, and generally transform data at rest within S3. For solutions that have significant requirements for read optimization, this solution would need to be extended with another tier of data that focuses on that objective.

Use Cutting-Edge Resources for the Management of Your Data with Trek10

Setting Things Up

Before we get started, we will need some data. For this example we’ll use a random dataset about flights from 2008; you may use any dataset you would like. Be aware that the dataset schema definition will need to be fully written out.

Once we have selected a dataset, we can begin building. To make the data accessible, we need to incorporate it into Athena. There are two relatively easy ways to do this, the first is to point a Glue crawler at it and let it identify the format and schema for us. The second is to create an external table looking directly at the raw data in S3. I personally find it easier to simply use a Glue crawler for this task.

Once we have a table and schema available to us in Glue, we can now leverage Athena to query it. After navigating to Athena, ensure that your data is appropriately loaded and available by running a simple SELECT Athena query:

select * from flights;

You may also perform operations such as SELECT COUNT(*) in your query to validate the number of records.

Creating The Iceberg Table

With your raw data table setup and ready to go, we now create the Iceberg table. To do this, we simply use the CREATE TABLE statement provided with the full schema definition of the table along with any other properties. The initial table we create will be empty, but Athena will take care of configuring the target S3 location for Iceberg. When creating the table you will provide the full schema and data types. A list of available data types can be found here.

create table 
    demo.flights_iceberg (
        id bigint
        ,FlightYear int
        ,FlightMonth int
        ,DayofMonth int
        ,...
        ,LateAircraftDelay bigint
    )
    LOCATION 's3://your_bucket/flights/'
    TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

Another important optimization to consider when creating these tables is how partitions will play a role in your query patterns. To partition your data, you may provide a PARTITION value to the SQL statement. For example, should I decide that flightyear, flightmonth, and dayofmonth are always used to query my data, then I could add a partition keyword like this:

create table 
    demo.flights_iceberg (
        id bigint
        ,FlightYear int
        ,FlightMonth int
        ,DayofMonth int
        ,...
        ,LateAircraftDelay bigint
    )
    PARTITIONED BY (flightyear, flightmonth, dayofmonth)
    LOCATION 's3://your_bucket/flights/'
    TBLPROPERTIES ( 'table_type' = 'ICEBERG' );

You may also provide partition transformations in the partitioned by operation. The list of these can be found here.

Loading Data

Now that we have an Iceberg table created and we have data in S3 that is accessible from Athena, we can now load the Iceberg table.

The most direct way to get this done for the initial load is to simply perform an INSERT INTO call.

insert into flights_iceberg
    select
        *
    from
        flights;

This should be pretty straightforward, flights is the table created from the Glue Crawler and flights_iceberg is the Iceberg table I had created. This operation took me an average of around 30 seconds to load 240 MB of data.

All operations are executed on both Athena Workgroup version 2 and Workgroup version 3. I discovered while working on this, version 2 was very particular about data types for the insert function. When swapping over to version 3, it was capable of casting the values from one table into the other. This may be something to keep in mind if you see errors while executing this operation.

One downside of the insert approach of course is that it is not intelligent. It will work for full loads and intentional inserts, but what about something with more nuance or control?

Two other valid ways to update information within an Iceberg table is to perform either an update, which would be useful to update all occurrences of a value in a table, or a merge.

MERGE INTO is certainly one of the more powerful features available. For full details you can review Athena’s guide here. Performing an UPDATE can help batch change certain values, the syntax details can be found here.

Managing Tables

Now that we have the table available to us as an Iceberg table, we may want to perform some optimizations or evolutions. This is where Iceberg truly shines—not only does it provide you the full capabilities of create, insert, update, and delete, but you can also perform operations such as ALTER TABLE to add or remove columns. Iceberg also contains certain features which aren’t fully implemented in Athena quite yet. To access these features, you would need to use another interface such as Spark, Flink, Hive, etc. For a full breakdown of how to utilize different interfaces, refer to Icebergs documentation.

After you have created your table, you will want to regularly perform some maintenance operations. The operations to note here are the OPTIMIZE and VACUUM operations. Optimize will compact rewrites and deletes into more optimized file sizes based on your requested configuration. Vacuum on the other hand will perform some snapshot expiration and remove orphaned files. These processes can help to ensure the performance of your tables and should be executed with some regularity depending on your analytical needs. This can vary from daily to monthly.

Conclusion

This process was very simple and very powerful. At this point, you now have an Iceberg table that can be further optimized to improve your write performance and extend your data lake into a lakehouse in S3. Once you have this table, you can perform any and all of your analytical operations, however, you will still want to build your materialized views, views, and other optimizations for read-heavy workloads. To clean up your environment, you can simply drop the Iceberg table by performing the operation DROP TABLE.

drop table flights_iceberg;

This operation will unregister the table from Glue and delete the files in S3.

Take the opportunity to unleash the full potential of your data infrastructure. Our experts are ready to help you get started building your own lakehouse on AWS today. Contact us for an initial discovery. Regardless of where you may be in the process, we are here to help!

Author

Ryan Farina

Go to Stories by Ryan

Similar Blog

Spotlight

Demoing the Blues Wifi + Cell Communication Module

Explore the Blues Cell + Wifi communication module on a Raspberry Pi Zero, Notehub, and thoughts on the pros and cons of utilizing Blues in your IoT project.

Justin Courtright | Dec 21 2024
6 min read

Data and Analytics

Amazon Kinesis Data Streams On-demand vs. Provisioned Billing Mode Cost Comparison

Is on-demand pricing really ‘serverless’ pricing?

Joel Haubold | Sep 21 2023
5 min read

Data and Analytics

Guide to Filtering in Amazon QuickSight

A comprehensive guide on Filters, Filter Controls, Actions, Parameters, and other QuickSight features to leverage filter capabilities in your dashboards.

Miguel Campos Ribau Featured Team Member

Miguel Campos | Apr 25 2023
12 min read

AWS Data and ML Expertise

Explore more resources from Trek10

Browse | Learn | Connect

Overview

Overview

Overview

Related Content

AWS Lambda

Blog

What is Serverless and Why Does it Matter?

Overview

Related Content

Blog

Serverless Architectures: IoT

Blog

Is IoT Device Shadow Right for You?

Overview

Related Content

AWS CloudFormation

Containers on AWS

Overview

Related Content

Amazon WorkSpaces

Containers on AWS

Overview

Related Content

Machine Learning Ops

Amazon SageMaker

Overview

Related Content

Developer Acceleration

Overview

Related Content

Amazon CloudWatch

Disaster Recovery

Overview

Related Content

Amazon CloudWatch

Disaster Recovery

Overview

Related Content

AWS Premier Partner

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Data and Analytics

Data Lakehouses Unleashed: How AWS and Apache Iceberg are Changing the Game

Setting Things Up

Creating The Iceberg Table

Loading Data

Managing Tables

Conclusion

Author

Ryan Farina

Similar Blog

Spotlight

Demoing the Blues Wifi + Cell Communication Module

Data and Analytics

Amazon Kinesis Data Streams On-demand vs. Provisioned Billing Mode Cost Comparison

Data and Analytics

Guide to Filtering in Amazon QuickSight

AWS Data and ML Expertise

Check out our Case Studies!

Contact Us