Have you ever thought about the significance of table format while organizing the data? How a table format can make our lives much easier while working with data. But we have heard a lot about the different file formats like Avro, ORC, and parquet which are widely used to store data in the analytical world. In this article, we will understand the importance of table format and getting started with apache iceberg.
Before we get started let’s try to form a solid understanding of a table format and what it actually contains. A table is a combination of a set of files put together. Table format is the way a table is structured to accommodate the underlying data from its files. They allow us to interact with the data in the data lakes by querying, inserting, updating and deleting the data etc. with the use of languages like Scala, Python, and Java to execute them and tools or frameworks like Spark, Flume to work with them.
Hive: A standardized table format
Hive simply put, is a data warehouse which stores the data from its underlying filesystems like hdfs, amazon s3, google cloud storage etc. Hive’s table format consists of a set of files stored in one or more directories. While the data is queried in the hive it searches for data stored in these directory structures. But it becomes really difficult to tap into the demands of high volumes of data for analytical purposes as speed plays a crucial role when it comes to the data results which lead to the birth of various table formats which can scale up and pick up speed as a habit.
Level-Up: Apache Iceberg
Apache iceberg is an open source table format for analytical datasets which comprise key factors like volume, velocity, veracity, variety and value of the data. It makes use of the metadata to handle the huge datasets.
It contains table-level information like the schema, level of partitioning, and data stores.
Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines like a spark, hive, and presto to work with the same tables at the same time.
Let’s try to understand the side of the data reliability provided by the iceberg. The iceberg is developed to solve the correctness of data retrieved from the hive table. Hive tables consist of a central meta store for both partition level information and table level information making it difficult to perform atomic changes as the listing of files is so huge to perform the changes and track them back as well. Iceberg tracks down the complete list of data files in each snapshot using a persistent structure. Each write or delete operation produces a new snapshot that reuses the previous snapshot metadata to record the changes and avoid high-volume writes.
Apache Iceberg’s approach is to define the table through three categories of metadata. These categories are:
- “metadata files” that define the information about table data
- “manifest lists” that define details about the latest snapshot of the table
- “manifests” that define groups of data files that may be part of one or more snapshots belonging to the table
Valid snapshot tables are stored in the table metadata file along with reference to the current snapshot. Any data manipulation operations replace the path of the current table metadata file making it possible to perform data operations in isolation. It leads to the below possibilities like
Serializable isolation: All table changes occur in a linear history of atomic table updates
Reliable reads: Readers always use a consistent snapshot of the table without holding a lock
Version history and rollback: Table snapshots are kept as history and tables can roll back if a job produces bad data
Safe file-level operations: By supporting atomic changes, Iceberg enables new use cases, like safely compacting small files and safely appending late data to tables
Concurrent write operations:
Iceberg supports concurrent write operations by using optimistic concurrency.
Each writer assumes that no other writers are operating and writes out new table metadata for an operation. Then, the writer attempts to commit by atomically swapping the new table metadata file for the existing metadata file.
If the atomic swap fails because another writer has committed, the failed writer retries by writing a new metadata tree based on the new current table state.
More details on the features of the iceberg will be shared in the upcoming blog post. Stay tuned for more exciting content.
Features of iceberg
In the previous blog post, we got the basics right to get started in understanding this flexible and reliable table format of apache iceberg. Now let’s get straight to the point to dig into the features provided by the iceberg which makes us think to give it a try and see how compatible it can be made use of in our data warehouses or common data tables when compared against the widely known hive tables which have been in the market for quite a long time and been in use with the adoption of big data technologies to handle the vast amount of data.
1. ACID Transactions:
ACID (Atomicity, Consistency, Isolation, Durability) transaction properties at a table level can be considered as transactions occurring at the row/column level of a table where the transactions are performed as if they are a single atomic operation along with their consistent and durable levelling which results in either all transactions get successful or all of them get failed. Each of the properties listed above contributes to the transaction level execution.
Hive supports all ACID transaction properties which allow us to use transactional tables, and perform operations like Insert, Update, and Delete on tables. But there is a tag of limitations attached to it which needs to be addressed.
→ Hive tables to be created as transactional tables to support ACID
→ Supports only ORC file format
→ External tables can’t be created as the hive has no control over them
Apache iceberg has all its table-level metadata stored in the layers of metadata files, manifest files and manifests as explained earlier which helps it achieve the ACID level transaction properties in tables. Apart from the operations like Insert, Update, and Delete row-level updates and deletes can also be handled.
2. Partition Evolution:
Partitions are an efficient way to query a data table which allows us, not to scan the entire table but only a particular partition which has our desired data. They can be treated as part of query optimisation. Partition evolution allows us to update the partition schema of a table without the necessity to rewrite the entire table.
Partitions are allowed to be used in the hive tables which has been a time saver when it comes to querying on large data sets. But when it comes to partition evolution hive doesn’t support this functionality yet. For example, if a table is partitioned by year and now it needs to be changed to month, an entire rewrite of the table is expected.
Iceberg supports partition evolution. Partitions are tracked based on the partition column and transformations done on the column. Ex. creation_date column can be partitioned into years which can be changed to partition type of month without changing the entire structure of the table which leads to a complete rewrite.
→Entire table data rewrite is not required.
→Query can be optimized by all partition level schemas
3. Schema evolution:
Schema evolution can be defined as the ability to change the schema of tables as per our requirement without affecting the structure of the table and maintaining the backward compatibilities of the schema of the old table structure. It can accommodate changes to table schema for the data changing over time.
Hive doesn’t support schema evolution directly. Few of the possibilities to change the schema of the table without affecting the hive table structure are changing the definition of columns, adding new columns and renaming columns. But we won’t be able to directly drop a column from the hive table but can replace the existing columns with the new ones.
We can change column name/type/position through the below command
ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
[COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];
Iceberg natively supports all operations part of the schema evolution such as the addition of new columns, updating an existing column, renaming to a new column, dropping an existing column and reordering the columns.
4. Time Travel:
Time travel allows us to query a data table in its previous version. Most of the table formats enable time travel through snapshots.
We can store the snapshots of tables in different table locations. We can alter the table location to point to a previous version of the table whose data can be acquired and queried as well.
Every time an update is made to an iceberg table a snapshot is created and stored. We can make sure to clean these snapshots over time as they can cause a huge data dump over time. The expire-snapshot procedure is used to expire snapshots.