Hudi's Latest Feature: Auto-Generating Primary Keys for Modern Data Lakes¶

Step 1 Define Imports¶

In [13]:

try:

    import os
    import sys
    import uuid
    import pyspark
    from pyspark.sql import SparkSession
    from pyspark import SparkConf, SparkContext
    from faker import Faker
    print("Imports loaded ")
    
except Exception as e:
    print("error",e)

Imports loaded

Step 2: Create Spark Session¶

In [14]:

HUDI_VERSION = '0.14.0'
SPARK_VERSION = '3.4'

SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable

spark = SparkSession.builder \
    .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
    .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
    .config('className', 'org.apache.hudi') \
    .config('spark.sql.hive.convertMetastoreParquet', 'false') \
    .getOrCreate()

Generate Data¶

In [7]:

# Sample data as a list of dictionaries with "id," "name," "age," and "ts" fields
data = [
    {"id": 1, "name": "Alice", "age": 25, "ts": "2023-10-07 10:00:00"},
    {"id": 2, "name": "Bob", "age": 30, "ts": "2023-10-07 11:15:00"},
    {"id": 3, "name": "Charlie", "age": 35, "ts": "2023-10-07 12:30:00"},
    {"id": 4, "name": "David", "age": 40, "ts": "2023-10-07 13:45:00"}
]

# Create a DataFrame from the sample data
spark_df = spark.createDataFrame(data)

# Show the contents of the DataFrame
spark_df.show()

+---+---+-------+-------------------+
|age| id|   name|                 ts|
+---+---+-------+-------------------+
| 25|  1|  Alice|2023-10-07 10:00:00|
| 30|  2|    Bob|2023-10-07 11:15:00|
| 35|  3|Charlie|2023-10-07 12:30:00|
| 40|  4|  David|2023-10-07 13:45:00|
+---+---+-------+-------------------+

Inserting Data into hudi Tables¶

In [8]:

db_name = "hudidb"
table_name = "sample_table_without_record_key"

path = f"file:///Users/soumilnitinshah/Downloads/{db_name}/{table_name}"  # Updated path
method = 'upsert'
table_type = "COPY_ON_WRITE" 

hudi_options = {
    'hoodie.table.name': table_name,
    'hoodie.datasource.write.table.type': table_type,
    'hoodie.datasource.write.table.name': table_name,
    'hoodie.datasource.write.operation': method,

}

spark_df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(path)
    

23/10/07 09:23:19 WARN AutoRecordKeyGenerationUtils$: Precombine field ts will be ignored with auto record key generation enabled

Read the Data From Hudi Tables¶

In [12]:

spark.read.format("org.apache.hudi").load(path).createOrReplaceTempView("hudi_snapshot")
spark.sql("select _hoodie_record_key,id  from hudi_snapshot").show(truncate=False)

+----------------------+---+
|_hoodie_record_key    |id |
+----------------------+---+
|20231007092319081_8_0 |3  |
|20231007092319081_5_0 |2  |
|20231007092319081_11_0|4  |
|20231007092319081_2_0 |1  |
+----------------------+---+

Use Cases: When to Go Auto-Generated vs. Specifying a Primary Key¶

Now, the question arises: in which scenarios should you utilize the primary key auto-generation feature, and when is it still advantageous to specify a primary key explicitly?

Auto-Generated Primary Key: This feature shines when your dataset lacks a natural primary key or when you're dealing with append-only use cases. It simplifies table creation, making it ideal for scenarios where you want to ingest data quickly without worrying about key configuration.

Specifying a Primary Key: On the other hand, specifying a primary key is still essential when you have well-defined primary keys in your data, especially for tables that require frequent updates or deletes. It gives you fine-grained control over data manipulation and allows for efficient indexing.

Conclusion¶

In conclusion, Apache Hudi's 0.14.0 release, with its primary key auto-generation feature, opens up new possibilities for managing data lakes more efficiently. Whether you choose to go auto-generated or specify a primary key depends on your specific use case and data requirements. This flexibility empowers users to adapt Hudi to their unique needs, making it a powerful tool in the world of big data management.

For more information on this exciting update, you can visit the official release notes.

Refernces¶

https://hudi.apache.org/releases/release-0.14.0/#record-level-index

Pythonist

Saturday, October 7, 2023