Hudi's Latest Feature: Auto-Generating Primary Keys for Modern Data Lakes¶
Step 1 Define Imports¶
try:
import os
import sys
import uuid
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from faker import Faker
print("Imports loaded ")
except Exception as e:
print("error",e)
Imports loaded
Step 2: Create Spark Session¶
HUDI_VERSION = '0.14.0'
SPARK_VERSION = '3.4'
SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION} pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable
spark = SparkSession.builder \
.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
.config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
.config('className', 'org.apache.hudi') \
.config('spark.sql.hive.convertMetastoreParquet', 'false') \
.getOrCreate()
Generate Data¶
# Sample data as a list of dictionaries with "id," "name," "age," and "ts" fields
data = [
{"id": 1, "name": "Alice", "age": 25, "ts": "2023-10-07 10:00:00"},
{"id": 2, "name": "Bob", "age": 30, "ts": "2023-10-07 11:15:00"},
{"id": 3, "name": "Charlie", "age": 35, "ts": "2023-10-07 12:30:00"},
{"id": 4, "name": "David", "age": 40, "ts": "2023-10-07 13:45:00"}
]
# Create a DataFrame from the sample data
spark_df = spark.createDataFrame(data)
# Show the contents of the DataFrame
spark_df.show()
+---+---+-------+-------------------+ |age| id| name| ts| +---+---+-------+-------------------+ | 25| 1| Alice|2023-10-07 10:00:00| | 30| 2| Bob|2023-10-07 11:15:00| | 35| 3|Charlie|2023-10-07 12:30:00| | 40| 4| David|2023-10-07 13:45:00| +---+---+-------+-------------------+
Inserting Data into hudi Tables¶
db_name = "hudidb"
table_name = "sample_table_without_record_key"
path = f"file:///Users/soumilnitinshah/Downloads/{db_name}/{table_name}" # Updated path
method = 'upsert'
table_type = "COPY_ON_WRITE"
hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.table.type': table_type,
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': method,
}
spark_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(path)
23/10/07 09:23:19 WARN AutoRecordKeyGenerationUtils$: Precombine field ts will be ignored with auto record key generation enabled
Read the Data From Hudi Tables¶
spark.read.format("org.apache.hudi").load(path).createOrReplaceTempView("hudi_snapshot")
spark.sql("select _hoodie_record_key,id from hudi_snapshot").show(truncate=False)
+----------------------+---+ |_hoodie_record_key |id | +----------------------+---+ |20231007092319081_8_0 |3 | |20231007092319081_5_0 |2 | |20231007092319081_11_0|4 | |20231007092319081_2_0 |1 | +----------------------+---+
Use Cases: When to Go Auto-Generated vs. Specifying a Primary Key¶
Now, the question arises: in which scenarios should you utilize the primary key auto-generation feature, and when is it still advantageous to specify a primary key explicitly?
Auto-Generated Primary Key: This feature shines when your dataset lacks a natural primary key or when you're dealing with append-only use cases. It simplifies table creation, making it ideal for scenarios where you want to ingest data quickly without worrying about key configuration.
Specifying a Primary Key: On the other hand, specifying a primary key is still essential when you have well-defined primary keys in your data, especially for tables that require frequent updates or deletes. It gives you fine-grained control over data manipulation and allows for efficient indexing.
Conclusion¶
In conclusion, Apache Hudi's 0.14.0 release, with its primary key auto-generation feature, opens up new possibilities for managing data lakes more efficiently. Whether you choose to go auto-generated or specify a primary key depends on your specific use case and data requirements. This flexibility empowers users to adapt Hudi to their unique needs, making it a powerful tool in the world of big data management.
For more information on this exciting update, you can visit the official release notes.
No comments:
Post a Comment