Sunday, October 6, 2024

Learn How to Read Hudi Tables on S3 Locally in Your PySpark Environment | Essential Packages You Need to Use

sprksql-2

Learn How to Read Hudi Tables on S3 Locally in Your PySpark Environment | Essential Packages You Need to Use

Define Imports

In [1]:
from pyspark.sql import SparkSession
import os, sys

# Set Java Home environment variable if needed
os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11"

HUDI_VERSION = '0.14.0'
SPARK_VERSION = '3.4'

Create Spark Session

In [4]:
SUBMIT_ARGS = f"--packages org.apache.hudi:hudi-spark{SPARK_VERSION}-bundle_2.12:{HUDI_VERSION},org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.773 pyspark-shell"

os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable

# Spark session
spark = SparkSession.builder \
    .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
    .config('spark.sql.extensions', 'org.apache.spark.sql.hudi.HoodieSparkSessionExtension') \
    .config('className', 'org.apache.hudi') \
    .config("fs.s3a.prefetch.enable", "false") \
    .config("fs.s3a.experimental.fadvise", "random") \
    .config('spark.sql.hive.convertMetastoreParquet', 'false') \
    .config("spark.hadoop.fs.s3a.access.key", os.getenv("ACCESS_KEY")) \
    .config("spark.hadoop.fs.s3a.secret.key", os.getenv("SECRET_KEY")) \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.endpoint", "https://s3.amazonaws.com") \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain") \
    .getOrCreate()

Read your Hudi tables

In [5]:
path = "s3a://soumilshah-dev-1995/tmp/people/"

df = spark.read.format("hudi") \
    .load(path) 

df.show(truncate=True)
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---+-------------------+----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name| id|   name|age|          create_ts|city|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---+-------------------+----+
|  20240916192523888|20240916192523888...|                 1|              city=NYC|623018ca-95f1-4c6...|  1|   John| 25|2023-09-28 00:00:00| NYC|
|  20240916192523888|20240916192523888...|                 4|              city=NYC|623018ca-95f1-4c6...|  4| Andrew| 40|2023-10-28 00:00:00| NYC|
|  20240916192523888|20240916192523888...|                 3|              city=ORD|3aa82092-735a-45a...|  3|Michael| 35|2023-09-28 00:00:00| ORD|
|  20240916192523888|20240916192523888...|                 6|              city=DFW|0a806fde-cac6-4bd...|  6|Charlie| 31|2023-08-29 00:00:00| DFW|
|  20240916192523888|20240916192523888...|                 2|              city=SFO|2a1d4681-44f9-4e9...|  2|  Emily| 30|2023-09-28 00:00:00| SFO|
|  20240916192523888|20240916192523888...|                 5|              city=SEA|fc7db0ea-3bc1-46c...|  5|    Bob| 28|2023-09-23 00:00:00| SEA|
+-------------------+--------------------+------------------+----------------------+--------------------+---+-------+---+-------------------+----+

                                                                                
In [ ]:

In [ ]:

In [ ]:

Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST endpoint

gluecat Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST e...