Apache Hudi Using Spark SQL on AWS S3 | Insert | Update | Deletes | Stored Procedures on AWS using Glue Notebooks a Hands on Guide¶

AWS Glue Notebook please set following Configurations¶


%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5
%%configure
{
    "--conf": "spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog",
    "--enable-glue-datacatalog": "true",
    "--datalake-formats": "hudi"
}

Since I am running an AWS Glue notebook locally using a Docker container, I will set the configurations below. If you are using a Glue notebook in an AWS environment, please set the above configurations.

# 
%%configure -f
{
"conf": {
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.hive.convertMetastoreParquet": "false",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
"spark.sql.legacy.pathOptionBehavior.enabled": "true",
"spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
}
}

Spark SQL Commands¶

%%sql
show databases

Starting Spark application

SparkSession available as 'spark'.

%%sql
use  default;

%%sql
show tables;

Create Hudi Table¶

%%sql
SET hoodie.metadata.enable=true;

%%sql
SET hoodie.metadata.column.stats.enable=true;

%%sql
CREATE TABLE hudi_table (
    ts BIGINT,
    uuid STRING,
    rider STRING,
    driver STRING,
    fare DOUBLE,
    city STRING
) USING hudi
OPTIONS (
    primaryKey = 'uuid',
    preCombineField = 'ts',
    path 's3://soumilshah-dev-1995/hudi/table_name=hudi_table'
)
PARTITIONED BY (city);

Insert Items using Spark SQL¶

%%sql
INSERT INTO hudi_table
VALUES
(1695159649087,'334e26e9-8355-45cc-97c6-c31daf0df330','rider-A','driver-K',19.10,'san_francisco'),
(1695091554788,'e96c4396-3fad-413a-a942-4cb36106d721','rider-C','driver-M',27.70 ,'san_francisco'),
(1695046462179,'9909a8b1-2d15-4d3d-8ec9-efc48c536a00','rider-D','driver-L',33.90 ,'san_francisco'),
(1695332066204,'1dced545-862b-4ceb-8b43-d2a568f6616b','rider-E','driver-O',93.50,'san_francisco'),
(1695516137016,'e3cf430c-889d-4015-bc98-59bdce1e530c','rider-F','driver-P',34.15,'sao_paulo'    ),
(1695376420876,'7a84095f-737f-40bc-b62f-6b69664712d2','rider-G','driver-Q',43.40 ,'sao_paulo'    ),
(1695173887231,'3eeb61f7-c2b0-4636-99bd-5d7a5a1d2c04','rider-I','driver-S',41.06 ,'chennai'      ),
(1695115999911,'c8abbe79-8d89-47ea-b4ce-4d224bae5bfa','rider-J','driver-T',17.85,'chennai');

Selecting Items from Table¶

%%sql 
SELECT ts, fare, rider, driver, city FROM  hudi_table WHERE fare > 20.0;

Updating Items with Spark SQL¶

%%sql
UPDATE hudi_table SET fare = 25.0 WHERE rider = 'rider-D';

%%sql 
SELECT * FROM hudi_table WHERE rider = 'rider-D';

Deleting Items from table¶

%%sql
DELETE FROM hudi_table WHERE uuid = '3f3d9565-7261-40e6-9b39-b8aa784f95e2';

Stored procedures¶

Show commits¶

%%sql
call show_commits(table => 'hudi_table', limit => 10);

cleaner¶

%%sql
CALL run_clean(
  table => 'hudi_table',
  trigger_max_commits => 2,
  clean_policy => 'KEEP_LATEST_FILE_VERSIONS',
  file_versions_retained => 1
);

Save points¶

%%sql
call create_savepoint(table => 'hudi_table', commit_time => '20240630132252427');

%%sql
call show_savepoints(table => 'hudi_table');

Clustering¶

%%sql

CALL run_clustering(
  table => 'hudi_table',
  op => 'schedule',
  options => 'hoodie.clustering.plan.strategy.target.file.max.bytes=1024*1024*1024,hoodie.clustering.plan.strategy.max.bytes.per.group=2*1024*1024*1024'
);

%%sql

CALL run_clustering(
  table => 'hudi_table',
  op => 'execute',
  options => 'hoodie.clustering.plan.strategy.target.file.max.bytes=1024*1024*1024,hoodie.clustering.plan.strategy.max.bytes.per.group=2*1024*1024*1024'
);

%%sql

CALL run_clustering(
  table => 'hudi_table',
  op => 'execute',
  options => 'hoodie.clustering.plan.strategy.target.file.max.bytes=1024*1024*1024,hoodie.clustering.plan.strategy.max.bytes.per.group=2*1024*1024*1024'
);

%%sql
call show_clustering(table => 'hudi_table');

Pythonist

Sunday, June 30, 2024

Apache Hudi Using Spark SQL on AWS S3 | Insert | Update | Deletes | Stored Procedures on AWS using Glue Notebooks a Hands on Guide

Apache Hudi Using Spark SQL on AWS S3 | Insert | Update | Deletes | Stored Procedures on AWS using Glue Notebooks a Hands on Guide¶

AWS Glue Notebook please set following Configurations¶

Spark SQL Commands¶

Create Hudi Table¶

Insert Items using Spark SQL¶

Selecting Items from Table¶

Updating Items with Spark SQL¶

Deleting Items from table¶

Stored procedures¶

Show commits¶

cleaner¶

Save points¶

Clustering¶

SPJ Joins in Iceberg how to use them | Faster Join Avoid Shuffle