Sunday, May 11, 2025

Getting started with LakeFS and Apache Iceberg Running Locally

learnlakefs

Install LakeFS

pip install lakefs
-----
python -m lakefs.quickstart

Step 1 : Define Imports

In [1]:
import os
import sys
from pyspark.sql import SparkSession

os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@11"

SPARK_VERSION = '3.4'
ICEBERG_VERSION = '1.3.0'
LAKEFS_ICEBERG_VERSION = '0.1.2'

SUBMIT_ARGS = (
    "--packages "
    f"org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:{ICEBERG_VERSION},"
    "com.amazonaws:aws-java-sdk-bundle:1.12.661,"
    "org.apache.hadoop:hadoop-aws:3.3.4,"
    f"io.lakefs:lakefs-iceberg:v{LAKEFS_ICEBERG_VERSION} "
    "pyspark-shell"
)

os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable

# lakeFS quickstart credentials
lakefs_access_key = "XX"
lakefs_secret_key = "X"
lakefs_endpoint = "http://127.0.0.1:8000"
repo_name = "learn-demo"

Create Spark Session

In [2]:
spark = SparkSession.builder \
    .appName("Iceberg-lakeFS-Example") \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
    .config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \
    .config("spark.sql.catalog.lakefs.uri", lakefs_endpoint) \
    .config("spark.sql.defaultCatalog", "lakefs") \
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.endpoint", lakefs_endpoint) \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.access.key", lakefs_access_key) \
    .config("spark.hadoop.fs.s3a.secret.key", lakefs_secret_key) \
    .getOrCreate()
:: loading settings :: url = jar:file:/Users/soumilshah/IdeaProjects/icebergpython/venv/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/soumilshah/.ivy2/cache
The jars for the packages stored in: /Users/soumilshah/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.4_2.12 added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
io.lakefs#lakefs-iceberg added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2b858584-7ab7-48bc-a9a7-b6ffc0ced615;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.4_2.12;1.3.0 in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.661 in spark-list
	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in spark-list
	found io.lakefs#lakefs-iceberg;v0.1.2 in central
	found com.fasterxml.jackson.module#jackson-module-scala_3;2.13.5 in central
	found org.scala-lang#scala3-library_3;3.0.2 in central
	found com.fasterxml.jackson.core#jackson-core;2.13.5 in local-m2-cache
	found com.fasterxml.jackson.core#jackson-annotations;2.13.5 in local-m2-cache
	found com.fasterxml.jackson.core#jackson-databind;2.13.5 in local-m2-cache
	found com.thoughtworks.paranamer#paranamer;2.8 in local-m2-cache
	found com.google.guava#guava;31.1-jre in central
	found com.google.guava#failureaccess;1.0.1 in central
	found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found org.checkerframework#checker-qual;3.12.0 in central
	found com.google.errorprone#error_prone_annotations;2.11.0 in central
	found com.google.j2objc#j2objc-annotations;1.3 in local-m2-cache
	found org.apache.commons#commons-lang3;3.12.0 in central
	found org.slf4j#slf4j-api;2.0.7 in central
:: resolution report :: resolve 1061ms :: artifacts dl 35ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.661 from spark-list in [default]
	com.fasterxml.jackson.core#jackson-annotations;2.13.5 from local-m2-cache in [default]
	com.fasterxml.jackson.core#jackson-core;2.13.5 from local-m2-cache in [default]
	com.fasterxml.jackson.core#jackson-databind;2.13.5 from local-m2-cache in [default]
	com.fasterxml.jackson.module#jackson-module-scala_3;2.13.5 from central in [default]
	com.google.code.findbugs#jsr305;3.0.2 from central in [default]
	com.google.errorprone#error_prone_annotations;2.11.0 from central in [default]
	com.google.guava#failureaccess;1.0.1 from central in [default]
	com.google.guava#guava;31.1-jre from central in [default]
	com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava from central in [default]
	com.google.j2objc#j2objc-annotations;1.3 from local-m2-cache in [default]
	com.thoughtworks.paranamer#paranamer;2.8 from local-m2-cache in [default]
	io.lakefs#lakefs-iceberg;v0.1.2 from central in [default]
	org.apache.commons#commons-lang3;3.12.0 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from central in [default]
	org.apache.iceberg#iceberg-spark-runtime-3.4_2.12;1.3.0 from central in [default]
	org.checkerframework#checker-qual;3.12.0 from central in [default]
	org.scala-lang#scala3-library_3;3.0.2 from central in [default]
	org.slf4j#slf4j-api;2.0.7 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from spark-list in [default]
	:: evicted modules:
	com.amazonaws#aws-java-sdk-bundle;1.12.262 by [com.amazonaws#aws-java-sdk-bundle;1.12.661] in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   21  |   0   |   0   |   1   ||   20  |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-2b858584-7ab7-48bc-a9a7-b6ffc0ced615
	confs: [default]
	0 artifacts copied, 20 already retrieved (0kB/29ms)
25/05/11 08:13:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
In [3]:
spark
Out[3]:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.4.0
Master
local[*]
AppName
Iceberg-lakeFS-Example

Write Data to Main Branch

In [6]:
from pyspark.sql import Row

# Create a sample DataFrame
data = [Row(id=1, data="foo"), Row(id=2, data="bar")]
df = spark.createDataFrame(data)
df.show()
df.writeTo("main.sampledb.my_table").using("iceberg").createOrReplace()
+---+----+
| id|data|
+---+----+
|  1| foo|
|  2| bar|
+---+----+

In [7]:
spark.sql("select * from main.sampledb.my_table").show()
+---+----+
| id|data|
+---+----+
|  1| foo|
|  2| bar|
+---+----+

Commit to Main Branch

In [8]:
! lakectl commit lakefs://demo/main -m "Add foo and bar rows to my_table"
Branch: lakefs://demo/main
Commit for branch "main" completed.

ID: dcbd145f2598297e11b806118cc48ca513c3c44435e1c16956a7f071a54ede37
Message: Add foo and bar rows to my_table
Timestamp: 2025-05-10 16:29:55 -0400 EDT
Parents: 27266b88e6ab1d475142f3a48b0c4adc74e1d8516cb22ee768c99d2f61526d2a

Lets Create New branch

In [12]:
! lakectl branch create lakefs://learn-demo/dev -s lakefs://learn-demo/main
Source ref: lakefs://demo/main
created branch 'dev' dcbd145f2598297e11b806118cc48ca513c3c44435e1c16956a7f071a54ede37
In [9]:
#spark.sql("select * from dev.sampledb.my_table").show()
In [10]:
from pyspark.sql import Row

# Create a sample DataFrame
data = [Row(id=3, data="foo***"), Row(id=4, data="bar***")]
df = spark.createDataFrame(data)
df.show()
df.writeTo("dev.sampledb.my_table").append()
+---+------+
| id|  data|
+---+------+
|  3|foo***|
|  4|bar***|
+---+------+

In [11]:
spark.sql("select * from dev.sampledb.my_table order by id asc").show()
+---+------+
| id|  data|
+---+------+
|  1|   foo|
|  2|   bar|
|  3|foo***|
|  4|bar***|
+---+------+

In [16]:
! lakectl commit lakefs://learn-demo/dev -m "Add foo*** and bar*** rows to my_table"
Branch: lakefs://demo/dev
Commit for branch "dev" completed.

ID: deef301a0619182d8b4f778e1ce8b35ffdc135d8e8b8161919c0ceed88871ad1
Message: Add foo*** and bar*** rows to my_table
Timestamp: 2025-05-10 16:32:06 -0400 EDT
Parents: dcbd145f2598297e11b806118cc48ca513c3c44435e1c16956a7f071a54ede37

Lets See both Branch

In [12]:
spark.sql("select * from main.sampledb.my_table").show()
spark.sql("select * from dev.sampledb.my_table").show()
+---+----+
| id|data|
+---+----+
|  1| foo|
|  2| bar|
+---+----+

+---+------+
| id|  data|
+---+------+
|  3|foo***|
|  4|bar***|
|  1|   foo|
|  2|   bar|
+---+------+

Create MR

In [15]:
! lakectl merge lakefs://learn-demo/dev lakefs://learn-demomain
Source: lakefs://demo/dev
Destination: lakefs://demo/main
repository not found
404 Not Found
In [13]:
spark.sql("select * from main.sampledb.my_table").show()
spark.sql("select * from dev.sampledb.my_table").show()
+---+------+
| id|  data|
+---+------+
|  3|foo***|
|  1|   foo|
|  4|bar***|
|  2|   bar|
+---+------+

+---+------+
| id|  data|
+---+------+
|  3|foo***|
|  4|bar***|
|  1|   foo|
|  2|   bar|
+---+------+

Delete Dev Branch

In [14]:
! lakectl branch delete lakefs://learn-demo/dev -y
Branch: lakefs://learn-demo/dev

Getting started with LakeFS and Apache Iceberg Running Locally

learnlakefs Install LakeFS ¶ pip install lakefs ----- python -m lakefs.quickstart ...