Thursday, December 11, 2025

MSSQL 2012SP4 to 2022 upgrade plan

 Please refer to the plan book with commands:

https://docs.google.com/spreadsheets/d/1W-IoA9EbWHnASrp80yLhq5PZkfO4Wdeu/edit?usp=sharing&ouid=108977388436519456629&rtpof=true&sd=true

Please refer to the plan book with commands for MSSQL with AG:

https://docs.google.com/spreadsheets/d/1w0tq8zWSvg9FVCi2AvbmEcZmW2ArqKdy/edit?usp=sharing&ouid=108977388436519456629&rtpof=true&sd=true

Wednesday, December 10, 2025

Learn Data Engineering on Your Laptop: Local Iceberg Lakehouse with Docker

Objective: Learn Data Engineering in your laptop, without any cloud subscription


YouTube:



Reference (credits): https://dev.to/alexmercedcoder/data-engineering-create-a-apache-iceberg-based-data-lakehouse-on-your-laptop-41a8


I followed the blog as it is provided, there were 2 challenges to call out from the blog, thus this video is recorded. As such both aren’t actually related to the blog, but it might be based on how our environment is setup & the product versions.


Environment setup: 

Not performing this in local laptop setup, instead we perform this in a virtual environment!!

VM: Oracle 9

Hypervisor provided: Virtualbox

Network interface: Default setting, so no external connections can reach the machine.


Steps followed:

1. Setup docker, already a video is recorded for it. Please refer.

2. Create docker compose file as outlined in the document...


mkdir /opt/de

cd /opt/de

vi docker-compose.yml


copy/paste below:


version: "3.9"


services:

  dremio:

    platform: linux/x86_64

    image: dremio/dremio-oss:latest

    ports:

      - 9047:9047

      - 31010:31010

      - 32010:32010

    container_name: dremio


  minioserver:

    image: minio/minio

    ports:

      - 9000:9000

      - 9001:9001

    environment:

      MINIO_ROOT_USER: minioadmin

      MINIO_ROOT_PASSWORD: minioadmin

    container_name: minio

    command: server /data --console-address ":9001"


  spark_notebook:

    image: alexmerced/spark33-notebook

    ports: 

      - 8888:8888

    env_file: .env

    container_name: notebook


  nessie:

    image: projectnessie/nessie

    container_name: nessie

    ports:

      - "19120:19120"


networks:

  default:

    name: iceberg_env

    driver: bridge




3. Create .env file on the same location as docker-compose.yml


# Fill in Details


# AWS_REGION is used by Spark

AWS_REGION=us-east-1

# This must match if using minio

MINIO_REGION=us-east-1

# Used by pyIceberg

AWS_DEFAULT_REGION=us-east-1

# AWS Credentials (this can use minio credential, to be filled in later)

AWS_ACCESS_KEY_ID=minioadmin

AWS_SECRET_ACCESS_KEY=minioadmin

# If using Minio, this should be the API address of Minio Server

AWS_S3_ENDPOINT=http://minioserver:9000

# Location where files will be written when creating new tables

WAREHOUSE=s3a://warehouse/

# URI of Nessie Catalog

NESSIE_URI=http://nessie:19120/api/v1


Note:

issue#1

The original blog insists on setting up a AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY later and set it here, but the minio community edition doesnt provide you with options to setup a new access id and secret.

So use the default admin user and password as access_key and secret.


issue#2

The minioserver and nessie wont resolve to any IP or host, since this is our lab and no dns server setup and this machine is also an isolated setup with no incoming connectivity from external world.

So make /etc/hosts entry like below...


[root@localhost de]# cat /etc/hosts

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

127.0.0.1   minioserver

127.0.0.1   nessie

Recommendation: Keep your linux firewall down, until you get hold of a running setup.


4. Set the PATH for docker-compose and bring up the containers one by one or all at once...


export PATH=$PATH:/usr/libexec/docker/cli-plugins/


one line command:

docker-compose up spark_notebook minioserver nessie dremio


to launch docker in detached mode:

docker-compose up -d spark_notebook minioserver nessie dremio


to view docker log:


docker logs --tail 100 notebook


5. Verify if all containers are up and running

[root@localhost de]# docker ps -a

CONTAINER ID   IMAGE                         COMMAND                  CREATED          STATUS                   PORTS

                                                                                              NAMES

162f2fc51f04   dremio/dremio-oss:latest      "bin/dremio start-fg"    12 seconds ago   Up 10 seconds            0.0.0.0:9047->9047/tcp, [::]:9047->9047/tcp, 0.0.0.0:31010->3

1010/tcp, [::]:31010->31010/tcp, 0.0.0.0:32010->32010/tcp, [::]:32010->32010/tcp, 45678/tcp   dremio

fe18a2927d5c   minio/minio                   "/usr/bin/docker-ent…"   40 minutes ago   Up 40 minutes            0.0.0.0:9000-9001->9000-9001/tcp, [::]:9000-9001->9000-9001/t

cp                                                                                            minio

ab3d6f24873e   projectnessie/nessie          "/usr/local/s2i/run"     40 minutes ago   Up 40 minutes            8080/tcp, 8443/tcp, 0.0.0.0:19120->19120/tcp, [::]:19120->191

20/tcp                                                                                        nessie

178ec9f92254   alexmerced/spark33-notebook   "/bin/sh -c '~/.loca…"   40 minutes ago   Up 40 minutes            0.0.0.0:8888->8888/tcp, [::]:8888->8888/tcp

                                                                                              notebook

[root@localhost de]#


6. Now you wont be able to access the urls you have setup from outside of your VM, for accessing the url from your windows guest OS, setup port forwarding.

In my case I have setup following port forwarding


9001 -> 59001 # for minio

8888 -> 58888 # for notebook


7. Now launch the below urls (using forwarded ports)


minio:

http://127.0.0.1:59001/ [with forward port]


notebook:

http://127.0.0.1:58888/ [with forward port]


the blog will point out that we need to fetch the url from docker log like below...


docker logs --tail 100 notebook

http://127.0.0.1:8888/?token=dc27dd45e85007edef5219fcfab1e90229bc802083348a94


but with http://127.0.0.1:8888 also it works fine.


8. Now create the table...


import pyspark

from pyspark.sql import SparkSession

import os



## DEFINE SENSITIVE VARIABLES

NESSIE_URI = os.environ.get("NESSIE_URI") ## Nessie Server URI

WAREHOUSE = os.environ.get("WAREHOUSE") ## BUCKET TO WRITE DATA TOO

AWS_ACCESS_KEY = os.environ.get("AWS_ACCESS_KEY") ## AWS CREDENTIALS

AWS_SECRET_KEY = os.environ.get("AWS_SECRET_KEY") ## AWS CREDENTIALS

AWS_S3_ENDPOINT= os.environ.get("AWS_S3_ENDPOINT") ## MINIO ENDPOINT



print(AWS_S3_ENDPOINT)

print(NESSIE_URI)

print(WAREHOUSE)



conf = (

    pyspark.SparkConf()

        .setAppName('app_name')

        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.67.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178')

        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')

        .set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog')

        .set('spark.sql.catalog.nessie.uri', NESSIE_URI)

        .set('spark.sql.catalog.nessie.ref', 'main')

        .set('spark.sql.catalog.nessie.authentication.type', 'NONE')

        .set('spark.sql.catalog.nessie.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')

        .set('spark.sql.catalog.nessie.s3.endpoint', AWS_S3_ENDPOINT)

        .set('spark.sql.catalog.nessie.warehouse', WAREHOUSE)

        .set('spark.sql.catalog.nessie.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')

        .set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY)

        .set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY)

)



## Start Spark Session

spark = SparkSession.builder.config(conf=conf).getOrCreate()

print("Spark Running")



## Create a Table

spark.sql("CREATE TABLE nessie.names (name STRING) USING iceberg;").show()



## Insert Some Data

spark.sql("INSERT INTO nessie.names VALUES ('Alex Merced'), ('Dipankar Mazumdar'), ('Jason Hughes')").show()



## Query the Data

spark.sql("SELECT * FROM nessie.names;").show()


Log:

http://minioserver:9000

http://nessie:19120/api/v1

s3a://warehouse/

:: loading settings :: url = jar:file:/home/docker/.local/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml

Ivy Default Cache set to: /home/docker/.ivy2/cache

The jars for the packages stored in: /home/docker/.ivy2/jars

org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency

org.projectnessie.nessie-integrations#nessie-spark-extensions-3.3_2.12 added as a dependency

software.amazon.awssdk#bundle added as a dependency

software.amazon.awssdk#url-connection-client added as a dependency

:: resolving dependencies :: org.apache.spark#spark-submit-parent-50482bc6-c583-4c4a-892d-1cdc577a038d;1.0

confs: [default]

found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.3.1 in central

found org.projectnessie.nessie-integrations#nessie-spark-extensions-3.3_2.12;0.67.0 in central

found software.amazon.awssdk#bundle;2.17.178 in central

found software.amazon.eventstream#eventstream;1.0.1 in central

found software.amazon.awssdk#url-connection-client;2.17.178 in central

found software.amazon.awssdk#utils;2.17.178 in central

found org.reactivestreams#reactive-streams;1.0.3 in central

found software.amazon.awssdk#annotations;2.17.178 in central

found org.slf4j#slf4j-api;1.7.30 in central

found software.amazon.awssdk#http-client-spi;2.17.178 in central

found software.amazon.awssdk#metrics-spi;2.17.178 in central

downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.3.1/iceberg-spark-runtime-3.3_2.12-1.3.1.jar ...

[SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.3.1!iceberg-spark-runtime-3.3_2.12.jar (5045ms)

downloading https://repo1.maven.org/maven2/org/projectnessie/nessie-integrations/nessie-spark-extensions-3.3_2.12/0.67.0/nessie-spark-extensions-3.3_2.12-0.67.0.jar ...

[SUCCESSFUL ] org.projectnessie.nessie-integrations#nessie-spark-extensions-3.3_2.12;0.67.0!nessie-spark-extensions-3.3_2.12.jar (283ms)

downloading https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.17.178/bundle-2.17.178.jar ...

[SUCCESSFUL ] software.amazon.awssdk#bundle;2.17.178!bundle.jar (21399ms)

downloading https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.17.178/url-connection-client-2.17.178.jar ...

[SUCCESSFUL ] software.amazon.awssdk#url-connection-client;2.17.178!url-connection-client.jar (98ms)

downloading https://repo1.maven.org/maven2/software/amazon/eventstream/eventstream/1.0.1/eventstream-1.0.1.jar ...

[SUCCESSFUL ] software.amazon.eventstream#eventstream;1.0.1!eventstream.jar (117ms)

downloading https://repo1.maven.org/maven2/software/amazon/awssdk/utils/2.17.178/utils-2.17.178.jar ...

[SUCCESSFUL ] software.amazon.awssdk#utils;2.17.178!utils.jar (133ms)

downloading https://repo1.maven.org/maven2/software/amazon/awssdk/annotations/2.17.178/annotations-2.17.178.jar ...

[SUCCESSFUL ] software.amazon.awssdk#annotations;2.17.178!annotations.jar (98ms)

downloading https://repo1.maven.org/maven2/software/amazon/awssdk/http-client-spi/2.17.178/http-client-spi-2.17.178.jar ...

[SUCCESSFUL ] software.amazon.awssdk#http-client-spi;2.17.178!http-client-spi.jar (171ms)

downloading https://repo1.maven.org/maven2/org/reactivestreams/reactive-streams/1.0.3/reactive-streams-1.0.3.jar ...

[SUCCESSFUL ] org.reactivestreams#reactive-streams;1.0.3!reactive-streams.jar (111ms)

downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.30/slf4j-api-1.7.30.jar ...

[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.30!slf4j-api.jar (105ms)

downloading https://repo1.maven.org/maven2/software/amazon/awssdk/metrics-spi/2.17.178/metrics-spi-2.17.178.jar ...

[SUCCESSFUL ] software.amazon.awssdk#metrics-spi;2.17.178!metrics-spi.jar (118ms)

:: resolution report :: resolve 10476ms :: artifacts dl 27945ms

:: modules in use:

org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.3.1 from central in [default]

org.projectnessie.nessie-integrations#nessie-spark-extensions-3.3_2.12;0.67.0 from central in [default]

org.reactivestreams#reactive-streams;1.0.3 from central in [default]

org.slf4j#slf4j-api;1.7.30 from central in [default]

software.amazon.awssdk#annotations;2.17.178 from central in [default]

software.amazon.awssdk#bundle;2.17.178 from central in [default]

software.amazon.awssdk#http-client-spi;2.17.178 from central in [default]

software.amazon.awssdk#metrics-spi;2.17.178 from central in [default]

software.amazon.awssdk#url-connection-client;2.17.178 from central in [default]

software.amazon.awssdk#utils;2.17.178 from central in [default]

software.amazon.eventstream#eventstream;1.0.1 from central in [default]

---------------------------------------------------------------------

|                  |            modules            ||   artifacts   |

|       conf       | number| search|dwnlded|evicted|| number|dwnlded|

---------------------------------------------------------------------

|      default     |   11  |   11  |   11  |   0   ||   11  |   11  |

---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent-50482bc6-c583-4c4a-892d-1cdc577a038d

confs: [default]

11 artifacts copied, 0 already retrieved (367715kB/2734ms)

25/12/09 23:23:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

25/12/09 23:24:26 WARN Executor: Issue communicating with driver in heartbeater

org.apache.spark.SparkException: Exception thrown in awaitResult: 

at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)

at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)

at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)

at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)

at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:79)

at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:643)

at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1057)

at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238)

at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066)

at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)

at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)

at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)

at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)

at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)

at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

at java.base/java.lang.Thread.run(Thread.java:829)

Caused by: java.lang.NullPointerException

at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:579)

at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:121)

at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)

at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)

at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)

at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)

at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)

at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)

at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)

... 3 more

25/12/09 23:24:26 ERROR Inbox: Ignoring error

java.lang.NullPointerException

at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:579)

at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:121)

at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)

at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)

at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)

at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)

at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)

at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)

at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)

at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)

at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)

at java.base/java.lang.Thread.run(Thread.java:829)

Spark Running

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

++

||

++

++


                                                                                

++

||

++

++


                                                                                

+-----------------+

|             name|

+-----------------+

|      Alex Merced|

|Dipankar Mazumdar|

|     Jason Hughes|

+-----------------+


9. Query the table data in different ways from spark notebook

spark.sql("SELECT * FROM nessie.names order by 1 desc;").show()


log:

+-----------------+

|             name|

+-----------------+

|     Jason Hughes|

|Dipankar Mazumdar|

|      Alex Merced|

+-----------------+


25/12/09 23:36:56 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 341119 ms exceeds timeout 120000 ms

25/12/09 23:36:56 WARN SparkContext: Killing executors is not supported by current scheduler.



10. Verify the data storage in minio, visit the minioserver url and check the warehouse bucket.


Next step usage of nessie and dremio, this will be part 2.

Thanks

MSSQL 2012SP4 to 2022 upgrade plan

 Please refer to the plan book with commands: https://docs.google.com/spreadsheets/d/1W-IoA9EbWHnASrp80yLhq5PZkfO4Wdeu/edit?usp=sharing&...