Monday, March 1, 2021

Install & Setup Hadoop in Standalone Mode

Install & Setup Hadoop in Standalone Mode

Environment:
Os: Centos 7.9
Kernel: 3.10.0-1160.6.1.el7.x86_64
Java version:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)

1) Setup hadoop user in centos machine(follow blog link below)

https://oracledbaplanner.blogspot.com/2021/02/adding-linux-usergroup-and-modifying.html

2) Setup mountpoint for Apache hadoop install

[root@localhost ~]# mkdir /opt/hadoop

[root@localhost ~]# chown hadoop:bigdata /opt/hadoop

[root@localhost ~]# ls -altr /opt/hadoop
total 0
drwxr-xr-x. 4 root   root    53 Mar  1 02:14 ..
drwxr-xr-x. 2 hadoop bigdata  6 Mar  1 02:14 .
[root@localhost ~]#

3) Download the hadoop binary from url https://downloads.apache.org/hadoop/common/
Specifically like below..


wget https://downloads.apache.org/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz

4) Once download is finished, gzip decompress and untar the binary

[hadoop@localhost hadoop]$ ls -altr
total 489016
-rw-r--r--. 1 hadoop bigdata 500749234 Jul 15  2020 hadoop-3.3.0.tar.gz
drwxr-xr-x. 4 root   root           53 Mar  1 02:14 ..
drwxr-xr-x. 2 hadoop bigdata        33 Mar  1 02:32 .
[hadoop@localhost hadoop]$ gzip -d hadoop-3.3.0.tar.gz

[hadoop@localhost hadoop]$ ls -altr
total 1034752
-rw-r--r--. 1 hadoop bigdata 1059584000 Jul 15  2020 hadoop-3.3.0.tar
drwxr-xr-x. 4 root   root            53 Mar  1 02:14 ..
drwxr-xr-x. 2 hadoop bigdata         30 Mar  1 23:13 .

[hadoop@localhost hadoop]$ pwd
/opt/hadoop

[hadoop@localhost hadoop]$ ls -altr
total 1034752
-rw-r--r--. 1 hadoop bigdata 1059584000 Jul 15  2020 hadoop-3.3.0.tar
drwxr-xr-x. 4 root   root            53 Mar  1 02:14 ..
drwxr-xr-x. 2 hadoop bigdata         30 Mar  1 23:13 .
[hadoop@localhost hadoop]$ tar -tvf hadoop-3.3.0.tar|head
drwxr-xr-x brahma/brahma     0 2020-07-06 15:50 hadoop-3.3.0/
-rw-rw-r-- brahma/brahma   175 2020-03-24 13:23 hadoop-3.3.0/README.txt
..

I evaluated the nohup.out to see if there are any errors reported during in untar operation.So the file is nearly 1GB in size. After untar the total usage of the directory is 2GB in size.

[hadoop@localhost hadoop]$ ls -altr
total 1038160
drwxr-xr-x. 10 hadoop bigdata        215 Jul  6  2020 hadoop-3.3.0
-rw-r--r--.  1 hadoop bigdata 1059584000 Jul 15  2020 hadoop-3.3.0.tar
drwxr-xr-x.  4 root   root            53 Mar  1 02:14 ..
drwxr-xr-x.  3 hadoop bigdata         67 Mar  1 23:13 .
-rw-------.  1 hadoop bigdata    3486147 Mar  1 23:14 nohup.out

[hadoop@localhost hadoop]$ du -sh .
2.0G    .

[hadoop@localhost hadoop]$ du -sk .
2092064 .
[hadoop@localhost hadoop]$

5) Detect and set the java home. Run the below command on the centos machine, look for java.home and use that value to set JAVA_HOME in hadoop's .bash_profile file.

java -XshowSettings:properties -version

JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre
export JAVA_HOME

6) Now launch a new session of hadoop user and follow the below steps [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html]

    a) Goto the untarred directory of the hadoop-3.3.0

[hadoop@localhost hadoop-3.3.0]$ ls -altr
total 84
-rw-r--r--.  1 hadoop bigdata   175 Mar 24  2020 README.txt
-rw-r--r--.  1 hadoop bigdata  1541 Mar 24  2020 NOTICE.txt
-rw-r--r--.  1 hadoop bigdata 27570 Mar 24  2020 NOTICE-binary
-rw-r--r--.  1 hadoop bigdata 15697 Mar 24  2020 LICENSE.txt
-rw-r--r--.  1 hadoop bigdata 22976 Jul  4  2020 LICENSE-binary
drwxr-xr-x.  3 hadoop bigdata  4096 Jul  6  2020 sbin
drwxr-xr-x.  3 hadoop bigdata    20 Jul  6  2020 etc
drwxr-xr-x.  2 hadoop bigdata  4096 Jul  6  2020 licenses-binary
drwxr-xr-x.  3 hadoop bigdata    20 Jul  6  2020 lib
drwxr-xr-x. 10 hadoop bigdata   215 Jul  6  2020 .
drwxr-xr-x.  2 hadoop bigdata   203 Jul  6  2020 bin
drwxr-xr-x.  2 hadoop bigdata   106 Jul  6  2020 include
drwxr-xr-x.  4 hadoop bigdata   288 Jul  6  2020 libexec
drwxr-xr-x.  4 hadoop bigdata    31 Jul  6  2020 share
drwxr-xr-x.  3 hadoop bigdata    67 Mar  1 23:13 ..
[hadoop@localhost hadoop-3.3.0]$

[hadoop@localhost hadoop-3.3.0]$ pwd
/opt/hadoop/hadoop-3.3.0
[hadoop@localhost hadoop-3.3.0]$

    b) Create input directory and copy *.xml files from etc/hadoop directory

[hadoop@localhost hadoop-3.3.0]$ mkdir input

[hadoop@localhost hadoop-3.3.0]$ ls -ld input
drwxr-xr-x. 2 hadoop bigdata 6 Mar  2 01:13 input
[hadoop@localhost hadoop-3.3.0]$ cp etc/hadoop/*.xml input/
[hadoop@localhost hadoop-3.3.0]$

    c) Launch the below command to let the example copy the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep input output 'dfs[a-z.]+'

Output: 

2021-03-02 01:24:00,805 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-03-02 01:24:00,932 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-03-02 01:24:00,932 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-03-02 01:24:01,129 INFO input.FileInputFormat: Total input files to process : 10
2021-03-02 01:24:01,172 INFO mapreduce.JobSubmitter: number of splits:10
2021-03-02 01:24:01,460 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local440722701_0001
2021-03-02 01:24:01,460 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-03-02 01:24:01,680 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
2021-03-02 01:24:01,680 INFO mapreduce.Job: Running job: job_local440722701_0001
2021-03-02 01:24:01,686 INFO mapred.LocalJobRunner: OutputCommitter set in config null
...
2021-03-02 01:24:04,925 INFO mapreduce.Job:  map 100% reduce 100%
2021-03-02 01:24:04,926 INFO mapreduce.Job: Job job_local2096128567_0002 completed successfully
2021-03-02 01:24:04,931 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=1203532
                FILE: Number of bytes written=3576646
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
        Map-Reduce Framework
                Map input records=1
                Map output records=1
                Map output bytes=17
                Map output materialized bytes=25
                Input split bytes=127
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=25
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=41
                Total committed heap usage (bytes)=273997824
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=123
        File Output Format Counters
                Bytes Written=23
[hadoop@localhost hadoop-3.3.0]$

    d) Verify the output

[hadoop@localhost hadoop-3.3.0]$ cat output/*
1       dfsadmin
[hadoop@localhost hadoop-3.3.0]$

[hadoop@localhost hadoop-3.3.0]$ ls -altr output/*
-rw-r--r--. 1 hadoop bigdata 11 Mar  2 01:24 output/part-r-00000
-rw-r--r--. 1 hadoop bigdata  0 Mar  2 01:24 output/_SUCCESS
[hadoop@localhost hadoop-3.3.0]$ cat output/part-r-00000
1       dfsadmin
[hadoop@localhost hadoop-3.3.0]$

# We are done with Standalone mode of operation. We will see Pseudo-Distributed Operation in seperate blog.

No comments:

Post a Comment

Flashback data archive steps

 Objective: Ways to track DML changes in a table Detailed objective: We should be able to track the changes in the table at any point in tim...