CCAH
Jump to navigation
Jump to search
https://developers.google.com/hadoop/images/hadoop-elephant.png
"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term." Doug Cutting, Hadoop Project Creator
Introduction
http://news.nost.org.cn/wp-content/uploads/2013/06/BigDataBigBuildings.jpg
http://upload.wikimedia.org/wikipedia/commons/7/7e/Thomas_J_Watson_Sr.jpg
Thomas Watson, IBM, 1943.
http://www.atkearney.com.tr/documents/10192/698536/FG-Big-Data-and-the-Creative-Destruction-of-Todays-Business-Models-1.png
There are as many cloud definitions as people talking about that Is Hadoop a cloud computing solution?
http://cdn2.hadoopwizard.com/wp-content/uploads/2013/01/hadoopwizard_companies_by_node_size600.png
HDFS
http://munnamark.files.wordpress.com/2012/03/petabyte1.jpg
T. White. "Hadoop: The Definitive Guide"
T. White. "Hadoop: The Definitive Guide"
T. White. "Hadoop: The Definitive Guide"
http://www.binospace.com/wp-content/uploads/2013/07/slide-10-1024_%E5%89%AF%E6%9C%AC.jpg
E. Sammer. "Hadoop Operations"
MapReduce
http://cacm.acm.org/system/assets/0000/2020/121609_CACMpg73_MapReduce_A_Flexible.large.jpg
T. White. "Hadoop: The Definitive Guide"
http://blog.cloudera.com/wp-content/uploads/2014/03/ssd1.png
If one woman can have a baby in nine months, nine women should be able to have a baby in one month. YARN
http://th00.deviantart.net/fs71/PRE/f/2013/185/c/9/spinning_yarn_by_chaosfissure-d6byoqu.jpg
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html
http://pravinchavan.files.wordpress.com/2013/04/yarn-2.png
Pseudo-distributed mode
http://www.cpusage.com/blog/wp-content/uploads/2013/11/distributed-computing-1.jpg
Hadoop installation in pseudo-distributed
Running MapReduce jobs
http://news.nost.org.cn/wp-content/uploads/2013/06/BigDataBigBuildings.jpg
Running MapReduce jobs in
Hadoop ecosystem
http://www.imaso.co.kr/data/article/1(2542).jpg
T. White. "Hadoop: The Definitive Guide"
T. White. "Hadoop: The Definitive Guide"
What is Flume? Flume is a frontend to HDFS
http://archive.cloudera.com/cdh/3/flume-ng-1.1.0-cdh3u4/images/UserGuide_image00.png
What is Oozie? Oozie is a frontend to Hadoop jobs
http://www.gruter.com/download_files/cloumon-oozie-01.png
Supporting tools and Big Data future
http://www.datacenterjournal.com/wp-content/uploads/2012/11/big-data1-612x300.jpg
What is Hue? Hue is a frontend to Hadoop components
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.3.0/Hue-2-User-Guide/images/huearch.jpg
What is Cloudera Manager? Cloudera Manager is a frontend to Hadoop components
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Navigator/1.0/Cloudera-Navigator-Installation-and-User-Guide/images/image1.jpeg
http://images.cnitblog.com/blog/312753/201310/19114512-89b2001bf0444863b5c11da4d5100401.png
Hadoop cluster planning
http://news.fincit.com/wp-content/uploads/2012/07/tangle.jpg
http://drcos.boudnik.org/
E. Sammer. "Hadoop Operations"
E. Sammer. "Hadoop Operations"
E. Sammer. "Hadoop Operations"
Security
http://tech.shopzilla.com/wp-content/uploads/2011/06/hadoop-security-logo.jpg
Monitoring and logging
http://cdn.cookefamily.us/wp-content/uploads/2013/07/Bigboard.jpg
Hadoop cluster installation and configuration
http://www.michael-noll.com/blog/uploads/Yahoo-hadoop-cluster_OSCON_2007.jpeg
YARN cluster installation and configuration
http://mikepinta.com/wp-content/uploads/2014/07/Server-Room.jpg
Case Studies, Certification and Surveys
http://zone16.pcansw.org.au/site/ponyclub/image/fullsize/60786.jpg
Administrator Training for Apache Hadoop
Hadoop introduction⌘
What does Hadoop mean?⌘
Outline⌘
- First Day:
- Session I:
- Introduction to the course
- Introduction to Cloud Computing and Big Data solutions
- Session II:
- HDFS
- Session III:
- MapReduce
- Session IV:
- YARN
- Session I:
Outline #2⌘
- Second Day:
- Session I:
- Installation and configuration of Hadoop in pseudo-distributed mode
- Session II:
- Running MapReduce jobs on the Hadoop cluster
- Session III:
- Hadoop ecosystem: Pig, Hive, Sqoop, HBase, Flume, Oozie
- Session IV:
- Supporting tools: Hue, Cloudera Manager
- Big Data future: Impala, Cassandra
- Session I:
Outline #3⌘
- Third Day:
- Session I:
- Hadoop cluster planning
- Session II:
- Hadoop cluster planning
- Session III:
- Hadoop security
- Session IV:
- Hadoop monitoring and logging
- Session I:
Outline #4⌘
- Fourth Day:
- Session I:
- Hadoop cluster installation and configuration
- Session II:
- Hadoop cluster installation and configuration
- Session III:
- Hadoop cluster installation and configuration
- Session IV:
- Hadoop cluster installation and configuration
- Session I:
Outline #5⌘
- Fifth Day:
- Session I:
- Hadoop cluster installation and configuration
- Session II:
- Hadoop cluster installation and configuration
- Session III:
- Hadoop cluster installation and configuration
- Session IV:
- Case Studies
- Certification and Surveys
- Session I:
First Day - Session I⌘
"I think there is a world market for maybe five computers"⌘
How big are the data we are talking about?⌘
Data storage evolution⌘
- 1956 - HDD (Hard Disk Drive), now up to 6 TB
- 1983 - SDD (Solid State Drive), now up to 16 TB
- 1984 - NFS (Network File System), first NAS (Network Attached Storage) implementation
- 1987 - RAID (Redundant Array of Independent Disks), now up to ~100 disks
- 1993 - Disk Arrays, now up to ~200 disks
- 1994 - Fibre-channel, first SAN (Storage Area Network) implementation
- 2003 - GFS (Google File System), first Big Data implementation
Virtualization vs clustering⌘
Foundations of cloud computing⌘
The base features of a cloud:
- it is completely transparent to users
- it delivers a value to users in a form of services
- its storage and computing resources are infinite from users' perspective
- it is geographically distributed
- it is highly available
- it runs on the commodity hardware
- it leverages computer network technologies
- it is easily scalable
- it operates on the basis of distributed computing paradigm
- it is multi-tenant
- it implements "pay-as-you-go" billing
What is Hadoop?⌘
- Apache project for storing and processing large data sets
- Open-source implementation of Google Big Data solutions
- Components:
- HDFS (Hadoop Distributed File System)
- YARN (Yet Another Resource Negotiator)
- Data processing models (MapReduce, Impala, Tez, etc.)
- Underpinning tools (Pig, Hive, Sqoop, HBase, etc.)
- Written in Java
Big Data and Hadoop evolution⌘
- 2003 - GFS (Google File System) publication, available here
- 2004 - Google MapReduce publication, available here
- 2005 - Hadoop project is founded
- 2006 - Google BigTable publication, available here
- 2008 - First Hadoop implementation
- 2010 - Google Dremel publication, available here
- 2010 - First HBase implementation
- 2012 - Apache Hadoop YARN project is founded
- 2013 - Cloudera releases Impala
- 2013 - Apache releases Tez
- 2014 - Cloudera signs cooperation agreements with Teradata and Red Hat
Hadoop distributions⌘
- Apache Hadoop, http://hadoop.apache.org/
- CDH (Cloudera Distribution including apache Hadoop), http://www.cloudera.com/
- HDP (Hortonworks Data Platform), http://hortonworks.com/
- M3, M5 and M7, http://www.mapr.com/
- Amazon Elastic MapReduce3, http://aws.amazon.com/
- BigInsights Enterprise Edition, http://www.ibm.com/
- Intel Distribution for Apache Hadoop, http://hadoop.intel.com/
- And more ...
Who uses Hadoop?⌘
Where is it all going?⌘
- At this point we have all the tools to start changing the future:
- capability of storing and processing any amount of data
- tools for storing unstructured and structured data
- tools for batch and interactive processing
- tools allowing an integration with other services
- cloud computing paradigm has matured
- BigData use and future:
- search engines
- scientific analysis
- advertisements
- trends prediction
- business intelligence
- artificial intelligence
CCA-500 exam⌘
- Detailed Objectives: http://www.cloudera.com/content/cloudera/en/training/certification/ccah/prep.html
- Certification Authority: PearsonVUE (http://www.pearsonvue.com/)
- Exam Cost: 295$ (prizes in local currency updated on a daily basis)
- Exam Duration: 90m
- Number of questions: 60
- Exam Passing Score: 70%
- Question Type: closed
- Alternative exams: CCA-410, CCA-505
References⌘
- Books:
- E. Sammer. "Hadoop Operations". O'Reilly, 1st edition
- T. White. "Hadoop: The Definitive Guide". O'Reilly, 3rd edition
- Publications:
- Websites:
- Apache Hadoop project website: http://hadoop.apache.org/
- Cloudera website: http://www.cloudera.com/content/cloudera/en/home.html
- Hortonworks website: http://hortonworks.com/
- MapR website: http://www.mapr.com/
- Internet
Introduction to the lab⌘
Lab components:
- Laptop with Windows
- Virtual Machine with Linux on VirtualBox:
- Hadoop instance in pseudo-distributed mode
- Hadoop ecosystem
- terminal for connecting to GCE
- GCE (Google Compute Engine) cloud-based IaaS platform
- Hadoop cluster
VirtualBox⌘
- VirtualBox 64-bit version (click here to download)
- VM name: CCAH
- Credentials: terminal / terminal (use root account)
- Snapshots (top right corner)
- Press right "Control" key to release
GCE⌘
- Accounts
- Project name: Hadoop
- Project ID: check after logging in
- GUI:
- CLI:
- gcloud auth login
- gcloud config set project [Project ID]
- sudo /opt/google-cloud-sdk/bin/gcloud components update
First Day - Session II⌘
Goals and motivation⌘
- Very large files:
- file size up to tens of terabytes
- filesystem size up to tens of petabytes
- Streaming data access:
- write-once, read-many-times approach
- optimized for batch processing
- Fault tolerance:
- data replication
- metadata high availability
- Scalability:
- commodity hardware
- ease of extension
- Resilience:
- components failures are common
- Support for MapReduce computing paradigm
Design⌘
- FUSE (Filesystem in USErspace)
- Non POSIX-compliant filesystem (http://standards.ieee.org/findstds/standard/1003.1-2008.html)
- Block abstraction:
- block size: 64 MB by default (editable by client per file)
- block replication: 3 replicas by default (editable by client per file)
- Benefits:
- support for files larger than disks
- segregation of data and metadata
- fault tolerance
- Lab Exercise 1.2.1
Daemons⌘
- Datanode:
- responsible for storing data
- many per cluster
- Namenode:
- responsible for storing metadata
- responsible for maintaining a database of data location
- 1 active per cluster
- Secondary namenode:
- responsible for processing metadata
- 1 active per cluster
Data⌘
- Blocks:
- Stored in a form of files on a top of an uderlying filesystem
- Stored on datanodes' data disks and accessed by datanodes themselves
- Reported periodically to the namenode in a form of block reports
- Concepts:
- Lazy space
- Fault tolerance thanks to replication
- Parallel processing thanks to replication
Metadata⌘
- Stored in a form of files on the namenode:
- fsimage_[transaction ID] - complete snapshot of the filesystem metadata up to specified transaction
- edits_inprogress_[transaction ID] - incremental modifications made to the metadata since specified transaction
- Contain information on properties of the HDFS filesystem
- Copied and served from the namenode's RAM
- Processed by the secondary namenode
- Don't confuse metadata with a database of data blocks' location
- Lab Exercise 1.2.2
Metadata management⌘
- When does it occur?
- every hour
- edits file reaches 64 MB
- How does it look like?
Read path⌘
- A client opens a file by calling open() on FileSystem object
- The client calls a namenode to return sorted list of datanodes for the first batch of blocks in the file
- The client caches the list
- The client connects to the first datanode on the list
- The client streams the data from the datanode
- The client closes the connection to the datanode
- The client repeats steps 3-5 for the next blocks or steps 2-5 for the next batch of blocks, or closes the file
Read path #2⌘
Data read preferences⌘
- Determines the closest datanode hosting the same block of data
- Distance based on the available network bandwidth between two nodes:
- 0 - the same node
- 2 - the same rack
- 4 - the same data center
- 6 - different data centers
Write path⌘
- A client creates a file by calling create() on DistributedFileSystem object
- The client calls a namenode to create the file with no blocks in a filesystem namespace
- The client calls the namenode to return a list of datanodes to store replicas of a batch data blocks
- The client caches the list
- The client connects to the first datanode on the list and chain streams the data block over there
- The datanode connects to the second datanode on the list and immediately chain streams the data block over there, and so on ...
- The datanodes acknowledge the receiving of the data block back back to the client
- The client repeats steps 4-5 for the next blocks or steps 3-5 for the next batch of blocks
- The client closes the data queue
- The client notifies the namenode that the file is complete
Write path #2⌘
Data write preferences⌘
- 1st copy - on the client (if running datanode daemon) or randomly selected non-overloaded datanode
- 2nd copy - on a datanode in a different rack in the same data center
- 3rd copy - on randomly selected non-overloaded datanode in the same rack
- 4th and further copies - on randomly selected non-overloaded datanaode
Namenode High Availability⌘
- Namenode - SPOF (Single Point Of Failure)
- Manual namenode recovery process may take up to 30 minutes
- Namenode HA feature available since 0.23 release
- High Availability type: active / passive
- Failover type: manual / automatic
- Based on QJM (Quorum Journal Manager)
- STONITH (Shot The Other Node In The Head)
- Fencing methods
- Secondary namenode functionality moved to the standby namenode
Namenode High Availability #2⌘
Journal Node⌘
- Relatively lightweight
- Usually collocated on the machines running namenode or jobtracker daemons
- edits file modification must be written to a majority of the journal nodes
- Usually deployed in 3, 5, 7, 9, etc.
- Lab Exercise 1.2.3
Namenode Federation⌘
- Namenode memory limitations
- Namenode federation feature available since 0.23 release
- Horizontal scalability
- Filesystem partitioning
- ViewFS
- Federation of HA pairs
- Lab Exercise 1.2.4
Namenode Federation #2⌘
Fault Tolerance⌘
- Case 1:
- Failure type: datanode failure
- Detection method: heartbeat messages sent every 3 seconds
- Handling method: datanode is assumed dead after 10 minutes
- Case 2:
- Failure type: communication failure
- Detection method: acknowledgements sent in response to read / write requests
- Handling method: failed datanode is skipped during read / write operations
- Case 3:
- Failure type: data corruption
- Detection method: checksum stored with every data block
- Handling method: data blocks with a wrong checksum are not reported by datanodes
Under replicated blocks⌘
- Block is assumed under replicated when:
- a datanode on which one of its replica resides is assumed dead by the namenode
- not enough block replicas are reported by datanodes
- Under replicated block handling strategy:
- the namenode automatically replicates under replicated blocks
- the administrator can manually re-balance blocks at any time
- Decommissioning dead datanode:
- add the datanode to the list of datanodes excluded from the cluster
- refresh a list of datnodes allowed to communicate with the cluster
- physically decommission / re-provision the datanode
Interfaces⌘
- Underlying interface:
- Java API - native Hadoop access method
- Additional interfaces:
- CLI
- C interface
- HTTP interface
- FUSE
URIs⌘
- [path] - local HDFS filesystem
- file://[path] - local HDFS filesystem
- hdfs://[namenode]/[path] - remote HDFS filesystem
- http://[namenode]/[path] - HTTP
- viewfs://[path] - ViewFS
Java interface⌘
- Reminder: Hadoop is written in Java
- Destined to Hadoop developers rather than administrators
- Based on the FileSystem class
- Reading data:
- by streaming from Hadoop URIs
- by using Hadoop API
- Writing data:
- by streaming to Hadoop URIs
- by appending to files
Command-line interface⌘
- Shell-like commands
- All begin with the hdfs keyword
- No current working directory
- Overriding replication factor:
hdfs dfs -setrep [RF] [file]
- All commands available here
Base commands⌘
- hdfs dfs -mkdir [URI] - creates directory
- hdfs dfs -rm [URI] - removes file
- hdfs dfs -rm -r [URI] - removes directory
- hdfs dfs -cat [URI] - displays content of file
- hdfs dfs -ls [URI] - displays content of directory
- hdfs dfs -cp [source URI] [destination URI] - copies file or directory
- hdfs dfs -mv [source URI] [destination URI] - moves file or directory
- hdfs dfs -du [URI] - displays size of file or directory
Base commands #2⌘
- hdfs dfs -chmod [mode] [URI] - changes permissions of file or directory
- hdfs dfs -chown [owner] [URI] - changes owner of file or directory
- hdfs dfs -chgrp [group] [URI] - changes owner group of file or directory
- hdfs dfs -put [source path] [destination URI] - uploads data into HDFS
- hdfs dfs -getmerge [source directory] [destination file] - merges files from directory into single file
- hadoop archive -archiveName [name] [source URI]* [destination URI] - creates Hadoop archive
- hadoop fsck [URI] - performs HDFS filesystem check
- hadoop version - displays Hadoop version
C interface⌘
- libhdfs library
- Mirrors Java FileSystem interface
- Leverages JNI (Java Native Interface)
- Documentation: libhdfs/docs/api directory
- Home page: http://wiki.apache.org/hadoop/libHDFS
HTTP interface⌘
- HDFS HTTP ports:
- 50070 - namenode
- 50075 - datanode
- 50090 - secondary namenode
- Access methods:
- direct
- via proxy
- Pure HTTP vs WebHDFS vs HTTPFS
FUSE⌘
- Reminder:
- HDFS is user space filesystem
- HDFS is not POSIX-compliant filesystem
- Fuse-DFS module
- Allows HDFS to me mounted under the Linux FHS (Filesystem Hierarchy Standard)
- Allows Unix tools to be used against mounted HDFS data structures
- Documentation: src/contrib/fuse-dfs
Quotas⌘
- Count quotas vs space quotas:
- count quotas - limit a number of files in HDFS directory
- space quotas - limit a post-replication disk space utilization of HDFS directory
- Setting quotas:
hadoop dfsadmin -setQuota [count] [path] hadoop dfsadmin -setSpaceQuota [size] [path]
- Removing quotas:
hadoop dfsadmin -clrQuota [path] hadoop dfsadmin -clrSpaceQuota [path]
- Viewing quotas:
hadoop fs -count -q [path]
Limitations⌘
- Low-latency data access:
- tens of milliseconds
- alternatives: HBase, Cassandra
- Lots of small files
- limited by namenode memory in use
- inefficient storage utilization due to large block size
- Multiple writers, arbitrary file modifications:
- single writer
- append operations only
First Day - Session III⌘
Goals and motivation⌘
- Simplicity of development:
- MapReduce programs are simple functional programs for data processing
- all background activities are handled by the MapReduce itself
- Scalability:
- parallelism
- ease of extension
- Automatic parallelization and distribution of work:
- developers are required to write map and reduce functions only
- distribution and parallelism are handled by the MapReduce itself
- Fault tolerance:
- tasks distribution
- jobs high availability
Design⌘
- Fully automated parallel data processing computing paradigm
- Job abstraction as an atomic unit
- Job return key-value pairs
- Job components:
- map tasks
- reduce tasks
- Benefits:
- simplicity of development
- fast processing time of a large data sets
Daemonos⌘
- Jobtracker:
- responsible for MapReduce jobs execution
- 1 active per cluster
- Tasktracker:
- responsible for MapReduce tasks execution
- many per cluster
MapReduce job stages⌘
- A client creates a job object using the Java MapReduce API
- The client retrieves a new job ID from the jobtracker
- The client calculates input splits and writes the job resources (e.g. jar file) on HDFS
- The client submits the job by calling submitApplication() procedure on the jobtracker
- The jobtracker initializes the job
- The jobtracker retrieves the job resources from HDFS
- The jobtracker allocates tasks on tasktrackers based on a scheduler in use
- The tasktrackers retrieve the job resources and data from HDFS
- The tasktrackers launch a separate JVM for task execution purpose
- The tasktrackers periodically report progress and status updates to the jobtracker
- The client periodically polls the jobtracker for progress and status updates
- On the job completion the jobtracker and the tasktrackers clean up their working state
MapReduce job stages #2⌘
Schufle and sort⌘
- The tasktrackers execute the map function against the chunk of data they process
- The tasktrackers write the intermediate key-value pairs returned as an output of the map function in a circular memory buffer
- The tasktrackers partition the content of the circular memory buffer, sort it by key and spill to disks
- The tasktrackers execute the combiner function against the partitioned and sorted key-value pairs
- The tasktrackers merge the intermediate key-value pairs returned as an output of the combiner function to an output files
- Another tasktrackers fetch the output files and merge them
- The tasktrackers execute the reduce function against the intermediate key-value pairs
- The tasktrackers write ultimate key-value pairs into HDFS
Schufle and sort #2⌘
MapReduce by analogy⌘
- Lab Exercise 1.3.1
- Lab Exercise 1.3.2
Map task in details⌘
- Map function:
- takes key-value pair
- returns key-value pair
- Input key-value pair: file-line (record)
- Output key-value pair: defined by the map function
- Output key-value pairs are sorted by the key
- Partitioner assigns each key-value pair to a partition
Reduce task in details⌘
- Reduce function:
- takes key-value pair
- returns key-value pair
- Input key-value pair: returned by the map function
- Output key-value pair: final result
- Input is received in the shuffle and sort step
- Each reducer is assigned with one or more partitions
- Reducer copies the data from partitions as soon as they are written
- Reducer merges intermediate key-value pair based on the key
Sample data⌘
- http://academic.udayton.edu/kissock/http/Weather/gsod95-current/allsites.zip
1 1 1995 82.4 1 2 1995 75.1 1 3 1995 73.7 1 4 1995 77.1 1 5 1995 79.5 1 6 1995 71.3 1 7 1995 71.4 1 8 1995 75.2 1 9 1995 66.3 1 10 1995 61.8
Sample map function⌘
- Function:
#!/usr/bin/perl use strict; my %results; while (<>) { $_ =~ m/.*(\d{4}).*\s(\d+\.\d+)/; push (@{$results{$1}}, $2); } foreach my $year (sort keys %results) { print "$year: "; foreach my $temperature (@{$results{$year}}) { print "$temperature "; } print "\n"; }
- Function output:
year: temperature1 temperature2 ...
Sample reduce function⌘
- Function:
#!/usr/bin/perl use List::Util qw(max); use strict; my %results; while (<>) { $_ =~ m/(\d{4})\:\s(.*)/; my @temperatures = split(' ', $2); @{$results{$1}} = @temperatures; } foreach my $year (sort keys %results) { my $temperature = max(@{$results{$year}}); print "$year: $temperature\n"; }
- Function output:
year: temperature
Combine function⌘
- Executed by the map task
- Executed just after the map function
- Most often the same as the reduce function
- Lab Exercise 1.3.3
High Availability⌘
- Jobtracker - SPOF
- High availability type: active / passive
- Failover type: manual / automatic
- Automatic failover based on Apache ZooKeeper failover controller
- STONITH
- Fencing methods
- Lab Exercise 1.3.4
Schedulers⌘
- Worker nodes resources:
- CPU
- RAM
- IO
- network bandwidth
- Each job has different requirements
- Data locality concern
- Schedulers:
- FIFO scheduler
- Fair scheduler
- Capacity scheduler
FIFO scheduler⌘
- FIFO (First In First Out)
- Default scheduler in MRv1
- Priorities:
- 5 levels
- each priority is a separate FIFO queue
- Suitable for small, development or experimental environments
Fair scheduler⌘
- Developed by Matei Zahara as a part of his PhD thesis to bypass limitations of the FIFO scheduler
- Support for SLAs (Service Level Agreements) and multi-tenant environments
- Design:
- submitted jobs are placed in pools
- each pool is assigned with a number of task slots based on an available cluster capacity and pool parameters
- into which pool should the scheduler assign the MapReduce job is defined in the job's property
Fair scheduler #2⌘
- Fairness:
- pools with demands are allocated first
- pools without demands are allocated later
- pools are allocated with at least as much slots as guaranteed by the minshare parameter
- pools are not allocated with more slots then they need
- pools are allocate equally based on their weight after fulfilling all previous requirements
- Suitable for production environments
- Lab Exercise 1.3.5
Capacity scheduler⌘
- Developed by Hadoop team at Yachoo to bypass limitations of the FIFO scheduler
- Support for SLAs and multi-tenant environments
- Design:
- submitted jobs are placed in queues
- internal queue order is FIFO with priorities
- each queue is assigned with a capacity representing the number of task slots
- the most starved queues receive task slots first
- queue starvation is measured by its percentage utilization
- queues can be hierarchical
- Complex configuration
- Suitable for production environments
Java MapReduce⌘
- MapReduce jobs written in Java
- org.apache.hadoop.mapreduce namespace:
- Mapper class and map() function
- Reducer class and reduce() function
- Executed by the hadoop program
- Sample Java MapReduce job: here [[1]]
- Sample Java MapReduce job execution:
hadoop [MapReduce program path] [input data path] [output data path]
Streaming⌘
- Leverages hadoop-streaming.jar class
- Supports all languages capable of reading from stdin and writing to stdout
- Sample streaming MapReduce job execution:
hadoop jar [hadoop-streaming.jar path] \ -input [input data path] \ -output [output data path] \ -mapper [map function path] \ -reducer [reduce function path] \ -combiner [combiner function path] \ -file [attached files path]*
- * - for each attached function
Pipes⌘
- Supports all languages capable of reading from and writing to network sockets
- Mostly applicable for C and C++ languages
- Sample streaming MapReduce job execution:
hadoop pipes \ -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input [input data path] \ -output [output data path] \ -program [MapReduce program path] -file [attached files path]*
- - for each attached program
Limitations⌘
- Batch processing system:
- MapReduce jobs run on the order of minutes or hours
- optimized for full table scan operations
- no support for low-latency data access
- Simplicity:
- no support for data access optimization
- Too low-level:
- raw data processing functions
- alternatives: Pig, Hive
- Not all algorithms can be parallelized:
- algorithms with shared state
- algorithms with dependent variables
First Day - Session IV⌘
Goals and motivation⌘
- Segregation of services:
- Single MapReduce service for resource and application management purpose
- Separate YARN services for resource and application management purpose
- Scalability / resource pressure on the jobtracker:
- MapReduce framework hits scalability limitations in clusters consisting of around 5000 nodes
- YARN framework doesn’t hit scalability limitations in clusters consisting of around 40000 nodes
- Flexibility:
- MapReduce framework is capable of executing MapReduce jobs only
- YARN framework is capable of executing jobs other than MapReduce
Daemons⌘
- ResourceManager:
- responsible for cluster resources management and applications execution
- consists of the scheduler and application manager components
- 1 active per cluster
- NodeManager:
- responsible for node resources management and applications execution
- can run YARN containers or application master process
- many per cluster
- JobHistoryServer:
- responsible for serving information about completed jobs
- 1 active per cluster
Design⌘
YARN job stages part 1⌘
- A client creates a job object using the same Java MapReduce API as in MRv1
- The client retrieves a new application ID from the resource manager
- The client calculates input splits and writes the job resources (e.g. jar file) on HDFS
- The client submits the job by calling submitApplication() procedure on the resource manager
- The resource manager allocates a container for the job execution purpose and launches the application master process on the node manager
- The application master initializes the job
- The application master retrieves the job resources from HDFS
- The application master requests for containers for tasks execution purpose from the resource manager
YARN job stages part 2⌘
- The resource manager allocates tasks on containers based on a scheduler in use
- The containers launch a separate JVM for task execution purpose
- The containers retrieve the job resources and data from HDFS
- The containers periodically report progress and status updates to the application master
- The client periodically polls the application master for progress and status updates
- On the job completion the application master and the containers clean up their working state
YARN job stages #2⌘
ResourceManager high availability⌘
- ResourceManager - SPOF
- High Availability type: active / passive
- Failover type: manual / automatic
- Automatic failover based on the Apache ZooKeeper failover controller
- ZFCK daemon embedded in the resourcemanager code
- STONITH
- Fencing methods
- Administration tool: yarn rmadmin
Resource configuration⌘
- No distinction between resources available for map and reduce tasks
- Resource requests for memory and virtual cores
- Resource configuration (yarn-site.xml):
- yarn.nodemanager.resource.memory-mb - available memory in megabytes
- yarn.nodemanager.resource.cpu-vcores - available number of virtual cores
- yarn.scheduler.minimum-allocation-mb - minimum allocated memory in megabytes
- yarn.scheduler.minimum-allocation-vcores - minimum number of allocated virtual cores
- yarn.scheduler.increment-allocation-mb - increment memory in megabytes into which the request is rounded up
- yarn.scheduler.increment-allocation-vcores - increment number of virtual cores into which the request is rounded up
- Lab Exercise 1.4.1
Second Day - Session I⌘
Hadoop installation in pseudo-distributed
mode⌘
- Lab Exercise 2.1.1
- Lab Exercise 2.1.2
- Lab Exercise 2.1.3
- Lab Exercise 2.1.4
Second Day - Session II⌘
Running MapReduce jobs in
pseudo-distributed mode⌘
- Lab Exercise 2.2.1
- Lab Exercise 2.2.2
- Lab Exercise 2.2.3
- Lab Exercise 2.2.4
- Lab Exercise 2.2.5
- Lab Exercise 2.2.6
Second Day - Session III⌘
Pig - Goals and motivation⌘
- Developed by Yachoo
- Simplicity of development:
- however, MapReduce programs themselves are simple, designing complex regular expressions may be challenging and time consuming
- Pig offers much reacher data structures for pattern matching purpose
- Length and clarity of the source code:
- MapReduce programs are usually long and comprehensible
- Pig programs are usually short and understandable
- Simplicity of execution:
- MapReduce programs require compiling, packaging and submitting the job
- Pig programs can be written and executed from the console
Pig - Design⌘
- Pig components:
- Pig Latin - scripting language for data flows
- each program is made up of series of transformations applied to the input data
- each transformation is made up of series of MapReduce jobs run on the input data
- Pig Latin programs execution environment:
- local execution in a single JVM
- execution distributed over the Hadoop cluster
- Pig Latin - scripting language for data flows
Pig - relational operators⌘
- LOAD - loads data from the filesystem or other storage into a relation
- STORE - saves the relation to a filesystem or other storage
- DUMP - prints the relation to the console
- FILTER - removes unwanted rows from the relation
- DISTINCT - removes duplicate rows from the relation
- MAPREDUCE - runs MapReduce job using the relation as an input
- TRANSFORM - transforms the relation using an external program
- SAMPLE - selects a random sample of the relation
- JOIN - joins two or more relations
- GROUP - groups the data in a single relation
- ORDER - sorts the relation by one or more fields
- LIMIT - limits the size of the relation to a maximum number of tuples
Pig - mathematical and logical operators⌘
- literal - constant / variable
- $n - n-th column
- x + y / x - y - addition / substraction
- x * y / x / y - multiplication / division
- x == y / x != y - equals / not equals
- x > y / x < y - greater then / less then
- x >= y / x <= y - greater or equal then / less or equal then
- x matches y - pattern matching with regular expressions
- x is null - is null
- x is not null - is not null
- x or y - logical or
- x and y - logical and
- not x - logical negation
Pig - types⌘
- int - 32-bit signed integer
- long - 64-bit signed integer
- float - 32-bit floating-point number
- double - 64-bit floating-point number
- chararray - character array in UTF-16 format
- bytearray - byte array
- tuple - sequence of fields of any type
- map - set of key-value pairs:
- key - character array
- value - any type
Pig - Exercises⌘
- Lab Exercise 2.3.1
Hive - Goals and Motivation⌘
- Developed by Facebook
- Data structurization:
- HDFS stores the data in an unstructured format
- Hive stores the data in a structured format
- SQL interface:
- HDFS does not provide the SQL interface
- Hive provides the SQL interface
- Cost-effectiveness:
- Hive is faster compared to regular MapReduce jobs
Hive - Design⌘
- Metastore:
- central repository of Hive metadata:
- metastore service
- database
- central repository of Hive metadata:
- HiveQL:
- used to write data by uploading them into HDFS and updating the Metastore
- used to read data by accepting queries and translating them to a series of MapReduce jobs
- Interfaces:
- CLI
- Web GUI
- ODBC
- JDBC
Hive - Data Types⌘
- TINYINT - 1-byte signed integer
- SMALLINT - 2-byte signed integer
- INT - 4-byte signed integer
- BIGINT - 8-byte signed integer
- FLOAT - 4-byte floating-point number
- D0UBLE - 8-byte floating-point number
- BOOLEAN - true/false value
- STRING - character string
- BINARY - byte array
- MAP - an unordered collection of key-value pairs
Hive - SQL vs HiveQL⌘
Feature | SQL | HiveQL |
---|---|---|
Updates | UPDATE, INSERT, DELETE | INSERT OVERWRITE TABLE |
Transactions | Supported | Not Supported |
Indexes | Supported | Not supported |
Latency | Sub-second | Minutes |
Multitable inserts | Partially supported | Supported |
Create table as select | Partially supported | Supported |
Views | Updatable | Read-only |
Hive - Exercises⌘
- Lab Exercise 2.3.2
Sqoop - Goals and Motivation⌘
- Data formats diversity:
- HDFS can store data in any format
- MapReduce jobs can process data stored in any format
- Cooperation with RDBMS:
- RDBMS are widely used for data storing purpose
- need for importing / exporting data from RDBMS to HDFS and vice versa
- Data import / export flexibility:
- manual data import / export using HDFS CLI, Pig or Hive
- automatic data import / export using Sqoop
Sqoop - Design⌘
- JDBC - used to connect to the RDBMS
- SQL - used to retrieve / populate data from / to RDBMS
- MapReduce - used to retrieve / populate data from / to HDFS
- Sqoop Code Generator - used to create table-specific class to hold records extracted from the table
Sqoop - Import process⌘
Sqoop - Export process⌘
Sqoop - Exercises⌘
- Lab Exercise 2.3.3
HBase - Goals and Motivation⌘
- Designed by Google (BigTable)
- Developed by Apache
- Interactive queries:
- MapReduce, Pig and Hive are batch processing frameworks
- HBase is a real-time processing framework
- Data structurization:
- HDFS stores data in an unstructured format
- HBase stores data in a semi-structured format
HBase - Design⌘
- Runs on the top of HDFS
- Key-value NoSQL database
- Architecture:
- tables are distributed across the cluster
- tables are automatically partitioned horizontally into regions
- tables consist of rows and columns
- table cells are versioned (by a timestamp by default)
- table cell type is an uninterpreted array of bytes
- table rows are sorted by the row key which is the table's primary key
- table columns are grouped into column families
- table column families must be defined on the table creation stage
- table columns can be added on the fly if the column family exists
- Metadata stored in the -ROOT- and .META. tables
- Data can be accessed interactively or using MapReduce
HBase - Exercises⌘
- Lab Exercise 2.3.4
Flume - Goals and motivation⌘
- Streaming data:
- HDFS does not have any built-in mechanisms for handling streaming data flows
- Flume is designed to collect, aggregate and move streaming data flows into HDFS
- Data queueing:
- When writing directly to HDFS data are lost during spike periods
- Flume is designed to queue data during spike periods
- Guarantee of data delivery:
- Flume is designed to guarantee a delivery of data by using a single-hop message delivery semantics
Flume - Design⌘
- Event - unit of data that Flume can transport
- Flow - movement of the events from a point of origin to their final destination
- Client - an interface implementation that operates at the point of origin of the events
- Agent - an independent process that hosts flume components such as sources, channels and sinks
- Source - an interface implementation that can consume the events delivered to it and deliver them to channels
- Channel - a transient store for the events, which are delivered by the sources and removed by sinks
- Sink - an interface implementation that can remove the events from the channel and transmit them
Flume - Design #2⌘
Flume - Closer look⌘
- Flume sources:
- Avro, HTTP, Syslog, etc.
- https://flume.apache.org/FlumeUserGuide.html#flume-sources
- Sink types:
- regular - transmits the event to another agent
- terminal - transmits the event to its final destination
- Flow pipeline:
- from one source to one destination
- from multiple sources to one destination
- from one source to multiple destinations
- from multiple sources to multiple destinations
Oozie - Goals and motivation⌘
- Hadoop jobs execution:
- Hadoop clients execute MapReduce jobs from CLI using hadoop command
- Oozie provides web-based GUI for Hadoop jobs definition and execution
- Hadoop jobs management:
- Hadoop doesn't provide any built-in mechanism for jobs management (e.g. forks, merges, decisions, etc.)
- Oozie provides Hadoop jobs management feature based on a control dependency DAG
Oozie - Design⌘
- Oozie - workflow scheduling system
- Oozie workflow
- control flow nodes:
- define the beginning and the end of a workflow
- provide a mechanism to control the workflow execution path
- action nodes:
- are mechanisms by which the workflow triggers an execution of Hadoop jobs
- supported job types include Java MapReduce, Pig, Hive, Sqoop and more
- control flow nodes:
Oozie - Web GUI⌘
Second Day - Session IV⌘
Hue - Goals and motivation⌘
- Hadoop cluster management:
- each Hadoop component is managed independently
- Hue provides centralized Hadoop components management utility
- Hadoop components are mostly managed from the CLI
- Hue provides web-based GUI for Hadoop components management
- Open Source:
- other cluster management software (i.e. Cloudera Manager) are payable and closed
- Hue is free of charge and open-source software
Hue - Design⌘
- Hue components:
- Hue server - web application's container
- Hue database - web application's data
- Hue UI - web user interface application
- REST API - for communication with Hadoop components
- Supported browsers:
- Google Chrome
- Mozilla Firefox 17+
- Internet Explorer 9+
- Safari 5+
- Lab Exercise 2.4.1
- Lab Exercise 2.4.2
Hue - Design #2⌘
Cloudera Manager - Goals and motivation⌘
- Hadoop cluster deployment and configuration:
- Hadoop components need to be deployed and configured manually
- Cloudera Manager provides automatic deployment and configuration of Hadoop components
- Hadoop cluster management:
- each Hadoop component is managed independently
- Cloudera Manager provides centralized Hadoop components management utility
- Hadoop components are mostly managed from the CLI
- Cloudera Manager provides web-based GUI for Hadoop components management
Cloudera Manager - Design⌘
- Cloudera Manager versions:
- Express - deployment, configuration, management, monitoring and diagnostics
- Enterprise - advanced management features and support
- Cloudera Manager components:
- Cloudera Manager Server - web application's container and core Cloudera Manager engine
- Cloudera Manager Database - web application's data and monitoring information
- Admin Console - web user interface application
- Agent - installed on every component in the cluster
- REST API - for communication with Agents
- Lab Exercise 2.4.3
- Lab Exercise 2.4.4
- Lab Exercise 2.4.5
Cloudera Manager - Design #2⌘
Cassandra ⌘
- Designed by Amazon (DynamoDB)
- Developed by Facebook
- Data access optimization:
- HDFS and HBase are optimized for read operations
- Cassandra is optimized for write operations
- Cluster architecture:
- HDFS and HBase nodes have dedicated roles assigned
- Cassandra nodes are equal
- Cassandra vs HBase:
- So far Cassandra is a winner:
Impala - Goals and Motivation⌘
- Designed by Google (Dremel)
- Developed by Cloudera
- Interactive queries:
- MapReduce, Pig and Hive are batch processing frameworks
- Impala is real-time processing framework
- Data structurization:
- HDFS stores data in an unstructured format
- HBase and Cassandra store data in semi-structured format
- Impala stores data in a structured format
- SQL interface:
- HDFS does not provide SQL interface
- HBase and Cassandra do not provide SQL interface by default
- Impala provides SQL interface
Impala - Design⌘
- Runs on the top of HDFS or HBase
- Leverages Hive metastore
- Implements SQL tables and queries
- Interfaces:
- CLI
- web GUI
- ODBC
- JDBC
- Daemons
Impala - Daemons⌘
- Impala Daemon:
- performs read and write operations
- accepts queries and returns query results
- parallelizes queries and distributes the work
- Impala Statestore:
- checks health of the Impala Daemons
- reports detected failures to other nodes in the cluster
- Impala Catalog Server:
- relays metadata changes to all nodes in the cluster
Tez - Goals and Motivation⌘
- Interactive queries:
- MapReduce is a batch processing framework
- Tez is a real-time processing framework
- Job types:
- MapReduce is destined for key-value-based data processing
- Tez is destined for DAG-based data processing
Tez - Design⌘
Third Day - Sessions I and II⌘
Picking a distribution⌘
- Considerations:
- cost
- open / closed source code
- implemented Hadoop version
- available features
- supported OS types
- ease of deployment
- integrity
- support
Hadoop releases⌘
- Release notes: http://hadoop.apache.org/releases.html
Hadoop releases and features⌘
- CDH Release Notes: http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Release-Notes/CDH5-Release-Notes.html
Why CDH5?⌘
- Free
- Open source
- Fresh (always based on recent Hadoop releases)
- Wide range of available features
- Support for RHEL, SUSE and Debian
- Easy deployment process
- Very well integrated
- Very well supported
Hardware selection⌘
- Considerations:
- CPU
- RAM
- storage IO
- storage capacity
- storage resilience
- network bandwidth
- network resilience
- node resilience
Hardware selection #2⌘
- Master nodes:
- namenodes
- jobtrackers / resource managers
- secondary namenode
- Worker nodes
- Network devices
- Supporting nodes
Namenode hardware selection⌘
- CPU: quad-core 2.6 GHz
- RAM: 24-96 GB DDR3 (1 GB per 1 million of blocks)
- Storage IO: SAS / SATA II
- Storage capacity: less than 1 TB
- Storage resilience: RAID
- Network bandwidth: 2 Gb/s
- Network resilience: LACP
- Node resilience: non-commodity hardware
Jobtracker / RM hardware selection⌘
- CPU: quad-core 2.6 GHz
- RAM: 24-96 GB DDR3 (unpredictable)
- Storage IO: SAS / SATA II
- Storage capacity: less than 1 TB
- Storage resilience: RAID
- Network bandwidth: 2 Gb/s
- Network resilience: LACP
- Node resilience: non-commodity hardware
Secondary namenode hardware selection⌘
- Almost always identical to namenode
- The same requirements for RAM, storage capacity and desired resilience level
- Very often used as a replacement hardware in case of namenode failure
Worker hardware selection⌘
- CPU: 2 * 6 core 2.9 GHz
- RAM: 64-96 GB DDR3
- Storage IO: SAS / SATA II
- Storage capacity: 24-36 TB JBOD
- Storage resilience: none
- Network bandwidth: 1-10 Gb/s
- Network resilience: none
- Node resilience: commodity hardware
Network devices hardware selection⌘
- Downlink bandwidth: 1-10 Gb/s
- Uplink bandwidth: 10 Gb/s
- Computing resources: vendor-specific
- Resilience: by network topology
Supporting nodes hardware selection⌘
- Backup system
- Monitoring system
- Security system
- Logging system
Planning cluster growth⌘
- Considerations:
- replication factor: 3 by default
- temporary data: 20-30% of the worker storage space
- data growth: based on the trends analysis
- the amount of data being analized: based on the cluster requirements
- usual task RAM requirements: 2-4 GB
- Lab Exercise 3.1.1
- Lab Exercise 3.1.2
Network topologies⌘
- Hadoop East/West traffic flow design
- Network topologies:
- Leaf
- Spine
- Inter data center replication
- Lab Exercise 3.1.3
Leaf topology⌘
Spine topology⌘
Blades, SANs, RAIDs and virtualization⌘
- Blades considerations:
- Hadoop is designed for the commodity hardware class in terms of resilience
- Hadoop cluster scales out rather than up
- SANs considerations:
- HDFS is distributed and shared by design
- Hadoop is designed for processing local data
- RAIDs considerations:
- Hadoop is designed for data fault tolerance
- RAID introduces additional limitations and overhead
- Virtualization considerations:
- Hadoop worker nodes don't benefit from virtualization
- Hadoop is the clustering solution which is the opposite concept to virtualization
Hadoop directories⌘
Directory | Linux path | Grows? |
---|---|---|
Hadoop install directory | /usr/* | no |
Namenode metadata directory | defined by an administrator | yes |
Secondary namenode metadata directory | defined by an administrator | yes |
Datanode data directory | defined by an administrator | yes |
MapReduce local directory | defined by an administrator | no |
Hadoop log directory | /var/log/* | yes |
Hadoop pid directory | /var/run/hadoop | no |
Hadoop temp directory | /tmp | no |
Filesystems⌘
- LVM (Logical Volume Manager) - HDFS killer
- Considerations:
- extent-based design
- burn-in time
- Winners:
- EXT4
- XFS
- Lab Exercise 3.1.4
Third Day - Session III⌘
Identification, authentication, authorization⌘
- Identity:
- Who does this user claim to be?
- controlled by Hadoop operation mode
- Authentication:
- Can this user prove they are who they say they are?
- controlled by Hadoop operation mode
- Authorization:
- Is this user allowed to do what they're asking to do?
- controlled by Hadoop services
- Hadoop operation modes:
- simple mode - uses system username as the identity
- secure mode - uses Kerberos principal as the identity
Kerberos - foundations⌘
- Kerberos - strong authentication protocol
- Strong authentication - combines at least two authentication factors of different types:
- "something you know"
- "something you own"
- "something you are"
- Kerberos - basic terms:
- Principal - Kerberos user
- Key - Kerberos user's password
- Keytab - file containing an exported key
- KDC (Key Distribution Center) - Kerberos service:
- AS (Authentication Service) - serves TGTs
- TGS (Ticket Granting Service) - serves Service Tickets
Kerberos - tickets⌘
- TGT (Ticket Granting Ticket) - used to obtain Service Ticket:
- encrypted with principal's key
- served to the principal by the AS
- contains TGT key and other TGT session parameters
- valid for a finite amount of time
- Service Ticket - used to grant an access for the principals to particular resources
- encrypted with TGT key
- served to the principal by the TGS
- contains service key and other service session parameters (e.g. session key)
- valid for a finite amount of time
Kerberos - authentication process⌘
- A Client sends cleartext authentication request to the AS providing Client ID and Service ID
- The AS sends a Client/TGS session key encrypted with the Client's key to the Client
- The AS sends a TGT encrypted with the Client/TGS session key to the Client
- The Client sends cleartext authentication request to the TGS providing the TGT and Service ID
- The Client sends another message encrypted with the Client/TGS session key to the TGS
- The TGS sends a Client/Service session key encrypted with the Client/TGS session key
- The TGS sends a Service Ticket (containing Client/Service session key) encrypted with the Service key
- The Client sends the Service Ticket encrypted with the Service key to the Service
- The Client sends an authentication request encrypted with the Client/Service session key to the Service providing Client ID
- The Service sends a confirmation message encrypted with the Client/Service session key to the client
Kerberos - principals⌘
- Kerberos principal:
[primary]/[instance]@[realm]
- primary - name of the user / service
- instance - host on which the user or service resides
- realm - group of principals
- Hadoop primaries:
- hdfs - used by namenode and datanode daemons
- 'mapred - used by jobtracker, tasktracker and job history server daemons
- yarn - used by resourcemanager and nodemanager daemons
- [username] - used by Hadoop user
Configuring secure Hadoop cluster⌘
- Audit all services to ensure enabling Kerberos authentication will not break anything.
- Configure a working non-Kerberos enabled Hadoop cluster.
- Configure a working Kerberos environment.
- Ensure host name resolution is sane.
- Create Hadoop Kerberos principals.
- Export principal keys to keytabs and distribute them to the proper cluster nodes.
- Update Hadoop configuration files.
- Restart all services.
- Test your environment.
Third Day - Session IV⌘
IT monitoring⌘
- To monitor = to be aware of a state of the system
- Monitoring types:
- health monitoring:
- short-term or live-time monitoring based on current samples of metrics
- used to determine whether the system is in an expected operational state
- performance monitoring:
- long-term monitoring based on samples of metrics gathered over time
- used to determine system performance
- health monitoring:
- Monitoring software:
- Hyperic HQ: http://www.hyperic.com/
- Nagios: http://www.nagios.org/
- Zabbix: http://www.zabbix.com/
Hadoop metrics⌘
- Hadoop has built-in support for exposing various metrics to outside systems
- Metrics contexts - group of related metrics:
- jvm - contains JVM information and metrics (e.g. maximum heap size)
- dfs - contains HDFS metrics (e.g. total HDFS capacity)
- mapred - contains MapReduce metrics (e.g. total number of map slots)
- rpc - contains RPC metrics (e.g. number of open connections)
- Access to metrics contexts needs to be enabled by plugins
Metric plugins⌘
- hadoop-metrics.properties - metrics configuration file
- [context].class=org.apache.hadoop.metrics.spi.NullContext
- metrics are neither collected nor exposed to the external systems
- [context].class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
- metrics are collected, but are not exposed to the external systems
- [context].class=org.apache.hadoop.metrics.file.FileContext
- metrics are collected and exposed to the external systems in a form of files
- [context].class=org.apache.hadoop.metrics.ganglia.GangliaContext
- metrics are collected and exposed to Ganglia software natively
- [context].class=org.apache.hadoop.metrics.spi.NullContext
- Example:
rpc.class=org.apache.hadoop.metrics.file.FileContext rpc.filename=/tmp/rpcmetrics.log
- Although Cloudera distributes hadoop-metrics.properties file it does not support it!!!
REST interface⌘
- Metrics servlets:
- /metrics - metrics1
- /jmx - metrics2
- Metrics retrieval:
- plain-text:
curl 'http://[node]:[port]/jmx'
- JSON:
curl 'http://[node]:[port]/jmx?format=json'
- Lab Exercise 3.4.1
Considerations⌘
- The following aspects should be considered when setting up the monitoring system:
- what to monitor?
- on which daemon?
- what metrics to use?
- what thresholds to set?
- how often to notify?
- who to notify?
- should automated actions be configured instead of manual?
Hadoop monitoring - storage - namenode⌘
- What to monitor?
- dfs.name.dir directory capacity
- On which daemon?
- namenode
- What metrics to use?
- Hadoop:service=NameNode,name=FSNamesystemState => CapacityRemaining
- What thresholds to set?
- 5 to 30 days' capacity
Hadoop monitoring - storage - datanode⌘
- What to monitor?
- dfs.data.dir and mapred.local.dir directories capacity and state
- On which daemon?
- datanode
- What metrics to use?
- Hadoop:service=DataNode,name=FSDatasetState-null => Remaining
- Hadoop:service=DataNode,name=FSDatasetState-null => NumFailedVolumes
- What thresholds to set?
- opt for higher-level checks on the aggregate HDFS and MapReduce service level metrics
Hadoop monitoring - memory⌘
- What to monitor?
- physical and virtual memory utilization
- On which daemon?
- namenode, secondary namenode, datanode, resource manager, node manager
- What metrics to use?
- OperatingSystem => FreePhysicalMemorySize
- OperatingSystem => FreeSwapSpaceSize
- What thresholds to set?
- opt for higher-level checks on the aggregate HDFS and MapReduce service level metrics
Hadoop monitoring - processor⌘
- What to monitor?
- load average
- On which daemon?
- namenode, secondary namenode, datanode, resource manager, node manager
- What metrics to use?
- OperatingSystem => SystemLoadAverage
- What thresholds to set?
- [number of processors] * [number of cores]
- monitor processor utilization for performance monitoring purpose only
Hadoop monitoring - Java heap size⌘
- What to monitor?
- used heap size
- On which daemon?
- namenode, resource manager
- What metrics to use?
- java.lang:type=Memory => HeapMemoryUsage => used
- java.lang:type=Memory => HeapMemoryUsage => max
- What thresholds to set?
- measure the median used heap over a set of samples
Hadoop monitoring - Garbage collection⌘
- What to monitor?
- duration of garbage collection events
- On which daemon?
- namenode, resource manager
- What metrics to use?
- java.lang:type=GarbageCollector,name=ConcurrentMarkSweep => LastGcInfo => duration
- What thresholds to set?
- threshold vary by application (~1s)
Hadoop monitoring - HDFS capacity⌘
- What to monitor?
- the absolute amount of HDFS capacity
- On which daemon?
- namenode
- What metrics to use?
- Hadoop:service=NameNode,name=FSNamesystemState => CapacityRemaining
- What thresholds to set?
- depends on the data growth speed
Hadoop monitoring - HDFS metadata paths⌘
- What to monitor?
- the absolute number of active metadata paths
- On which daemon?
- namenode
- What metrics to use?
- Hadoop:service=NameNode,name=NameNodeInfo => NameDirStatuses => failed
- What thresholds to set?
- it should be equal to zero
Hadoop monitoring - HDFS blocks⌘
- What to monitor?
- the absolute number of blocks which are missing, corrupted and allocated
- On which daemon?
- namenode
- What metrics to use?
- Hadoop:service=NameNode,name=FSNamesystem => MissingBlocks
- Hadoop:service=NameNode,name=FSNamesystem => CorruptBlocks
- Hadoop:service=NameNode,name=FSNamesystem => BlockCapacity
- What thresholds to set?
- the number of blocks which are missing or corrupted should be equal to zero
- the number of blocks which are allocated influences namenode heap size
- Lab Exercise 3.4.2
Log categories⌘
- Hadoop Daemon Logs - daemons startup (.out) and runtime (.log) messages
- Job Configuration XML - describes job configuration
- Job Statistics - contain job runtime statistics
- Standard Error - STDERR messages from tasks execution attempts
- Standard Out - STDOUT messages from tasks execution attempts
- log4j - information messages from within the task process
Log localization⌘
/var/log/hadoop-*/*.log // daemon logs /job_*.xml // current jobs configuration XML logs /history *_conf.xml // history jobs configuration XML logs * // job statistics logs /userlogs /attempt_* /stderr // standard error logs /stdout // standard out logs /syslog // log4j logs
- Lab Exercise 3.4.3
Hadoop Daemon Logs⌘
- Name:
hadoop-<user-running-hadoop>-<daemon>-<hostname>.log
- user-running-hadoop - username of the user who started the daemon
- daemon - name of the daemon the log file is associated with
- hostname - hostname of the machine on which the daemon is running
Job Configuration XML⌘
- Name:
<hostname>_<epoch-of-jobtracker-start>_<job-id>_conf.xml
- hostname - hostname of the machine on which the daemon is running
- epoch-of-jobtracker-start - number of seconds since 1st of January 1970
- job-id - job ID
Job Statistics⌘
- Name:
<hostname>_<epoch-of-jobtracker-start>_<job-id>_<job-name>
- hostname - hostname of the machine on which the daemon is running
- epoch-of-jobtracker-start - number of seconds since 1st of January 1970
- job-id - job ID
- job-name - name of the job
Standard Error and Standard Out⌘
- Name:
/var/log/hadoop/userlogs/attempt_<job-id>_<map-or-reduce>_<attempt-id>
- job-id - job ID
- map-or-reduce - m or r depending on the task type
- attempt-id - task attempt ID
Fourth Day - Session I, II, III and IV⌘
Underpinning software⌘
- Prerequisites:
- JDK (Java Development Kit)
- Additional software:
- Cron daemon
- NTP client
- SSH server
- MTA (Mail Transport Agent)
- RSYNC tool
- NMAP tool
Hadoop users, groups and permissions⌘
- Users and groups:
- hdfs - controls namenode, secondary namenode and datanode daemons operations
- mapred - controls jobtracker and tasktracker daemons operation and client jobs execution
- hadoop - legacy user and group
- Permissions:
Directory | User:Group | Permissions |
---|---|---|
Namenode metadata directories | hdfs:hadoop | 0700 |
Secondary namenode metadata directories | hdfs:hadoop | 0700 |
Datanode data directories | hdfs:hadoop | 0700 |
Jobtracker local directories | mapred:hadoop | 0770 |
Tasktracker local directories | mapred:hadoop | 0700 |
Hadoop installation⌘
- Should be performed as a root user
- Installation instructions are distribution-dependent
- Installation types:
- tarball installation
- package installation
Hadoop package content⌘
- /etc/hadoop - configuration files
- /etc/rc.d/init.d - SYSV init scripts for Hadoop daemons
- /usr/bin - executables
- /usr/include/hadoop - C++ header files for Hadoop pipes
- /usr/lib - C libraries
- /usr/libexec - miscellaneous files used by various libraries and scripts
- /usr/sbin - administrator scripts
- /usr/share/doc/hadoop - License, NOTICE and README files
CDH-specific additions⌘
- Binaries for Hadoop Ecosystem
- Use of alternatives system for Hadoop configuration directory:
- /etc/alternatives/hadoop-conf
- Default nofile and nproc limits:
- /etc/security/limits.d
- Lab Exercise 4.1.1
Hadoop configuration⌘
- hadoop-env.sh - environmental variables
- core-site.xml - all daemons
- hdfs-site.xml - HDFS daemons
- mapred-site.xml - MapReduce daemons
- log4j.properties - logging information
- masters (optional) - list of servers which run secondary namenode daemon
- slaves (optional) - list of servers which run datanode and tasktracker daemons
- fair-scheduler.xml (optional) - fair scheduler configuration and properties
- capacity-scheduler.xml (optional) - capacity scheduler configuration and properties
- dfs.include (optional) - list of servers permitted to connect to the namenode
- dfs.exclude (optional) - list of servers denied to connect to the namenode
- hadoop-policy.xml - user / group permissions for RPC functions execution
- mapred-queue-acl.xml - user / group permissions for jobs submission
Hadoop XML files structure⌘
<?xml version="1.0"?> <configuration> <property> <name>[name]</name> <value>[value]</value> (<final>{true | false}</final>) </property> </configuration>
Hadoop configuration⌘
- Parameter: fs.default.name
- Configuration file: core-site.xml
- Purpose: specifies the default filesystem used by clients
- Format: Hadoop URL
- Used by: NN, SNN, DN, JT, TT, clients
- Example:
<property> <name>fs.default.name</name> <value>hdfs://namenode.hadooplab:8020</value> </property>
Hadoop configuration⌘
- Parameter: io.file.buffer.size
- Configuration file: core-site.xml
- Purpose: specifies a size of the IO buffer
- Guidline: 64 KB
- Format: integer
- Used by: NN, SNN, DN, JT, TT, clients
- Example:
<property> <name>io.file.buffer.size</name> <value>65536</value> </property>
Hadoop configuration⌘
- Parameter: fs.trash.interval
- Configuration file: core-site.xml
- Purpose: specifies the amount of time (in minutes) the file is retained in the .Trash directory
- Format: integer
- Used by: NN, clients
- Example:
<property> <name>fs.trash.interval</name> <value>60</value> </property>
Hadoop configuration⌘
- Parameter: ha.zookeeper.quorum
- Configuration file: core-site.xml
- Purpose: specifies the list of ZKFCs for automatic jobtracker failover purpose
- Format: comma separated list of hostname:port pairs
- Used by: JT
- Example:
<property> <name>ha.zookeeper.quorum</name> <value>jobtracker1.hadooplab:2181,jobtracker2.hadooplab:2181</value> </property>
Hadoop configuration⌘
- Parameter: dfs.name.dir
- Configuration file: hdfs-site.xml
- Purpose: specifies where the namenode should store a copy of the HDFS metadata
- Format: comma separated list of Hadoop URLs
- Used by: NN
- Example:
<property> <name>fs.default.name</name> <value>file:///namenodedir,file:///mnt/nfs/namenodedir</value> </property>
Hadoop configuration⌘
- Parameter: dfs.data.dir
- Configuration file: hdfs-site.xml
- Purpose: specifies where the datanode should store HDFS data
- Format: comma separated list of Hadoop URLs
- Used by: DN
- Example:
<property> <name>fs.default.name</name> <value>file:///datanodedir1,file:///datanodedir2</value> </property>
Hadoop configuration⌘
- Parameter: fs.checkpoint.dir
- Configuration file: hdfs-site.xml
- Purpose: specifies where the secondary namenode should store HDFS metadata
- Format: comma separated list of Hadoop URLs
- Used by: SNN
- Example:
<property> <name>fs.checkpoint.dir</name> <value>file:///secondarynamenodedir</value> </property>
Hadoop configuration⌘
- Parameter: dfs.permissions.supergroup
- Configuration file: hdfs-site.xml
- Purpose: specifies a group permitted to perform all HDFS filesystem operations
- Format: string
- Used by: NN, clients
- Example:
<property> <name>dfs.permissions.supergroup</name> <value>supergroup</value> </property>
Hadoop configuration⌘
- Parameter: dfs.balance.bandwidthPerSec
- Configuration file: hdfs-site.xml
- Purpose: specifies the maximum allowed bandwidth for HDFS balancing operations
- Format: integer
- Used by: DN
- Example:
<property> <name>dfs.balance.bandwidthPerSec</name> <value>100000000</value> </property>
Hadoop configuration⌘
- Parameter: dfs.blocksize
- Configuration file: hdfs-site.xml
- Purpose: specifies the HDFS block size of newly created files
- Guideline: 64 MB
- Format: integer
- Used by: clients
- Example:
<property> <name>dfs.block.size</name> <value>67108864</value> </property>
Hadoop configuration⌘
- Parameter: dfs.datanode.du.reserved
- Configuration file: hdfs-site.xml
- Purpose: specifies the disk space reserved for MapReduce temporary data
- Guideline: 20-30% of the datanode disk space
- Format: integer
- Used by: DN
- Example:
<property> <name>dfs.datanode.du.reserved</name> <value>10737418240</value> </property>
Hadoop configuration⌘
- Parameter: dfs.namenode.handler.count
- Configuration file: hdfs-site.xml
- Purpose: specifies the number of concurrent namenode handlers
- Guideline: natural logarithm of the number of cluster nodes, times 20, as a whole number
- Format: integer
- Used by: NN
- Example:
<property> <name>dfs.namenode.handler.count</name> <value>105</value> </property>
Hadoop configuration⌘
- Parameter: dfs.datanode.failed.volumes.tolerated
- Configuration file: hdfs-site.xml
- Purpose: specifies the number of datanode disks that are permitted to die before failing the entire datanode
- Format: integer
- Used by: DN
- Example:
<property> <name>dfs.datanode.failed.volumes.tolerated</name> <value>0</value> </property>
Hadoop configuration⌘
- Parameter: dfs.hosts
- Configuration file: hdfs-site.xml
- Purpose: specifies the location of dfs.hosts file
- Format: Linux path
- Used by: NN
- Example:
<property> <name>dfs.hosts</name> <value>/etc/hadoop/conf/dfs.hosts</value> </property>
Hadoop configuration⌘
- Parameter: dfs.hosts.exclude
- Configuration file: hdfs-site.xml
- Purpose: specifies the location of dfs.hosts.exclude file
- Format: Linux path
- Used by: NN
- Example:
<property> <name>dfs.hosts.exclude</name> <value>/etc/hadoop/conf/dfs.hosts.exclude</value> </property>
Hadoop configuration⌘
- Parameter: dfs.nameservices
- Configuration file: hdfs-site.xml
- Purpose: specifies virtual namenode names used for the namenode HA and federation purpose
- Format: comma separated strings
- Used by: NN, clients
- Example:
<property> <name>dfs.nameservices</name> <value>HAcluster1,HAcluster2</value> </property>
Hadoop configuration⌘
- Parameter: dfs.ha.namenodes.[nameservice]
- Configuration file: hdfs-site.xml
- Purpose: specifies logical namenode names used for the namenode HA purpose
- Format: comma separated strings
- Used by: DN, NN, clients
- Example:
<property> <name>dfs.ha.namenodes.HAcluster1</name> <value>namenode1,namenode2</value> </property>
Hadoop configuration⌘
- Parameter: dfs.namenode.rpc-address.[nameservice].[namenode]
- Configuration file: hdfs-site.xml
- Purpose: specifies namenode RPC address used for the namenode HA purpose
- Format: Hadoop URL
- Used by: DN, NN, clients
- Example:
<property> <name>dfs.namenode.rpc-address.HAcluster1.namenode1</name> <value>http://namenode1.hadooplab:8020</value> </property>
Hadoop configuration⌘
- Parameter: dfs.namenode.http-address.[nameservice].[namenode]
- Configuration file: hdfs-site.xml
- Purpose: specifies namenode HTTP address used for the namenode HA purpose
- Format: Hadoop URL
- Used by: DN, NN, clients
- Example:
<property> <name>dfs.namenode.http-address.[nameservice].[namenode]</name> <value>http://namenode1.hadooplab:50070</value> </property>
Hadoop configuration⌘
- Parameter: topology.script.file.name
- Configuration file: core-site.xml
- Purpose: specifies the location of a script for rack-awareness purpose
- Format: Linux path
- Used by: NN
- Example:
<property> <name>topology.script.file.name</name> <value>/etc/hadoop/conf/topology.sh</value> </property>
Hadoop configuration⌘
- Parameter: dfs.namenode.shared.edits.dir
- Configuration file: hdfs-site.xml
- Purpose: specifies the location of shared directory on QJM with namenode metadata for namenode HA purpose
- Format: Hadoop URL
- Used by: NN, clients
- Example:
<property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://namenode1.hadooplab.com:8485/hadoop/hdfs/HAcluster</value> </property>
Hadoop configuration⌘
- Parameter: dfs.client.failover.proxy.provider.[nameservice]
- Configuration file: hdfs-site.xml
- Purpose: specifies the class used for locating active namenode for namenode HA purpose
- Format: Java class
- Used by: DN, clients
- Example:
<property> <name>dfs.client.failover.proxy.provider.HAcluster1</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property>
Hadoop configuration⌘
- Parameter: dfs.ha.fencing.methods
- Configuration file: hdfs-site.xml
- Purpose: specifies the fencing methods for namenode HA purpose
- Format: new line-separated list of Linux paths
- Used by: NN
- Example:
<property> <name>dfs.ha.fencing.methods</name> <value>/usr/local/bin/STONITH.sh</value> </property>
Hadoop configuration⌘
- Parameter: dfs.ha.automatic-failover.enabled
- Configuration file: hdfs-site.xml
- Purpose: specifies whether the automatic namenode failover is enabled or not
- Format: 'true' / 'false'
- Used by: NN, clients
- Example:
<property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: ha.zookeeper.quorum
- Configuration file: hdfs-site.xml
- Purpose: specifies the list of ZKFCs for automatic namenode failover purpose
- Format: comma separated list of hostname:port pairs
- Used by: NN
- Example:
<property> <name>ha.zookeeper.quorum</name> <value>namenode1.hadooplab:2181,namenode2.hadooplab:2181</value> </property>
Hadoop configuration⌘
- Parameter: fs.viewfs.mounttable.default.link.[path]
- Configuration file: hdfs-site.xml
- Purpose: specifies the mount point of the namenode or namenode cluster for namenode federation purpose
- Format: Hadoop URL or namenode cluster virtual name
- Used by: NN
- Example:
<property> <name>fs.viewfs.mounttable.default.link./federation1</name> <value>hdfs://namenode.hadooplabl:8020</value> </property>
Hadoop configuration⌘
- Parameter: mapred.job.tracker
- Configuration file: mapred-site.xml
- Purpose: specifies the location of jobtracker
- Format: hostname:port
- Used by: JT, TT, clients
- Example:
<property> <name>mapred.job.tracker</name> <value>jobtracker.hadooplab:8021</value> </property>
Hadoop configuration⌘
- Parameter: mapred.local.dir
- Configuration file: mapred-site.xml
- Purpose: specifies where the jobtracker and tasktracker should store MapReduce temporary data
- Format: comma separated list of Hadoop URLs
- Used by: JT, TT
- Example:
<property> <name>mapred.local.dir</name> <value>file:///mapredlocaldir</value> </property>
Hadoop configuration⌘
- Parameter: mapred.java.child.opts
- Configuration file: mapred-site.xml
- Purpose: specifies the java options automatically applied when launching JVM for job / task execution pupose
- Format: -string
- Used by: JT, TT
- Example:
<property> <name>mapred.java.child.opts</name> <value>-Xmx1g</value> </property>
Hadoop configuration⌘
- Parameter: mapred.child.ulimit
- Configuration file: mapred-site.xml
- Purpose: specifies the maximum amount of virtual memory a task is permitted to consume
- Guideline: 1.5 times the size of the JVM max heap size
- Format: integer
- Used by: TT
- Example:
<property> <name>mapred.child.ulimit</name> <value>1572864</value> </property>
Hadoop configuration⌘
- Parameter: mapred.tasktracker.map.tasks.maximum
- Configuration file: mapred-site.xml
- Purpose: specifies the maximum number of map tasks to run in parallel
- Format: integer
- Used by: TT
- Example:
<property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>10</value> </property>
Hadoop configuration⌘
- Parameter: mapred.tasktracker.reduce.tasks.maximum
- Configuration file: mapred-site.xml
- Purpose: specifies the maximum number of reduce tasks to run in parallel
- Format: integer
- Used by: TT
- Example:
<property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>1</value> </property>
Hadoop configuration⌘
- Parameter: io.sort.mb
- Configuration file: mapred-site.xml
- Purpose: specifies the size of the circular in-memory buffer used to store MapReduce temporary data
- Guideline: ~10% of the JVM heap size
- Format: integer (MB)
- Used by: TT
- Example:
<property> <name>io.sort.mb</name> <value>128</value> </property>
Hadoop configuration⌘
- Parameter: io.sort.factor
- Configuration file: mapred-site.xml
- Purpose: specifies the number of files to merge at once
- Guideline: 64 for 1 GB JVM heap size
- Format: integer
- Used by: TT
- Example:
<property> <name>io.sort.factor</name> <value>64</value> </property>
Hadoop configuration⌘
- Parameter: mapred.compress.map.output
- Configuration file: mapred-site.xml
- Purpose: specifies whether the output of the map tasks should be compressed or not
- Format: 'true' / 'false'
- Used by: TT
- Example:
<property> <name>mapred.compress.map.output</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: mapred.map.output.compression.codec
- Configuration file: mapred-site.xml
- Purpose: specifies the codec used for map tasks output compression
- Format: Java class
- Used by: TT
- Example:
<property> <name>mapred.map.output.compression.codec</name> <value>org.apache.io.compress.SnappyCodec</value> </property>
Hadoop configuration⌘
- Parameter: mapred.output.compression.type
- Configuration file: mapred-site.xml
- Purpose: specifies whether the compression should be applied to the values only, key-value pairs or not at all
- Format: 'RECORD' / 'BLOCK' / 'NONE'
- Used by: TT
- Example:
<property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property>
Hadoop configuration⌘
- Parameter: map.job.tracker.handler.count
- Configuration file: mapred-site.xml
- Purpose: specifies the number of concurrent jobtracker handlers
- Guideline: natural logarithm of the number of cluster nodes, times 20, as a whole number
- Format: integer
- Used by: JT
- Example:
<property> <name>map.job.tracker.handler.count</name> <value>105</value> </property>
Hadoop configuration⌘
- Parameter: mapred.jobtracker.taskScheduler
- Configuration file: mapred-site.xml
- Purpose: specifies the scheduler used for the MapReduce jobs
- Format: Java class
- Used by: JT
- Example:
<property> <name>mapred.jobtracker.taskScheduler</name> <value>org.apache.hadoop.mapred.FairScheduler</value> </property>
Hadoop configuration⌘
- Parameter: mapred.reduce.parallel.copies
- Configuration file: mapred-site.xml
- Purpose: specifies the number of threads started in parallel to fetch the output from map tasks
- Guideline: natural logarithm of the size of the cluster, times four, as a whole number
- Format: integer
- Used by: TT
- Example:
<property> <name>mapred.reduce.parallel.copies</name> <value>5</value> </property>
Hadoop configuration⌘
- Parameter: mapred.reduce.tasks
- Configuration file: mapred-site.xml
- Purpose: specifies the number of reduce tasks to use for the MapReduce jobs
- Format: integer
- Used by: JT
- Example:
<property> <name>mapred.reduce.tasks</name> <value>1</value> </property>
Hadoop configuration⌘
- Parameter: tasktracker.http.threads
- Configuration file: mapred-site.xml
- Purpose: specifies the number of threads started in parallel to serve the output of map tasks
- Format: integer
- Used by: JT
- Example:
<property> <name>tasktracker.http.threads</name> <value>1</value> </property>
Hadoop configuration⌘
- Parameter: mapred.reduce.slowstart.completed.maps
- Configuration file: mapred-site.xml
- Purpose: specifies when to start allocating reducers as the number of complete map tasks grows
- Format: float
- Used by: JT
- Example:
<property> <name>mapred.reduce.slowstart.completed.maps</name> <value>0.5</value> </property>
Hadoop configuration⌘
- Parameter: mapred.jobtrackers.[nameservice]
- Configuration file: mapred-site.xml
- Purpose: specifies logical jobtracker names used for the jobtracker HA purpose
- Format: comma separated strings
- Used by: JT, TT, clients
- Example:
<property> <name>mapred.jobtrackers.HAcluster</name> <value>jobtracker1,jobtracker2</value> </property>
Hadoop configuration⌘
- Parameter: mapred.ha.jobtracker.id
- Configuration file: mapred-site.xml
- Purpose: specifies preferred jobtracker for an active unit
- Format: string
- Used by: JT, TT, clients
- Example:
<property> <name>mapred.ha.jobtracker.id</name> <value>jobtracker1</value> </property>
Hadoop configuration⌘
- Parameter: mapred.jobtracker.rpc-address.[nameservice].[jobtracker]
- Configuration file: mapred-site.xml
- Purpose: specifies jobtracker RPC address
- Format: Hadoop URL
- Used by: JT, TT, clients
- Example:
<property> <name>mapred.jobtracker.rpc-address.HAcluster.jobtracker1</name> <value>http://jobtracker1.hadooplab:8021</value> </property>
Hadoop configuration⌘
- Parameter: mapred.jobtracker.http-address.[nameservice].[jobtracker]
- Configuration file: mapred-site.xml
- Purpose: specifies jobtracker HTTP address
- Format: Hadoop URL
- Used by: JT, TT
- Example:
<property> <name>mapred.jobtracker.http-address.HAcluster.jobtracker1</name> <value>http://jobtracker1.hadooplab:50030</value> </property>
Hadoop configuration⌘
- Parameter: mapred.ha.jobtracker.rpc-address.[nameservice].[jobtracker]
- Configuration file: mapred-site.xml
- Purpose: specifies jobtracker RPC address used for the jobtracker HA purpose
- Format: Hadoop URL
- Used by: JT
- Example:
<property> <name>mapred.ha.jobtracker.rpc-address.HAcluster.jobtracker1</name> <value>http://jobtracker1.hadooplab:8022</value> </property>
Hadoop configuration⌘
- Parameter: mapred.ha.jobtracker.http-redirect-address.[nameservice].[jobtracker]
- Configuration file: mapred-site.xml
- Purpose: specifies jobtracker HTTP address for the jobtracker HA purpose
- Format: Hadoop URL
- Used by: JT
- Example:
<property> <name>mapred.ha.jobtracker.http-redirect-address.HAcluster.jobtracker1</name> <value>http://jobtracker1.hadooplab:50030</value> </property>
Hadoop configuration⌘
- Parameter: mapred.client.failover.proxy.provider.[nameservice]
- Configuration file: mapred-site.xml
- Purpose: specifies the class used for locating active jobtracker for jobtracker HA purpose
- Format: Java class
- Used by: TT, clients
- Example:
<property> <name>mapred.client.failover.proxy.provider.HAcluster</name> <value>org.apache.hadoop.mapred.ConfiguredFailoverProxyProvider</value> </property>
Hadoop configuration⌘
- Parameter: mapred.client.failover.max.attempts
- Configuration file: mapred-site.xml
- Purpose: specifies the maximum number of times to try to fail over
- Format: integer
- Used by: TT, clients
- Example:
<property> <name>mapred.client.failover.max.attempts</name> <value>3</value> </property>
Hadoop configuration⌘
- Parameter: mapred.ha.fencing.methods
- Configuration file: mapred-site.xml
- Purpose: specifies the fencing methods for jobtracker HA purpose
- Format: new line-separated list of Linux paths
- Used by: JT
- Example:
<property> <name>mapred.ha.fencing.methods</name> <value>/usr/local/bin/STONITH.sh</value> </property>
Hadoop configuration⌘
- Parameter: mapred.ha.automatic-failover.enabled
- Configuration file: mapred-site.xml
- Purpose: enables automatic failover in jobtracker HA pair
- Format: boolean
- Used by: JT
- Example:
<property> <name>mapred.ha.automatic-failover.enabled</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: mapred.jobtracker.restart.recover
- Configuration file: mapred-site.xml
- Purpose: enables job recovery on jobtracker restart or during the failover
- Format: boolean
- Used by: JT
- Example:
<property> <name>mapred.jobtracker.restart.recover</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: mapred.ha.zkfc.port
- Configuration file: mapred-site.xml
- Purpose: specifies ZKFC port
- Format: integer
- Used by: JT
- Example:
<property> <name>mapred.ha.zkfc.port</name> <value>8018</value> </property>
Hadoop cluster tweaking⌘
- Make sure that hostname and canonical name are the same:
- hostname - returned by the uname() syscall (uname -n)
- canonical name - returned by the resolver
- Set the runtime memory swapiness to minimum:
- set the value of /sys/proc/vm/swapiness file to 0
- temporarily (manually or by using the sysctl command) or permanently (by editing the /etc/sysctl.conf file)
- Allow memory to be overcommited:
- set the value of /sys/proc/vm/overcommit_memory file to 1
- temporarily (manually or by using the sysctl command) or permanently (by editing the /etc/sysctl.conf file)
- Disable access time stamps:
- mount the filesystems with noatime and nodiratime options
- temporarily (by using the mount command) or permanently (by editing the /etc/fstab file)
Lab preparation⌘
- Lab Exercise 4.2.1
- Lab Exercise 4.2.2
Hadoop cluster installation and configuration - HDFS⌘
- Lab Exercise 4.3.1
- Lab Exercise 4.3.2
- Lab Exercise 4.3.3
- Lab Exercise 4.3.4
- Lab Exercise 4.3.5
- Lab Exercise 4.3.6
Hadoop cluster installation and configuration - MapReduce⌘
- Lab Exercise 4.4.1
- Lab Exercise 4.4.2
- Lab Exercise 4.4.3
- Lab Exercise 4.4.4
Fifth Day - Session I, II and III⌘
Upgrading from MRv1 to YARN⌘
- Packages to be uninstalled:
- hadoop-0.20-mapreduce
- hadoop-0.20-mapreduce-jobtracker
- hadoop-0.20-mapreduce-tasktracker
- hadoop-0.20-mapreduce-zkfc
- hadoop-0.20-mapreduce-jobtrackerha
- Packages to be installed:
- hadoop-yarn
- hadoop-mapreduce
- hadoop-mapreduce-historyserver
- hadoop-yarn-resourcemanager
- hadoop-yarn-nodemanager
Minimal configuration⌘
- yarn-site.xml:
<?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>you.hostname.com</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Minimal configuration #2⌘
- mapred-site.xml:
<?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
- Lab Exercise 2
Hadoop configuration⌘
- Parameter: fs.defaultFS
- Configuration file: core-site.xml
- Purpose: specifies the default filesystem used by clients
- Format: Hadoop URL
- Used by: NN, SNN, DN, RM, NM, clients
- Example:
<property> <name>fs.defaultFS</name> <value>hdfs://namenode.hadooplab:8020</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.hostname
- Configuration file: yarn-site.xml
- Purpose: specifies the location of resource manager
- Format: hostname:port
- Used by: RM, NM, clients
- Example:
<property> <name>yarn.resourcemanager.hostname</name> <value>resourcemanager.hadooplab:8021</value> </property>
Hadoop configuration⌘
- Parameter: yarn.nodemanager.local-dirs
- Configuration file: yarn-site.xml
- Purpose: specifies where the resource manager and node manager should store application temporary data
- Format: comma separated list of Hadoop URLs
- Used by: RM, NM
- Example:
<property> <name>yarn.nodemanager.local-dirs</name> <value>file:///applocaldir</value> </property>
Hadoop configuration⌘
- Parameter: yarn.nodemanager.resource.memory-mb
- Configuration file: yarn-site.xml
- Purpose: specifies memory in megabytes available for allocation for the applications
- Format: integer
- Used by: NM
- Example:
<property> <name>yarn.nodemanager.resource.memory-mb</name> <value>40960</value> </property>
Hadoop configuration⌘
- Parameter: yarn.nodemanager.resource.cpu-vcores
- Configuration file: yarn-site.xml
- Purpose: specifies virtual cores available for allocation for the applications
- Format: integer
- Used by: NM
- Example:
<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>8</value> </property>
Hadoop configuration⌘
- Parameter: yarn.scheduler.minimum-allocation-mb
- Configuration file: yarn-site.xml
- Purpose: specifies minimum amount of memory in megabytes allocated for the applications
- Format: integer
- Used by: NM
- Example:
<property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property>
Hadoop configuration⌘
- Parameter: yarn.scheduler.minimum-allocation-vcores
- Configuration file: yarn-site.xml
- Purpose: specifies minimum number of virtual cores allocated for the applications
- Format: integer
- Used by: NM
- Example:
<property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>2</value> </property>
Hadoop configuration⌘
- Parameter: yarn.scheduler.increment-allocation-mb
- Configuration file: yarn-site.xml
- Purpose: specifies increment amount of memory in megabytes for allocation for the applications
- Format: integer
- Used by: NM
- Example:
<property> <name>yarn.scheduler.increment-allocation-mb</name> <value>512</value> </property>
Hadoop configuration⌘
- Parameter: yarn.scheduler.increment-allocation-vcores
- Configuration file: yarn-site.xml
- Purpose: specifies incremental number of virtual cores for allocation for the applications
- Format: integer
- Used by: NM
- Example:
<property> <name>yarn.scheduler.increment-allocation-vcores</name> <value>1</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.task.io.sort.mb
- Configuration file: mapred-site.xml
- Purpose: specifies the size of the circular in-memory buffer used to store MapReduce temporary data
- Guideline: ~10% of the JVM heap size
- Format: integer (MB)
- Used by: NM
- Example:
<property> <name>mapreduce.task.io.sort.mb</name> <value>128</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.task.io.sort.factor
- Configuration file: mapred-site.xml
- Purpose: specifies the number of files to merge at once
- Guideline: 64 for 1 GB JVM heap size
- Format: integer
- Used by: NM
- Example:
<property> <name>mapreduce.task.io.sort.factor</name> <value>64</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.map.output.compress
- Configuration file: mapred-site.xml
- Purpose: specifies whether the output of the map tasks should be compressed or not
- Format: 'true' / 'false'
- Used by: NM
- Example:
<property> <name>mapreduce.map.output.compress</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.map.output.compress.codec
- Configuration file: mapred-site.xml
- Purpose: specifies the codec used for map tasks output compression
- Format: Java class
- Used by: NM
- Example:
<property> <name>mapreduce.map.output.compress.codec</name> <value>org.apache.io.compress.SnappyCodec</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.output.fileoutputformat.compress.type
- Configuration file: mapred-site.xml
- Purpose: specifies whether the compression should be applied to the values only, key-value pairs or not at all
- Format: 'RECORD' / 'BLOCK' / 'NONE'
- Used by: NM
- Example:
<property> <name>mapreduce.output.fileoutputformat.compress.type</name> <value>BLOCK</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.reduce.shuffle.parallelcopies
- Configuration file: mapred-site.xml
- Purpose: specifies the number of threads started in parallel to fetch the output from map tasks
- Guideline: natural logarithm of the size of the cluster, times four, as a whole number
- Format: integer
- Used by: NM
- Example:
<property> <name>mapreduce.reduce.shuffle.parallelcopies</name> <value>5</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.job.reduces
- Configuration file: mapred-site.xml
- Purpose: specifies the number of reduce tasks to use for the MapReduce jobs
- Format: integer
- Used by: RM
- Example:
<property> <name>mapreduce.job.reduces</name> <value>1</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.shuffle.max.threads
- Configuration file: mapred-site.xml
- Purpose: specifies the number of threads started in parallel to serve the output of map tasks
- Format: integer
- Used by: RM
- Example:
<property> <name>mapreduce.shuffle.max.threads</name> <value>1</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.job.reduce.slowstart.completedmaps
- Configuration file: mapred-site.xml
- Purpose: specifies when to start allocating reducers as the number of complete map tasks grows
- Format: float
- Used by: RM
- Example:
<property> <name>mapreduce.job.reduce.slowstart.completedmaps</name> <value>0.5</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.ha.rm-ids
- Configuration file: yarn-site.xml
- Purpose: specifies logical resource manager names used for resource manager HA purpose
- Format: comma separated strings
- Used by: RM, NM, clients
- Example:
<property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>resourcemanager1,resourcemanager2</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.ha.id
- Configuration file: yarn-site.xml
- Purpose: specifies preferred resource manager for an active unit
- Format: string
- Used by: RM
- Example:
<property> <name>yarn.resourcemanager.ha.id</name> <value>resourcemanager1</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.address.[resourcemanager]
- Configuration file: yarn-site.xml
- Purpose: specifies resource manager RPC address
- Format: Hadoop URL
- Used by: RM, clients
- Example:
<property> <name>yarn.resourcemanager.address.resourcemanager1</name> <value>http://resourcemanager1.hadooplab:8032</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.scheduler.address.[resourcemanager]
- Configuration file: yarn-site.xml
- Purpose: specifies resource manager RPC address for Scheduler service
- Format: Hadoop URL
- Used by: RM, clients
- Example:
<property> <name>yarn.resourcemanager.scheduler.address.resourcemanager1</name> <value>http://resourcemanager1.hadooplab:8030</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.admin.address.[resourcemanager]
- Configuration file: yarn-site.xml
- Purpose: specifies resource manager RPC address used for the resource manager HA purpose
- Format: Hadoop URL
- Used by: RM, clients
- Example:
<property> <name>yarn.resourcemanager.admin.address.resourcemanager1</name> <value>http://resourcemanager1.hadooplab:8033</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.resource-tracker.address.[resourcemanager]
- Configuration file: yarn-site.xml
- Purpose: specifies resource manager RPC address for Resource Tracker service
- Format: Hadoop URL
- Used by: RM, NM
- Example:
<property> <name>yarn.resourcemanager.resource-tracker.address.resourcemanager1</name> <value>http://resourcemanager1.hadooplab:8031</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.webapp.address.[resourcemanager]
- Configuration file: yarn-site.xml
- Purpose: specifies resource manager address for web UI and REST API services
- Format: Hadoop URL
- Used by: RM, clients
- Example:
<property> <name>yarn.resourcemanager.webapp.address.resourcemanager1</name> <value>http://resourcemanager1.hadooplab:8088</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.ha.fencer
- Configuration file: yarn-site.xml
- Purpose: specifies the fencing methods for jobtracker HA purpose
- Format: new-line-separated list of Linux paths
- Used by: RM
- Example:
<property> <name>yarn.resourcemanager.ha.fencer</name> <value>/usr/local/bin/STONITH.sh</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.ha.auto-failover.enabled
- Configuration file: yarn-site.xml
- Purpose: enables automatic failover in resource manager HA pair
- Format: boolean
- Used by: RM
- Example:
<property> <name>yarn.resourcemanager.ha.auto-failover.enabled</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.recovery.enabled
- Configuration file: yarn-site.xml
- Purpose: enables job recovery on resource manager restart or during the failover
- Format: boolean
- Used by: RM
- Example:
<property> <name>yarn.resourcemanager.ha.recovery.enabled</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.ha.auto-failover.port
- Configuration file: yarn-site.xml
- Purpose: specifies ZKFC port
- Format: integer
- Used by: RM
- Example:
<property> <name>yarn.resourcemanager.ha.auto-failover.port</name> <value>8018</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.ha.enabled
- Configuration file: yarn-site.xml
- Purpose: enables resource manager HA feature
- Format: boolean value
- Used by: RM
- Example:
<property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.store.class
- Configuration file: yarn-site.xml
- Purpose: specifies storage for resource manager internal state
- Format: string
- Used by: RM
- Example:
<property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.cluster-id
- Configuration file: yarn-site.xml
- Purpose: specifies the logical name of YARN cluster
- Format: string
- Used by: RM, NM, clients
- Example:
<property> <name>yarn.resourcemanager.cluster-id</name> <value>HAcluster</value> </property>
Configuration changes⌘
MRv1 | MRv2 |
---|---|
fs.default.name | fs.defaultFS |
ha.zookeeper.quorum | --- |
mapred.job.tracker | yarn.resourcemanager.hostname |
mapred.local.dir | yarn.nodemanager.local-dirs |
mapred.java.child.opts | --- |
mapred.child.ulimit | --- |
mapred.tasktracker.map.tasks.maximum | --- |
mapred.tasktracker.reduce.tasks.maximum | --- |
Configuration changes⌘
MRv1 | MRv2 |
---|---|
--- | yarn.nodemanager.resource.memory-mb |
--- | yarn.nodemanager.resource.cpu-vcores |
--- | yarn.scheduler.minimum-allocation-mb |
--- | yarn.scheduler.minimum-allocation-vcores |
--- | yarn.scheduler.increment-allocation-mb |
--- | yarn.scheduler.increment-allocation-vcores |
io.sort.mb | mapreduce.task.io.sort.mb |
io.sort.factor | mapreduce.task.io.sort.factor |
Configuration changes⌘
MRv1 | MRv2 |
---|---|
mapred.compress.map.output | mapreduce.map.output.compress |
mapred.map.output.compression.codec | mapreduce.map.output.compress.codec |
mapred.output.compression.type | mapreduce.output.fileoutputformat.compress.type |
mapred.reduce.parallel.copies | mapreduce.reduce.shuffle.parallelcopies |
mapred.reduce.tasks | mapreduce.job.reduces |
tasktracker.http.threads | mapreduce.shuffle.max.threads |
mapred.reduce.slowstart.completed.maps | mapreduce.job.reduce.slowstart.completedmaps |
mapred.jobtrackers.[nameservice] | yarn.resourcemanager.ha.rm-ids |
mapred.ha.jobtracker.id | yarn.resourcemanager.ha.id |
Configuration changes⌘
MRv1 | MRv2 |
---|---|
mapred.jobtracker*address / mapred.ha.jobtracker*address | yarn.resourcemanager*address |
mapred.client.failover.* | --- |
mapred.ha.fencing.methods | yarn.resourcemanager.ha.fencer |
mapred.ha.automatic-failover.enabled | yarn.resourcemanager.ha.auto-failover.enabled |
mapred.ha.zkfc.port | yarn.resourcemanager.ha.auto-failover.port |
--- | yarn.resourcemanager.ha.enabled |
yarn.resourcemanager.store.class | |
mapred.jobtracker.restart.recover | yarn.resourcemanager.recovery.enabled |
--- | yarn.resourcemanager.cluster-id |
Hadoop configuration⌘
- Parameter: hadoop.security.authentication
- Configuration file: core-site.xml
- Purpose: defines authentication mechanism to use within the Hadoop cluster
- Format: string
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property>
Hadoop configuration⌘
- Parameter: hadoop.security.authorization
- Configuration file: core-site.xml
- Purpose: enables / disables client authorization within the Hadoop cluster
- Format: boolean
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>hadoop.security.authorization</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: dfs.block.access.token.enable
- Configuration file: hdfs-site.xml
- Purpose: enables / disables access to HDFS blocks for authenticated users only
- Format: boolean
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.block.access.token.enable</name> <value>true</value> </property>
Hadoop configuration⌘
- Parameter: dfs.namenode.keytab.file
- Configuration file: hdfs-site.xml
- Purpose: specifies a path to namenode keytab file
- Format: Linux path
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.namenode.keytab.file</name> <value>/etc/hadoop/conf/hdfs.keytab</value> </property>
Hadoop configuration⌘
- Parameter: dfs.namenode.kerberos.principal
- Configuration file: hdfs-site.xml
- Purpose: specifies a principle that namenode should use for the authentication purpose
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/namenode@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- Parameter: dfs.namenode.kerberos.internal.spnego.principal
- Configuration file: hdfs-site.xml
- Purpose: specifies an internal spnego principle that namenode should use
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/namenode@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- Parameter: dfs.secondary.namenode.keytab.file
- Configuration file: hdfs-site.xml
- Purpose: specifies a path to secondary namenode keytab file
- Format: Linux path
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.secondary.namenode.keytab.file</name> <value>/etc/hadoop/conf/hdfs.keytab</value> </property>
Hadoop configuration⌘
- Parameter: dfs.secondary.namenode.kerberos.principal
- Configuration file: hdfs-site.xml
- Purpose: specifies a principle that secondary namenode should use for the authentication purpose
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.secondary.namenode.kerberos.principal</name> <value>hdfs/secondarynamenode@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- Parameter: dfs.secondary.namenode.kerberos.internal.spnego.principal
- Configuration file: hdfs-site.xml
- Purpose: specifies an internal spnego principle that secondary namenode should use
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/namenode@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- Parameter: dfs.datanode.keytab.file
- Configuration file: hdfs-site.xml
- Purpose: specifies a path to datanode keytab file
- Format: Linux path
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.datanode.keytab.file</name> <value>/etc/hadoop/conf/hdfs.keytab</value> </property>
Hadoop configuration⌘
- Parameter: dfs.datanode.kerberos.principal
- Configuration file: hdfs-site.xml
- Purpose: specifies a principle that datanode should use for the authentication purpose
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/datanode@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- Parameter: dfs.datanode.data.dir.perm
- Configuration file: hdfs-site.xml
- Purpose: overwrites permissions assigned to the directories specified by the dfs.data.dir parameter
- Format: Linux permissions
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.datanode.data.dir.perm</name> <value>700</value> </property>
Hadoop configuration⌘
- Parameter: dfs.datanode.address
- Configuration file: hdfs-site.xml
- Purpose: specifies IP address and port on which the data transceiver RPC server should be bound
- Guideline: port mus be below 1024 in secure mode
- Format: IP:port
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.datanode.address</name> <value>0.0.0.0:1004</value> </property>
Hadoop configuration⌘
- Parameter: dfs.datanode.http.address
- Configuration file: hdfs-site.xml
- Purpose: specifies IP address and port on which the embedded HTTP server should be bound
- Format: IP:port
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:1006</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.keytab
- Configuration file: yarn-site.xml
- Purpose: specifies a path to resourcemanager keytab file
- Format: Linux path
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>yarn.resourcemanager.keytab</name> <value>/etc/hadoop/conf/yarn.keytab</value> </property>
Hadoop configuration⌘
- Parameter: yarn.resourcemanager.principal
- Configuration file: yarn-site.xml
- Purpose: specifies a principle that resourcemanager should use for the authentication purpose
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>yarn.resourcemanager.principal</name> <value>yarn/resourcemanager@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- Parameter: yarn.nodemanager.keytab
- Configuration file: yarn-site.xml
- Purpose: specifies a path to nodemanager keytab file
- Format: Linux path
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>yarn.nodemanager.keytab</name> <value>/etc/hadoop/conf/yarn.keytab</value> </property>
Hadoop configuration⌘
- Parameter: yarn.nodemanager.principal
- Configuration file: yarn-site.xml
- Purpose: specifies a principle that nodemanager should use for the authentication purpose
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>yarn.nodemanager.principal</name> <value>yarn/nodemanager@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- Parameter: yarn.nodemanager.container-executor.class
- Configuration file: yarn-site.xml
- Purpose: specifies what engine in a cluster is responsible for containers execution
- Format: Java class
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>yarn.nodemanager.container-executor.class</name> <value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value> </property>
Hadoop configuration⌘
- Parameter: yarn.nodemanager.linux-container-executor.group
- Configuration file: yarn-site.xml
- Purpose: specifies a user or group for containers execution purpose
- Format: Java class
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>yarn.nodemanager.linux-container-executor.group</name> <value>yarn</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.jobhistory.address
- Configuration file: mapred-site.xml
- Purpose: specifies IP address and port on which the MapReduce Job History server should be bound
- Format: IP:port
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>mapreduce.jobhistory.address</name> <value>0.0.0.0:10020</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.jobhistory.keytab
- Configuration file: mapred-site.xml
- Purpose: specifies a path to jobhistoryserver keytab file
- Format: Linux path
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>mapreduce.jobhistory.keytab</name> <value>/etc/hadoop/conf/mapred.keytab</value> </property>
Hadoop configuration⌘
- Parameter: mapreduce.jobhistory.principal
- Configuration file: mapred-site.xml
- Purpose: specifies a principle that jobhistoryserver should use for the authentication purpose
- Format: Kerberos principal
- Used by: NN, SNN, DN, RM, NM
- Example:
<property> <name>mapreduce.jobhistory.principal</name> <value>mapred/jobhistoryserver@HADOOPLAB.COM</value> </property>
Hadoop configuration⌘
- /etc/hadoop/conf/container-executor.cfg:
yarn.nodemanager.local-dirs=[comma-separated list of paths to local nodemanager directories] yarn.nodemanager.linux-container-executor.group=[user or group for containers execution purpose] yarn.nodemanager.log-dirs=[comma-separated list of paths to local nodemanager directories] banned.users=[comma-separated list of users disallowed to execute jobs] min.user.id=[minimal UID of users allowed to execute jobs]
Security - YARN⌘
- Principles:
- JobHistoryServer: mapred@[HOST]
- ResourceManager, NodeManager: yarn@[HOST]
- Kerberos properties:
- MRv1: taskcontroller.cfg
- MRv2: container-executor.cfg
YARN cluster installation and configuration⌘
- Lab Exercise 5.2.1
- Lab Exercise 5.2.2
- Lab Exercise 5.2.3
Running MapReduce jobs on YARN cluster⌘
- Lab Exercise 5.3.1
Fifth Day - Session IV⌘
Certification and Surveys⌘
- Congratulations on completing the course!
- Official Cloudera certification authority:
- Visit http://www.nobleprog.com for other courses
- Surveys: http://www.nobleprog.pl/te