Overview
Big Data is a terminology being given to very large data sets which can be analyzed computationally to show us patterns or trends in the random data. Today whole IT Industry is re-structuring the way they used to maintain their database. This data could be anything right from email IDs, numbers of employees, clients or blood groups of patients, database collection of driving license numbers of whole world.
Big Data in simple words is a technique to manage the important and scattered database and analyze its behavior. This technology is the latest technology on which whole world is moving onto. Enormous Jobs and Opportunities to start own business will be created in the field.
IBM Says: Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is Data Science.
Prerequisite
The Workshop content consists of an approximately equal mixture of lecture and hands-on lab. This will be a minimum 1 / 2 days workshop. All students have at least moderate knowledge in Basic of C Programming Knowledge.
Recommendation: It is strongly recommended to bring your own LAPTOP during the training on which you can install and run programs if you would like to do the optional, hands-on experiments/exercises after the trainings/ workshops.
Course Details
Introduction to Big Data
- What is Big data
- Big Data opportunities
- Big Data Challenges
- Characteristics of Big data
Introduction to Hadoop
- Hadoop Distributed File System
- Hadoop Distributed File System
- Industries using Hadoop.
- Data Locality.
- Hadoop Architecture.
- Map Reduce & HDFS.
- Using the Hadoop single node image (Clone).
The Hadoop Distributed File System (HDFS)
- HDFS Design & Concepts
- Blocks, Name nodes and Data nodes
- HDFS High-Availability and HDFS Federation.
- Hadoop DFS The Command-Line Interface.
- Anatomy of File Read
- Anatomy of File Write
- Block Placement Policy and Modes
- More detailed explanation about Configuration files.
- Metadata, FS image, Edit log, Secondary Name Node and Safe Mode.
- How to add New Data Node dynamically.
- How to decommission a Data Node dynamically (Without stopping cluster).
- FSCK Utility. (Block report).
- How to override default configuration at system level and Programming level.
- HDFS Federation.
- ZOOKEEPER Leader Election Algorithm.
- Exercise and small use case on HDFS.
Map Reduce
- Functional Programming Basics.
- Map and Reduce Basics
- How Map Reduce Works
- Anatomy of a Map Reduce Job Run
- Legacy Architecture ->Job Submission, Job Initialization, Task Assignment, Task
- Execution, Progress and Status Updates
- Job Completion, Failures
- Shuffling and Sorting
- Splits, Record reader, Partition, Types of partitions & Combiner
- Optimization Techniques -> Speculative Execution, JVM Reuse and No. Slots.
- Types of Schedulers and Counters.
- Comparisons between Old and New API at code and Architecture Level.
- Getting the data from RDBMS into HDFS using Custom data types.
- Distributed Cache and Hadoop Streaming (Python, Ruby and R).
Introduction to R
- History of R
- An Insight into R
- Data Structure and Data Type
Data Management and Data Cleaning
- Missing Value Treatment
- Outlier Treatment
- Sorting Datasets
- Merging Datasets
- Creating new variables
- Binning variables
- Reading datasets from other environments into R ( importing )
- Writing datasets from R environment to other environments (exporting )
Data Visualization in R
- Bar Chart
- Dot Plot
- Scatter Plot ( 3D )
- Spinning Scatter Plots
- Pie Chart
- Histogram ( 3D ) [including colourful ones
- Overlapping Histograms
- Boxplot
- Plotting with Base and Lattice Graphics
- Plotting and Colouring
- Geo Charts
- Motion Charts
- Case Study with Data Management
Register For Big Data Hadoop