All about DataSince, DataEngineering and ComputerScience
View the Project on GitHub datainsightat/DataScience_Examples
Big Data 5 V’s: Volume, Velocity, Variety, Veracity (Accuracy), Value
Distributed Storage, Distributed Computing
MapReduce: Distributed Computing
Name
Node (Master)
| |-----|-----| Data Data Data Node Node Node (Slave)
Fault tolerance: Data is stored in block size of 128 MB and replicated over 3 machines (configurable).
1) Enable Cloud Dataproc API > Create Cluster 2) Set up Cluster 3) Configure Node 4) CREATE
Use ‘SSH’ to get a Terminal to the master node. The nodes run on ‘SMP Debian 5.1’.
$ wget https://raw.githubusercontent.com/futurexskill/bigdata/master/retailstore.csv
$ hadoop fs -ls
$ hadoop fs -mkfir /user/newuser
$ hadoop fs -put retailstore.csv /user/newuser/
$ rm retailstore.csv
$ hadoop fs -get retailstore.csv
Way of sending computational tasks to Worker Nodes.
Job -> RessourceManager
Job Tracker
|
|-------------|-------------|
Mapper(m) NodeManager NodeManager NodeManager
Reducer(r) Task Tracker Task Tracker Task Tracker
Mapper(m)
Map tasks reads reach row and fetches an element. Reduce task performs aggregation operations like calculating sum, average etc on the fetched element. YARN was implemented to run non-MapReduce jobs on Hadoop clusters.