Hadoop

JULY 15 2018 | 05:00
Hadoop

An open source software framework for storing data and running applications on cluster of commodity hardware is Hadoop. It allows huge storage for any kind of data, enormous processing power and the ability to handle virtually limitless parallel tasks or jobs.

Importance of Hadoop:

  • Processing power: Distributed computing model of Hadoop processes big data fast. The more computing nodes you use the more processing power you have.

  • Ability to store and process huge amounts of any kind of data, quickly: With data volumes and varieties consistently increasing, especially from social media and the Internet of Things (IoT), this feature of Hadoop is key consideration.

  • Fault tolerance. Data and application processing are prevented against hardware failure. If a node fails, taskss are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are saved automatically.

  • Flexibility: In this technology, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. Unstructured data like text, images and videos are included under this context.

  • Low cost. The open-source framework is free and uses commodity hardware to store large volume of data.

  • Scalability. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Working of Hadoop:

    Currently, four core modules are included in the basic framework from the Apache Foundation:

  1. Hadoop Common – the libraries and utilities used by other Hadoop modules.
  2. Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines without prior organization.
  3. YARN – (Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop.
  4. MapReduce – a parallel processing software framework. It is comprised of two steps. Map step is a master node that takes inputs and partitions them into smaller subproblems and then distributes them to worker nodes. After the map step has taken place, the master node takes the answers to all of the subproblems and combines them to produce output.

Applications

  1. Job tracking
  2. Scheduling
  3. Marketing analytics
  4. Machine learning and/or sophisticated data mining
  5. Image processing
  6. Processing of xml messages
  7. Web crawling and/or text processing
  8. General archiving, including of relational/tabular data, e.g. For compliance Fashion Shop Color