Hadoop Ecosystem Application

Aditi Dosi
10 min readDec 3, 2022

A Complete processing architecture

When Hadoop was released, it supplied two Big Data needs: distributed storage and distributed processing. However, this alone is not enough to work with Big Data; other tools are required, i.e., other functionalities to meet different business, application, and architecture needs.

The entire Hadoop Ecosystem is to work with billions of records. That’s what we have at our disposal the whole Apache Hadoop Ecosystem.

Over the years, other software has begun to appear to run seamlessly together with Hadoop and its Ecosystem. Several products benefit, for example, from the HDFS file system that allows you to manage multiple machines as if they were just one.

These other Ecosystem products benefit by running on The Hadoop Distributed File System. With this, we can create an Ecosystem focused on the need for an application, architecture, or what we want to do with Big Data. Just think of the Ecosystem as the iOS or Android operating system apps; without them, the smartphone would only serve to call and receive calls — their initial goal.

Applications serve to enhance the Operating System’s capability, so we can apply the same reasoning to the Hadoop Ecosystem components to complement the operation of Hadoop.

  • Data transfer (Flume, Sqoop, Kafka, Falcon)
  • File System (HDFS)
  • Data Storage (HBase, Cassandra)
  • Serialization (Avro, Trevni, Thrift)
  • Jobs Execution (MapReduce, YARN)
  • Data Interaction (Pig, Hive, Spark, Storm)
  • Intelligence (Mahout, Drill)
  • Search (Lucene, Solr)
  • Graphics (Giraph)
  • Security (Knox, Sentry)
  • Operation and Development (Ooozie, Zookeeper, Ambari)

In addition to using Hadoop, we can use other systems that will run on Hadoop and, with that, assemble a single architecture powerful enough to extract from Big Data the best it has to offer.

1. Apache Zookeeper — coordination of distributed services

To manage and organize the zoo of several of Hadoop’s animals was created the guardian Zookeeper, responsible for the entire functioning. ZooKeeper has become a standard for organizing Hadoop services, HBase, and other distributed structures.

Zookeeper is a high-performance open-source solution for coordinating services in distributed applications, i.e., corresponds to large cluster sets.

Coordinating and managing a service in a distributed environment is a complicated process. The zookeeper solves the debut problem with its simple architecture, doing the management of the Hadoop cluster. Zookeeper allows developers to focus on the logic of the main application without worrying about the distributed nature of the application.

2. Apache Oozie — Workflow Scheduling

The Hadoop Ecosystem also has a workflow manager, which allows us to schedule jobs and manage them through the Cluster. Remember that we are dealing with running on computer clusters, where all of this requires different management than companies are typically used to doing in more traditional Database environments.

When we create the Cluster to store and process large data sets, other concerns will arise. Hadoop came to solve the big data problem, but it brought cluster management, which requires other practices, different techniques, etc.

Apache Oozie is a workflow scheduling system used to manage MapReduce Jobs primarily. Apache Oozie is integrated with the rest of the Hadoop components to support various Hadoop jobs (such as Java Map-Reduce, Streaming Map-Reduce, Pig, Hive, and Sqoop) as well as system-specific and jobs (such as Java programs and shell scripts).

Oozie is a workflow processing system that allows users to define a series of jobs written in different languages like MapReduce, Pig, and Hive and then intelligently link them to each other. Ozzie will enable users to specify that a particular query can only be started after previous jobs that access the same data are complete.

3. Apache Hive — Data Warehouse (Developed by Facebook)

Apache Hive is a Data Warehouse that works with Hadoop and MapReduce. Hive is a data storage system for Hadoop that makes it easy to aggregate data for reporting and analysis of large data sets (Big Data). It is a system for managing and querying unstructured data in a structured format!

Just Hadoop alone, maybe not enough for an architecture of a Big Data application. Hadoop lacks some components that we need for the day to day, such as a tool that allows us to aggregate and generate reports from the data stored in Hadoop HDFS. Soon, Apache Hive came to meet this processing need, being another component that runs on Apache Hadoop.

Hive enables queries about data using a SQL-like language called HiveQL (HQL). This system provides fault tolerance capability for data storage and relies on MapReduce for execution, meaning Hive alone is of no use! We need Hive running on a Hadoop infrastructure because it needs the data stored in HDFS and also depends on MapReduce, so Hive is a kind of Hadoop plugin.

It allows JDBC/ODBC connections to be easily integrated with other business intelligence tools like Tableau, Microstrategy, Microsoft Power BI, and more. Hive is batch-oriented and has high latency for query execution. Like Pig, it generates MapReduce Jobs that run in the Hadoop Cluster.

In the Backend, Apache Hive itself will generate a Job from MapReduce. We do not need to create this Job from MapReduce; Hive will facilitate our life by serving as a user-friendly interface and easy to collect the data from the infrastructure.

Therefore, Hive uses MapReduce for execution and HDFS for data storage and research; it provides the specific HQL language for Hive engine queries, which supports the basics of the SQL language.

4. Apache Sqoop — SQL Server/Oracle for HDFS

Sqoop is a project of the Apache Hadoop ecosystem whose responsibility is to import and export data from relational databases. Sqoop was developed to transfer data from Hadoop to RDBMS and vice versa, transforming the data into Hadoop without further development.

At some point during the analysis process, we will need data stored in relational banks. We can collect data from a non-structured source, a Social Network, and structured sales data from the relational bank and record both in Hadoop, apply some analysis, and try to find out the relationship between the company’s interaction in Social Networks with the sales process. After all, we create a predictive model to help managers and decision-makers of the company.

Sqoop allows us to move data from traditional databases like Microsoft SQL Server or Oracle to Hadoop. You can import individual tables or entire databases into HDFS, and the developer can determine which columns or rows will be imported.

We can manipulate how Apache Sqoop will import or export the data from Apache HDFS. Sqoop uses the JDBC connection to connect relational databases and can directly create tables in Apache Hive and supports incremental import — if we forget to import some part of a table or the table has grown since the last do an incremental import.

5. Apache Pig

Apache Pig is a tool that is used to analyze large data sets that represent data flows. We can perform all data manipulation operations on Hadoop using Apache Pig.

To write data analysis programs, Pig offers a high-level language known as Pig Latin. This language provides several operators that programmers can use to create their own reading, writing, and processing data functions.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All of these scripts are converted internally to mapping and reduction tasks. Apache Pig has a component known as Pig Engine that accepts Pig Latin scripts as input and converts those scripts to Jobs MapReduce. We have 2 components of Apache Pig:

  • Pig Latin Script Language is a procedural language of data flow and contains syntax and commands applied to implement business logic.
  • Runtime Engine, which is the compiler that produces sequences of MapReduce programs, uses HDFS to store and fetch data, is used to interact with Hadoop systems, and validates and compiles scripts in Jobs MapReduce sequences.

6. Apache HBase — NoSQL key-value database

Apache HBase is one of the most impressive products in the Hadoop system. This is a non-relational NoSQL database designed to work with large data sets on Hadoop HDFS. This is a type of NoSQL database that uses the key-value model. A respective byte-array key identifies each value; in addition to that, tables do not have schemas, which is a fairly common characteristic in relational databases.

The goal of HBase is to store huge tables with billions of records. When working with a relational database, it is widespread to have a table with multiple columns or multiple tables with a smaller number of columns, but that relates. With HBase, we have only one table, usually with 3 to 4 columns and Billions of records — not ideal for any project, but for projects where it requires this type of access with fewer variables and a vast number of records HDFS.

It takes advantage of the fault tolerance provided by the Hadoop file system (HDFS), being one of the characteristics of ecosystem products. We don’t have to worry about fault tolerance, as this has already been implemented in HDFS, so it’s a product that will run on the HDFS, and we’ll only worry about the features and features of the accessory product.

HBase Node Architecture

1.Node Master

Only a Master node can be executed. ZooKeeper maintains high availability. He is responsible for managing cluster operations, such as assignment, load balancing, and splitting. It is not part of a reading and writing operation.

2.Node Region Server

We can have one or more. It is responsible for storing tables, performing reads, and writing buffers. The Client communicates with the RegionServer to process read and write operations.

We have an architecture very similar to Hadoop architecture, where we have the Master and the Slave. In the case of HBase, the difference is subtle, but in general terms, it’s as if we have the Master doing the management and the Slave.

HBase vs. RDBMS

In HBase, partitioning is automatic; already in a database of the category RDBMS, the partitioning can be automated or manual — performed by the administrator.

HBase can scale it linearly and automatically with new nodes; if the Cluster is no longer supporting processing capacity, we will add new nodes to the Cluster to gain horizontal scalability with machines. In the case of RDBMS, scalability is vertical with the addition of more hardware to the server. In this case, we have a single server to add more hardware, more memory, disk space, more processing, etc.

HBase uses commodity hardware, the same characteristic as a Hadoop cluster. RDBMS requires more robust and, therefore, more expensive hardware.

The HBase has fault tolerance; in the case of RDBMS, fault tolerance may be present. Soon, some issues will be resolved with relational databases, while other issues, especially those related to Big Data, will be fixed with Apache Hbase.

7. Apache Flume — Massive source data collection for HDFS

Apache Flume is a service that works in a distributed environment to collect efficiently, aggregate, and move large amounts of data, with a flexible and straightforward streaming architecture based on data — Twitter, how to collect network data and bring it to HDFS? Apache Flume is an option.

Flume is for when we need to bring data from different sources in real-time to Hadoop. Flume is a service that basically allows you to send data directly to Hadoop HDFS.

Flume’s data model allows it to be used in online analytics applications. We can have social networks, Facebook, Twitter, server logs, or any other data source on the left side. We collect this data with Apache Flume, record it in Apache HDFS to eventually apply MapReduce, or even use Apache HBase and overall architecture.

We set up a Big Data infrastructure to store data and then process it. Apache Flume is designed to bring and store the data in a distributed environment.

8. Apache Mahout — central data flow repository

At some point, we will need to apply Machine Learning, today’s leading technology. Machine Learning allows us to perform predictive modeling, predicting through historical data and automating the process.

Apache Mahout is an open-source library of machine learning algorithms, scalable and focused on clustering, classification, and recommendation systems. Therefore, when we need to apply Machine Learning to a large data set stored in Hadoop, Apache Mahout may be an option.

Suppose we need to use high-performance machine learning algorithms, an open-source and free solution. In that case, we have a large data set (Big Data), we use analytics tools like R and Python, process batch data, and a mature library on the market, then Mahout can meet our needs.

9. Apache Kafka — central data flow repository

Apache Kafka manages real-time data flows generated from websites, applications, and IoT sensors. This agent application is a central system that collects high-volume data and makes it available in real-time for other applications.

Producers are the data sources that produce the data, and consumers are the applications that will consume the data. All of this can be done in real-time, while Apache Kafka collects data made from clicks, logs, stock quotes and delivers them to the consumer to process and apply machine learning, and then the data is discarded.

We generate an amount of data never seen before, and this only tends to increase exponentially. Kafka proposes to analyze data in real-time rather than storing it — it no longer makes sense to talk only about data stored in tables, with rows and columns. The volume of data is now so large that the data needs to be seen as what it really is: a constant stream that needs to be analyzed in real-time.

In other words, we collect the data from Twitter, forward the data through some application, analyze the set, deliver the result and let the data go away, that is, analyze in real-time without having to store the data. It makes no sense to store the data because it is already data from the past.

Kafka must evolve a lot because we have no way to store so much data, and in some cases, it makes no sense for the business’s goal to store and analyze data from a recent past.

And there we have it. I hope you have found this helpful. Thank you for reading.