Can a student learn Hadoop


In summary, there are roughly the following situations for beginners to big data:

  • Cross-industry transfer (this one is the most difficult, you may have never had contact with programming, this is a true zero foundation)
  • Seniors who are about to graduate (including computer majors or other majors, this one is a bit better, at least you have some programming skills in college)
  • Old driver with experience in software development (including Javaweb, .net, c etc.)

The students in the above situations have one thing in common: They have a zero foundation on big data. In relative terms, it won't be very difficult to learn an old driver with experience in software development. Ken spends more time and effort than others, but it's actually not as difficult as you think. There are rewards for paying!

Okay, no more nonsense, let's go straight through it, some learning suggestions for any big data beginner [for students with the three basics above]

What is big data?

Many friends have asked me what exactly is big data. Summarize in one sentence

  • For friends in the non-software industry
    • In accordance with your usual consumer behavior in supermarkets, gas stations, restaurants and other places, we can use big data technology to determine your current age range, whether you are married, whether you have children, how old the children are and whether you have a permanent address. Information about the price of the car.
  • For friends in the software industry
    • Usually the programs we write run on a computer with limited processing power. Of course, the amount of data is also limited. Big data technology can actually distribute our code across many computers in order to process large amounts of data in parallel and then extract valuable and meaningful information from these huge amounts of data.

Basic knowledge for learning big data

1. Linux Foundation is necessary, at least you need to master the basic operational commands under the Linux command line

2. JavaScript basics [including MySQL], note that this is Java, not Javaee. Big data engineers do not need to know Javaweb



The development of internet technology is booming and the era of artificial intelligence is coming. Take advantage of the next trend. To help those who want to study in the direction of the Internet, but give up due to lack of time and resources. I've put together a piece of the latest advanced big data materials and advanced tutorials. Big data learning group: 199 plus [427] and finally 210 to find organizational learning. Welcome to the advanced and enter the big data.


The development of Hadoop to date includes a very extensive product family that can meet the requirements of big data processing in various scenarios. As the currently common big data processing technology, the big data business of many companies in the market is based on Hadoop and offers very mature solutions for many scenarios.

As a developer, mastering Hadoop's development technology and its environmental internal framework is the only way to get into big data.

The following is a detailed introduction to the roadmap for learning Hadoop development technology.
Hadoop itself was developed in Java, so support for Java is very good, but other languages ​​can also be used.

The following technical route focuses on the direction of data mining as the efficiency of Python development is high. Hence, we use Python for tasks.

Since Hadoop runs on a Linux system, knowledge of Linux is also required.


The first stage: Hadoop ecological architecture technology
Language basics

Java: Master the knowledge of Javase, understand and practice memory management in the Java virtual machine, as well as multithreading, thread pools, design patterns and parallelization. No detailed knowledge is required.

Linux: system installation (command line interface and graphical interface), basic commands, network configuration, Vim editor, process management, shell scripts, familiarity with virtual machine menus, etc.

Python: Basic knowledge of basic syntax, data structures, functions, conditional judgments, loops, etc.

Environmental preparation

Here is an introduction to creating a fully distributed system on a Windows computer with 1 master and 2 slaves.

The VMware virtual machine, the Linux system (Centos6.5), the Hadoop installation package and the fully distributed Hadoop cluster environment are available here.


The MapReduce Distributed Offline Computing Framework is the central programming model of Hadoop. It is mainly suitable for large cluster tasks. Because it runs in batches, it is not up to date.


HDFS1.0 / 2.0

The Hadoop Distributed File System (HDFS) is a highly fault-tolerant system suitable for deployment on cheap computers. HDFS offers high-throughput data access that is very well suited for applications involving large amounts of data.


Yarn (Hadoop2.0)

You can understand it early on. Yarn is a resource planning platform primarily responsible for assigning resources to tasks. Yarn is a platform for planning public resources. Any framework that meets the conditions can use yarn for resource planning.


Hive is a data warehouse and all data is stored in HDFS. The main use of Hive is for writing Hql, which is very similar to the Sql of Mysql database. In fact, Hive is running Hql, and the underlying MapRedce program will continue to run when it runs.



Spark is a fast and universal adding machine designed for large-scale computing. It is an iterative calculation based on memory. Spark retains the benefits of MapReduce and has greatly improved its freshness.


Spark streaming

Spark Streaming is a real-time processing framework and data is processed in batches.


Spark Hive

Fast SQL retrieval based on Spark. As the computing engine of Hive, Spark sends Hive queries as Spark tasks to the Spark cluster for computation, which can improve the performance of Hive queries.



Storm is a real-time computing framework. It differs from MR in that MR processes huge amounts of offline data, while Storm processes all new data that is added in real time one at a time, which can ensure that the data processing is up to date.



Zookeeper is the foundation of many big data frameworks and the manager of the cluster. Monitor the status of each node in the cluster and perform the next useful operation based on the feedback provided by the node.

Finally, provide users with user-friendly interfaces and systems with high performance and stable functions



Hbase is a Nosql database, a key-value type database, a highly reliable, columnar, scalable and distributed database.

The underlying data is suitable for unstructured data storage and is stored in HDFS.



Kafka is a message middleware that is often used in real-time processing scenarios when working as an intermediate buffer layer.



Flume is a log collection tool. It is common to collect data in application-generated log files. In general there are two processes.

On the one hand, Flume collects data and stores it in Kafka to facilitate real-time processing by Storm or SparkStreaming.

Another process is to store the data that Flume collects on HDFS for offline processing with Hadoop or Spark.


The second stage: data mining algorithm
Chinese word segmentation

Offline and online applications of open source word segmentation

Natural language processing

Text relevance algorithm

Recommendation algorithm

Based on CB, CF, normalization method, mahout application.

Classification algorithm


Regression algorithm

LR 、 Decision Tree

Clustering algorithm

Hierarchical clustering, Kmeans

Neural network and deep learning

NN 、 Tensorflow


The above is a detailed route for learning Hadoop development. For reasons of space, only the framework functions are listed and explained.

After you have acquired the knowledge of the first phase, you can already carry out work in connection with the big data architecture and be responsible for some or certain development and maintenance work in the company.

After learning the second level knowledge, you can move on to data mining, which is currently the most valuable job in the big data industry.