How to Process Big Data with MapReduce

This week Microsoft, AWS, Google and Meta released these announcements and updates.

Microsoft Dataverse

AWS Big Data

Google Cloud

Meta research

Weekly Picks

We’ve selected some interesting articles from the world of data for you.

Build Complex Time Series Regression Pipelines with sktime – Using sktime, we can convert a time series forecasting problem into a regression problem. With the popular library, XGBoost, you will also learn how to build a complex time series forecaster.
Computational Linear Algebra for Coders – This post focuses on – How do we do matrix computations with acceptable speed and acceptable accuracy? It is taught in Python with Jupyter Notebooks, using libraries such as Scikit-Learn and Numpy for most lessons, as well as Numba and PyTorch.

Tutorial of the Week

How to Process Big Data with MapReduce

Post credit: Sridhar Alla

This section will focus on the practical use case of building an end-to-end pipeline to perform big data analytics.

The MapReduce framework

The MapReduce framework enables you to write distributed applications to process large amounts of data from a filesystem, such as a Hadoop Distributed File System (HDFS), in a reliable and fault-tolerant manner.

An example of using a MapReduce job to count frequencies of words is shown in the following diagram:

MapReduce uses YARN as a resource manager, which is shown in the following diagram:

Each map task in Hadoop is broken into the following phases: record reader, mapper, combiner, and partitioner. The output of the map tasks, called the intermediate keys and values, is sent to the reducers. The reduce tasks are broken into the following phases: shuffle, sort, reducer, and output format. The nodes in which the map tasks run are optimally on the nodes in which the data rests. This way, the data typically does not have to move over the network, and can be computed on the local machine.

MapReduce job types

MapReduce jobs can be written in multiple ways, depending on what the desired outcome is. The fundamental structure of a MapReduce job is as follows:

import java.io.IOException;

import java.util.StringTokenizer;

import java.util.Map;

import java.util.HashMap;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.mapreduce.Reducer;

public class EnglishWordCounter {

public static class WordMapper

extends Mapper<Object, Text, Text, IntWritable> { … }

public static class CountReducer

extends Reducer<Text, IntWritable, Text, IntWritable> { … }

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, “English Word Counter”);

job.setJarByClass(EnglishWordCounter.class);

job.setMapperClass(WordMapper.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1); } }

The purpose of the driver is to orchestrate the jobs. The first few lines of the main are all about parsing command-line arguments. Then, we start setting up the job object by telling it what classes to use for computations and what input paths and output paths to use.
This how-to was curated from the book Big Data Analytics with Hadoop 3. Explore Big data more deeply by clicking the button below!

Read the Book

Source: https://hub.packtpub.com/how-to-process-big-data-with-mapreduce/

Nommu

How to Process Big Data with MapReduce – Packt Hub

Microsoft Dataverse

AWS Big Data

Google Cloud

Meta research

Weekly Picks

Tutorial of the Week

How to Process Big Data with MapReduce

The MapReduce framework

MapReduce job types