上QQ阅读APP看书，第一时间看更新

Single mapper job

Single mapper jobs are used in transformation use cases. If we want to change only the format of data, such as some kind of transformation, then this pattern is used:

Now, let's look at a complete example of a Single mapper only job. For this, we will simply try to output the cityID and temperature from the temperature.csv file seen earlier.

The following is the code:

package io.somethinglikethis;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

public class SingleMapper
{
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "City Temperature Job");
        job.setMapperClass(TemperatureMapper.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

    /*
    Date,Id,Temperature
    2018-01-01,1,21
    2018-01-01,2,22
    */
    private static class TemperatureMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            String txt = value.toString();
            String[] tokens = txt.split(",");
            String date = tokens[0];
            String id = tokens[1].trim();
            String temperature = tokens[2].trim();
            if (temperature.compareTo("Temperature") != 0)
                context.write(new Text(id), new IntWritable(Integer.parseInt(temperature)));
        }
    }


}

To execute this job, you have to create a Maven project using your favorite editor and edit pom.xml to look like the following code:

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <packaging>jar</packaging>
  <groupId>io.somethinglikethis</groupId>
  <artifactId>mapreduce</artifactId>
  <version>1.0-SNAPSHOT</version>

  <name>mapreduce</name>
  <!-- FIXME change it to the project's website -->
  <url>http://somethinglikethis.io</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
  </properties>

  <dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-mapreduce-client-core</artifactId>
      <version>3.1.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>3.1.0</version>
    </dependency>
  </dependencies>
  <build>
      <plugins>
        <plugin>
          <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-shade-plugin</artifactId>
          <version>3.1.1</version>
          <executions>
              <execution>
                  <phase>package</phase>
                  <goals>
                      <goal>shade</goal>
                  </goals>
              </execution>
          </executions>
            <configuration>
                <finalName>uber-${project.artifactId}-${project.version}</finalName>
                <transformers>
                    <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                </transformers>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                            <exclude>META-INF/LICENSE*</exclude>
                            <exclude>license/*</exclude>
                        </excludes>
                    </filter>
                </filters>
            </configuration>
        </plugin>
      </plugins>
  </build>
</project>

Once you have the code, you can use Maven to build the shaded/fat .jar as the following:

Moogie:mapreduce sridharalla$ mvn clean compile package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building mapreduce 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ mapreduce ---
[INFO] Deleting /Users/sridharalla/git/mapreduce/target
.......
............

You should see a uber-mapreduce-1.0-SNAPSHOT.jar in the target directory; now we are ready to execute the job.

Make sure that the local Hadoop cluster, as seen in Chapter 1, Introduction to Hadoop, is started, and that you are able to browse to http://localhost:9870.

To execute the job, we will use the Hadoop binaries and the fat .jar we just built earlier as shown in the following code:

export PATH=$PATH:/Users/sridharalla/hadoop-3.1.0/bin
hdfs dfs -chmod -R 777 /user/normal

Now, run the command, as shown in the following code:

hadoop jar target/uber-mapreduce-1.0-SNAPSHOT.jar io.somethinglikethis.SingleMapper /user/normal/temperatures.csv /user/normal/output/SingleMapper

The job will run, and you should be able to see output as shown in the following code:

Moogie:target sridharalla$ hadoop jar uber-mapreduce-1.0-SNAPSHOT.jar io.somethinglikethis.SingleMapper /user/normal/temperatures.csv /user/normal/output/SingleMapper
2018-05-20 18:38:01,399 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-05-20 18:38:02,248 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
......

Pay particular attention to the output counters:

Map-Reduce Framework
 Map input records=28
 Map output records=27
 Map output bytes=162
 Map output materialized bytes=222
 Input split bytes=115
 Combine input records=0
 Combine output records=0
 Reduce input groups=6
 Reduce shuffle bytes=222
 Reduce input records=27
 Reduce output records=27
 Spilled Records=54
 Shuffled Maps =1
 Failed Shuffles=0
 Merged Map outputs=1
 GC time elapsed (ms)=13
 Total committed heap usage (bytes)=1084227584

This shows that 27 records were output from the mapper, and there is no reducer action and all input records are output on a 1:1 basis. You will be able to check this using the HDFS browser by simply using http://localhost:9870 and jumping into the output directory shown under /user/normal/output as shown in the following screenshot:

Figure: Screenshot showing how to check output from output directory

Now find the SingleMapper folder and go into this directory as shown in the following screenshot:

Figure: Screenshot showing SingleMapper folder

Going further down into this SingleMapper folder:

Figure: Screenshot showing further down in the SingleMapper folder

Finally, click on the part-r-00000 file seen in the following screenshot:

Figure: Screenshot showing the file to be selected

You will see a screen showing the file properties as seen in the following screenshot:

Figure: screenshot showing the file properties

Using head/tail option in the preceding screenshot you can view the content of the file as shown in the following screenshot:

Figure: Screenshot showing content of the file

This shows the output of the SingleMapper job as simply writing each row's cityID and temperature without any calculations.

You can also use the command line to view the contents of output hdfs dfs -cat /user/normal/output/SingleMapper/part-r-00000.

The output file contents are shown in the following code:

This concludes the SingleMapper job execution and the output is as expected.