hello-hadoop-2021-LahaLuhem created by GitHub Classroom
Downloading the jar was done by using:
cd /opt/hadoop
wget https://raw.githubusercontent.com/rubigdata-dockerhub/hadoop-dockerfile/master/WordCount.jar
hdfs dfs -put WordCount.jar input
and the rest of the steps for compiling worked without any errors
Deleted old output directory from hadoop using
hadoop fs -ls
hadoop fs -rm -rf output
and then ran hadoop jar wc.jar WordCount input output
The output was copied to local by using hadoop fs -copyToLocal output (easier than previous).
According to the current WordCount.jar file, all the occurences of every unique words were being counted (and emitted finally).
On closer inspection, it was noted that the StringTokenizer was being run on default, to tokenize(and match) whole words only. So maybe to change the match, adding the additional constructor argument for delim could be used, for example, counting lines by putting ‘\n’ as a delimitter. The same key Text("Number of lines: ") was supplied to the emiiter, so that it could aggregate.
The final method looked like:-
StringTokenizer itr = new StringTokenizer(value.toString(), "\n");
Text singleText = new Text("Number of lines: ");
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(singleText, one);
//context.write(word, one); //orignal
}
java file recompiled to class file, class file recompiled to jar file, then submitted as a new job.
Number of lines: 130778
Similar procedure was followed for counting the number of words, ie, using identical keys everytime:-
StringTokenizer(value.toString());
Number of words: 962240
For number of characters:- StringTokenizer does not support regexes, so String.split() was used instead
// StringTokenizer itr = new StringTokenizer(value.toString(), "");
Text singleText = new Text("Number of characters: ");
for (String tok : value.toString().split("\\S")) {
word.set(tok);
context.write(singleText, one);
}
Number of characters: 3683179
For Romeo VS Juliet occurences, a regex that returns even the matched word (with the split), is needed, to avoid a second traversal. This was achieved by using ‘look-ahead’:-
Text romeoText = new Text("Romeo occurances: ");
Text julietText = new Text("Juliet occurances: ");
// Matched delimitter put at the beginning of the split string
String[] splits = value.toString().split("(?=(Romeo)|(Juliet))");
for (String tok : Arrays.copyOfRange(splits, 0, splits.length)) {
if (tok.startsWith("Romeo")) {
word.set("Romeo");
context.write(romeoText, one);
}
else if (tok.startsWith("Juliet")) {
word.set("Juliet");
context.write(julietText, one);
}
}
Juliet occurances: 71
Romeo occurances: 123
In totality, the following command bunch can be used (even for later assignments):-
<starting everything> sudo /opt/ssh/sbin/sshd && start-dfs.sh && start-yarn.sh
<executable readying> hadoop com.sun.tools.javac.Main WordCount.java && jar cf wc.jar WordCount*.class
<insert exec command> hadoop com.sun.tools.javac.Main WordCount.java && jar cf wc.jar WordCount*.class
<outputing and clearing storage> hadoop fs -copyToLocal output && cat output/* && hdfs dfs -rm -r output && rm -r output
<stopping everything> stop-dfs.sh && stop-yarn.sh && exit