bigdata-blog-2021-LahaLuhem created by GitHub Classroom
This project is maintained by rubigdata
To conclude the most ‘famous’ video game, from each selected genre. Additionally, if there is enough time, a 2nd dimension of region will be added, giving the most famous game, of a genre/category, from each geographic region.
After 1. and 2. from above are determined, the first step is to
Genre-subgenre table
*Adapted from this list.
Some of the sug-genres were omitted as I suspected that there would a high degree of overlap between them, and other genres. It was important to create this division into sub-genres temporarily, in order to choose games uniformly from each sub-genre, as to avoid biasing the parent genre receiving games from only a selective numbers of sub-genres, by accident. In other words, this was done to avoid accidental cherry-picking for each genre.
I chose to pick exactly 2 (or 3 in rare exceptional cases) games for each sub-genre. This final picks to be tested on can be found in this table.
The games that have been chosen may be heavily biased towards a specific platform (PC/Console), as some genres have exclusives. But this is still acceptable.
If the time permits, the division of the regions would be as follows:
*Adapted from standing division systems used by major E-Sports events.
The Spark context was reconfigured to work better with the WARC files as:
// Override default settings in order to use WARC
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf()
.setAppName("RUBigDataApp")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[WarcRecord]))
// .set("spark.dynamicAllocation.enabled", "true")
implicit val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
The WARC file was loaded by using:
//Initialize and load WARC file
//Use s"hdfs:///cc-index-subset"
val warcfile_name = s"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-00613.warc.gz"
val warcs = sc.newAPIHadoopFile(
warcfile_name,
classOf[WarcGzInputFormat], // InputFormat
classOf[NullWritable], // Key
classOf[WarcWritable] // Value
)
The compiled standalone app was uses a single segment of a commmong crawl to test on, and that is what is used here right now. Later on (when the code runs without issues and gives reasonable results), that file will be replaced by the actual complete crawl, to scale the query up. When that happens the standalone app will be passed onto the gold cluester queue. Until then, I will work with bronze, and then silver queues only.
// Preliminary filtering out
val Warcs = warcs.
map{ wr => wr._2}. // ⚫1
filter{ _.isValid() }. // ⚫2
filter( _.getRecord.getHeader.getHeaderValue("WARC-Type") == "response"). // ⚫3
filter(
wr => {
val body: String = wr.getRecord.getHttpHeaders.get("Content-Type"); // ⚫4
body match {
case null => false
case _ => body.startsWith("text/html") // ⚫5
}
}
)
⚫1:- Filtering out to the 2nd colmun of a WARC record, as that contains the data that we can use for this specific purpose.
⚫2:- Filtering out the invalid content.
⚫3:- Choose only the requests that are a ‘Response’, such that it replying with data about a query ‘Request’
⚫4:- Using the Content-Type field of the HTTP Header.
⚫5:- I chose to work with only plain text (in the form of HTML content) for the counting part.
After getting the object filt_warcs into a clean and friendly format, it was put through another processing step that would further mold it to my exact needs before analyzing. This made the object analyze_warcs. I used JSoup library to parse the HTML content of a page and converted it into a pure String type (RDD[String]) in the end.
// Use JSoup to work with wr.getRecord.getHttpStringBody()...
val analyze_warcs = filt_warcs.
filter( _.getHttpStringBody != null ).
map( wr =>
try {
val http_body = Jsoup.parse(wr.getHttpStringBody).body()
val body_text = http_body.text()
if (http_body != null && body_text != null) {
body_text
}
else {
""
}
} catch {
case e: java.lang.NullPointerException => ""
case _: Throwable => ""
}
).
filter( _ != "" ).
map( _.replaceAll("<([^>]*)>", "") ). // remove full html-tags
map( _.replaceAll("[^\\x00-\\x7F]", "") ). // remove non-ASCII
map( _.replaceAll("[^a-zA-Z\\d:'\\s\\.]", "") ). // remove non-essential chars
map( _.replaceAll("(\\s|\\.|'|:)(?=\\1)", "") ) // remove repetitive non-alphanumeric
.cache() // Final object must persist
A hard-coded dictionay (in the form of a mutable Map) was initialized. This would act as a histogram of the counts of mentions of each of the games, along with the created regexes for matching them.
The simple code can be found here for any curious reference.
Using this method, the dicitonary object was neatly initialized into freq_dict, by using
// Initializing the dictionary
val freq_dict = init_dictionary()
A lot of games would have different kinds of references for them. While some people refer to the full game name, others may choose to select one from a list of possible popular acronyms/initialisms for it. As an example, League of Legends is sometimes more popularly referred to as ‘LOL’ or ‘LoL’. This means, I would have to define matching regexes for each game.
Additionally, there may be different versions a game (FIFA 19, FIFA 20, …). Hence, for the sake of brevity, only the base franchise (=FIFA) will need to be counted. Luckily, most of such games use the base-game name and add a version next to it, simplifying it for this use case.
The games and their regexes that I defined can be found in this table again.
Some common abbreivations for a game (such as pop for ‘Prince of Persia’) were not included, as these woould be common English words otherwise anyways. The benefits of excluding these would far outweigh the drawbacks of sheer number of false positives.
Some of the single-word regexes had an extra space (‘\\s’) put behind them, so that it is clear that they are not detected as a part of another word (ending or starting).
Some more game-specific optimizations were made, such as detecting a number (‘\\d’) in front of “FIFA” to distinguish between the game and the mention of the association.
Using Regex.findAllMatchIn(string) and flatMap, I could easily get the number of occurances for each game’s regex.
This would create a job for each game (44 jobs totally), and all the executors would work parallely on the each Warc entry (line of text).
// Actual analysis
for ( (gameID, (regex, freq)) <- freq_dict ) {
freq_dict(gameID) = (
null, // Not needed anymore, releasing space ASAP
analyze_warcs.
flatMap( regex findAllMatchIn (_) ). //Flattening a sequence of iterators
count().
toInt // For some reason, it returns a double value with a .0 at the end always otherwise
)
}
prettyPrint_winners ( freq_dict.map{ case (k, v) => (k, v._2)} ) // Discarding the 'nulled' column before passing on
Initially, when I wanted to scale up from a single WARC file, I chose all the files (the whole folder: 610 of them). A single job (game) took ~13.6 hours. Extrapolating this for the full task, would mean running it for ~25 days straight, longer than the vacations. So, instead, I opted to 1/3 (~260 Warc files) of them using the following file-mathcing pattern to give to the Hadoop API. This brought the expected run-time to little over 6 days.
val warcfile_name = //"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-00613.warc.gz"
"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-000**.warc.gz,"+
"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-001**.warc.gz,"+
"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0021*.warc.gz,"+
"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0022*.warc.gz,"+
"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0023*.warc.gz,"+
"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0024*.warc.gz,"+
"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0025*.warc.gz"
| CATEGORY | WINNER | # REFERENCES |
|---|---|---|
| Action | Super Smash Bros. | 327,892 |
| Adventure | Zork | 9,249 |
| RPG | Final Fantasy | 25,359,941 |
| Simulation | Sims | 75,076 |
| Strategy | Civilization | 830,256 |
| Sports | Forza | 74,872 |
The screenshot from the WebUI.
I had not had much experience with the winner of ‘Actions’, Super Smash Bros. I had expected Doom, Halo or Minecraft to win, as they have had a long history.
For the ‘Adventure’ genre, I had only played Mirror’s Edge and What Remains of Edith Finch, and had not heard of the Zork. But after some research, I came to know that this game had had a very long history indeed.
For ‘RPG’ I was expecting Final Fantasy to win, and it did. And the close contenders would have to be League of Legends and RuneScape, and (although not shown in the intermediate results) they indeed were. And the sheer difference of the referneces from the rest of the winners, sets it apart as well, as being the Champion amongst the Champions.
For ‘Simulations’ I had wanted it to be Trackmania, but due to history, I also kind of knew that it would be Sims.
For ‘Strategy’, I had Age of Empires (one of my first games ever) and DotA. I did not expect it to be Civilization, although it is kind of popular.
For ‘Sports’, again I had expected it to be one of the classics like Need for Speed, or FIFA, but it turned out to be Forza.
I did not have enough time available, to be able to continue to the second (further) analysis, using geographic locations.