bigdata-blog-2021-LahaLuhem

bigdata-blog-2021-LahaLuhem created by GitHub Classroom

This project is maintained by rubigdata

Most Famous Video Games, from each Genre


TABLE OF CONTENTS




AIM OF THE PROJECT

To conclude the most ‘famous’ video game, from each selected genre. Additionally, if there is enough time, a 2nd dimension of region will be added, giving the most famous game, of a genre/category, from each geographic region.


PROCEDURE

Deciding criterion

  1. The genres/categories of the games that will be taken as class variables will be fixed.
  2. Famous games from each of genres from before will be determined.
  3. (if time permits) The regions will be determined

Analysis

After 1. and 2. from above are determined, the first step is to

  1. Filter the raw data (Preprocessing)
  2. Obtain the frequency of each of those games, to determine the top-winner from each category


FIXED CRITERIA

List of genres and examples of their sub-genres

Genre-subgenre table
*Adapted from this list.
Some of the sug-genres were omitted as I suspected that there would a high degree of overlap between them, and other genres. It was important to create this division into sub-genres temporarily, in order to choose games uniformly from each sub-genre, as to avoid biasing the parent genre receiving games from only a selective numbers of sub-genres, by accident. In other words, this was done to avoid accidental cherry-picking for each genre.

List of games, by their genre

I chose to pick exactly 2 (or 3 in rare exceptional cases) games for each sub-genre. This final picks to be tested on can be found in this table.
The games that have been chosen may be heavily biased towards a specific platform (PC/Console), as some genres have exclusives. But this is still acceptable.

List of regions decided

If the time permits, the division of the regions would be as follows:

  1. North America
  2. Latin Ameria
  3. Europe
  4. Asia-Pacific
  5. Oceania

*Adapted from standing division systems used by major E-Sports events.


Analysing

Preliminaries

The Spark context was reconfigured to work better with the WARC files as:

// Override default settings in order to use WARC
    import org.apache.spark.SparkConf
    import org.apache.spark.sql.SparkSession
    val sparkConf = new SparkConf()
                      .setAppName("RUBigDataApp")
                      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                      .registerKryoClasses(Array(classOf[WarcRecord]))
    //                      .set("spark.dynamicAllocation.enabled", "true")
    implicit val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    val sc = spark.sparkContext
    import spark.implicits._

The WARC file was loaded by using:

//Initialize and load WARC file
    //Use s"hdfs:///cc-index-subset"
    val warcfile_name = s"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-00613.warc.gz"
    val warcs = sc.newAPIHadoopFile(
                  warcfile_name,
                  classOf[WarcGzInputFormat],             // InputFormat
                  classOf[NullWritable],                  // Key
                  classOf[WarcWritable]                   // Value
                )

The compiled standalone app was uses a single segment of a commmong crawl to test on, and that is what is used here right now. Later on (when the code runs without issues and gives reasonable results), that file will be replaced by the actual complete crawl, to scale the query up. When that happens the standalone app will be passed onto the gold cluester queue. Until then, I will work with bronze, and then silver queues only.

Basic filtering

// Preliminary filtering out
val Warcs = warcs.
                 map{ wr => wr._2}.					// ⚫1
                 filter{ _.isValid() }.				// ⚫2
                 filter( _.getRecord.getHeader.getHeaderValue("WARC-Type") == "response").  // ⚫3
                 filter(
                   wr => {
                       val body: String = wr.getRecord.getHttpHeaders.get("Content-Type"); // ⚫4
                       body match {
                         case null => false
                         case   _  => body.startsWith("text/html")	// ⚫5
                       }
                   }
                 )

⚫1:- Filtering out to the 2nd colmun of a WARC record, as that contains the data that we can use for this specific purpose.
⚫2:- Filtering out the invalid content.
⚫3:- Choose only the requests that are a ‘Response’, such that it replying with data about a query ‘Request’
⚫4:- Using the Content-Type field of the HTTP Header.
⚫5:- I chose to work with only plain text (in the form of HTML content) for the counting part.

Advanced pre-processing

After getting the object filt_warcs into a clean and friendly format, it was put through another processing step that would further mold it to my exact needs before analyzing. This made the object analyze_warcs. I used JSoup library to parse the HTML content of a page and converted it into a pure String type (RDD[String]) in the end.

// Use JSoup to work with wr.getRecord.getHttpStringBody()...
val analyze_warcs = filt_warcs.
                               filter( _.getHttpStringBody != null ).
                               map( wr =>
                                 try {
                                    val http_body = Jsoup.parse(wr.getHttpStringBody).body()
                                    val body_text = http_body.text()
                                    if (http_body != null && body_text != null) {
                                        body_text
                                    }
                                    else {
                                        ""
                                    }
                                 } catch {
                                    case e: java.lang.NullPointerException => ""
                                    case _: Throwable => ""
                                 }
                               ).
                               filter( _ != "" ).
                               
                               map( _.replaceAll("<([^>]*)>", "") ).                // remove full html-tags
                               map( _.replaceAll("[^\\x00-\\x7F]", "") ).           // remove non-ASCII
                               map( _.replaceAll("[^a-zA-Z\\d:'\\s\\.]", "") ).     // remove non-essential chars
                               map( _.replaceAll("(\\s|\\.|'|:)(?=\\1)", "") )      // remove repetitive non-alphanumeric

                               .cache()                 // Final object must persist



Dictionary initialization for frequency

A hard-coded dictionay (in the form of a mutable Map) was initialized. This would act as a histogram of the counts of mentions of each of the games, along with the created regexes for matching them.
The simple code can be found here for any curious reference.
Using this method, the dicitonary object was neatly initialized into freq_dict, by using

// Initializing the dictionary
val freq_dict = init_dictionary()



Possible regexes for matching game names

Reducing false-negatives

A lot of games would have different kinds of references for them. While some people refer to the full game name, others may choose to select one from a list of possible popular acronyms/initialisms for it. As an example, League of Legends is sometimes more popularly referred to as ‘LOL’ or ‘LoL’. This means, I would have to define matching regexes for each game.

Additionally, there may be different versions a game (FIFA 19, FIFA 20, …). Hence, for the sake of brevity, only the base franchise (=FIFA) will need to be counted. Luckily, most of such games use the base-game name and add a version next to it, simplifying it for this use case.

The games and their regexes that I defined can be found in this table again.

Reducing false positives

Some common abbreivations for a game (such as pop for ‘Prince of Persia’) were not included, as these woould be common English words otherwise anyways. The benefits of excluding these would far outweigh the drawbacks of sheer number of false positives.

Some of the single-word regexes had an extra space (‘\\s’) put behind them, so that it is clear that they are not detected as a part of another word (ending or starting).

Some more game-specific optimizations were made, such as detecting a number (‘\\d’) in front of “FIFA” to distinguish between the game and the mention of the association.

Counting occurances

Using Regex.findAllMatchIn(string) and flatMap, I could easily get the number of occurances for each game’s regex.
This would create a job for each game (44 jobs totally), and all the executors would work parallely on the each Warc entry (line of text).

// Actual analysis
for ( (gameID, (regex, freq)) <- freq_dict ) {
    freq_dict(gameID) = (
                        null,                   // Not needed anymore, releasing space ASAP
                        analyze_warcs.
                                     flatMap( regex findAllMatchIn (_) ). //Flattening a sequence of iterators
                                     count().
                                     toInt              // For some reason, it returns a double value with a .0 at the end always otherwise
                        )
}

prettyPrint_winners ( freq_dict.map{ case (k, v) => (k, v._2)} )    // Discarding the 'nulled' column before passing on




SCALING UP

Initially, when I wanted to scale up from a single WARC file, I chose all the files (the whole folder: 610 of them). A single job (game) took ~13.6 hours. Extrapolating this for the full task, would mean running it for ~25 days straight, longer than the vacations. So, instead, I opted to 1/3 (~260 Warc files) of them using the following file-mathcing pattern to give to the Hadoop API. This brought the expected run-time to little over 6 days.

val warcfile_name = //"hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-00613.warc.gz"

                    "hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-000**.warc.gz,"+
                    "hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-001**.warc.gz,"+
                    "hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0021*.warc.gz,"+
                    "hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0022*.warc.gz,"+
                    "hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0023*.warc.gz,"+
                    "hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0024*.warc.gz,"+
                    "hdfs:///single-warc-segment/CC-MAIN-20210410105831-20210410135831-0025*.warc.gz"




RESULTS

CATEGORY WINNER # REFERENCES
Action Super Smash Bros. 327,892
Adventure Zork 9,249
RPG Final Fantasy 25,359,941
Simulation Sims 75,076
Strategy Civilization 830,256
Sports Forza 74,872

The screenshot from the WebUI.


PERSONAL EXPERCATIONS

I had not had much experience with the winner of ‘Actions’, Super Smash Bros. I had expected Doom, Halo or Minecraft to win, as they have had a long history.
For the ‘Adventure’ genre, I had only played Mirror’s Edge and What Remains of Edith Finch, and had not heard of the Zork. But after some research, I came to know that this game had had a very long history indeed.
For ‘RPG’ I was expecting Final Fantasy to win, and it did. And the close contenders would have to be League of Legends and RuneScape, and (although not shown in the intermediate results) they indeed were. And the sheer difference of the referneces from the rest of the winners, sets it apart as well, as being the Champion amongst the Champions.
For ‘Simulations’ I had wanted it to be Trackmania, but due to history, I also kind of knew that it would be Sims.
For ‘Strategy’, I had Age of Empires (one of my first games ever) and DotA. I did not expect it to be Civilization, although it is kind of popular.
For ‘Sports’, again I had expected it to be one of the classics like Need for Speed, or FIFA, but it turned out to be Forza.

I did not have enough time available, to be able to continue to the second (further) analysis, using geographic locations.


APPENDIX

  1. Full source code
  2. init_dictionary method
  3. prettyPrint_winners method
  4. Helpful command chains
  5. Huygens ssh login script template
  6. My scala.nanorc file for better syntax highlighting
  7. ToC from Markdown