Assignment 4

by Stan van Lier

For this assignment I did two things. I started with a test to see if I would succeed to load some content from the warc files and do some calculations with that. After this I made an inverted file index of all the words containing only letters. This I did together with Sietse Mooren. See here for Part 2

Part 1. Domain popularity by links from unique domains

I made a program to calculate a kind of popularity measurement by looking at the number of domains (included subdomains) that link to a specific domain. Eventually I also added two columns that show the domain that has the most pages with links to the specific domain and the corresponding number of pages.

First we initiate our needed spark variables and load all the warc files on the cluster. After this, warcf is an RDD filed with WarcRecord instances.

val conf = new SparkConf().setAppName("BackLinkCounter")
val sc = new SparkContext(conf)

val warcfile = "/data/public/common-crawl/crawl-data/CC-MAIN-2016-07/segments/*/warc/*.warc.gz"
val warcf = sc.newAPIHadoopFile(
  warcfile,
  classOf[WarcInputFormat],               // InputFormat
  classOf[LongWritable],                  // Key
  classOf[WarcRecord]                     // Value
)

We define a function to extract the information we need from a given WarcRecord so we can use this function in a map on the warcf RDD. This function returns a list of every domain that is being linked to on the page content that is included in the given WarcRecord.

def getHrefHosts(record: WarcRecord):String = {
  val cLen = record.header.contentLength.toInt
  val cStream = record.getPayloadContent()
  val content = new java.io.ByteArrayOutputStream();

  val buf = new Array[Byte](cLen)

  var nRead = cStream.read(buf)
  while (nRead != -1) {
    content.write(buf, 0, nRead)
    nRead = cStream.read(buf)
  }
  cStream.close()   
  
  try {
    Jsoup
    .parse(content.toString("UTF-8"))
    .select("a[href]")
    .eachAttr("href")
    .toArray
    .distinct
    .map( v => 
      try { new URI(v.toString).getHost().replaceFirst("www\\.","") } 
      catch { case e: Exception => null } 
    )
    .filter(_ != null)
    .distinct
  }
  catch {
      case e: Exception => throw new IOException("Caught exception processing input row ", e)
  }
}

Now we filter de warcf RDD so we end up with only record where we are interested in. This being the HTTP responses. After this we set the domains of the WarcRecord as the keys of this RDD so we later know where the links are coming from. Then we use our defined function to replace the WarcRecord by a list of all the links to unique domains that are in this WarcRecord. By using flatMap this list gets flatten, so we get an RDD with the tuple (domain, domain linked to from key domain). We are only interested in domains that contain links to other domains, not in the number of pages that link so we do a distinct. We are also not interested in domains linking to themselves so we filter those out too. Then we do a map so that every domain being linked to gets the value 1 so we can calculate the sum of the domain being linked to by other domains.

Using string interpolation makes it easy to make nice looking tables by using the println function.

println("%-48s | %-16s".format(" Most links from unique hosts", " Nr of hosts"))
warcf
  .filter{ a => a._2.header != null                      && a._2.getHttpHeader() != null }
  .filter{ a => a._2.header.warcTypeIdx == 2             && a._2.header.warcTargetUriUri != null }
  .filter{ a => a._2.getHttpHeader().contentType != null && a._2.getHttpHeader().statusCode == 200 }
  .filter{ _._2.getHttpHeader().contentType.startsWith("text/html") }
  .map{ wr => ( wr._2.header.warcTargetUriUri.getHost.replaceFirst("www\\.",""), wr._2 ) }
  .flatMapValues{ v => getHrefHosts(v) }
  .distinct
  .filter{ a => a._1 != a._2 }
  .map{ a => (a._2, 1) }
  .reduceByKey{ _ + _ }
  .sortBy( _._2, ascending = false)
  .take(100)
  .foreach{ case (k, v) => println("%-48s | %16d |".format(k, v)) }

We edit the code a bit so we are also able to see which host has the most pages with links to this most popular hosts. We do this by instead of a distinct after the flatmap, we first calculate the sum of the combination (host link to, host link from). This is equal to the number of pages that the “host from” has with links on it to “host to”. After this we swap the key to only “host to” and do a reduce where we calculate the sum and save only the domain which has te most links to this “host to”.

println("%-48s | %-16s | %-48s | %-16s ".format(" Most links from unique hosts", " Nr of hosts", " Host with most pages with a link", " Nr of pages"))
warcf
  .filter{ a => a._2.header != null                      && a._2.getHttpHeader() != null }
  .filter{ a => a._2.header.warcTypeIdx == 2             && a._2.header.warcTargetUriUri != null }
  .filter{ a => a._2.getHttpHeader().contentType != null && a._2.getHttpHeader().statusCode == 200 }
  .filter{ _._2.getHttpHeader().contentType.startsWith("text/html") }
  .map{ wr => ( wr._2.header.warcTargetUriUri.getHost.replaceFirst("www\\.",""), wr._2 ) }
  .flatMapValues{ v => getHrefHosts(getContent(v)) }
//  .distinct
  .filter{ a => a._1! = a._2 }
  .map{ a => ((a._2,a._1),1) }
  .reduceByKey{ _ + _ }
  .map{ a => (a._1._1, (1, (a._1._2, a._2))) }
  .reduceByKey{ case ((count1, host1), (count2, host2)) 
                  => ( count1 + count2, if (host2._2 > host1._2) host2 else host1 ) }
  .sortBy( _._2._1, ascending = false)
  .take(100)
  .foreach{ case (k, (c, (h, t))) => println("%-48s | %16d | %-48s | %16d".format(k, c, h, t)) }

The top result, sorted on most popular domains are shown below. You see that this are indeed the famous domains. But the domains with the most pages that link to this domains are most of the time not really famous. For example popsugar.com. It is unclear why this domain has so many pages with links to many of the famous domains. It may be that the links are in a kind of share bar that is on every page, and the crawler of the commoncrawl has in some way crawled many duplicates of every page on popsugar.com.

 Most links from unique hosts                    |  Nr of hosts     |  Host with most pages with a link                |  Nr of pages  
facebook.com                                     |          6595548 | popsugar.com                                     |          1401573
twitter.com                                      |          5630776 | popsugar.com                                     |          1401477
blogger.com                                      |          2918127 | althouse.blogspot.com                            |           246860
plus.google.com                                  |          2219580 | popsugar.com                                     |          1235829
youtube.com                                      |          2190185 | popsugar.com                                     |          1291409
pinterest.com                                    |          1602991 | popsugar.com                                     |          1290824
instagram.com                                    |          1337743 | popsugar.com                                     |          1292977
4.bp.blogspot.com                                |          1273056 | sonsofstevegarvey.com                            |           144492
1.bp.blogspot.com                                |          1271229 | cakewrecks.blogspot.com                          |            71140
3.bp.blogspot.com                                |          1271095 | cakewrecks.blogspot.com                          |            70811
2.bp.blogspot.com                                |          1270863 | cakewrecks.blogspot.com                          |            73222
linkedin.com                                     |          1022450 | cnet.com                                         |           943059
wordpress.com                                    |           996546 | blackamericaweb.com                              |           155079
google.com                                       |           897449 | popsugar.com                                     |          1290662
wordpress.org                                    |           803283 | profiles.wordpress.org                           |           191975
en.wikipedia.org                                 |           657487 | wikidata.org                                     |           707017
amazon.com                                       |           586295 | dpreview.com                                     |           758586
gravatar.com                                     |           576902 | variety.com                                      |           279157
flickr.com                                       |           569060 | eol.org                                          |           711941
deviantart.com                                   |           517495 | forum.deviantart.com                             |            12591
portfolio.deviantart.com                         |           508694 | deviantart.com                                   |           164608
sta.sh                                           |           507747 | deviantart.com                                   |           164607
forum.deviantart.com                             |           507716 | deviantart.com                                   |           164607
help.deviantart.com                              |           507601 | deviantart.com                                   |           164607
about.deviantart.com                             |           507597 | deviantart.com                                   |           164607
shop.deviantart.com                              |           507593 | deviantart.com                                   |           164607
chat.deviantart.com                              |           507585 | deviantart.com                                   |           164607
wallpaper.deviantart.com                         |           507580 | deviantart.com                                   |           164607
welcome.deviantart.com                           |           507569 | deviantart.com                                   |           164607
groups.deviantart.com                            |           507566 | deviantart.com                                   |           164607
deviantart.theresumator.com                      |           507564 | deviantart.com                                   |           164604
critiques.deviantart.com                         |           507558 | deviantart.com                                   |           164607
l.deviantart.com                                 |           507558 | deviantart.com                                   |           164607
polyvore.com                                     |           499512 | heels.com                                        |            34848
blog.polyvore.com                                |           487540 | polyvore.com                                     |           482960
tumblr.com                                       |           484666 | popsugar.com                                     |          1097858
comments.deviantart.com                          |           452580 | forum.deviantart.com                             |            12366
reddit.com                                       |           448654 | engadget.com                                     |           831893
stumbleupon.com                                  |           442205 | dictionary.cambridge.org                         |           459831
maps.google.com                                  |           399837 | google.com                                       |           674721
livejournal.com                                  |           389352 | ohnotheydidnt.livejournal.com                    |            89189
istockphoto.com                                  |           385308 | gettyimages.com                                  |           335128
statcounter.com                                  |           364467 | onthesnow.com                                    |           351338
vimeo.com                                        |           330543 | dpreview.com                                     |           758586
add.my.yahoo.com                                 |           327603 | althouse.blogspot.com                            |           246860
feeds.feedburner.com                             |           325390 | threadless.com                                   |           625906
bit.ly                                           |           319351 | urbandictionary.com                              |          1427473
itunes.apple.com                                 |           317194 | popsugar.com                                     |          1290900
netvibes.com                                     |           298714 | althouse.blogspot.com                            |           246860
adobe.com                                        |           290237 | behance.net                                      |           484770
addthis.com                                      |           282213 | urbandictionary.com                              |          1123298
advertising.deviantart.com                       |           277703 | deviantart.com                                   |           164040
feedproxy.google.com                             |           239322 | tmz.com                                          |           459037
creativecommons.org                              |           237147 | stackoverflow.com                                |          1051596
etsy.com                                         |           233342 | pinterest.com                                    |            98576
http                                             |           230390 | linncountyleader.com                             |           225294
youtu.be                                         |           228614 | dpreview.com                                     |           445031
myspace.com                                      |           211624 | dailypuppy.com                                   |           364175
nytimes.com                                      |           204683 | iherb.com                                        |           579764
t.co                                             |           204362 | threadless.com                                   |           625081
play.google.com                                  |           181957 | popsugar.com                                     |          1290472
validator.w3.org                                 |           177953 | forums.poz.com                                   |           443116
goo.gl                                           |           177318 | jpcycles.com                                     |           475792
feedjit.com                                      |           171697 | macca-central.com                                |            51392
linkwithin.com                                   |           162316 | twifans.com                                      |           213327
sedo.com                                         |           159331 | mdelections.org                                  |           347911
docs.google.com                                  |           150809 | chowhound.com                                    |           351807
sedoparking.com                                  |           149926 | mdelections.org                                  |           347911
digg.com                                         |           149680 | cyclonefanatic.com                               |           308409
sites.google.com                                 |           145690 | hfboards.hockeysfuture.com                       |           361640
histats.com                                      |           145648 | unitedbimmer.com                                 |           183619
technorati.com                                   |           143798 | cboard.cprogramming.com                          |           299715
imdb.com                                         |           138602 | boxofficemojo.com                                |           882144
whoisprivacyprotect.com                          |           136540 | vitamindeal.com                                  |            55822
alexa.com                                        |           136197 | amazon.com                                       |           365745
soundcloud.com                                   |           135638 | popmatters.com                                   |           316213
bbc.co.uk                                        |           133665 | news.bbc.co.uk                                   |           495347
huffingtonpost.com                               |           124229 | justjared.com                                    |           506067
nbcnews.com                                      |           123758 | today.com                                        |           115780
github.com                                       |           117873 | typekit.com                                      |           394087
farfetch.com                                     |           117301 | polyvore.com                                     |           105639
yelp.com                                         |           115872 | trusnow.com                                      |           213183
web.archive.org                                  |           114842 | en.wikipedia.org                                 |           217572
feedburner.google.com                            |           114352 | drownedinsound.com                               |           183599
jigsaw.w3.org                                    |           109548 | craftster.org                                    |           272788
washingtonpost.com                               |           109210 | heraldnet.com                                    |           813243
microsoft.com                                    |           108837 | store.kobobooks.com                              |           299660
net-a-porter.com                                 |           108445 | mrporter.com                                     |           248843
newsvine.com                                     |           107869 | nbcnews.com                                      |           160216
nbcuni.com                                       |           106686 | cnbc.com                                         |           421916
weebly.com                                       |           105738 | thepausemenu.weebly.com                          |              596
guardian.co.uk                                   |           105653 | joyofsox.blogspot.com                            |           174741
godaddy.com                                      |           103992 | righthondaparts.com                              |           150754
tinyurl.com                                      |           101859 | ncnewspress.com                                  |           234840
shop.nordstrom.com                               |           100695 | polyvore.com                                     |           114200
bloglovin.com                                    |           100482 | temptalia.com                                    |           160783
theoutnet.com                                    |            95066 | polyvore.com                                     |            71420
paypal.com                                       |            93630 | delmarfans.com                                   |           529487
blackeyedpeas.com                                |            93131 | mobile17.com                                     |              326
fergieshoes.com                                  |            92668 | redbookmag.com                                   |               50
fergie.blackeyedpeas.com                         |            92662 | blackeyedpeas.com                                |            16043
will-i-am.com                                    |            92616 | blackeyedpeas.com                                |            16043
brothersinart.com                                |            92580 | blackeyedpeas.com                                |            16034
fergie.fanfire.com                               |            92562 | blackeyedpeas.com                                |                7
blackeyedpeas.cinderblock.com                    |            92561 | blackeyedpeas.com                                |                8
blog.newsvine.com                                |            92482 | newsvine.com                                     |            45612
support.newsvine.com                             |            92462 | newsvine.com                                     |            45612
secure.newsvine.com                              |            92441 | newsvine.com                                     |            45612
bbb.org                                          |            92151 | packersproshop.com                               |           909699
apple.com                                        |            91686 | thestreet.com                                    |           658842
quantcast.com                                    |            91203 | yardbarker.com                                   |           684150
get.adobe.com                                    |            91107 | pandora.com                                      |           935283
del.icio.us                                      |            90889 | cyclonefanatic.com                               |           308409
amazon.co.uk                                     |            86668 | amazon.com                                       |           365770
us.topshop.com                                   |            84155 | us.topman.com                                    |           151341
yahoo.com                                        |            81813 | groups.yahoo.com                                 |           517218
slideshare.net                                   |            81091 | icpsr.umich.edu                                  |           228808
addtoany.com                                     |            80794 | www2.ljworld.com                                 |           200921
es.wikipedia.org                                 |            80537 | wikidata.org                                     |           169420
cnn.com                                          |            80060 | money.cnn.com                                    |           268458
heidi.deviantart.com                             |            79598 | deviantart.com                                   |            14937
picasaweb.google.com                             |            79536 | beloit.edu                                       |           245921
online.wsj.com                                   |            79428 | blogs.wsj.com                                    |           541957
npr.org                                          |            77752 | iherb.com                                        |           579764
fav.me                                           |            77746 | forum.deviantart.com                             |             1263
bandcamp.com                                     |            77120 | hypnos.com                                       |             2954
news.bbc.co.uk                                   |            76579 | statto.com                                       |            68552
issuu.com                                        |            76407 | durangoherald.com                                |           634710
mywot.com                                        |            75480 | websitelooker.net                                |             9119
delicious.com                                    |            75014 | metalstorm.net                                   |           432050
forbes.com                                       |            74666 | igourmet.com                                     |           193962
geocities.com                                    |            73818 | sirstevesguide.com                               |            53491
hm.com                                           |            72888 | lookbook.nu                                      |            55777
telegraph.co.uk                                  |            72439 | blogs.telegraph.co.uk                            |            93292
scribd.com                                       |            71798 | es.scribd.com                                    |           166148
news.google.com                                  |            71669 | google.com                                       |           673645
lh3.googleusercontent.com                        |            71157 | weightweenies.starbike.com                       |            12387
mediafire.com                                    |            70325 | ohhla.com                                        |           119858
tripadvisor.com                                  |            69669 | lasvegas.com                                     |           593735
f1.bcbits.com                                    |            68951 | grindcorekaraoke.com                             |              361
usatoday.com                                     |            68525 | kffl.com                                         |           240218
eepurl.com                                       |            68086 | beeradvocate.com                                 |           888244
modcloth.com                                     |            67760 | polyvore.com                                     |            17195
dailymail.co.uk                                  |            67198 | jezebel.com                                      |            42202
asos.com                                         |            67141 | lookbook.nu                                      |            45129
es.wordpress.com                                 |            66064 | berkutmartin.mlblogs.com                         |             7177
goodreads.com                                    |            65637 | amazon.com                                       |           365745
networkedblogs.com                               |            64630 | kosherscene.wordpress.com                        |            30799
reuters.com                                      |            63631 | business.gmu.edu                                 |           123010
forever21.com                                    |            63205 | lookbook.nu                                      |            52496
stylebop.com                                     |            63159 | polyvore.com                                     |            65931
topshop.com                                      |            62807 | lookbook.nu                                      |            46999
vk.com                                           |            62173 | suicidegirls.com                                 |           329820
mytheresa.com                                    |            62128 | polyvore.com                                     |            60259
wired.com                                        |            61616 | arstechnica.com                                  |           458689
groups.google.com                                |            61462 | rubygems.org                                     |           279527
latimes.com                                      |            61418 | articles.latimes.com                             |           281385
ncbi.nlm.nih.gov                                 |            61218 | origene.com                                      |           118820
sourceforge.net                                  |            60902 | kmeleon.sourceforge.net                          |            55642
pbs.org                                          |            60568 | idahoptv.org                                     |           591225
styleinsider.polyvore.com                        |            59871 | gaburrus.polyvore.com                            |               14
abcnews.go.com                                   |            59479 | pjmedia.com                                      |            37710
macromedia.com                                   |            59201 | rockport.com                                     |           256219
dailymotion.com                                  |            57774 | musee-orsay.fr                                   |            41240
lh5.googleusercontent.com                        |            57634 | parisbreakfasts.blogspot.com                     |             7351
lh6.googleusercontent.com                        |            57430 | weightweenies.starbike.com                       |            12382
en.support.wordpress.com                         |            57386 | en.forums.wordpress.com                          |            87065
lh4.googleusercontent.com                        |            57341 | parisbreakfasts.blogspot.com                     |             7410
4shared.com                                      |            57077 | search.4shared.com                               |            13760
chrome.google.com                                |            56921 | weather.weatherbug.com                           |           298855
siteadvisor.com                                  |            56557 | searchbug.com                                    |            19142
wp.me                                            |            56263 | ksnt.com                                         |            93789
avgthreatlabs.com                                |            56176 | websitelooker.net                                |             9119
mozilla.org                                      |            56023 | shapeways.com                                    |           403539
newlook.com                                      |            55974 | lookbook.nu                                      |            17706
news.yahoo.com                                   |            55643 | groups.yahoo.com                                 |           517218
gmail.com                                        |            55399 | youthsoccertalk.blogspot.com                     |            29515
johnlewis.com                                    |            54919 | polyvore.com                                     |            94320
barneys.com                                      |            54839 | polyvore.com                                     |            59938
webstatsdomain.org                               |            54545 | yummylittlecooks.com                             |              335
photos1.blogger.com                              |            54454 | michiganzone.blogspot.com                        |             9425
sb-ssl.google.com                                |            54228 | domainreferralsite.com                           |              240
bloomberg.com                                    |            53825 | wtok.com                                         |           104715
cbsnews.com                                      |            53056 | cbs.com                                          |            97060
time.com                                         |            52610 | content.time.com                                 |           235998
theguardian.com                                  |            52103 | scoop.it                                         |            47323
windows.microsoft.com                            |            51895 | thestreet.com                                    |           658842
top100.rambler.ru                                |            51543 | en.trend.az                                      |            16377
books.google.com                                 |            50926 | en.wikipedia.org                                 |           183323
cdc.gov                                          |            50872 | ndep.nih.gov                                     |           193019
msnbc.msn.com                                    |            50705 | wdsu.com                                         |            61246
neimanmarcus.com                                 |            50341 | polyvore.com                                     |           102452
translate.google.com                             |            50197 | iherb.com                                        |           579764
houseoffraser.co.uk                              |            49950 | polyvore.com                                     |            27518
mail.google.com                                  |            49683 | google.com                                       |           674546
polldaddy.com                                    |            49349 | hollywoodlife.com                                |            90147
code.google.com                                  |            48954 | murraysworld.com                                 |           211714
foxnews.com                                      |            48440 | tmz.com                                          |           459037
groups.yahoo.com                                 |            48105 | finance.yahoo.com                                |           471071
bloglines.com                                    |            48008 | althouse.blogspot.com                            |           246860
id.wordpress.com                                 |            47918 | omkicau.com                                      |             2437
whitehouse.gov                                   |            47895 | portal.hud.gov                                   |           653499
shopbop.com                                      |            47749 | amazon.com                                       |           365744
matchesfashion.com                               |            47728 | polyvore.com                                     |            46516
zappos.com                                       |            47159 | 6pm.com                                          |           822512
upload.wikimedia.org                             |            47151 | commons.wikimedia.org                            |            20682
fusion.google.com                                |            47022 | hotair.com                                       |           190743
riverisland.com                                  |            46711 | polyvore.com                                     |             9167
gnu.org                                          |            46633 | directory.fsf.org                                |           365632
jcrew.com                                        |            46315 | polyvore.com                                     |            13314
nelly.com                                        |            45844 | polyvore.com                                     |            19509
theatlantic.com                                  |            45723 | govexec.com                                      |            57874
adtrack.ministerial5.com                         |            45623 | members.tripod.com                               |             7163
barnesandnoble.com                               |            45536 | fmylife.com                                      |           303105
target.com                                       |            45276 | rockband4.com                                    |           214793
boston.com                                       |            45221 | bostonglobe.com                                  |           348560
top.mail.ru                                      |            45218 | en.trend.az                                      |            16377
afternic.com                                     |            45192 | formapuraliving.com                              |            88487
saksfifthavenue.com                              |            45122 | polyvore.com                                     |            76859
examiner.com                                     |            44813 | movieroomreviews.com                             |            73970
ebay.com                                         |            44705 | stores.ebay.com                                  |           312796
drive.google.com                                 |            44519 | google.com                                       |           674532
newgrounds.com                                   |            44449 | psypokes.com                                     |              734

Part 2. Inverted file index

Assignment 3B3 - PC Chairs of SIGIR 2018

by Stan van Lier

See Assingment3B3-StanvanLier.snb for the spark notebook file, or the github link. (All the code in the notebook is also in this blogpost.)

The problem to solve

We have three tables. Our goal is to merge these tables into one with the most recent data. The columns that we want are “Name”, “Institute”, “Email”, “DBLP”, “Personal webpage” and “Topics of interest”. We call this new table “invitees_2018”.

committee.csv contains the most recent email addresses and institutions of the persons and there is a column that contains the “Personal webpage” data that we want.

For the “Topics of interest” column we need to think of a way to extract the data from committee_topic.csv so we can put it in our invitees_2018 table.

The data

SIGIR2017-invitees.csv

columns: Name, Affiliation, Email, DBLP, _c4, _c5, _c6, _c7, _c8, _c9, _c10, _c11, _c12, _c13, _c14, _c15, _c16, _c17, _c18, _c19, _c20, _c21, _c22, _c23, _c24
count: 281

The columns _c4 … _c24 does not contain any useful information for us so we will not use these. In fact, its all null values.

invitees_2018 will contain exactly all persons that are in this table. So we will use this table as our basis. Well, only the first four columns of course.

committee.csv

columns: #, person #, first name, last name, email, country, organization, Web page, role
count: 286

This table contains the most recent information. But there are more persons in this table than in the SIGIR2017-invitees table. We don’t need all the persons from this table in our invitees_2018 table. So we have to be careful to use the right join later on. (not actually a “right join”)

We will use the information from “email”, “organization” and “Web page” in the invitees_2018 table.

committee_topic.csv

colums: member #, member name, topic
count: 4098

This table contains many rows per person. Each row represents a topic of interest of a person. We need to use this information for our “Topics of interest” column in invitees_2018.

Step 1. Initialize SparkSession

This is needed for reading the csv files.

import org.apache.spark.sql.types._ 
val spark = SparkSession
            .builder()
            .appName("A3b-spark-df")
            .getOrCreate()

Step 2. Reading csv files

We read the tables from the csv files. We only need the first four columns from SIGIR2017-invitees.csv so we select these before we save the table in a variable.

val invitees = spark.read
               .format("csv")
               .option("header", true)
               .load("SIGIR2017-invitees.csv")
               .select("Name","Affiliation","Email","DBLP")
               
val committee = spark.read
                .format("csv")
                .option("header", true)
                .load("committee.csv")
                 
val committee_topic = spark.read
                      .format("csv")
                      .option("header", true)
                      .load("committee_topic.csv")

Step 3. Joining the tables

We want to merge multiple tables. The easiest way to do this is by joining. To join two tables they both need to contain a column with corresponding values. Then you combine all the columns of two tables so that in every row these columns have the same value.

We want to take the invitees table as a basis and join the committee table on it. Because we only need information for the persons that are in the invitees table.

3.1 Concatenate first name and last name in committee

Before we can join the invitees table with committee there has to be a column in each table with values that are the same. The column “Name” in the invitees table contains the full name of a person. In the committee table there is no column like that but there are two columns “first name” and “last name” that contain the same information. So if we combine these two columns we will get a column that repesents this in exactly the same format as in the invitees table. We will do this by making an extra column in the committee table, “fullName”, that concatenates the firstname and lastname with a space in between.

val committee_fullName = committee
                         .withColumn("fullName", 
                                     concat($"first name", 
                                            lit(" "), 
                                            $"last name"))

3.2 Join invitees with committee

Now we can join the invitees table with our new committee_fullName table. The type of the join needs to be “leftouter”. This means that the resulting table will be based on the table where the join operation is being called on. So the resulting table contains the same rows of the table join is called on. All the rows that have the same value in “fullName” as in “Name” will be expanded with the columns from the other table. This means that if there is a person that is in the invitees table but not in the committee table this person will still be in the resulting table but not the other way around. This is exactly what we want so we use “leftouter” as joint type.

val invitees_committee = invitees
                         .join(committee_fullName, 
                               committee_fullName("fullName") === invitees("Name"), 
                               "leftouter")

3.3 Reduce the rows of committee_topic

In the committee_topic table a person has multiple rows, using each row for one topic of interest. We want to expand our invitees_2018 table with one column, “Topics of interest”, that contains all the inforation of the topics of interest of that person. The only way this is possible and having just one result table is by putting an array of in this new column. So we first need to reduce the committee_topic table in a way that we have only one array for each person containing their information.

val committee_topic_reduced = committee_topic
                              .groupBy("member #", "member name")
                              .agg(collect_list("topic")
                                   .as("topics"))

3.4 Join the reduced committee_topic table

To be sure that we don’t miss some joins because of names that are spelled differently in the two tables we use the “person #” and “member #” columns to join. We got the “person #” column from the previous join with the committee table, so me might as well use it.

val fully_joined = invitees_committee
                   .join(committee_topic_reduced, 
                         committee_topic_reduced("member #") === invitees_committee("person #"),
                         "leftouter")

Maybe you noticed that if you first join this table with the committee table you get the column with the full name for free, and you can then use this column to make the join with the invitees table. This is not completely correct because you then depend on the fact that every person in the committee table has at least one topic of interest in the committe_topic table. It could be that this is the case but it will be more work to check if this is the case and what do you do if it is not the case? Then you still need to fill in those missing full names by concatenating the first and last name.

Step 4. Updating new information

The information in the columns from the committee table is more recent than the information in the columns from the invitees table. But the committee table does not contain all the persons that are in the invitees table. There are some null-values in the columns from this table. So we can not just use the information from the most recent columns. We have to think of a way to use the information from the committee table if there is information present and otherwise use the information from the invitees table.

We do this by creating two new columns that contain the information that will be used our invitees_2018 table. These columns will be filled by information from the committee table if this table contains information, otherwise the information from the invitees table is used. This way the column contains the most recent information.

Before we can do this we have to be able to make a distinction between the column “Email” and “email”. By default spark sql is case insensitive so we have to set this this setting to case sensitive.

spark.sql("set spark.sql.caseSensitive=true")

val invitees_updatedInfo = fully_joined
                           .withColumn("new_institute", 
                                       when($"organization".isNull, $"Affiliation")
                                         .otherwise($"organization"))
                           .withColumn("new_email", 
                                       when($"email".isNull, $"Email")
                                         .otherwise($"email"))

Step 5. Selecting only relevant columns

Ok, so now we have one table that contains all the infromation we need. The only problem that remains is that it contains a lot of rubbish. Look at all the columns:

columns: Name, Affiliation, Email, DBLP, #, person #, first name, last name, email, country, organization, Web page, role, fullName, member #, member name, topics
count: 281

We only need “Name”,”new_institute”,”new_email”,”DBLP”,”Web page” and “topics” for our invitees_2018 table. And to make it really look nice we also rename those columns so we have consistent names: “Name”,”Institute”,”Email”,”DBLP”,”Personal webpage” and “Topics of interest”. The count value of our result is the same as SIGIR2017-invitees.csv as we expected. So we didn’t lose any information of a person by doing the operations.

val invitees_2018 = invitees_updatedInfo
                    .select("Name","new_institute","new_email","DBLP","Web page","topics")
                    .toDF("Name","Institute","Email","DBLP","Personal webpage","Topics of interest")

Now, invitees_2018 is ready for being used. It contains all of the most recent information that we know of.

Assignment 2 - Hello hadoop

by Stan van Lier

My first experience with Hadoop and Map Reduce

I did everything in a docker container instance of the image from this command:

$ docker pull andypetrella/spark-notebook:0.7.0-scala-2.11.8-spark-2.1.0-hadoop-2.7.3-with-hive

What happens when you ran the commands in the Standalone part of the tutorial?

$ mkdir input

Makes a directory to put the input data in.

$ cp etc/hadoop/*.xml input

Puts some data in the input directory.

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'

Run the grep example with regular expresion ‘dfs[a-z.]+’ on the data in the input directory and puts the results in output.

$ cat output/*

show the result.

What is different in the Pseudo-Distributed case?

When you run a MapReduce program as Pseudo-Distributed the program will also run on a single machine but in multiple java-instances.

How do you use mapreduce to count the number of lines/words/characters/… in the Complete Shakespeare?

See CapsUsage.java for the source code (a modification of WordCount.java). I modified the StringTokenizer by adding more characters to the spit parameter:

StringTokenizer itr = new StringTokenizer(value.toString()," \t\n\r\f,.!?\"\':;(){}<>~`-_@\\/&[]");

This way the document will be split on more characters then only space and enter. There are many words that are concatenated to a one or more of these characters. Without modifying the StringTokenizer every word concatenated to a special character will be a unique key so it will be counted as if it is a different word.

I also converted all the words to lowercase. This way words at the beginning of a sentence will have the same key as the rest of the occurences of the word.

I wanted to do something which needed to pass a tuple to the reducer. I came up with the percentage of the words occurences with a capital first letter (so names and first word in a sentence).

Maybe its not clear at first why you need a tupple for this. The two things that are needed for this calculation are number_of_capital_occurrences and total_nr_of_occurrences. It is possible to calculate this by emitting only one value in de mapper, a 1 for the words with a capital first letter, and a 0 for the words with no captital. In the reducer it is then possible to take as total_nr_of_occurrences the lenght of Iterable values and the number_of_capital_occurrences is then computed by summing values. But the problem with this approach is that you rely on all the instances of 0 in values. So there can not be a combiner between the mapper and the reducer this way. This may not be a big problem since there is an option to not use a combiner. But I think MapReduce is not ment being used like this and the program will be less scaleble.

So I think this is a suitable problem for the usage of tuples. The key value pair that the mapper will emit is word => (iscap, 1), where iscap only is 1 if the first letter of the word is uppercase.

In MapReduce it is not completely straigt forward how to inplement multiple values because this has to be a Writable. To implement this you can chose between three things. Converting al the values to one big string, make your own custom writable or use the TupleWritable from org.apache.hadoop.mapreduce.lib.join. (The documentation of this class recomments not to use this for efficiency reasons, but to make a specific custom writable element.) I started by trying the latter anyway. It didn’t work, the only output I got for every word was [ , ]. After some debugging I found that the TupleWritable that the mapper emits didn’t contain values. Unfortunately I couldn’t find a way to fix this. Maybe the class wasn’t supposed for this usage? I dont know.. but I desided to go for the “quicker” solution, making a string that contains the two values. The mapper now emits new Text(iscap+",1"). the reducer splits this string on "," and you get back the two integers.

public static class TokenizerMapper extends Mapper<Object, Text, Text, Text>{
  private Text word = new Text();
  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString()," \t\n\r\f,.!?\"\':;(){}<>~`-_@\\/&[]",false);
    while (itr.hasMoreTokens()) {
    String word_str = itr.nextToken();
      word.set(word_str.toLowerCase());
      int iscap = 0;
      if (Character.isUpperCase(word_str.charAt(0))) {
        iscap = 1;
      }
      context.write(word,new Text(iscap+",1"));
    }
  }
}
public static class IntSumReducer extends Reducer<Text,Text,Text,Text> {
  public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    int sumTotal = 0;
    int sumCapital = 0;
    for (Text val : values) {
      String val_str = val.toString();
      String[] tuple = val_str.split(",");
      sumCapital += Integer.parseInt(tuple[0]);
      sumTotal += Integer.parseInt(tuple[1]);
    }
    context.write(key, new Text((sumCapital*100.0/sumTotal)+"%"));
  }
}

Now it is posible to implement a cobiner. This combiner also spits the string, calculates a sum of the first value of the tuple and a sum of the second value and emits these two values converted back to same string format.

public static class IntSumCombiner extends Reducer<Text,Text,Text,Text> {
  public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    int sumTotal = 0;
    int sumCapital = 0;
    for (Text val : values) {
      String val_str = val.toString();
      String[] tuple = val_str.split(",");
      sumCapital += Integer.parseInt(tuple[0]);
      sumTotal += Integer.parseInt(tuple[1]);
    }
    context.write(key, new Text(sumCapital+","+sumTotal));
  }
}
public static void main(String[] args) throws Exception {
  ...
  job.setCombinerClass(IntSumCombiner.class);
  ...
}

The only other thing the reducer does is the emit. After summing the tuples it emits the result of a percentage calculation. This is number_of_capital_occurrences * 100 / total_nr_of_occurrences. Here is a part of the output:

...
yare        20.0%
yarely      0.0%
yarn        0.0%
yaughan     100.0%
yaw         0.0%
yawn        0.0%
yawning     0.0%
ycleped     0.0%
ycliped     0.0%
ye          7.534246575342466%
yea         68.75%
yead        100.0%
year        0.0%
yearly      40.0%
yearn       0.0%
yearns      0.0%
years       0.0%
yeas        0.0%
yeast       0.0%
yedward     100.0%
yell        0.0%
yellow      3.5714285714285716%
yellowed    0.0%
yellowing   0.0%
yellowness  0.0%
yellows     0.0%
yells       0.0%
yelping     0.0%
yeoman      0.0%
yeomen      0.0%
yerk        50.0%
yes         86.82926829268293%
yesterday   4.0%
yesterdays  0.0%
yesternight 0.0%
yesty       0.0%
yet         26.63284717818643%
yew         0.0%
yicld       0.0%
yield       10.48951048951049%
yielded     4.166666666666667%
...

For instance the word “yes” is used 86.6% of the time as a start of a sentence. Yaughan, Yead and Yedward are always used with an uppercase first letter so these could be names.

Does Romeo or Juliet appear more often in the plays? Can you answer this question making only one pass over the corpus?

I made a little MapReduce program (see RvsJ.java) to count the number of occurences of each, this is the output:

juliet	83
romeo	156

Romeo appears more often. This problem is almost the same as the word count problem. So this can also be done by making only one pass over the corpus.

Assignments for Big Data Assignment2, Assignment3B3

Assignment 4

Part 1. Domain popularity by links from unique domains

Part 2. Inverted file index

Assignment 3B3 - PC Chairs of SIGIR 2018

The problem to solve

The data

SIGIR2017-invitees.csv

committee.csv

committee_topic.csv

Step 1. Initialize SparkSession

Step 2. Reading csv files

Step 3. Joining the tables

3.1 Concatenate first name and last name in committee

3.2 Join invitees with committee

3.3 Reduce the rows of committee_topic

3.4 Join the reduced committee_topic table

Step 4. Updating new information

Step 5. Selecting only relevant columns

Assignment 2 - Hello hadoop

My first experience with Hadoop and Map Reduce

What happens when you ran the commands in the Standalone part of the tutorial?

What is different in the Pseudo-Distributed case?

How do you use mapreduce to count the number of lines/words/characters/… in the Complete Shakespeare?

Does Romeo or Juliet appear more often in the plays? Can you answer this question making only one pass over the corpus?