Setup

Let’s walk you through webhoser with a simple use case: we’ll get articles and blog posts about the R programming language. But first, we need to authenticate, head over to webhose.io if you do not have an account yet. The aforementioned account will give you an API Key, this is what you will use in the function below.

wh_token("xXX-x0X0xX0X-00X")

Get

Great, now that we have a token, we can fetch some articles. We’ll use some basic boolean search arguments and filters to narrow down the search to english and news (excludes blogs). We can expect the number of articles about the R Programming Language to be relatively low so we’ll extend the date range to make sure we get a decent number of articles (functions default to the last 3 days), we’ll also limit the search to blogs as this is where we’d expect R to be mentioned.

rstats <- wh_news(
    q = '"R programming" is_first:true language:english site_type:blogs',
    ts = (Sys.time() - (30 * 24 * 60 * 60))
  ) 

Note that with the free version you are limited to 30 days of results

webhose.io gives 1,000 results per month for free, wh_news and wh_broadcasts print the number of queries left by default.

Paginate

We can paginate through the results if we want more data; let’s get five more pages.

rstats_pagingated <- rstats %>% 
  wh_paginate(5)
#> ▶ Crawling 2 page(s)

class(rstats_pagingated)
#> [1] "webhoser"

So we fetched one page of results then using that object we paginated through 5 more pages. However, this returns an object of class webhoser. The wh_collect function takes another arguement flatten which defaults to FALSE.

  1. If TRUE the function attempts to flatten the results into a standard data.frame.
  2. If FALSE the function returns a nested data.frame.
rstats_df <- wh_collect(rstats_pagingated)
class(rstats_df)
#> [1] "data.frame"

The above functions can be summed up as:

rstats <- wh_news(
    q = '"R programming" is_first:true language:english site_type:blogs',
    ts = (Sys.time() - (30 * 24 * 60 * 60))
  ) %>% 
  wh_paginate(5) %>% 
  wh_collect()

This returned 211 articles with lot of variables.

names(rstats)
#>  [1] "uuid"                             
#>  [2] "url"                              
#>  [3] "ord_in_thread"                    
#>  [4] "author"                           
#>  [5] "published"                        
#>  [6] "title"                            
#>  [7] "text"                             
#>  [8] "highlightText"                    
#>  [9] "highlightTitle"                   
#> [10] "language"                         
#> [11] "external_links"                   
#> [12] "external_images"                  
#> [13] "rating"                           
#> [14] "crawled"                          
#> [15] "thread.uuid"                      
#> [16] "thread.url"                       
#> [17] "thread.site_full"                 
#> [18] "thread.site"                      
#> [19] "thread.site_section"              
#> [20] "thread.site_categories"           
#> [21] "thread.section_title"             
#> [22] "thread.title"                     
#> [23] "thread.title_full"                
#> [24] "thread.published"                 
#> [25] "thread.replies_count"             
#> [26] "thread.participants_count"        
#> [27] "thread.site_type"                 
#> [28] "thread.country"                   
#> [29] "thread.spam_score"                
#> [30] "thread.main_image"                
#> [31] "thread.performance_score"         
#> [32] "thread.domain_rank"               
#> [33] "thread.social.facebook.likes"     
#> [34] "thread.social.facebook.comments"  
#> [35] "thread.social.facebook.shares"    
#> [36] "thread.social.gplus.shares"       
#> [37] "thread.social.pinterest.shares"   
#> [38] "thread.social.linkedin.shares"    
#> [39] "thread.social.stumbledupon.shares"
#> [40] "thread.social.vk.shares"          
#> [41] "entities.persons"                 
#> [42] "entities.organizations"           
#> [43] "entities.locations"

Below we use g2r for a quick plot of the number of articles published by day.

library(dplyr)
# remotes::install_github("JohnCoene/g2r)
library(g2r)

rstats %>% 
  mutate(
    date = as.Date(rstats$published, "%Y-%m-%d")
  ) %>% 
  count(date) %>% 
  g2(asp(date, n, color = "#247BA0")) %>% 
  fig_line()

webhoser.io does some entity extraction for us, so we do not have to do it ourselves, the data.frame includes entitities mentioned in the body of the text along with the sentiment associated with it:

  1. Locations
  2. Persons
  3. Organisations

This, however, is only so accurate depending on the articles you fetch.

webhoser.io also lets you fetch broadcasts transcripts, happy text mining!