1 爬Google新聞:Scraping news report

Try to scrape and parse one news website (必須要是非得剖析背後的html不可的網站,例如鉅亨網背後是json,那就沒必要用html)

# packages needed
library(stringr)
library(tidyverse) #with tidyr, dplyr, magrittr
library(rvest) #with rvest
# library(httr)
# library(dplyr)
library(lubridate)
options(stringsAsFactors = F)
# options(encoding = "")

我選擇Google news,類別重要明確,包含熱門即時,還有自己喜歡的分類,如商業、科技等等。

1.1 抓取文章網址連結

從Google news網頁中得到連出去原新聞來源(如ETtoday, 中時電子報)的網址,以及新聞來源、標題、發佈時間。 文章內容則於後面步驟,在來源頁面取得後,一併放入Dataframe。

#news source
google <- "https://news.google.com"
google_news_url <- "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRFZxYUdjU0JYcG9MVlJYR2dKVVZ5Z0FQAQ?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant"
## Get post link
# Get and parse the url
doc  <- read_html(google_news_url)
cT <- Sys.time() #crawlingTime
# observe where the articles' links are
css <- "article a.VDXfz"
# retrieve links of news
news_links <- doc %>%
    html_nodes(css) %>% 
    html_attr("href")
# take a look at the result
head(news_links, 3)
[1] "./articles/CBMiLmh0dHBzOi8vbmV3dGFsay50dy9uZXdzL3ZpZXcvMjAxOC0xMC0yOC8xNTg4OTnSAQA?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant"                                                                                                                                        
[2] "./articles/CBMiPWh0dHBzOi8vd3d3LmNoaW5hdGltZXMuY29tL3JlYWx0aW1lbmV3cy8yMDE4MTAyODAwMDkyMC0yNjA0MDfSAQA?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant"                                                                                                                    
[3] "./articles/CBMiKGh0dHBzOi8vdWRuLmNvbS9uZXdzL3N0b3J5LzEwOTU4LzM0NDcxOTDSAWxodHRwczovL3Vkbi1jb20uY2RuLmFtcHByb2plY3Qub3JnL3Yvcy91ZG4uY29tL25ld3MvYW1wL3N0b3J5LzEwOTU4LzM0NDcxOTA_YW1wX2pzX3Y9MC4xI3dlYnZpZXc9MSZjYXA9c3dpcGU?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant"
length(news_links)
[1] 264
# the result is not an url, can't be browsed, so I fix it
new_links <- str_replace(news_links, "\\.", google)
tail(new_links,3)
[1] "https://news.google.com/articles/CBMiugFodHRwczovL3R3Lm5ld3MueWFob28uY29tLyVFOSU5OSVCOCVFNiU4QiVCMyVFOCVCMyVCRCVFOSU5RCU5RSVFNiVCNCVCMiVFNiU4QiVCMyVFNyU4RSU4QiVFOSU4MSVBRCVFOCU5OSU5MC0lRTglQTIlQUIlRTclODglODYlRTUlOEYlQUElRTYlOTglQUYlRTclOTUlOTklRTUlQUQlQjglRTclOTQlOUYtMDUwNjM2OTgxLmh0bWzSAYwCaHR0cHM6Ly90dy1uZXdzLXlhaG9vLWNvbS5jZG4uYW1wcHJvamVjdC5vcmcvdi9zL3R3Lm5ld3MueWFob28uY29tL2FtcGh0bWwvJUU5JTk5JUI4JUU2JThCJUIzJUU4JUIzJUJEJUU5JTlEJTlFJUU2JUI0JUIyJUU2JThCJUIzJUU3JThFJThCJUU5JTgxJUFEJUU4JTk5JTkwLSVFOCVBMiVBQiVFNyU4OCU4NiVFNSU4RiVBQSVFNiU5OCVBRiVFNyU5NSU5OSVFNSVBRCVCOCVFNyU5NCU5Ri0wNTA2MzY5ODEuaHRtbD9hbXBfanNfdj0wLjEjd2Vidmlldz0xJmNhcD1zd2lwZQ?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant"
[2] "https://news.google.com/articles/CAIiEOzhLWCTsUTHEB1FtyNtCt4qFwgEKg4IACoGCAowr7I3MKfqBzDPlIMG?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant
[3] "https://news.google.com/articles/CBMiJ2h0dHBzOi8vdWRuLmNvbS9uZXdzL3N0b3J5LzczMzIvMzQ0NjUxNtIBa2h0dHBzOi8vdWRuLWNvbS5jZG4uYW1wcHJvamVjdC5vcmcvdi9zL3Vkbi5jb20vbmV3cy9hbXAvc3RvcnkvNzMzMi8zNDQ2NTE2P2FtcF9qc192PTAuMSN3ZWJ2aWV3PTEmY2FwPXN3aXBl?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant"                                                                                                                                                                                                                                                                                                                                                                                                                              
# see if it works? browse the last one
browseURL(new_links[length(new_links)])
## Get post source: publisher
css_source <- "article .QmrVtf .KbnJ8"
news_source <- doc %>%
    html_nodes(css_source) %>% 
    html_text()
length(news_source)
[1] 264
# head(news_source)
## Get post headline
css_headline <- "article .ZulkBc a span"
news_title <- doc %>%
    html_nodes(css_headline) %>% 
    html_text()
# head(news_title) ##non-numeric to binary
# Get post datetime
css_time <- ".kybdz > div > time"
news_time <- doc %>%
    html_nodes(css_time) %>% 
    html_attr("datetime")
news_time[3]
[1] "seconds: 1540710321\n"
length(news_title);length(news_time)
[1] 264
[1] 264
# fix timestamp and converted its type
news_time <- str_replace(news_time, "seconds: ","")
news_time <- str_replace(news_time, "\n","")
news_time <- as_datetime(as.numeric(news_time))
news_time[3]
[1] "2018-10-28 07:05:21 UTC"
## Create data frame
gnews_df <- data.frame()
gnews_df <- data.frame(headline=news_title,
                       source=news_source,
                       link = new_links,
                       time = news_time
                       )
# take a look at df
View(gnews_df)
gnews_df[c(1:3, nrow(gnews_df)),]
sn <- table(gnews_df$source)
# names(sn); ##non-numeric to binary

Google news中的新聞是來自許多不同來源的新聞,從“最新”標籤下,抓取到的新聞來源常出現的包含諸如…

  • “自由時報電子報” “中時電子報” “ETtoday 新聞雲”
  • “udn 聯合新聞網” “TVBS新聞”
sn <- sort(sn[])
# names(sn[(length(sn):(length(sn)-4))])
# sn # table of all sources

1.2 嘗試取得新聞來源頁面之文章內容

遇到的問題:若直接從上面爬下來得到的連結並無法順利獲取text,因為中間會先經過跳轉的步驟,才到來源頁的真實網址。比如一則源於ETtoday的新聞:

測試:從來源網頁經過CSS Selector得到文章paragraph位置,從A網址抓不到,但從B可以順利得到。

  • 以針對單一頁面測試結果如下:
## Test ETtoday link
# paragraph css
css_ETt <- ".story p"
#A: google url
ETnews <- str_detect(gnews_df[,"source"],"ETtoday")
link <- gnews_df[ETnews,"link"][1]
link
[1] "https://news.google.com/articles/CAIiEG6qt7WqdQ0ZZESdB_eVMIIqFggEKg4IACoGCAowr7I3MKfqBzDpjAs?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant"
postprgr <- read_html(link) %>%
    html_nodes(css = css_ETt) %>% 
    html_text()
# postprgr #notebook, not preview
#B: real url
link <- "https://www.ettoday.net/news/20181022/1287141.htm"
postprgr <- read_html(link) %>%
    html_nodes(css = css_ETt) %>% 
    html_text()
# see one of the paragraph
# postprgr[3] ##non-numeric to binary
post <- paste(postprgr[c(3:length(postprgr)-2)], collapse = "")
# see the entire post
post ##non-numeric to binary
[1] "政治中心/綜合報導普悠瑪列車21日傍晚翻覆造成列車上18人死亡、183人輕重傷,行政院長賴清德凌晨趕到蘇澳榮民醫院慰問死者家屬,一聽到董姓乘客與家人共有8人罹難,當場傻住,關心慰問之餘也紅了眼眶。董家在22日上午再度痛失一名親屬。據了解,台東船長董進興,號召親友共17人北上吃喜酒,回程全家都坐上這列死亡列車上,造成夫妻與2個孫子等共8人死亡,還有2人在加護病房搶救。賴清德凌晨在宜蘭縣代理縣長陳金德陪同下,到蘇澳榮民醫院探視罹難者家屬,當他聽到董家一下失去8人,當場傻住,賴清德說,「那一定超級難過」,他還猶豫一下要不要敲休息室的門,旁人勸他還是要,他和家屬見面,紅了眼眶。▲賴清德到普悠瑪出軌意外現場探視。(圖/記者姜國輝攝)據了解,董姓一家人分別買到了車頭第8節車廂和第3節車廂的位置,死傷慘重幾乎多位於車頭,其中董進興夫妻檔、弟弟董進發和孫子董怡良、孫女董佳惠等8人皆不幸罹難,另2人則在加護病房觀察,而其他7人則因坐在第3節車廂逃過一劫。據宜蘭縣消防局統計,18名罹難者中,董家罹難者有8人,分別是董進興、王綠雲(董進興妻)、董玉蘭與何發仁(董玉蘭夫)、董進發(董進興弟)、董宜良(董孫)、董佳惠(董孫女)與何青宴。  賴清德表示,罹難者後事交通部與地方政府會共同合作,原則是優先尊重家屬意見,剛剛慰問一戶家屬,已經準備將親人大體運送回花蓮,交通部跟宜蘭縣政府會也會共同合作,辦聯合公祭, 後事會協助大家,一起完滿辦成。據了解,22日上午董家再度痛失另一名家屬,死亡人數更新為9名。""
1,

1.3 從A到B過程經過跳轉。

解決方法:在瀏覽器貼上A網址,在跳轉的載入過程中按下網址列左邊的叉。

  • 此時可以看到網頁上寫著如:「正在開啟https://www.ettoday.net/news/20181022/1287141.htm」
  • 開啟Chrome DevTools,可以看到過度網頁的HTML碼,在這之中可以找到B網址的位置,取得css selector。
# Redirecting page
redirection_css <- "c-wiz > div > a"
link <- gnews_df[ETnews,"link"][1]
real_link <- read_html(link) %>%
    html_nodes(css = redirection_css) %>% 
    html_attr("href")
real_link
[1] "https://www.ettoday.net/news/20181027/1291939.htm"
  • 擴張到所有連結的置換,改變gnews_df$link
length(gnews_df$link)
[1] 264
new_link<-NULL
i <- 0
if(str_detect(gnews_df$link[1], "news.google.com")){
    for(lnk in gnews_df$link){
        real_link <- read_html(lnk) %>%
            html_nodes(css = redirection_css) %>% 
            html_attr("href")
        new_link <- c(new_link, real_link)
        i <- i+1
        # print(paste0(i, "......", real_link))
    }
}
Warning in .Internal(lapply(X, FUN)) :
  closing unused connection 3 (https://news.google.com/articles/CAIiEATX7rPfsnzWwakr0LnRnHoqFwgEKg4IACoGCAowr7I3MKfqBzDfkoMG?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant)
# take a look at the converted link
head(new_link, 3)
[1] "https://newtalk.tw/news/view/2018-10-28/158899"               
[2] "https://www.chinatimes.com/realtimenews/20181028000920-260407"
[3] "https://udn.com/news/story/10958/3447190"                     
tail(new_link, 3)
[1] "https://tw.news.yahoo.com/%E9%99%B8%E6%8B%B3%E8%B3%BD%E9%9D%9E%E6%B4%B2%E6%8B%B3%E7%8E%8B%E9%81%AD%E8%99%90-%E8%A2%AB%E7%88%86%E5%8F%AA%E6%98%AF%E7%95%99%E5%AD%B8%E7%94%9F-050636981.html"
[2] "https://www.ettoday.net/news/20181028/1292431.htm"                                                                                                                                         
[3] "https://udn.com/news/story/7332/3446516"                                                                                                                                                   
# mutate to df
glinks <- gnews_df$link
gnews_df <- mutate(gnews_df, link = new_link)
gnews_df[1:6,]

1.4 重新嘗試取得新聞內容

現在,有了原新聞的連結,就可以依據來源,一一去找尋其文章段落所在位置。

  • get at least 100 news reports
  • 從上面,針對source做出table()再sort後,觀察可知最常出現的有:
    • ETtoday
    • 自由時報
    • udn聯合
    • TVBS
    • 中時
    • Yahoo
  • 於是針對這幾個新聞來源存取文章。
##start parsing post
#css collection
css_chinatimes <- "article > p"
css_chinatimes_firstprgr <- ".pictext > p"
css_yahoo <- "article > div > p"
css_udn <- "#story_body_content > p"
css_tvbs <- "#news_detail_div"
css_ltn <- "div.text > p"

i <- 1
posts <- NULL


for(lnk in gnews_df$link){
    post <- NA
    # css_temp <- NA
    
    # Cases
    if(gnews_df$source[i]=="ETtoday 新聞雲")
    {
        postprgr <- read_html(lnk) %>%
           html_nodes(css = css_ETt) %>%
           html_text()
        post <- paste(postprgr[c(3:(length(postprgr)-2))], collapse = " ")
    }
    else if(gnews_df$source[i]=="中時電子報")
    {
        #content
        postprgr <- read_html(lnk) %>%
           html_nodes(css = css_chinatimes) %>%
           html_text()
        post <- paste(postprgr, collapse = " ")
        #picture caption
        postprgr <- read_html(lnk) %>%
           html_nodes(css = css_chinatimes_firstprgr) %>%
           html_text()
        postprgr <- postprgr %>%
            str_remove_all("\r\n") %>%
            str_trim()
        post <- paste(postprgr, post)
    }
    else if(gnews_df$source[i] %>% str_detect("Yahoo奇摩"))
    {
        postprgr <- read_html(lnk) %>%
           html_nodes(css = css_yahoo) %>%
           html_text()
        post <- paste(postprgr[c(1:length(postprgr))], collapse = " ")
    }
    else if(gnews_df$source[i]=="udn 聯合新聞網")
    {
        postprgr <- read_html(lnk) %>%
           html_nodes(css = css_udn) %>%
           html_text() %>% 
            str_trim()
        post <- paste(postprgr, collapse = " ")
    }
    else if(gnews_df$source[i]=="自由時報電子報")
    {
        postprgr <- read_html(lnk) %>%
           html_nodes(css = css_ltn) %>%
           html_text()
        post <- paste(postprgr, collapse = " ")
    }
    else if(gnews_df$source[i]=="TVBS新聞")
    {
        postprgr <- read_html(lnk) %>%
           html_nodes(css = css_tvbs) %>%
           html_text()
        postprgr <- postprgr %>%
            str_remove_all("\r\n") %>%
            str_trim() %>% 
            str_remove_all("最HOT話題在這!想跟上時事,快點我加入TVBS新聞LINE好友!") %>% 
            str_remove("2018全台選戰風雲 TVBS最新大數據分析") %>% 
            str_remove("看不到影音請點這")
        post <- paste(postprgr, collapse = " ")
    }
    
    # print(paste0(i, "......", post))
    posts <- c(posts, post)
    i <- i + 1
    
    if(i%%5 ==0){
        Sys.sleep(sample(1:3,size = 1))
    }
}
# see the result of posts
# head(posts);tail(posts)
# convert NULL data to NA
posts[posts == ""] <- NA # posts[is.na(posts)] <- ""
# count posts with content
sum(!is.na(posts))
[1] 174
length(posts)
[1] 264

1.5 將文章合併至dataframe並儲存

  • 這邊只取出現次數最高的前6個來源(確保符合作業要>100篇的要求)文章,其餘剔除。(即保留posts非NA的資料)
  • Store news data as .rds (zip檔中含有googleNews_10_28_12.rds檔存放df)
## Add posts to gnews_df
gnews_df <- mutate(gnews_df, article = posts)
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
# filter out article with content
gnews_df <- gnews_df[!is.na(gnews_df$article),]
dim(gnews_df)
[1] 174   5
gnews_df
## Saving DataFrame
format(cT)
[1] "2018-10-28 22:52:08"
# cT_tag <- paste0(month(cT),day(cT),"_",hour(cT),minute(cT))
cT_tag <- paste0(month(cT),day(cT))
rdsname <- paste("googleNews",cT_tag, hour(cT), sep = "_")
rdsname <- paste0(rdsname,".rds")
# save new gnews_df as rds
rdsname
[1] "googleNews_1028_22.rds"
saveRDS(gnews_df, rdsname)
## load existed df example
# load old df
loaded_gnews_df <- readRDS("googleNews_10_28_12.rds")
dim(loaded_gnews_df)
[1] 157   5
loaded_gnews_df

by Ivan Chen

