比特币评论分析

比特币评论分析

本文使用R和Python对stocktwits网站上的评论进行了爬取。这个网站大概每几分钟只能请求200次,超过这个次数会被封IP几分钟。

最后本文使用爬取到的数据绘制了一张图堆叠柱形图。

分析网站

简单分析就会发现,这些评论都是以json格式的文件发送过来的,每个json文件都有30条评论数据,第一个json文件的链接为:
https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?filter=top
后面的json文件的链接都有一个max参数:
https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146067880&filter=top

所以关键是如何找这个max参数,然后就能设计循环爬数据了。
上图的结果已经反映了max参数从哪来了。每一个json文件里都会有一个max参数,而这个参数正是下一个json文件请求时需要的max参数!

所以接下来就是做循环,不过循环的是max参数,然后处理每次得到的json数据合并到起来即可。

R语言爬取

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# https://stocktwits.com/symbol/BTC.X
library(jsonlite)
library(httr)
library(dplyr)
library(progress)
# 写一个把json对象变成自己需要的表格的函数
json2df <- function(jsontemp){
dftemp <- data.frame(id = NA, created_at = NA, body = NA,
sentiment = NA, source = NA, userid = NA,
userfollowers = NA, userfollowing = NA,
username = NA)
for(i in 1:30){
dftemp[i, ] <- c(
ifelse(!is.null(jsontemp$messages[i][[1]]$id), jsontemp$messages[i][[1]]$id, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$created_at), jsontemp$messages[i][[1]]$created_at, ""),
ifelse(!is.na(jsontemp$messages[i][[1]]$body), jsontemp$messages[i][[1]]$body, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$entities$sentiment$basic), jsontemp$messages[i][[1]]$entities$sentiment$basic, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$source$title), jsontemp$messages[i][[1]]$source$title, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$id), jsontemp$messages[i][[1]]$user$id, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$followers), jsontemp$messages[i][[1]]$user$followers, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$following), jsontemp$messages[i][[1]]$user$following, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$name), jsontemp$messages[i][[1]]$user$name, "")
)
}
return(dftemp)
}

# 根据一个json对象返回下一个json对象的max参数
jsonmax <- function(jsontemp){
return(jsontemp$cursor$max)
}

df <- data.frame(id = NA, created_at = NA, body = NA,
sentiment = NA, source = NA, userid = NA,
userfollowers = NA, userfollowing = NA,
username = NA)
json <- read_json("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?filter=top")
df <- rbind(df, json2df(json))
pb <- progress_bar$new(total = 1000)
for(i in 1:199){
pb$tick(0)
pb$tick()
tempmax = jsonmax(json)
json <- GET(paste0("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=", tempmax, "&filter=top"),
add_headers(c(origin = "https",
`accept-encoding` = "gzip, deflate, br",
`accept-language` = "zh-CN,zh;q=0.9,en;q=0.8",
`user-agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
accept = "application/json",
referer = "https",
authority = "api.stocktwits.com")),
set_cookies(c())) %>% content()
df <- rbind(df, json2df(json))
Sys.sleep(1)
};rm(i, tempmax)
df <- df[!duplicated(df$id), ]
df <- df[!is.na(df$id), ]

这里我用了httr包进行模拟浏览器请求,后来发现这么做其实也没有意义的!因为即使是模拟浏览器请求也只能在一定时间内请求200次。

所以实际上直接这样即可:

R
1
json <- read_json("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top")

其实我这里为了构造请求头用了一个非常有意思的工具:curl2r
这个工具可以把curl请求直接翻译成 R语言的请求:

Shell
1
2
3
4
5
6
7
8
9
$ curl2r curl 'https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top' -H 'origin: https://stocktwits.com' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: zh-CN,zh;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36' -H 'accept: application/json' -H 'referer: https://stocktwits.com/symbol/BTC.X' -H 'authority: api.stocktwits.com' --compressed

library(httr)
GET("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top", add_headers(c(origin = "https",
`accept-encoding` = "gzip, deflate, br",
`accept-language` = "zh-CN,zh;q=0.9,en;q=0.8",
`user-agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
accept = "application/json", referer = "https",
authority = "api.stocktwits.com")), set_cookies(c()))

然后把返回结果复制过来就好啦!

这个命令的GitHub主页为:badbye/curl2r,安装方法如下:

R
1
devtools::install_github('badbye/curl2r')

把脚本移动到全局环境中:

Shell
1
cp `Rscript -e "cat(system.file('bin/curl2r', package = 'curl2r'))"` /usr/local/bin

然后就能直接在终端使用了!

对于Python用户,可以使用uncurl命令:

Shell
1
pip install uncurl

使用示例:

Shell
1
2
3
4
5
6
7
$ uncurl "curl 'https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top' -H 'origin: https://stocktwits.com' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: zh-CN,zh;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36' -H 'accept: application/json' -H 'referer: https://stocktwits.com/symbol/BTC.X' -H 'authority: api.stocktwits.com' --compressed"

requests.get("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top", headers={ "accept": "application/json", "accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"authority": "api.stocktwits.com", "origin": "https://stocktwits.com", "referer": "https://stocktwits.com/symbol/BTC.X", "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"
}, cookies={},
)

Python + Selenium爬取

上面的R程序无法解决封IP的问题,然后我就想用Selenuim爬一爬,RSelenium暂时没用过,可以参考丁文亮的文章:stocktwits crawer Notes

这里我是写了一个函数,这个函数可以根据一个json链接返回整理好的数据框和下一个json链接。

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from selenium import webdriver
from os import environ
from bs4 import BeautifulSoup
import codecs
import codecs
import json as jsonp
import pandas as pd
chromedriver = "/usr/local/bin/chromedriver"
environ['webdriver.chrome.driver'] = chromedriver
driver = webdriver.Chrome(chromedriver)

# 函数:根据一个json链接获取数据框和下一个链接
def json2df(jsonurl = "https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?filter=top"):
driver.get(jsonurl)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
json = soup.select('pre')[0].string
f = codecs.open('temp.json', 'w', 'utf-8')
f.write(json)
f.close()
data = open("temp.json", encoding = 'utf-8')
strJson = jsonp.load(data)

id =[]
created_at = []
body = []
sentiment = []
source = []
userid =[]
userfollowers = []
userfollowing = []
username = []
for i in range(0, len(strJson['messages'])):
handle = strJson['messages'][i]
# 评论id
if handle['id'] is not None:
id.append(handle['id'])
else:
id.append('')
# 评论时间
if handle['created_at'] is not None:
created_at.append(handle['created_at'])
else:
created_at.append('')
# 评论内容
if handle['body'] is not None:
body.append(handle['body'])
else:
body.append('')
# 评论情绪
if handle['entities']['sentiment'] is not None:
sentiment.append(handle['entities']['sentiment']['basic'])
else:
sentiment.append('')
# 评论来源
if handle['source']['title'] is not None:
source.append(handle['source']['title'])
else:
source.append('')
# 评论用户id
if handle['user']['id'] is not None:
userid.append(handle['user']['id'])
else:
userid.append('')
# 评论用户name
if handle['user']['username'] is not None:
username.append(handle['user']['username'])
else:
username.append('')
# userfollowers
if handle['user']['followers'] is not None:
userfollowers.append(handle['user']['followers'])
else:
userfollowers.append('')
# userfollowing
if handle['user']['following'] is not None:
userfollowing.append(handle['user']['following'])
else:
userfollowing.append('')

df = pd.DataFrame(data = [id, created_at, body,
sentiment, source, userid,
userfollowers, userfollowing, username]).transpose()
df.columns = ['id', 'created_at', 'body',
'sentiment', 'source', 'userid',
'userfollowers', 'userfollowing', 'username']
nexturl = "https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=" + str(int(strJson['cursor']['max'])) + "&filter=top"
return {'df': df, 'nexturl': nexturl}

# 开始爬取
mydf = json2df()['df']
nexturl = json2df()['nexturl']
df = mydf
for i in range(0, 1000):
try:
driver.implicitly_wait(10)
print(nexturl)
print(i)
myhandle = json2df(nexturl)
mydf = myhandle['df']
df = df.append(mydf)
df.to_csv("比特币看法.csv")
nexturl = myhandle['nexturl']
except OSError:
pass
continue

很不幸,Selenium依然无法解决快速爬取中会被封IP的问题,获取丁文亮的处理html文件的方法可以避免被封IP,但是想着就很复制,浏览器把json解析成html,然后爬的时候再从html中找数据,感觉在绕弯路。

如果真的需要这个数据的话,可以每爬100次让程序休息4分钟,每小时爬12期,也就是$12 \times 100 \times 30 = 36000$条评论,这么多条大概是一个月的数据,如果爬一年的,我估计可能要花一整天。

当然另外一种方法就是准备一个代理IP池了,然后换着IP爬,自己还没有这么做过。

数据分析

使用上面的Python程序可以爬到7000多条评论,下面再用R分析一下:
比特币看法.csv

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
library(dplyr)
library(ggplot2)
library(cowplot)
df <- read.csv("比特币看法.csv", stringsAsFactors = F)
df <- df[, 2:10]
df$created_at <- df$created_at %>%
gsub(pattern = "T", replacement = " ") %>%
gsub(pattern = "Z", replacement = "")

df$created_at <- df$created_at %>% as.Date("%Y-%b-%d %H:%M:%S")
df$sentiment <- factor(df$sentiment, levels = c("Bullish", "", "Bearish"))

# 另外从这个网站上还能爬到比特币的价格数据,下面是爬取收盘价数据并使用今日的收盘价除以昨天的收盘价再进行归一化(就是把比率范围映射到0-1的区间上)
library(jsonlite)
json <- read_json("https://ql.stocktwits.com/chart?symbol=BTC.X&zoom=1w")
pricedf <- data.frame(date = NA, close = NA)
for(i in 1:length(json)){
pricedf[i, 1] = json[[i]]$Date
pricedf[i, 2] = json[[i]]$Close
}
pricedf$date <- pricedf$date %>% as.Date(format = "%b/%d/%Y")
pricedf <- pricedf[!duplicated(pricedf$date),]
pricedf$closediff <- lag(pricedf$close)
pricedf$closeratio <- pricedf$close / pricedf$closediff
pricedf <- pricedf[-1,]
# 把价格比率归一化
for(i in 1:length(pricedf)){
pricedf$closeratio[i] <- (pricedf$closeratio[i] - min(pricedf$closeratio))/(max(pricedf$closeratio) - min(pricedf$closeratio))
}
# 归一化之后,涨跌的区分值是:
(updownid <- (1 - min(pricedf$closeratio))/(max(pricedf$closeratio) - min(pricedf$closeratio)))
row.names(pricedf) <- pricedf$date
pricedf <- pricedf[c("2018-11-19",
"2018-11-20",
"2018-11-21",
"2018-11-22",
"2018-11-23"),]

(p <- ggplot(data = df) +
geom_bar(aes(x = created_at, y = ..count.., colour = sentiment, fill = sentiment), position = position_fill(), stat = "count") +
geom_line(data = pricedf, aes(x = date, y = closeratio),
colour = "#6a3d9a", size = 2, alpha = 0.6) +
geom_hline(aes(yintercept = updownid), colour = "#b15928") +
scale_colour_brewer(palette = "Set2", guide = 'none') +
scale_fill_brewer(palette = "Set2", labels = c(" 看涨", " 无语", " 看跌")) +
scale_x_date(labels = scales::date_format()) +
labs(title = "StocksTwits网站上网友对比特币看法\n",
x = "日期") +
theme_bw(base_size = 18, base_family = 'STSongti-SC-Bold') +
theme(plot.title = element_text(hjust = 0.1)) +
theme(plot.margin = grid::unit(c(1, 1, 2, 2), "cm")) +
theme(axis.title.y = element_blank()) +
theme(axis.text.y = element_blank()) +
theme(axis.ticks.y = element_blank()) +
theme(legend.title = element_blank()) +
theme(legend.position = "right"))
ggdraw(p) +
draw_label("数据来源:https://stocktwits.com/symbol/BTC.X", x = 0.78, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.svg",
x = 0.52, y = 0.02, width = 0.06, height = 0.06) +
draw_label("上涨", x = 0.11, y = 0.82, fontfamily = 'STSongti-SC-Bold') +
draw_label("下跌", x = 0.11, y = 0.765, fontfamily = 'STSongti-SC-Bold')

图中紫色的线是归一化之后的比特币收盘价比率。我还在涂上画了一条🏾的线,如果紫色的线超过了这条线就表示当天的比特币价格是上涨的,反之就是下跌的,因为数据量过少,所以看似有些意思,其实没有显著的意思!用于不要试图从小样本中得到任何可靠的结论!

# Python, R

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×