比特币评论分析

比特币评论分析

本文使用 R 和 Python 对stocktwits网站上的评论进行了爬取。这个网站大概每几分钟只能请求 200 次,超过这个次数会被封 IP 几分钟。

最后本文使用爬取到的数据绘制了一张图堆叠柱形图。

分析网站

简单分析就会发现,这些评论都是以 json 格式的文件发送过来的,每个 json 文件都有 30 条评论数据,第一个 json 文件的链接为:
https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?filter=top
后面的 json 文件的链接都有一个 max 参数:
https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146067880&filter=top

所以关键是如何找这个 max 参数,然后就能设计循环爬数据了。
上图的结果已经反映了 max 参数从哪来了。每一个 json 文件里都会有一个 max 参数,而这个参数正是下一个 json 文件请求时需要的 max 参数!

所以接下来就是做循环,不过循环的是 max 参数,然后处理每次得到的 json 数据合并到起来即可。

R 语言爬取

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# https://stocktwits.com/symbol/BTC.X
library(jsonlite)
library(httr)
library(dplyr)
library(progress)
# 写一个把json对象变成自己需要的表格的函数
json2df <- function(jsontemp){
dftemp <- data.frame(id = NA, created_at = NA, body = NA,
sentiment = NA, source = NA, userid = NA,
userfollowers = NA, userfollowing = NA,
username = NA)
for(i in 1:30){
dftemp[i, ] <- c(
ifelse(!is.null(jsontemp$messages[i][[1]]$id), jsontemp$messages[i][[1]]$id, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$created_at), jsontemp$messages[i][[1]]$created_at, ""),
ifelse(!is.na(jsontemp$messages[i][[1]]$body), jsontemp$messages[i][[1]]$body, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$entities$sentiment$basic), jsontemp$messages[i][[1]]$entities$sentiment$basic, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$source$title), jsontemp$messages[i][[1]]$source$title, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$id), jsontemp$messages[i][[1]]$user$id, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$followers), jsontemp$messages[i][[1]]$user$followers, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$following), jsontemp$messages[i][[1]]$user$following, ""),
ifelse(!is.null(jsontemp$messages[i][[1]]$user$name), jsontemp$messages[i][[1]]$user$name, "")
)
}
return(dftemp)
}

# 根据一个json对象返回下一个json对象的max参数
jsonmax <- function(jsontemp){
return(jsontemp$cursor$max)
}

df <- data.frame(id = NA, created_at = NA, body = NA,
sentiment = NA, source = NA, userid = NA,
userfollowers = NA, userfollowing = NA,
username = NA)
json <- read_json("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?filter=top")
df <- rbind(df, json2df(json))
pb <- progress_bar$new(total = 1000)
for(i in 1:199){
pb$tick(0)
pb$tick()
tempmax = jsonmax(json)
json <- GET(paste0("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=", tempmax, "&filter=top"),
add_headers(c(origin = "https",
`accept-encoding` = "gzip, deflate, br",
`accept-language` = "zh-CN,zh;q=0.9,en;q=0.8",
`user-agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
accept = "application/json",
referer = "https",
authority = "api.stocktwits.com")),
set_cookies(c())) %>% content()
df <- rbind(df, json2df(json))
Sys.sleep(1)
};rm(i, tempmax)
df <- df[!duplicated(df$id), ]
df <- df[!is.na(df$id), ]

这里我用了 httr 包进行模拟浏览器请求,后来发现这么做其实也没有意义的!因为即使是模拟浏览器请求也只能在一定时间内请求 200 次。

所以实际上直接这样即可:

R
1
json <- read_json("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top")

其实我这里为了构造请求头用了一个非常有意思的工具:curl2r
这个工具可以把 curl 请求直接翻译成 R 语言的请求:

Shell
1
2
3
4
5
6
7
8
9
$ curl2r curl 'https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top' -H 'origin: https://stocktwits.com' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: zh-CN,zh;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36' -H 'accept: application/json' -H 'referer: https://stocktwits.com/symbol/BTC.X' -H 'authority: api.stocktwits.com' --compressed

library(httr)
GET("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top", add_headers(c(origin = "https",
`accept-encoding` = "gzip, deflate, br",
`accept-language` = "zh-CN,zh;q=0.9,en;q=0.8",
`user-agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
accept = "application/json", referer = "https",
authority = "api.stocktwits.com")), set_cookies(c()))

然后把返回结果复制过来就好啦!

这个命令的 GitHub 主页为:badbye/curl2r,安装方法如下:

R
1
devtools::install_github('badbye/curl2r')

把脚本移动到全局环境中:

Shell
1
cp `Rscript -e "cat(system.file('bin/curl2r', package = 'curl2r'))"` /usr/local/bin

然后就能直接在终端使用了!

对于 Python 用户,可以使用uncurl命令:

Shell
1
pip install uncurl

使用示例:

Shell
1
2
3
4
5
6
7
$ uncurl "curl 'https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top' -H 'origin: https://stocktwits.com' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: zh-CN,zh;q=0.9,en;q=0.8' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36' -H 'accept: application/json' -H 'referer: https://stocktwits.com/symbol/BTC.X' -H 'authority: api.stocktwits.com' --compressed"

requests.get("https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=146068013&filter=top", headers={ "accept": "application/json", "accept-encoding": "gzip, deflate, br",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"authority": "api.stocktwits.com", "origin": "https://stocktwits.com", "referer": "https://stocktwits.com/symbol/BTC.X", "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36"
}, cookies={},
)

Python + Selenium 爬取

上面的 R 程序无法解决封 IP 的问题,然后我就想用 Selenuim 爬一爬,RSelenium 暂时没用过,可以参考丁文亮的文章:stocktwits crawer Notes

这里我是写了一个函数,这个函数可以根据一个 json 链接返回整理好的数据框和下一个 json 链接。

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from selenium import webdriver
from os import environ
from bs4 import BeautifulSoup
import codecs
import codecs
import json as jsonp
import pandas as pd
chromedriver = "/usr/local/bin/chromedriver"
environ['webdriver.chrome.driver'] = chromedriver
driver = webdriver.Chrome(chromedriver)

# 函数:根据一个json链接获取数据框和下一个链接
def json2df(jsonurl = "https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?filter=top"):
driver.get(jsonurl)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
json = soup.select('pre')[0].string
f = codecs.open('temp.json', 'w', 'utf-8')
f.write(json)
f.close()
data = open("temp.json", encoding = 'utf-8')
strJson = jsonp.load(data)

id =[]
created_at = []
body = []
sentiment = []
source = []
userid =[]
userfollowers = []
userfollowing = []
username = []
for i in range(0, len(strJson['messages'])):
handle = strJson['messages'][i]
# 评论id
if handle['id'] is not None:
id.append(handle['id'])
else:
id.append('')
# 评论时间
if handle['created_at'] is not None:
created_at.append(handle['created_at'])
else:
created_at.append('')
# 评论内容
if handle['body'] is not None:
body.append(handle['body'])
else:
body.append('')
# 评论情绪
if handle['entities']['sentiment'] is not None:
sentiment.append(handle['entities']['sentiment']['basic'])
else:
sentiment.append('')
# 评论来源
if handle['source']['title'] is not None:
source.append(handle['source']['title'])
else:
source.append('')
# 评论用户id
if handle['user']['id'] is not None:
userid.append(handle['user']['id'])
else:
userid.append('')
# 评论用户name
if handle['user']['username'] is not None:
username.append(handle['user']['username'])
else:
username.append('')
# userfollowers
if handle['user']['followers'] is not None:
userfollowers.append(handle['user']['followers'])
else:
userfollowers.append('')
# userfollowing
if handle['user']['following'] is not None:
userfollowing.append(handle['user']['following'])
else:
userfollowing.append('')

df = pd.DataFrame(data = [id, created_at, body,
sentiment, source, userid,
userfollowers, userfollowing, username]).transpose()
df.columns = ['id', 'created_at', 'body',
'sentiment', 'source', 'userid',
'userfollowers', 'userfollowing', 'username']
nexturl = "https://api.stocktwits.com/api/2/streams/symbol/BTC.X.json?max=" + str(int(strJson['cursor']['max'])) + "&filter=top"
return {'df': df, 'nexturl': nexturl}

# 开始爬取
mydf = json2df()['df']
nexturl = json2df()['nexturl']
df = mydf
for i in range(0, 1000):
try:
driver.implicitly_wait(10)
print(nexturl)
print(i)
myhandle = json2df(nexturl)
mydf = myhandle['df']
df = df.append(mydf)
df.to_csv("比特币看法.csv")
nexturl = myhandle['nexturl']
except OSError:
pass
continue

很不幸,Selenium 依然无法解决快速爬取中会被封 IP 的问题,获取丁文亮的处理 html 文件的方法可以避免被封 IP,但是想着就很复制,浏览器把 json 解析成 html,然后爬的时候再从 html 中找数据,感觉在绕弯路。

如果真的需要这个数据的话,可以每爬 100 次让程序休息 4 分钟,每小时爬 12 期,也就是$12 \times 100 \times 30 = 36000$条评论,这么多条大概是一个月的数据,如果爬一年的,我估计可能要花一整天。

当然另外一种方法就是准备一个代理 IP 池了,然后换着 IP 爬,自己还没有这么做过。

数据分析

使用上面的 Python 程序可以爬到 7000 多条评论,下面再用 R 分析一下:
比特币看法.csv

R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
library(dplyr)
library(ggplot2)
library(cowplot)
df <- read.csv("比特币看法.csv", stringsAsFactors = F)
df <- df[, 2:10]
df$created_at <- df$created_at %>%
gsub(pattern = "T", replacement = " ") %>%
gsub(pattern = "Z", replacement = "")

df$created_at <- df$created_at %>% as.Date("%Y-%b-%d %H:%M:%S")
df$sentiment <- factor(df$sentiment, levels = c("Bullish", "", "Bearish"))

# 另外从这个网站上还能爬到比特币的价格数据,下面是爬取收盘价数据并使用今日的收盘价除以昨天的收盘价再进行归一化(就是把比率范围映射到0-1的区间上)
library(jsonlite)
json <- read_json("https://ql.stocktwits.com/chart?symbol=BTC.X&zoom=1w")
pricedf <- data.frame(date = NA, close = NA)
for(i in 1:length(json)){
pricedf[i, 1] = json[[i]]$Date
pricedf[i, 2] = json[[i]]$Close
}
pricedf$date <- pricedf$date %>% as.Date(format = "%b/%d/%Y")
pricedf <- pricedf[!duplicated(pricedf$date),]
pricedf$closediff <- lag(pricedf$close)
pricedf$closeratio <- pricedf$close / pricedf$closediff
pricedf <- pricedf[-1,]
# 把价格比率归一化
for(i in 1:length(pricedf)){
pricedf$closeratio[i] <- (pricedf$closeratio[i] - min(pricedf$closeratio))/(max(pricedf$closeratio) - min(pricedf$closeratio))
}
# 归一化之后,涨跌的区分值是:
(updownid <- (1 - min(pricedf$closeratio))/(max(pricedf$closeratio) - min(pricedf$closeratio)))
row.names(pricedf) <- pricedf$date
pricedf <- pricedf[c("2018-11-19",
"2018-11-20",
"2018-11-21",
"2018-11-22",
"2018-11-23"),]

(p <- ggplot(data = df) +
geom_bar(aes(x = created_at, y = ..count.., colour = sentiment, fill = sentiment), position = position_fill(), stat = "count") +
geom_line(data = pricedf, aes(x = date, y = closeratio),
colour = "#6a3d9a", size = 2, alpha = 0.6) +
geom_hline(aes(yintercept = updownid), colour = "#b15928") +
scale_colour_brewer(palette = "Set2", guide = 'none') +
scale_fill_brewer(palette = "Set2", labels = c(" 看涨", " 无语", " 看跌")) +
scale_x_date(labels = scales::date_format()) +
labs(title = "StocksTwits网站上网友对比特币看法\n",
x = "日期") +
theme_bw(base_size = 18, base_family = 'STSongti-SC-Bold') +
theme(plot.title = element_text(hjust = 0.1)) +
theme(plot.margin = grid::unit(c(1, 1, 2, 2), "cm")) +
theme(axis.title.y = element_blank()) +
theme(axis.text.y = element_blank()) +
theme(axis.ticks.y = element_blank()) +
theme(legend.title = element_blank()) +
theme(legend.position = "right"))
ggdraw(p) +
draw_label("数据来源:https://stocktwits.com/symbol/BTC.X", x = 0.78, y = 0.05, fontfamily = 'STSong', size = 14) +
draw_image("https://www.czxa.top/images/default28.svg",
x = 0.52, y = 0.02, width = 0.06, height = 0.06) +
draw_label("上涨", x = 0.11, y = 0.82, fontfamily = 'STSongti-SC-Bold') +
draw_label("下跌", x = 0.11, y = 0.765, fontfamily = 'STSongti-SC-Bold')

图中紫色的线是归一化之后的比特币收盘价比率。我还在涂上画了一条🏾的线,如果紫色的线超过了这条线就表示当天的比特币价格是上涨的,反之就是下跌的,因为数据量过少,所以看似有些意思,其实没有显著的意思!用于不要试图从小样本中得到任何可靠的结论!

# Python, R

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×