如何爬到自己喜欢的歌手的所有歌曲啊!

如何爬到自己喜欢的歌手的所有歌曲啊!

  • 音乐收费的现象越来越普遍了,那么在这个到处收费的年代还有没有能够靠两只手获得自己想要的东西了呢?所以我常常觉得是有必要现在多储备一些音乐以备未来所有的音乐都收费而难以负担。今天介绍的这个流程就可以实现这个想法了!

本文产生的两个 Stata 命令(不能保证在其他电脑上能够使用,但是即使不能使用也有一定的参考价值)

kugousearch.ado
kugou.ado

今天的爬取目标:酷狗

可能你觉得我是在说酷狗的网站,但是实际上我是打算爬酷狗的 APP(Mac 版本,Windows 版本的最近估计丁文亮会尝试)。

今天需要的工具

  • 酷狗 APP
  • Sublime Text3
  • Stata15.1 SE
  • Charles
  • curl

Charles 是一个抓包工具(抓包工具是拦截查看网络数据包内容的软件。),类似的软件还有 Fiddler。安装教程网上有很多就不再赘述了。
curl 是是利用 URL 语法在命令行方式下工作的开源文件传输工具。Windows 需要下载安装才能使用,Mac 自带可以直接使用。

Charles 启动界面:

第一步:思考想进行的任务

这一步当然是最重要的啦,例如我想爬我最喜欢的歌手Critty的所有歌曲,我们用酷狗 APP 搜索一下:

一共是 107 首歌曲,以前我只会爬网站,我们可以首先回到酷狗的官网使用网站爬 Critty 的歌曲。网站的搜索结果如下:


网站的搜素结果中并没有 Critty 的主页链接(可以对比一下周杰伦的),而且只显示 30 条搜索结果。这就意味着如果想下载所有的 Critty 的歌曲只能下载这些搜索结果了。显然由于网页的限制,下载所有的搜索结果都有困难。不过这个很容易解决的,我们打开检查页面:

很容易就找到了这个名为song_search...的 js 文件,我们再来看看它的请求头:

也就是这个

1
http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=Critty&page=1&pagesize=30&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679

里面有 pagesize=30,正好对于着一页显示 30 条结果,我们把这个链接点击打开就可以看到一个密密麻麻的文件,在这个文件的最后几行我们可以看到这个"total":159, 也就是说总共有 159 条搜索结果,我们吧 pagesize 改成 159 就可以发现返回的内容就是我们需要的搜索结果了:

1
http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=Critty&page=1&pagesize=159&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679

接下来你可以选择把这个文件直接Command+S保存成 temp.txt 文件也可以选择用下面的 Stata 的 copy 命令下载:

  • 下载得到的文件只有一行,而且字符非常的多,因此无法直接读入 Stata,解决这个问题就需要 Sublime text3 了,我们的目标是把一行的文件通过替换变成多行的文件:使用下面的替换规则即可:

也就是把},{换成},(这里是个换行,可以直接从文中复制){

然后就完成了这一步,然后使用下面的代码即可把这个文件读入 Stata 并处理(代码下面有一些自写命令的附注):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
clear all
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=Critty&page=1&pagesize=159&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
utrans temp.txt
infix strL v 1-5000 using temp.txt, clear
replace v = subinstr(v, `"jQuery1124031407034218114016_1528575153677({"status":1,"error_code":0,"data":{"page":1,"tab":"全部","lists":["', "", .)
drop in 163/166
split v, parse(,)
keep v1 v34 v35
* 也就是保留带歌曲名字和FileHash的变量
replace v1 = subinstr(v1, `"{"SongName":""', "", .)
replace v1 = subinstr(v1, `"""', "", .)
format v1 %20s
replace v35 = v34 if index(v35, "FileHash") == 0
drop v34
drop if !index(v35, "FileHash")
replace v35 = subinstr(v35, `""FileHash":""', "", .)
replace v35 = subinstr(v35, `"""', "", .)
replace v1 = subinstr(v1, `" HB to <em>CRITTY<\/em>"', "", .)
replace v1 = subinstr(v1, `" "', "", .)
replace v1 = subinstr(v1, `"."', "", .)
replace v1 = subinstr(v1, `"<em>"', "", .)
replace v1 = subinstr(v1, `"<\/em>"', "", .)
  • 注:utrans 为自写命令,代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
*! utf-8中文转码
*! utrans 文件名.后缀名
*! 示例:utrans temp.do
cap prog drop utrans
prog define utrans
version 14.0
syntax anything
cap preserve
clear
cap qui{
unicode encoding set gb18030
unicode translate "`anything'"
unicode erasebackups, badidea
unicode analyze "`anything'"
}
if r(N_needed) == 0 di in yellow "转码完成"
if r(N_needed) != 0 di in red "转码失败"
end

然后我们就得到了一个这个表格:

大家这个时候会很疑惑为什么要这么一大串数,我们可以随意打开一个歌曲的播放页面,例如这个:

1
http://www.kugou.com/song/#hash=214050FC671BDDAC6A68F8E891014E80&album_id=1214527

这首歌遇萤的播放页面链接里面恰好有这个 Hash 值,事实上用下面这个链接一样可以打开播放页面:

1
http://www.kugou.com/song/#hash=214050FC671BDDAC6A68F8E891014E80

也就是说我们只要在这个 FileHash 值前面加上
http://www.kugou.com/song/#hash=
就可以跳转到音乐播放的界面了,下面的代码就可以进行这一操作:

1
2
replace v35 = "http://www.kugou.com/song/#hash=" + v35
save songlist, replace

然后我们怎么从这个页面下载到歌曲呢?丁文亮同学已经用 R 实现了这个过程,具体可以参考他的这篇文中kugou 爬取记, 我是使用了另外一个实现方法,因为我发现了这么一个网站:http://music.sonimei.cn/ ,只要把刚才的那个链接复制到这个网站里面就能获取音乐的下载链接:

也就是说真实的下载链接是这个:

1
http://fs.open.kugou.com/5ce0942f75d061a29e3c84a43833e335/5b1c3e33/G018/M07/03/10/Ug0DAFV2wESAVjxAADqXLh5c7mg685.mp3

好了,那么下面一步就是如何完成从链接 1 :

1
http://www.kugou.com/song/#hash=214050FC671BDDAC6A68F8E891014E80

自动获取链接 2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
http://www.kugou.com/song/#hash=214050FC671BDDAC6A68F8E891014E80 
```html


打开这个页面的检查,可以很容易的发现这个:

![](https://czxb.github.io/mr/faxian2.png)

它的请求头是这样的:

![](https://czxb.github.io/mr/faxian1.png)

我们可以用Curl工具实现这一请求(右键选择Copy as cURL即可复制)

![](https://czxb.github.io/mr/curl.png)

然后粘贴到Stata中,Stata调用shell命令只需要在Shell命令前面加个shell或!即可。

```stata
!curl 'http://music.sonimei.cn/'
-H 'Cookie: UM_distinctid=163e4e1af53935-0db2428d718a45-336c7706-fa000-163e4e1af55ce; CNZZDATA1256427037=272117870-1528551817-null%7C1528574133; __51cke__=; __tins__15406580=%7B%22sid%22%3A%201528578179373%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201528580401283%7D; __51laig__=2'
-H 'Origin: http://music.sonimei.cn'
-H 'Accept-Encoding: gzip, deflate'
-H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8'
-H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8'
-H 'Accept: application/json, text/javascript, */*; q=0.01'
-H 'Referer: http://music.sonimei.cn/?url=http%3A%2F%2Fwww.kugou.com%2Fsong%2F%23hash%3D214050FC671BDDAC6A68F8E891014E80'
-H 'X-Requested-With: XMLHttpRequest'
-H 'Proxy-Connection: keep-alive' --data 'input=http%3A%2F%2Fwww.kugou.com%2Fsong%2F%23hash%3D214050FC671BDDAC6A68F8E891014E80&filter=url&type=_&page=1' --compressed -o temp.txt

上面的代码实际上是一行的,为了便于观测我把它手动换行了,在末尾我添加了一个-o temp.txt,表示把结果存储在 temp.json 里面(因为这个请求得到的结果实际上是 json,可以直接用 Stata 的外部命令 insheetjson 读入,安装 insheetjson: ssc install insheetjson)。运行上面的代码然后打开就会发现这里音乐下载的真实链接:

用下面的 Stata 代码就可以实现这个过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
use songlist, clear
ren v1 name
ren v35 url
gen link = ""
forval i = 1/`=_N'{
local urlinput = url[`i']
percentencode `urlinput'
preserve
!curl 'http://music.sonimei.cn/' -H 'Cookie: UM_distinctid=163e4e1af53935-0db2428d718a45-336c7706-fa000-163e4e1af55ce; CNZZDATA1256427037=272117870-1528551817-null%7C1528574133; __51cke__=; __tins__15406580=%7B%22sid%22%3A%201528578179373%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201528580401283%7D; __51laig__=2' -H 'Origin: http://music.sonimei.cn' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://music.sonimei.cn/?url=`r(percentencode)'' -H 'X-Requested-With: XMLHttpRequest' -H 'Proxy-Connection: keep-alive' --data 'input=`r(percentencode)'&filter=url&type=_&page=1' --compressed -o temp.json
clear
gen str200 url = ""
insheetjson url using temp.json, table(data) col(url)
replace url = subinstr(url, "\", "", .)
local url = url[1]
di "`url'"
restore
replace link = "`url'" in `i'
}
format name %10s
format url %10s
save song_download_list, replace

然后我们就得到了这样一个数据表:

link 即为可以直接下载得到歌曲的链接,下面的代码就可以下载这个列表上的所有歌曲了:

1
2
3
4
5
use song_download_list, clear
drop if link == ""
forval i = 1/`=_N'{
copy "`=link[`i']'" "`=name[`i']'.mp3", replace
}

在我的电脑上下载的速度还是蛮快的,差不多 0.5M/S(我连的无线。。这个是最大速度)。

最后我们再整理一下上面的代码

  1. 根据关键词搜索返回搜索结果列表
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
cap prog drop utrans
prog define utrans
version 14.0
syntax anything
cap preserve
clear
cap qui{
unicode encoding set gb18030
unicode translate "`anything'"
unicode erasebackups, badidea
unicode analyze "`anything'"
}
if r(N_needed) == 0 di in yellow "转码完成"
if r(N_needed) != 0 di in red "转码失败"
end
clear all
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=Critty&page=1&pagesize=159&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
* 注意上面一句命令下载的内容过多而无法直接读入Stata中,可以手动分行读入,也可以设置一个较小的pagesize然后分几页进行读取处理合并。
utrans temp.txt
infix strL v 1-5000 using temp.txt, clear
replace v = subinstr(v, `"jQuery1124031407034218114016_1528575153677({"status":1,"error_code":0,"data":{"page":1,"tab":"全部","lists":["', "", .)
drop in 163/166
split v, parse(,)
keep v1 v34 v35
* 也就是保留带歌曲名字和FileHash的变量
replace v1 = subinstr(v1, `"{"SongName":""', "", .)
replace v1 = subinstr(v1, `"""', "", .)
format v1 %20s
replace v35 = v34 if index(v35, "FileHash") == 0
drop v34
drop if !index(v35, "FileHash")
replace v35 = subinstr(v35, `""FileHash":""', "", .)
replace v35 = subinstr(v35, `"""', "", .)
replace v1 = subinstr(v1, `" HB to <em>CRITTY<\/em>"', "", .)
replace v1 = subinstr(v1, `" "', "", .)
replace v1 = subinstr(v1, `"."', "", .)
replace v1 = subinstr(v1, `"<em>"', "", .)
replace v1 = subinstr(v1, `"<\/em>"', "", .)
replace v35 = "http://www.kugou.com/song/#hash=" + v35
save songlist, replace
  1. 根据 hash 值获取歌曲的下载链接
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
use songlist, clear
ren v1 name
ren v35 url
gen link = ""
forval i = 1/`=_N'{
local urlinput = url[`i']
percentencode `urlinput'
preserve
!curl 'http://music.sonimei.cn/' -H 'Cookie: UM_distinctid=163e4e1af53935-0db2428d718a45-336c7706-fa000-163e4e1af55ce; CNZZDATA1256427037=272117870-1528551817-null%7C1528574133; __51cke__=; __tins__15406580=%7B%22sid%22%3A%201528578179373%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201528580401283%7D; __51laig__=2' -H 'Origin: http://music.sonimei.cn' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://music.sonimei.cn/?url=`r(percentencode)'' -H 'X-Requested-With: XMLHttpRequest' -H 'Proxy-Connection: keep-alive' --data 'input=`r(percentencode)'&filter=url&type=_&page=1' --compressed -o temp.json
clear
gen str200 url = ""
insheetjson url using temp.json, table(data) col(url)
replace url = subinstr(url, "\", "", .)
local url = url[1]
di "`url'"
restore
replace link = "`url'" in `i'
}
format name %10s
format url %10s
save song_download_list, replace
  1. 下载歌曲
1
2
3
4
5
use song_download_list, clear
drop if link == ""
forval i = 1/`=_N'{
copy "`=link[`i']'" "`=name[`i']'.mp3", replace
}

下载结果如下:

哎呀!忘记补充一点了,percentencode 也是一个外部命令,不是我写的,但是为了方便我把它绑在了我的 dict 命令上,因此只要安装 dict 命令就会安装 percentencode 命令,安装 dict 命令:

1
github install czxa/dict, replace

一个小插曲

根据这个爬取程序,可以很容易写一个判断输入的关键词是否为酷狗收录的歌手的小命令,具体实现方法如下:
kugou.ado

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
cap prog drop kugou
prog def kugou, rclass
syntax anything
qui{
cap preserve
clear all
local name = "`anything'"
percentencode "`name'"
copy "http://so.service.kugou.com/get/complex?callback=jQuery112405110215310589978_1528549487656&word=`r(percentencode)'&_=1528549487659" temp.txt, replace
utrans temp.txt
set obs 1
gen v = fileread("temp.txt")
split v, parse(,)
sxpose, clear
ren _var1 v
drop in 1
keep if index(v, "singerid")
cap keep in 1
}
if _rc == 198{
di in red "未发现名为`name'的歌手!"
}
else{
qui replace v = subinstr(v, `""singerid":"', "", .)
local singerid = v[1]
di "歌手`name'的id为`singerid'"
di `"`name'的酷狗主页网址为{bf:{browse "http://www.kugou.com/singer/`singerid'.html": `name'的酷狗主页}}"'
ret local singerid "www.kugou.com/singer/`singerid'"
}
end

例如:

歌手的 id 是歌手主页链接的重要组成部分。

一些思考

  1. 思考 1: 我们在看这个网站的时候会发现下面这个:

这就意味着,处理酷狗网站,其他的音乐网站爬起来都是一样的!大家可以自己试试。

  1. 思考 2: 就是我还是想把上面的这些内容封装一下,封装后的代码如下:
    这里我是使用分页读取来实现避免手工分行操作的;
    在 Hash 值的寻找中我使用的是正则表达式;
    由于里面大量使用了 curl 模拟浏览器登录,因此我不确定该命令是否能在别人的电脑上运行。点击下面的链接即可直接下载 ado 文件:

kugousearch.ado

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
*! 查询并下载酷狗网站搜索结果
*! kugousearch 周杰伦
*! kugousearch 程振兴, d
*! kugousearch 程振兴, download
*! 需要的外部命令utrans, insheetjson, dict(percentencode)
cap prog drop kugousearch
prog def kugousearch
version 13.0
syntax anything [, Download]
clear all
qui{
local keyword = "`anything'"
percentencode `keyword'
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=`r(percentencode)'&page=1&pagesize=5&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
utrans temp.txt
infix strL v 1-20000 using temp.txt, clear
split v, parse(({ [{ },{ }] }))
sxpose, clear
ren _var1 v
drop if index(v, "SongName") == 0
drop in 1
gen name = ustrregexs(1) if ustrregexm(v, `""SongName":"(.*)","OwnerCount""')
gen hash = ustrregexs(1) if ustrregexm(v, `""FileHash":"(.*)"?"')
drop v
replace name = subinstr(name, `"""', "", .)
replace name = subinstr(name, `" "', "", .)
replace name = subinstr(name, `"."', "", .)
replace name = subinstr(name, `"\/"', "", .)
replace hash = "http://www.kugou.com/song/#hash=" + hash
save songlist, replace
local id = 1
local page = 2
while `id'{
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=`r(percentencode)'&page=`page'&pagesize=5&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
utrans temp.txt
infix strL v 1-20000 using temp.txt, clear
cap{
split v, parse(({ [{ },{ }] }))
sxpose, clear
ren _var1 v
drop if index(v, "SongName") == 0
drop in 1
gen name = ustrregexs(1) if ustrregexm(v, `""SongName":"(.*)","OwnerCount""')
gen hash = ustrregexs(1) if ustrregexm(v, `""FileHash":"(.*)"?"')
drop v
}
if _rc != 0{
use songlist, clear
save songlist, replace
continue, break
}
cap{
replace hash = "http://www.kugou.com/song/#hash=" + hash
}
append using songlist.dta
save songlist, replace
local page = `page' + 1
}
}
di "1. 搜索结果列表爬取完成!"
di "2. 一共得到了`=_N'个搜索结果,下一步是获取音频下载链接,费时较长,请耐心等候。"
qui{
use songlist, clear
replace hash = ustrregexs(1) if ustrregexm(hash, `"(.*)","SQPayType""')
drop if length(hash) < 40
drop if length(hash) > 80
replace name = subinstr(name, `"""', "", .)
replace name = subinstr(name, `" "', "", .)
replace name = subinstr(name, `"."', "", .)
replace name = subinstr(name, `"\/"', "", .)
format name %10s
format hash %40s
replace hash = subinstr(hash, `"""', "", .)
gen link = ""
}
forval i = 1/`=_N'{
local rt = (`i'/`=_N')*100
dis %6.0f "已完成`rt'%"
local urlinput = hash[`i']
qui{
percentencode `urlinput'
preserve
!curl 'http://music.sonimei.cn/' -H 'Cookie: UM_distinctid=163e4e1af53935-0db2428d718a45-336c7706-fa000-163e4e1af55ce; CNZZDATA1256427037=272117870-1528551817-null%7C1528574133; __51cke__=; __tins__15406580=%7B%22sid%22%3A%201528578179373%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201528580401283%7D; __51laig__=2' -H 'Origin: http://music.sonimei.cn' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://music.sonimei.cn/?url=`r(percentencode)'' -H 'X-Requested-With: XMLHttpRequest' -H 'Proxy-Connection: keep-alive' --data 'input=`r(percentencode)'&filter=url&type=_&page=1' --compressed -o temp.json
clear
gen str200 url = ""
insheetjson url using temp.json, table(data) col(url)
replace url = subinstr(url, "\", "", .)
local url = url[1]
di "`url'"
restore
replace link = "`url'" in `i'
}
}
save song_download_list, replace
drop if link == ""
di "3. 下载链接列表爬取完成!"
qui{
if "`download'" != ""{
use song_download_list, clear
drop if link == ""
forval i = 1/`=_N'{
copy "`=link[`i']'" "`=name[`i']'.mp3", replace
}
}
}
end

例如下载刘含笑相关的搜索结果:kugousearch 刘含笑,结果如下:

本来第二篇文章应该写爬酷狗 APP 的,但是突发奇想在网站上添加一个音乐播放框口,这里需要插件hexo-tag-aplayer的支持,音乐播放框口的效果如下:


hexo-tag-aplayer 插件的使用

使用下面的列表即可产生一个音乐播放框口:

1
{% aplayer "Caffeine" "Jeff Williams" "caffeine.mp3" "picture.jpg" "lrc:caffeine.txt" %}

例如开头显示的窗口的代码为:

1
{% aplayer "同手同脚" "王亚楠" "https://czxb.github.io/Web_data_source/同手同脚.mp3"  "https://czxb.github.io/Web_data_source/WechatIMG22.jpeg" %}

开始爬音频和专辑图片链接

从上一篇文章中可以看到,从 http://music.sonimei.cn/ 上下载得到的 json 文件中已有音频和专辑图片的链接,因此我们只要再对那个 kugousearch.ado 文件稍微修改一下即可同时得到音频链接和下载链接。修改后的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
*! 查询并下载酷狗网站搜索结果
*! kugousearch 周杰伦
*! kugousearch 程振兴, d
*! kugousearch 程振兴, download
*! 需要的外部命令utrans, insheetjson, dict(percentencode)
cap prog drop kugousearch
prog def kugousearch
version 13.0
syntax anything [, Download]
clear all
qui{
local keyword = "`anything'"
percentencode `keyword'
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=`r(percentencode)'&page=1&pagesize=5&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
utrans temp.txt
infix strL v 1-20000 using temp.txt, clear
split v, parse(({ [{ },{ }] }))
sxpose, clear
ren _var1 v
drop if index(v, "SongName") == 0
drop in 1
gen name = ustrregexs(1) if ustrregexm(v, `""SongName":"(.*)","OwnerCount""')
gen hash = ustrregexs(1) if ustrregexm(v, `""FileHash":"(.*)"?"')
drop v
replace name = subinstr(name, `"""', "", .)
replace name = subinstr(name, `" "', "", .)
replace name = subinstr(name, `"."', "", .)
replace name = subinstr(name, `"\/"', "", .)
replace hash = "http://www.kugou.com/song/#hash=" + hash
save songlist, replace
local id = 1
local page = 2
while `id'{
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=`r(percentencode)'&page=`page'&pagesize=5&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
utrans temp.txt
infix strL v 1-20000 using temp.txt, clear
cap{
split v, parse(({ [{ },{ }] }))
sxpose, clear
ren _var1 v
drop if index(v, "SongName") == 0
drop in 1
gen name = ustrregexs(1) if ustrregexm(v, `""SongName":"(.*)","OwnerCount""')
gen hash = ustrregexs(1) if ustrregexm(v, `""FileHash":"(.*)"?"')
drop v
}
if _rc != 0{
use songlist, clear
save songlist, replace
continue, break
}
cap{
replace hash = "http://www.kugou.com/song/#hash=" + hash
}
append using songlist.dta
save songlist, replace
local page = `page' + 1
}
}
di "1. 搜索结果列表爬取完成!"
di "2. 一共得到了`=_N'个搜索结果,下一步是获取音频下载链接,费时较长,请耐心等候。"
qui{
use songlist, clear
replace hash = ustrregexs(1) if ustrregexm(hash, `"(.*)","SQPayType""')
drop if length(hash) < 40
drop if length(hash) > 80
replace name = subinstr(name, `"""', "", .)
replace name = subinstr(name, `" "', "", .)
replace name = subinstr(name, `"."', "", .)
replace name = subinstr(name, `"\/"', "", .)
format name %10s
format hash %40s
replace hash = subinstr(hash, `"""', "", .)
gen link = ""
gen piclink = ""
save songlist, replace
}
forval i = 1/`=_N'{
use songlist, clear
local rt = (`i'/`=_N')*100
dis %6.0f "已完成`rt'%"
local urlinput = hash[`i']
qui{
percentencode `urlinput'
cap restore
preserve
!curl 'http://music.sonimei.cn/' -H 'Cookie: UM_distinctid=163e4e1af53935-0db2428d718a45-336c7706-fa000-163e4e1af55ce; CNZZDATA1256427037=272117870-1528551817-null%7C1528574133; __51cke__=; __tins__15406580=%7B%22sid%22%3A%201528578179373%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201528580401283%7D; __51laig__=2' -H 'Origin: http://music.sonimei.cn' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://music.sonimei.cn/?url=`r(percentencode)'' -H 'X-Requested-With: XMLHttpRequest' -H 'Proxy-Connection: keep-alive' --data 'input=`r(percentencode)'&filter=url&type=_&page=1' --compressed -o temp.json
clear
gen str200 url = ""
gen str200 pic = ""
cap{
insheetjson url pic using temp.json, table(data) col(url pic)
replace url = subinstr(url, "\", "", .)
local url = url[1]
replace pic = subinstr(pic, "\", "", .)
local pic = pic[1]
restore
replace link = "`url'" in `i'
replace piclink = "`pic'" in `i'
}
save songlist, replace
}
}
save song_download_list, replace
drop if link == ""
di "3. 下载链接列表爬取完成!"
qui{
if "`download'" != ""{
use song_download_list, clear
drop if link == ""
forval i = 1/`=_N'{
copy "`=link[`i']'" "`=name[`i']'.mp3", replace
}
}
}
end

kugousearch 银临

理论上来说上面的代码是可行的,但是运行时间过久,又经常出错,我们可以修改一下变成一个简化版的,例如只爬前 20 条搜索结果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
*! 查询并下载酷狗网站搜索结果
*! kugousearch 周杰伦
*! kugousearch 程振兴, d
*! kugousearch 程振兴, download
*! 需要的外部命令utrans, insheetjson, dict(percentencode)
cap prog drop kugousearch
prog def kugousearch
version 13.0
syntax anything [, Download]
clear all
qui{
local keyword = "`anything'"
percentencode `keyword'
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=`r(percentencode)'&page=1&pagesize=5&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
utrans temp.txt
infix strL v 1-20000 using temp.txt, clear
split v, parse(({ [{ },{ }] }))
sxpose, clear
ren _var1 v
drop if index(v, "SongName") == 0
drop in 1
gen name = ustrregexs(1) if ustrregexm(v, `""SongName":"(.*)","OwnerCount""')
gen hash = ustrregexs(1) if ustrregexm(v, `""FileHash":"(.*)"?"')
drop v
replace name = subinstr(name, `"""', "", .)
replace name = subinstr(name, `" "', "", .)
replace name = subinstr(name, `"."', "", .)
replace name = subinstr(name, `"\/"', "", .)
replace hash = "http://www.kugou.com/song/#hash=" + hash
save songlist, replace
forval i = 2/4{
copy "http://songsearch.kugou.com/song_search_v2?callback=jQuery1124031407034218114016_1528575153677&keyword=`r(percentencode)'&page=`page'&pagesize=5&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1528575153679" temp.txt, replace
utrans temp.txt
infix strL v 1-20000 using temp.txt, clear
cap{
split v, parse(({ [{ },{ }] }))
sxpose, clear
ren _var1 v
drop if index(v, "SongName") == 0
drop in 1
gen name = ustrregexs(1) if ustrregexm(v, `""SongName":"(.*)","OwnerCount""')
gen hash = ustrregexs(1) if ustrregexm(v, `""FileHash":"(.*)"?"')
drop v
}
if _rc != 0{
use songlist, clear
save songlist, replace
continue, break
}
cap{
replace hash = "http://www.kugou.com/song/#hash=" + hash
}
append using songlist.dta
save songlist, replace
}
}
di "1. 搜索结果列表爬取完成!"
di "2. 一共得到了`=_N'个搜索结果,下一步是获取音频下载链接,费时较长,请耐心等候。"
qui{
use songlist, clear
replace hash = ustrregexs(1) if ustrregexm(hash, `"(.*)","SQPayType""')
drop if length(hash) < 40
drop if length(hash) > 80
replace name = subinstr(name, `"""', "", .)
replace name = subinstr(name, `" "', "", .)
replace name = subinstr(name, `"."', "", .)
replace name = subinstr(name, `"\/"', "", .)
format name %10s
format hash %40s
replace hash = subinstr(hash, `"""', "", .)
gen link = ""
gen piclink = ""
save songlist, replace
}
forval i = 1/`=_N'{
use songlist, clear
local rt = (`i'/`=_N')*100
dis %6.0f "已完成`rt'%"
local urlinput = hash[`i']
qui{
percentencode `urlinput'
cap restore
preserve
!curl 'http://music.sonimei.cn/' -H 'Cookie: UM_distinctid=163e4e1af53935-0db2428d718a45-336c7706-fa000-163e4e1af55ce; CNZZDATA1256427037=272117870-1528551817-null%7C1528574133; __51cke__=; __tins__15406580=%7B%22sid%22%3A%201528578179373%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201528580401283%7D; __51laig__=2' -H 'Origin: http://music.sonimei.cn' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://music.sonimei.cn/?url=`r(percentencode)'' -H 'X-Requested-With: XMLHttpRequest' -H 'Proxy-Connection: keep-alive' --data 'input=`r(percentencode)'&filter=url&type=_&page=1' --compressed -o temp.json
clear
gen str200 url = ""
gen str200 pic = ""
cap{
insheetjson url pic using temp.json, table(data) col(url pic)
replace url = subinstr(url, "\", "", .)
local url = url[1]
replace pic = subinstr(pic, "\", "", .)
local pic = pic[1]
restore
replace link = "`url'" in `i'
replace piclink = "`pic'" in `i'
}
save songlist, replace
}
}
qui save song_download_list, replace
qui drop if link == ""
di "3. 下载链接列表爬取完成!"
qui{
if "`download'" != ""{
use song_download_list, clear
drop if link == ""
forval i = 1/`=_N'{
copy "`=link[`i']'" "`=name[`i']'.mp3", replace
}
}
}
end

kugousearch 银临

爬取结果如下:

然后我们呢就可以做出了 22 个音乐窗口,虽然这个插件说是有制作播放列表的功能,但是尝试失败了。为了更快速地生成音乐窗口的代码,我们继续写代码。

生成音乐窗口

1
2
3
4
5
6
7
cap file close myfile
file open myfile using temp.json, write replace
forval i = 1/`=_N'{
file write myfile `"{% aplayer "`=name[`i']'" "银临(可能是)" "`=link[`i']'" "`=piclink[`i']'" %}"' _n
}
file close myfile
utrans temp.json

附录
kugouslp.do

经过两天的努力终于走到了正文部分,这一篇将介绍我使用 Stata 爬酷狗 APP 的过程。
首先我们打开酷狗 APP 然后搜索周杰伦:

Charles 抓包

好家伙!一共 606 首歌曲,今天我们就把这 606 首歌曲全部下载下来!
点击进入周杰伦的主页,然后打开 Charles:

经过反复查看,我觉得这个就是刚刚点击单曲时的请求:

右键选择copy cURL Request

1
curl -H 'Host: kmr.service.kugou.com' -H 'Content-Type: application/json' -H 'Cookie: KuGoo=KugooID=978183797&KugooPwd=64A545DAD7901B87DDA7CAAB8C543B8E&NickName=%u0048%u0065%u0074%u0065%u0072%u006f%u0073%u006b%u0065%u0064%u0061%u0073%u0074%u0069%u0063%u0069%u0074%u0079&Pic=http://imge.kugou.com/kugouicon/165/20161130/20161130172325344744.jpg&RegState=1&RegFrom=WEIXIN&t=e09e2563ce3b6dd1ab241213e28c19df7508d708a1d7923520cec4d40a44632c&a_id=1155&ct=1528731147&UserName=%u006b%u0067%u006f%u0070%u0065%u006e%u0039%u0037%u0038%u0031%u0038%u0033%u0037%u0039%u0037&t1=; kg_mid=6a496367090cfcc3f6d6ab73809b5a9c' -H 'Accept: */*' -H 'User-Agent: KugouMusic/2.6.1 (Mac OS X ban ben 10.13.4(ban hao 17E202))' -H 'Accept-Language: zh-Hans-CN;q=1, en-CN;q=0.9' --data-binary '{"clienttime":1528787597895,"author_id":"3520","mid":"7f24358ad3591e449129a4ef58668e09","sort":2,"clientver":261,"pagesize":30,"area_code":"all","key":"972197755efb7824eaad81031770d482","page":1,"appid":1155}' --compressed 'http://kmr.service.kugou.com/container/v1/audio_group/author'

把上面的 curl 代码拆开:

1
2
3
4
5
6
7
8
9
10
curl
-H 'Host: kmr.service.kugou.com'
-H 'Content-Type: application/json'
-H 'Cookie: KuGoo=KugooID=978183797&KugooPwd=64A545DAD7901B87DDA7CAAB8C543B8E&NickName=%u0048%u0065%u0074%u0065%u0072%u006f%u0073%u006b%u0065%u0064%u0061%u0073%u0074%u0069%u0063%u0069%u0074%u0079&Pic=http://imge.kugou.com/kugouicon/165/20161130/20161130172325344744.jpg&RegState=1&RegFrom=WEIXIN&t=e09e2563ce3b6dd1ab241213e28c19df7508d708a1d7923520cec4d40a44632c&a_id=1155&ct=1528731147&UserName=%u006b%u0067%u006f%u0070%u0065%u006e%u0039%u0037%u0038%u0031%u0038%u0033%u0037%u0039%u0037&t1=; kg_mid=6a496367090cfcc3f6d6ab73809b5a9c'
-H 'Accept: */*'
-H 'User-Agent: KugouMusic/2.6.1 (Mac OS X ban ben 10.13.4(ban hao 17E202))'
-H 'Accept-Language: zh-Hans-CN;q=1, en-CN;q=0.9'
--data-binary '{"clienttime":1528787597895,"author_id":"3520","mid":"7f24358ad3591e449129a4ef58668e09","sort":2,"clientver":261,"pagesize":30,"area_code":"all","key":"972197755efb7824eaad81031770d482","page":1,"appid":1155}'
--compressed
'http://kmr.service.kugou.com/container/v1/audio_group/author'

注意到这是一个 POST 请求,仔细观察可以发现"pagesize":30 ,我们把这个改成"pagesize":606然后运行一下:

1
!curl -H 'Host: kmr.service.kugou.com' -H 'Content-Type: application/json' -H 'Cookie: KuGoo=KugooID=978183797&KugooPwd=64A545DAD7901B87DDA7CAAB8C543B8E&NickName=%u0048%u0065%u0074%u0065%u0072%u006f%u0073%u006b%u0065%u0064%u0061%u0073%u0074%u0069%u0063%u0069%u0074%u0079&Pic=http://imge.kugou.com/kugouicon/165/20161130/20161130172325344744.jpg&RegState=1&RegFrom=WEIXIN&t=e09e2563ce3b6dd1ab241213e28c19df7508d708a1d7923520cec4d40a44632c&a_id=1155&ct=1528731147&UserName=%u006b%u0067%u006f%u0070%u0065%u006e%u0039%u0037%u0038%u0031%u0038%u0033%u0037%u0039%u0037&t1=; kg_mid=6a496367090cfcc3f6d6ab73809b5a9c' -H 'Accept: */*' -H 'User-Agent: KugouMusic/2.6.1 (Mac OS X ban ben 10.13.4(ban hao 17E202))' -H 'Accept-Language: zh-Hans-CN;q=1, en-CN;q=0.9' --data-binary '{"clienttime":1528787597895,"author_id":"3520","mid":"7f24358ad3591e449129a4ef58668e09","sort":2,"clientver":261,"pagesize":606,"area_code":"all","key":"972197755efb7824eaad81031770d482","page":1,"appid":1155}' --compressed 'http://kmr.service.kugou.com/container/v1/audio_group/author' -o temp.txt

Stata 处理数据

结果虽然是 json 格式的,但是我要 insheetjson 读取试了一下,失败了,还是手动断行读入吧:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
!curl -H 'Host: kmr.service.kugou.com' -H 'Content-Type: application/json' -H 'Cookie: KuGoo=KugooID=978183797&KugooPwd=64A545DAD7901B87DDA7CAAB8C543B8E&NickName=%u0048%u0065%u0074%u0065%u0072%u006f%u0073%u006b%u0065%u0064%u0061%u0073%u0074%u0069%u0063%u0069%u0074%u0079&Pic=http://imge.kugou.com/kugouicon/165/20161130/20161130172325344744.jpg&RegState=1&RegFrom=WEIXIN&t=e09e2563ce3b6dd1ab241213e28c19df7508d708a1d7923520cec4d40a44632c&a_id=1155&ct=1528731147&UserName=%u006b%u0067%u006f%u0070%u0065%u006e%u0039%u0037%u0038%u0031%u0038%u0033%u0037%u0039%u0037&t1=; kg_mid=6a496367090cfcc3f6d6ab73809b5a9c' -H 'Accept: */*' -H 'User-Agent: KugouMusic/2.6.1 (Mac OS X ban ben 10.13.4(ban hao 17E202))' -H 'Accept-Language: zh-Hans-CN;q=1, en-CN;q=0.9' --data-binary '{"clienttime":1528787597895,"author_id":"3520","mid":"7f24358ad3591e449129a4ef58668e09","sort":2,"clientver":261,"pagesize":606,"area_code":"all","key":"972197755efb7824eaad81031770d482","page":1,"appid":1155}' --compressed 'http://kmr.service.kugou.com/container/v1/audio_group/author' -o temp.txt
* 手动断行处理
utrans temp.txt
infix strL v 1-20000 using temp.txt, clear
gen name = ustrregexs(1) if ustrregexm(v, `""audio_name":"(.*)","video_timelength""')
gen hash = ustrregexs(1) if ustrregexm(v, `""hash":"(.*)","hash_320""')
drop if missing(name) | missing(hash)
drop v
replace name = subinstr(name, `"""', "", .)
replace name = subinstr(name, `" "', "", .)
replace name = subinstr(name, `"."', "", .)
replace name = subinstr(name, `"\/"', "", .)
format name %10s
format hash %40s

寻找真实的下载链接

结果损失了一些歌曲,这些歌曲应该就是传说中的“因版权方要求,歌曲暂不提供服务!”,还有一些可能是没有 hash 值,结果中其实还有 hash_320 之类的,。然后我们就又回到了根据 hash 值找音乐下载链接的地方。我就不再浪费时间重复了,我们就拿第六个 hash 值A50E563C152A4584321501E5BA824304为例找一下这首歌:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
replace hash = "http://www.kugou.com/song/#hash=" + hash
keep in 6
gen link = ""
gen piclink = ""
local urlinput = hash[1]
percentencode `urlinput'
cap restore
preserve
!curl 'http://music.sonimei.cn/' -H 'Cookie: UM_distinctid=163e4e1af53935-0db2428d718a45-336c7706-fa000-163e4e1af55ce; CNZZDATA1256427037=272117870-1528551817-null%7C1528574133; __51cke__=; __tins__15406580=%7B%22sid%22%3A%201528578179373%2C%20%22vd%22%3A%202%2C%20%22expires%22%3A%201528580401283%7D; __51laig__=2' -H 'Origin: http://music.sonimei.cn' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.9,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Referer: http://music.sonimei.cn/?url=`r(percentencode)'' -H 'X-Requested-With: XMLHttpRequest' -H 'Proxy-Connection: keep-alive' --data 'input=`r(percentencode)'&filter=url&type=_&page=1' --compressed -o temp.json
clear
gen str200 url = ""
gen str200 pic = ""
insheetjson url pic using temp.json, table(data) col(url pic)
replace url = subinstr(url, "\", "", .)
local url = url[1]
replace pic = subinstr(pic, "\", "", .)
local pic = pic[1]
restore
replace link = "`url'"
replace piclink = "`pic'"
sxpose, clear

好了!终于完成了,我们再做个音乐框:



代码如下:
1
{% aplayer "告白气球-Live版" "周杰伦" "http://fs.open.kugou.com/52222f834ca60dece2427e39abbf8f33/5b1f80bc/G114/M01/1F/10/sg0DAFnWF0mAWSkzADPphbA0r2s996.mp3"  "http://imge.kugou.com/stdmusic/150/20171005/20171005201444814336.jpg" %}

附录
kugou3.do

# Stata

评论

程振兴

程振兴 @czxa.top
截止今天,我已经在本博客上写了165.1k个字了!

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×