微信推文图片的下载

微信推文图片的下载

本文分别使用了Stata和Python写了两个用于爬取微信推文页面中图片的程序。

Stata版本

微信推文中所含图片的下载问题是我在今年寒假的时候第一次想到的,然后就用Stata写了一个,比较复杂,代码如下:
downpic.ado

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

*! 微信推文图片下载
*! 程振兴 2018年6月15日
*! 用法:downpic urls, Ignore(string) Path(string)
*! urls: 表示一个链接或者用空格分隔的多条链接;
*! ignore(string): 选择项忽略某种格式的图片,可以缩写为i;
*! path(string): 选择下载文件夹,如果没有会被自动创建,可以缩写为p。
*! 用法示例:
*! downpic `"https://mp.weixin.qq.com/s?__biz=MjM5MzIyODY1NA==&mid=2653889621&idx=1&sn=b513d479b5b132d7b2b832ff8c286c65&chksm=bd41b6548a363f427affabf983fd0a69db2122d28336e6cd2018def97839a089b5ebd9bb6841&scene=0#rd"' `"https://mp.weixin.qq.com/s?__biz=MzA5NjIzNjgxNw==&mid=2653071676&idx=5&sn=81cc83ae876b5012027086f46974d63a&chksm=8b652d42bc12a45412303e83813a8011b76b216429a7ddaf5678b3d305deb8259d9367fa4e83&mpshare=1&scene=1&srcid=06148NGZ2TXpPnSmkmQ9ynCc#rd"'
*! downpic `"https://mp.weixin.qq.com/s?__biz=MjM5MzIyODY1NA==&mid=2653889621&idx=1&sn=b513d479b5b132d7b2b832ff8c286c65&chksm=bd41b6548a363f427affabf983fd0a69db2122d28336e6cd2018def97839a089b5ebd9bb6841&scene=0#rd"'
*! downpic `"https://mp.weixin.qq.com/s?__biz=MzA5NjIzNjgxNw==&mid=2653071676&idx=5&sn=81cc83ae876b5012027086f46974d63a&chksm=8b652d42bc12a45412303e83813a8011b76b216429a7ddaf5678b3d305deb8259d9367fa4e83&mpshare=1&scene=1&srcid=06148NGZ2TXpPnSmkmQ9ynCc#rd"'
prog drop _all
prog define downpic
version 14.0
syntax anything(name = urls),[ Path(string) Ignore(string)]
clear all
if index("`path'", " "){
local path = subinstr("`path'", " ", "_")
cap mkdir `path'
}
if "`path'" != ""{
cap mkdir `path'
}
if "`path'" != ""{
local path = "`c(pwd)'"
di "你当前的工作目录为`path'。"
}
local m = 1
foreach name in `urls'{
qui{
cap copy "`name'" temp.txt, replace
cap unicode encoding set gb18030
cap unicode translate temp.txt
cap unicode erasebackups, badidea
infix strL v 1-20000 using temp.txt, clear
keep if index(v, "https") & (index(v, "png") | index(v, "jpeg") | index(v, "jpg") | index(v, "gif") | index(v, "bmp") | index(v, "svg") | index(v, "eps") | index(v, "PNG") | index(v, "JPEG") | index(v, "JPG") | index(v, "GIF") | index(v, "BMP") | index(v, "SVG") | index(v, "EPS") | index(v, "gph"))
if index(v, `"'http"'){
split v, parse(`"'"')
}
else{
split v, parse(`"""')
}
drop v
set obs 1000
gen v = ""
local i = 1
foreach v of varlist _all{
replace v = `v'[1] if _n == `i'
local i = `i' + 1
}
keep v
keep if (index(v, `"http"')&(index(v, "png")|index(v, "jpeg")|index(v, "jpg")|index(v, "gif")|index(v, "bmp")|index(v, "svg")|index(v, "eps")|index(v, "PNG")|index(v, "JPEG")|index(v, "JPG")|index(v, "GIF")|index(v, "BMP")|index(v, "SVG")|index(v, "EPS")|index(v, "gph")))

forvalue i = 1/`=_N'{
local a = v[`i']
if index(v[`i'], "="){
replace v = subinstr(v, "=", ".", .)
}
compress
local b = length(v[`i'])
local c = substr(v[`i'], `b'-6, `b')
local temp = v[`i']
cap copy "`a'" "`m'_`i'_`c'", replace
}
}
local m = `m' + 1
}
cap erase temp.txt
end

这个命令的后面是可以添加多条链接的。
示例:

1
2
3
downpic `"https://mp.weixin.qq.com/s?__biz=MjM5MzIyODY1NA==&mid=2653889621&idx=1&sn=b513d479b5b132d7b2b832ff8c286c65&chksm=bd41b6548a363f427affabf983fd0a69db2122d28336e6cd2018def97839a089b5ebd9bb6841&scene=0#rd"' `"https://mp.weixin.qq.com/s?__biz=MzA5NjIzNjgxNw==&mid=2653071676&idx=5&sn=81cc83ae876b5012027086f46974d63a&chksm=8b652d42bc12a45412303e83813a8011b76b216429a7ddaf5678b3d305deb8259d9367fa4e83&mpshare=1&scene=1&srcid=06148NGZ2TXpPnSmkmQ9ynCc#rd"'
downpic `"https://mp.weixin.qq.com/s?__biz=MjM5MzIyODY1NA==&mid=2653889621&idx=1&sn=b513d479b5b132d7b2b832ff8c286c65&chksm=bd41b6548a363f427affabf983fd0a69db2122d28336e6cd2018def97839a089b5ebd9bb6841&scene=0#rd"'
downpic `"https://mp.weixin.qq.com/s?__biz=MzA5NjIzNjgxNw==&mid=2653071676&idx=5&sn=81cc83ae876b5012027086f46974d63a&chksm=8b652d42bc12a45412303e83813a8011b76b216429a7ddaf5678b3d305deb8259d9367fa4e83&mpshare=1&scene=1&srcid=06148NGZ2TXpPnSmkmQ9ynCc#rd"'

下载结果:

Python版本

downpic.py
今天用Python写了一个,因为自己Python刚刚入门还不是很熟练,有些地方可能过于繁复,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# 程振兴 2018年6月14日
"""
用于下载微信推文中的图片、动图等
需要的包有:
urllib
requests
BeautifulSoup
ssl
time
os
re
"""

def downpic(url, foldername = "下载结果"):
"""
用于下载微信推文中的图片、动图等。
:param url: 必填链接,指定需要下载图片的链接;
:param foldername: 默认为"mytest",可以不指定;
:return: 无返回值,但是该程序运行之后会下载得到一个图片文件夹。
示例:
downpic("https://mp.weixin.qq.com/s/Pw3lzQpS7Lk8hXXse1TezA")
downpic(url = "https://mp.weixin.qq.com/s?__biz=MzA5NjIzNjgxNw=="
"&mid=2653071676&idx=5&sn=81cc83ae876b5012027086f469"
"74d63a&chksm=8b652d42bc12a45412303e83813a8011b76b21"
"6429a7ddaf5678b3d305deb8259d9367fa4e83&mpshare=1&sce"
"ne=1&srcid=06148NGZ2TXpPnSmkmQ9ynCc#rd", foldername = 'mytest')
"""
from urllib.request import urlretrieve
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import time
import re
import os
print("开始爬取=============================================>")
def mkdir(path):
folder = os.path.exists(path)
if not folder:
os.makedirs(path)
print("--- 创建新文件夹... ---")
print("--- 创建完成 ---")
else:
print("--- 文件夹已存在! ---")
file = foldername
mkdir(file)
cwd = os.getcwd()
os.chdir(file)
html = urlopen(url)
bsObj = BeautifulSoup(html, "html.parser")
imagelocation = bsObj.findAll("img")
linklist = []
for link in imagelocation:
if 'fmt' in str(link):
linklist.append(link.attrs['data-src'])
i = 1
for link in linklist:
format = re.findall(r"fmt=(.*)", link)
local_time = time.strftime("%Y%m%d%H%M%S", time.localtime())
filename = local_time + '_' +str(i) + '.' + format[0]
urlretrieve(link, filename)
print("已下载完第%d张" % i )
i = i + 1
i = i - 1
print("一共下载了%d张图片,下载已完成" % i)
os.chdir(cwd)

示例:

1
2
3
4
5
6
downpic("https://mp.weixin.qq.com/s/Pw3lzQpS7Lk8hXXse1TezA")
downpic(url = "https://mp.weixin.qq.com/s?__biz=MzA5NjIzNjgxNw=="
"&mid=2653071676&idx=5&sn=81cc83ae876b5012027086f469"
"74d63a&chksm=8b652d42bc12a45412303e83813a8011b76b21"
"6429a7ddaf5678b3d305deb8259d9367fa4e83&mpshare=1&sce"
"ne=1&srcid=06148NGZ2TXpPnSmkmQ9ynCc#rd", foldername = 'mytest')

评论

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×