您现在的位置是：亿华云 > 域名

Python数据可视化：Python大佬有哪些？

亿华云2025-10-02 09:04:14【域名】5人已围观

简介有态度地学习之前讲了代理池以及Cookies的相关知识，这里针对搜狗搜索微信公众号文章的爬取，将它俩实践一下。在崔大的书里面，他是用代理IP来应对搜狗的反爬措施，因为同一IP访问网页过于频繁，就会跳转

有态度地学习

之前讲了代理池以及Cookies的数据相关知识，这里针对搜狗搜索微信公众号文章的可视爬取，将它俩实践一下。数据

在崔大的可视书里面，他是数据用代理IP来应对搜狗的反爬措施，因为同一IP访问网页过于频繁，可视就会跳转验证码页面。数据

不过时代在进步，可视搜狗搜索的数据反爬也在更新，现在它是可视IP加Cookies双重把关。

01 网页分析

获取微信公众号文章信息，数据标题、可视开头、数据公众号、可视发布时间。数据

请求方式为GET，请求网址为红框部分，后面的信息没什么用。

02 反爬破解

什么时候出现上图这种情况呢?

两种，一种同一个IP重复访问页面，另一种同一个Cookies重复访问页面。

两个都有，挂的更快!完整爬取我只成功了一次...

因为我最开始就是先什么都不设置，然后就出现验证码页面。香港云服务器然后用了代理IP，还是会跳转验证码页面，直到***改变Cookies，才成功爬取。

01 代理IP设置

def get_proxies(i): """ 获取代理IP """ df = pd.read_csv(sg_effective_ip.csv, header=None, names=["proxy_type", "proxy_url"]) proxy_type = ["{ }".format(i) for i in np.array(df[proxy_type])] proxy_url = ["{ }".format(i) for i in np.array(df[proxy_url])] proxies = { proxy_type[i]: proxy_url[i]} return proxies

代理的获取以及使用这里就不赘述了，前面的文章有提到，有兴趣的小伙伴可以自行去看看。

经过我两天的实践，免费IP确实没什么用，两下子就把我真实IP揪出来了。

02 Cookies设置

def get_cookies_snuid(): """ 获取SNUID值 """ time.sleep(float(random.randint(2, 5))) url = "http://weixin.sogou.com/weixin?type=2&s_from=input&query=python&ie=utf8&_sug_=n&_sug_type_=" headers = { "Cookie": "ABTEST=你的参数;IPLOC=CN3301;SUID=你的参数;SUIR=你的参数"} # HEAD请求,请求资源的首部 response = requests.head(url, headers=headers).headers result = re.findall(SNUID=(.*?); expires, response[Set-Cookie]) SNUID = result[0] return SNUID

总的来说，Cookies的设置是整个反爬中最重要的，而其中的关键便是动态改变SNUID值。

这里就不详细说其中缘由，毕竟我也是在网上看大神的帖子才领悟到的，服务器托管而且领悟的还很浅。

成功爬取100页就只有一次，75页，50页，甚至到***一爬就挂的情况都出现了...

我可不想身陷「爬-反爬-反反爬」的泥潭之中，爬虫之后的事情才是我的真正目的，比如数据分析，数据可视化。

所以干票大的赶紧溜，只能膜拜搜狗工程师。

03 数据获取

1 构造请求头

head = """ Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 Accept-Encoding:gzip, deflate Accept-Language:zh-CN,zh;q=0.9 Connection:keep-alive Host:weixin.sogou.com Referer:http://weixin.sogou.com/, Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 """ # 不包含SNUID值 cookie = 你的Cookies def str_to_dict(header): """ 构造请求头,可以在不同函数里构造不同的请求头 """ header_dict = { } header = header.split(\n) for h in header: h = h.strip() if h: k, v = h.split(:, 1) header_dict[k] = v.strip() return header_dict

2 获取网页信息

def get_message(): """ 获取网页相关信息 """ failed_list = [] for i in range(1, 101): print(第 + str(i) + 页) print(float(random.randint(15, 20))) # 设置延时,这里是度娘查到的,说要设置15s延迟以上,不会被封 time.sleep(float(random.randint(15, 20))) # 每10页换一次SNUID值 if (i-1) % 10 == 0: value = get_cookies_snuid() snuid = SNUID= + value + ; # 设置Cookies cookies = cookie + snuid url = http://weixin.sogou.com/weixin?query=python&type=2&page= + str(i) + &ie=utf8 host = cookies + \n header = head + host headers = str_to_dict(header) # 设置代理IP proxies = get_proxies(i) try: response = requests.get(url=url, headers=headers, proxies=proxies) html = response.text soup = BeautifulSoup(html, html.parser) data = soup.find_all(ul, { class: news-list}) lis = data[0].find_all(li) for j in (range(len(lis))): h3 = lis[j].find_all(h3) #print(h3[0].get_text().replace(\n, )) title = h3[0].get_text().replace(\n, ).replace(,, ，) p = lis[j].find_all(p) #print(p[0].get_text()) article = p[0].get_text().replace(,, ，) a = lis[j].find_all(a, { class: account}) #print(a[0].get_text()) name = a[0].get_text() span = lis[j].find_all(span, { class: s2}) cmp = re.findall("\d{ 10}", span[0].get_text()) #print(time.strftime("%Y-%m-%d", time.localtime(int(cmp[0]))) + \n) date = time.strftime("%Y-%m-%d", time.localtime(int(cmp[0]))) with open(sg_articles.csv, a+, encoding=utf-8-sig) as f: f.write(title + , + article + , + name + , + date + \n) print(第 + str(i) + 页成功) except Exception as e: print(第 + str(i) + 页失败) failed_list.append(i) continue # 获取失败页码 print(failed_list) def main(): get_message() if __name__ == __main__: main()

***成功获取数据。

04 数据可视化

1 微信文章发布数量***0

这里对搜索过来的微信文章进行排序，发现了这十位Python大佬。

这里其实特想知道他们是团队运营，还是个人运营。不过不管了，亿华云计算先关注去。

这个结果可能也与我用Python这个关键词去搜索有关，一看公众号名字都是带有Python的(CSDN例外)。

from pyecharts import Bar import pandas as pd df = pd.read_csv(sg_articles.csv, header=None, names=["title", "article", "name", "date"]) list1 = [] for j in df[date]: # 获取文章发布年份 time = j.split(-)[0] list1.append(time) df[year] = list1 # 选取发布时间为2018年的文章，并对其统计 df = df.loc[df[year] == 2018] place_message = df.groupby([name]) place_com = place_message[name].agg([count]) place_com.reset_index(inplace=True) place_com_last = place_com.sort_index() dom = place_com_last.sort_values(count, ascending=False)[0:10] attr = dom[name] v1 = dom[count] bar = Bar("微信文章发布数量***0", title_pos=center, title_top=18, width=800, height=400) bar.add("", attr, v1, is_convert=True, xaxis_min=10, yaxis_rotate=30, yaxis_label_textsize=10, is_yaxis_boundarygap=True, yaxis_interval=0, is_label_show=True, is_legend_show=False, label_pos=right, is_yaxis_inverse=True, is_splitline_show=False) bar.render("微信文章发布数量***0.html")

2 微信文章发布时间分布

因为这里发现搜索到的文章会有2018年以前的，这里予以删除，并且验证剩下文章的发布时间。

毕竟信息讲究时效性，如果我搜索获取的都是老掉牙的信息，就没什么意思了，更何况还是在一直在变化的互联网行业。

import numpy as np import pandas as pd from pyecharts import Bar df = pd.read_csv(sg_articles.csv, header=None, names=["title", "article", "name", "date"]) list1 = [] list2 = [] for j in df[date]: # 获取文章发布年份及月份 time_1 = j.split(-)[0] time_2 = j.split(-)[1] list1.append(time_1) list2.append(time_2) df[year] = list1 df[month] = list2 # 选取发布时间为2018年的文章，并对其进行月份统计 df = df.loc[df[year] == 2018] month_message = df.groupby([month]) month_com = month_message[month].agg([count]) month_com.reset_index(inplace=True) month_com_last = month_com.sort_index() attr = ["{ }".format(str(i) + 月) for i in range(1, 12)] v1 = np.array(month_com_last[count]) v1 = ["{ }".format(int(i)) for i in v1] bar = Bar("微信文章发布时间分布", title_pos=center, title_top=18, width=800, height=400) bar.add("", attr, v1, is_stack=True, is_label_show=True) bar.render("微信文章发布时间分布.html")

3 标题、文章开头词云

from wordcloud import WordCloud, ImageColorGenerator import matplotlib.pyplot as plt import pandas as pd import jieba df = pd.read_csv(sg_articles.csv, header=None, names=["title", "article", "name", "date"]) text = # for line in df[article].astype(str):(前文词云代码) for line in df[title]: text += .join(jieba.cut(line, cut_all=False)) backgroud_Image = plt.imread(python_logo.jpg) wc = WordCloud( background_color=white, mask=backgroud_Image, font_path=C:\Windows\Fonts\STZHONGS.TTF, max_words=2000, max_font_size=150, random_state=30 ) wc.generate_from_text(text) img_colors = ImageColorGenerator(backgroud_Image) wc.recolor(color_func=img_colors) plt.imshow(wc) plt.axis(off) # wc.to_file("文章.jpg")(前文词云代码) wc.to_file("标题.jpg") print(生成词云成功!)