爬虫教程:从大学导师评价网站抓取数据
简介
本教程将指导你如何使用Python编写爬虫程序,从大学导师评价网站抓取导师的评价信息。我们将使用requests
库来发送网络请求,BeautifulSoup
库来解析HTML页面,以及json
库来处理JSON数据。
环境准备
在开始之前,请确保你的Python环境已经安装了以下库:
requests
bs4
(BeautifulSoup)json
concurrent.futures
(用于多线程)
如果未安装,可以通过以下命令安装:
pip install requests beautifulsoup4
第一部分:爬取大学链接
首先,我们需要爬取包含所有大学链接的页面,并将这些链接保存到一个JSON文件中。
1.1 爬取大学链接的代码
from bs4 import BeautifulSoup
import requests
import json
# 目标网站URL
url = "https://www.rateyoursupervisor.com"
response = requests.get(url)
soup = BeautifulSoup(response.content.decode(), 'lxml')
# 找到所有大学链接
links = soup.find_all('a')
university = {}
for link in links:
text = link.get_text(strip=True) # 提取链接文本
href = link.get('href') # 提取href属性
if text and href and 'University' in href: # 确保文本和链接都存在
href = url + href # 补全链接
university[text] = href
# 保存到JSON文件
with open('university_links.json', 'w', encoding='utf-8') as f:
json.dump(university, f, ensure_ascii=False, indent=4)
print("大学链接已保存到文件。")
第二部分:爬取大学导师及链接
接下来,我们将编写一个程序来爬取特定大学的导师信息及其链接。
2.1 爬取大学导师及链接的代码
from bs4 import BeautifulSoup
import requests
import json
# 读取大学链接的JSON文件
with open('university_links.json', 'r', encoding='utf-8') as file:
university_links = json.load(file)
# 根据用户输入获取大学链接
def get_university_url(university_name):
url1 = university_links.get(university_name)
if url1:
return url1
else:
exit("大学链接未找到")
university = input("请输入大学名称:")
url = get_university_url(university.strip())
url2 = "https://www.rateyoursupervisor.com/Professor/"
# 发送请求并解析页面
request = requests.get(url)
soup = BeautifulSoup(request.content.decode(), 'lxml')
# 提取导师信息
scripts = soup.find_all('script')
for script in scripts:
if 'Send' in script.text:
professor = script.text.split('=')[1].replace('var professorLists', '').encode().decode('unicode-escape')
data = json.loads(professor)
professors = data.get("professorLists", [])
professor_info = {}
for teacher in professors:
professor_info[teacher.get("name")] = {
'url': url2 + teacher.get("id"),
'college': teacher.get("collegeName"),
'star': teacher.get("star")
}
break
# 保存导师信息到JSON文件
with open(f'{university}_professor_links.json', 'w', encoding='utf-8') as f:
json.dump(professor_info, f, ensure_ascii=False, indent=4)
print("导师链接已保存到文件。")
第三部分:爬取导师评价
最后,我们将编写一个程序来爬取每个导师的评价信息。
3.1 爬取导师评价的代码
import concurrent.futures
import requests
from bs4 import BeautifulSoup
import json
# 配置文件路径
output_file_path = 'professor_reviews.txt'
# 读取JSON文件
with open('武汉大学_professor_links.json', 'r', encoding='utf-8') as file:
professors = json.load(file)
# 定义获取评价的函数
def fetch_professor_reviews(name, info):
url = info['url']
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'lxml')
comments = soup.find_all('p')
comments_text = [comment.text.strip() for comment in comments if comment.text]
return name, info['college'], info['star'], comments_text
except requests.ReadTimeout:
return name, info['college'], info['star'], ["读取超时"]
except requests.RequestException as e:
return name, info['college'], info['star'], [str(e)]
# 使用多线程爬取评价
with open(output_file_path, 'w', encoding='utf-8') as output_file:
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_prof = {executor.submit(fetch_professor_reviews, name, info): name for name, info in professors.items()}
for future in concurrent.futures.as_completed(future_to_prof):
name, college, star, comments = future.result()
output_file.write(f"教授姓名: {name}, 所属学院: {college}, 星级: {star} 的评价:\n")
for comment in comments:
output_file.write(comment + '\n')
output_file.write("\n")
print(f"所有教授的评价已保存到 '{output_file_path}' 文件中。")
总结
通过本教程,你已经学会了如何使用Python编写爬虫程序,从大学导师评价网站抓取数据。请注意,爬虫的使用应遵守目标网站的robots.txt
规则,以及相关法律法规。在实际应用中,还应考虑网站的负载和反爬虫机制。
Comments NOTHING