爬取各大学导师评价

发布于 2024-07-14  157 次阅读


爬虫教程:从大学导师评价网站抓取数据

简介

本教程将指导你如何使用Python编写爬虫程序,从大学导师评价网站抓取导师的评价信息。我们将使用requests库来发送网络请求,BeautifulSoup库来解析HTML页面,以及json库来处理JSON数据。

环境准备

在开始之前,请确保你的Python环境已经安装了以下库:

  • requests
  • bs4(BeautifulSoup)
  • json
  • concurrent.futures(用于多线程)

如果未安装,可以通过以下命令安装:

pip install requests beautifulsoup4

第一部分:爬取大学链接

首先,我们需要爬取包含所有大学链接的页面,并将这些链接保存到一个JSON文件中。

1.1 爬取大学链接的代码

from bs4 import BeautifulSoup
import requests
import json

# 目标网站URL
url = "https://www.rateyoursupervisor.com"
response = requests.get(url)
soup = BeautifulSoup(response.content.decode(), 'lxml')

# 找到所有大学链接
links = soup.find_all('a')
university = {}
for link in links:
    text = link.get_text(strip=True)  # 提取链接文本
    href = link.get('href')  # 提取href属性
    if text and href and 'University' in href:  # 确保文本和链接都存在
        href = url + href  # 补全链接
        university[text] = href

# 保存到JSON文件
with open('university_links.json', 'w', encoding='utf-8') as f:
    json.dump(university, f, ensure_ascii=False, indent=4)

print("大学链接已保存到文件。")

第二部分:爬取大学导师及链接

接下来,我们将编写一个程序来爬取特定大学的导师信息及其链接。

2.1 爬取大学导师及链接的代码

from bs4 import BeautifulSoup
import requests
import json

# 读取大学链接的JSON文件
with open('university_links.json', 'r', encoding='utf-8') as file:
    university_links = json.load(file)

# 根据用户输入获取大学链接
def get_university_url(university_name):
    url1 = university_links.get(university_name)
    if url1:
        return url1
    else:
        exit("大学链接未找到")

university = input("请输入大学名称:")
url = get_university_url(university.strip())
url2 = "https://www.rateyoursupervisor.com/Professor/"

# 发送请求并解析页面
request = requests.get(url)
soup = BeautifulSoup(request.content.decode(), 'lxml')

# 提取导师信息
scripts = soup.find_all('script')
for script in scripts:
    if 'Send' in script.text:
        professor = script.text.split('=')[1].replace('var professorLists', '').encode().decode('unicode-escape')
        data = json.loads(professor)
        professors = data.get("professorLists", [])
        professor_info = {}
        for teacher in professors:
            professor_info[teacher.get("name")] = {
                'url': url2 + teacher.get("id"), 
                'college': teacher.get("collegeName"), 
                'star': teacher.get("star")
            }
        break

# 保存导师信息到JSON文件
with open(f'{university}_professor_links.json', 'w', encoding='utf-8') as f:
    json.dump(professor_info, f, ensure_ascii=False, indent=4)

print("导师链接已保存到文件。")

第三部分:爬取导师评价

最后,我们将编写一个程序来爬取每个导师的评价信息。

3.1 爬取导师评价的代码

import concurrent.futures
import requests
from bs4 import BeautifulSoup
import json

# 配置文件路径
output_file_path = 'professor_reviews.txt'

# 读取JSON文件
with open('武汉大学_professor_links.json', 'r', encoding='utf-8') as file:
    professors = json.load(file)

# 定义获取评价的函数
def fetch_professor_reviews(name, info):
    url = info['url']
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'lxml')

        comments = soup.find_all('p')
        comments_text = [comment.text.strip() for comment in comments if comment.text]
        return name, info['college'], info['star'], comments_text
    except requests.ReadTimeout:
        return name, info['college'], info['star'], ["读取超时"]
    except requests.RequestException as e:
        return name, info['college'], info['star'], [str(e)]

# 使用多线程爬取评价
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_prof = {executor.submit(fetch_professor_reviews, name, info): name for name, info in professors.items()}

        for future in concurrent.futures.as_completed(future_to_prof):
            name, college, star, comments = future.result()
            output_file.write(f"教授姓名: {name}, 所属学院: {college}, 星级: {star} 的评价:\n")
            for comment in comments:
                output_file.write(comment + '\n')
            output_file.write("\n")

print(f"所有教授的评价已保存到 '{output_file_path}' 文件中。")

总结

通过本教程,你已经学会了如何使用Python编写爬虫程序,从大学导师评价网站抓取数据。请注意,爬虫的使用应遵守目标网站的robots.txt规则,以及相关法律法规。在实际应用中,还应考虑网站的负载和反爬虫机制。