爬虫教程：从大学导师评价网站抓取数据

简介

本教程将指导你如何使用Python编写爬虫程序，从大学导师评价网站抓取导师的评价信息。我们将使用requests库来发送网络请求，BeautifulSoup库来解析HTML页面，以及json库来处理JSON数据。

环境准备

在开始之前，请确保你的Python环境已经安装了以下库：

requests
bs4（BeautifulSoup）
json
concurrent.futures（用于多线程）

如果未安装，可以通过以下命令安装：

pip install requests beautifulsoup4

第一部分：爬取大学链接

首先，我们需要爬取包含所有大学链接的页面，并将这些链接保存到一个JSON文件中。

1.1 爬取大学链接的代码

from bs4 import BeautifulSoup
import requests
import json

# 目标网站URL
url = "https://www.rateyoursupervisor.com"
response = requests.get(url)
soup = BeautifulSoup(response.content.decode(), 'lxml')

# 找到所有大学链接
links = soup.find_all('a')
university = {}
for link in links:
    text = link.get_text(strip=True)  # 提取链接文本
    href = link.get('href')  # 提取href属性
    if text and href and 'University' in href:  # 确保文本和链接都存在
        href = url + href  # 补全链接
        university[text] = href

# 保存到JSON文件
with open('university_links.json', 'w', encoding='utf-8') as f:
    json.dump(university, f, ensure_ascii=False, indent=4)

print("大学链接已保存到文件。")

第二部分：爬取大学导师及链接

接下来，我们将编写一个程序来爬取特定大学的导师信息及其链接。

2.1 爬取大学导师及链接的代码

from bs4 import BeautifulSoup
import requests
import json

# 读取大学链接的JSON文件
with open('university_links.json', 'r', encoding='utf-8') as file:
    university_links = json.load(file)

# 根据用户输入获取大学链接
def get_university_url(university_name):
    url1 = university_links.get(university_name)
    if url1:
        return url1
    else:
        exit("大学链接未找到")

university = input("请输入大学名称：")
url = get_university_url(university.strip())
url2 = "https://www.rateyoursupervisor.com/Professor/"

# 发送请求并解析页面
request = requests.get(url)
soup = BeautifulSoup(request.content.decode(), 'lxml')

# 提取导师信息
scripts = soup.find_all('script')
for script in scripts:
    if 'Send' in script.text:
        professor = script.text.split('=')[1].replace('var professorLists', '').encode().decode('unicode-escape')
        data = json.loads(professor)
        professors = data.get("professorLists", [])
        professor_info = {}
        for teacher in professors:
            professor_info[teacher.get("name")] = {
                'url': url2 + teacher.get("id"), 
                'college': teacher.get("collegeName"), 
                'star': teacher.get("star")
            }
        break

# 保存导师信息到JSON文件
with open(f'{university}_professor_links.json', 'w', encoding='utf-8') as f:
    json.dump(professor_info, f, ensure_ascii=False, indent=4)

print("导师链接已保存到文件。")

第三部分：爬取导师评价

最后，我们将编写一个程序来爬取每个导师的评价信息。

3.1 爬取导师评价的代码

import concurrent.futures
import requests
from bs4 import BeautifulSoup
import json

# 配置文件路径
output_file_path = 'professor_reviews.txt'

# 读取JSON文件
with open('武汉大学_professor_links.json', 'r', encoding='utf-8') as file:
    professors = json.load(file)

# 定义获取评价的函数
def fetch_professor_reviews(name, info):
    url = info['url']
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'lxml')

        comments = soup.find_all('p')
        comments_text = [comment.text.strip() for comment in comments if comment.text]
        return name, info['college'], info['star'], comments_text
    except requests.ReadTimeout:
        return name, info['college'], info['star'], ["读取超时"]
    except requests.RequestException as e:
        return name, info['college'], info['star'], [str(e)]

# 使用多线程爬取评价
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_prof = {executor.submit(fetch_professor_reviews, name, info): name for name, info in professors.items()}

        for future in concurrent.futures.as_completed(future_to_prof):
            name, college, star, comments = future.result()
            output_file.write(f"教授姓名: {name}, 所属学院: {college}, 星级: {star} 的评价:\n")
            for comment in comments:
                output_file.write(comment + '\n')
            output_file.write("\n")

print(f"所有教授的评价已保存到 '{output_file_path}' 文件中。")