Extract Webpage Information with Python| 东毅居士

Extract Webpage Information with Python

作者：XD / 发表： 2020年12月29日 07:57 / 更新： 2020年12月29日 07:57 / 编程笔记 / 阅读量：4066

Here is the python program to extract webpage information with BeautifulSoup and save the data in a CSV file.

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

url = 'file:///Users/xd/Desktop/ieee/Region_5_Student_Branch_Counselors_and_Chairs.htm'
save_file = 'ieee_info_1'
html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html, "html.parser")

universities = soup.find_all('div', class_='spoName bullet pad-t15')
people = soup.find_all('div', class_='roster-results')

for u, p in zip(universities, people):
    info = p.find_all('p')
    university = u.get_text()
    name = info[0].get_text()
    if name == 'Position Vacant':
        continue
    title = info[2].get_text()
    address = info[3].get_text() + ', ' + info[4].get_text()
    email = info[-1].get_text()[7:]

    content = [[university, name, title, address, email]]
    list_name = ['university', 'name', 'title', 'address', 'email']
    data = pd.DataFrame(columns=list_name, data=content)
    data.to_csv("{}.csv".format(save_file), mode='a', index=False, header=False, encoding='utf-8')

本文作者：XD 转载请标明出处：http://www.eadst.com/blog/34

本站采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可。

上一篇
Attention Net with Pytorch

下一篇
Some Useful Websites for Algorithm Practice

相关标签

WebCrawler BeautifulSoup CSV

About Me

XD

Goals determine what you are going to be.

Category

标签云

FP64 COCO DeepStream Gemma Dataset YOLO LLAMA VGG-16 CUDA Paper Excel tar git-lfs VSCode NameSilo Math Domain Search Jetson uwsgi Plotly Animate Hungarian Clash 图标 Translation hf Docker 图形思考法 Safetensors UNIX RAR Mixtral CLAP 多线程 WebCrawler Ubuntu Paddle XGBoost 搞笑 DeepSeek 净利润 EXCEL InvalidArgumentError Heatmap FP8 论文 Shortcut Zip Qwen2.5 Pickle GGML SQLite BTC CAM ChatGPT v0.dev WAN TSV 飞书 ms-swift 强化学习 Permission Baidu ResNet-50 mmap Windows RL FP16 BF16 Augmentation git C++ Claude Magnet SPIE Vim 算法题顶会 MD5 IndexTTS2 CSV Password PDB Review Firewall 多进程 transformers SQL Sklearn Vmess Tensor 版权 GPT4 Conda v2ray UI 证件照 TTS GIT Tiktoken Freesound 阿里云 Anaconda FP32 Linux scipy 云服务器域名 Input 论文速读 Bitcoin QWEN Tracking Website Github Rebuttal Bipartite LoRA 递归学习法 Statistics Django torchinfo Numpy Cloudreve PDF Qwen AI tqdm Data Use TensorFlow Bin llama.cpp NLP LeetCode diffusers Bert Agent 签证 Distillation CV LaTeX Land Video Web Base64 OpenAI Attention Color Michelin uWSGI Datetime LLM Quantization 财报 PyTorch 腾讯云 ONNX RGB Breakpoint CTC TensorRT Knowledge printf Proxy Llama 第一性原理 Google Disk Food Random Plate Diagram SVR Nginx Transformers Miniforge Image2Text GoogLeNet Hotel ModelScope Logo NLTK PIP SAM News HaggingFace Pandas CEIR Streamlit Quantize PyCharm XML BeautifulSoup 公式 FlashAttention 音频 Python Pytorch Card FastAPI Template Pillow API 关于博主 Jupyter OCR logger 继承 Crawler Ptyhon Interview icon 报税 JSON HuggingFace Markdown Algorithm Qwen2 Hilton OpenCV Git GPTQ CC VPN

站点统计

本站现有博文332篇,共被浏览875665次

本站已经建立2583天!

热门文章

文章归档