In this example we will show how to implement a simple crawler program with Python.
2. Tips
First import the following package: Beautiful Soup, requests, re, then set the URL to access, and accept the return values.
Source Code
#! /usr/bin/env python3
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
from pprint import pprint
headers = {
'Host': 'en.wikipedia.org',
'Referer': 'https://en.wikipedia.org/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
}
url = 'https://en.wikipedia.org/wiki/Main_Page'
r = requests.session()
r = BeautifulSoup(r.get(url, headers=headers).content)
result = r.find(id="mp-otd").find('ul').find_all('li')
onThisDay = [] # store crawled results
for otd in result:
c = '{}'.format(otd.text)
onThisDay.append(c)
print(onThisDay)
pprint(onThisDay, width=90)
After running the codes above, we can see the results as follow. They include a group of today’s historical events. The number at the beginning is the year and the text is the event description.