Scraping HTML Data with BeautifulSoup

Scraping Numbers from HTML using BeautifulSoup In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

You do not need to save these files to your folder since your program will read the data directly from the URL. 

Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.

SOLUTION

import urllib.request
from bs4 import BeautifulSoup

import urllib
from urllib.request import urlopen
import re
from bs4 import BeautifulSoup


urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_676725.html').read()
soup = BeautifulSoup(html, "html.parser")



sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

Post a Comment

0 Comments