Web scraping means you can fetch URLs, email addresses, phone numbers, names and other text-like data from a webpage.
Python provides helpful libraries to read and extract the data from webpages. Let us delve deeper into the concept of web scraping using Python.
Libraries used:-
- urllib --> to call the particular url and extract the data.
- re(Regular Expression) --> to clean the data
- pandas --> convert the extracted data into the dataframe
import urllib.request
import re
import pandas as pd
url = "<URL>"
response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()
#extract all the phone numbers from the webpage
# we are using re.findall function to extract the data. The O/P will be a list.
phdata = re.findall("\(\d{3}\) \d{3}-\d{4}", htmlStr)
regex = re.compile("\n")
htmlStr1 = regex.sub("",htmlStr)
for name in re.findall("<li>\w{2,20} \w{2,20}<br/>",htmlStr1):
print(name)
#extractall the names from the webpage. The o/p will be a list
#Cleaning the data
for i in range(len(name1)):
#print(name1[i])
name1[i] = name1[i].replace("<li>","")
name1[i] = name1[i].replace("<br/>","")
name1
# adding the extracted data to a dictionary
phDict = {}
x = 0
for name in name1:
phDict[name] = phdata[x]
x+=1
print(phDict)
#Creating a dataframe through a dictionary
phone_df = pd.DataFrame(list(phDict.items()), index=range(len(phDict)))
phone_df.info()
# Renaming the columns of dataframe
phone_df=phone_df.rename(index=str,columns={0:"Name",1:"Phone"})
# extracting a perticular person's phone number
No comments:
Post a Comment