Sunday, 21 April 2019

Web Scraping In Python


Web scraping means you can fetch URLs, email addresses, phone numbers, names and other text-like data from a webpage.

Python provides helpful libraries to read and extract the data from webpages. Let us delve deeper into the concept of web scraping using Python.

Libraries used:-


  • urllib --> to call the particular url and extract the data.
  • re(Regular Expression) --> to clean the data
  • pandas --> convert the extracted data into the dataframe
import urllib.request
import re
import pandas as pd

url = "<URL>"

response = urllib.request.urlopen(url)
html = response.read()
htmlStr = html.decode()

#extract all the phone numbers from the webpage
# we are using re.findall function to extract the data. The O/P will be a list.
phdata = re.findall("\(\d{3}\) \d{3}-\d{4}", htmlStr)
print(phdata)



regex = re.compile("\n")
htmlStr1 = regex.sub("",htmlStr)

for name in re.findall("<li>\w{2,20} \w{2,20}<br/>",htmlStr1):
    print(name)


#extractall the names from the webpage. The o/p will be a list
name1 = re.findall("<li>\w{2,20} \w{2,20}<br/>",htmlStr1)



#Cleaning the data
for i in range(len(name1)):
    #print(name1[i])
    name1[i] = name1[i].replace("<li>","")
    name1[i] = name1[i].replace("<br/>","")

name1
 

# adding the extracted data to a dictionary

phDict = {}
x = 0
for name in name1:
    phDict[name] = phdata[x]
    x+=1
print(phDict)


#Creating a dataframe through a dictionary

phone_df = pd.DataFrame(list(phDict.items()), index=range(len(phDict)))

phone_df.info()

phone_df



# Renaming the columns of dataframe
phone_df=phone_df.rename(index=str,columns={0:"Name",1:"Phone"})

phone_df



# extracting a perticular person's phone number
phone_df[phone_df['Name']=='Tamara Howe']



No comments:

Post a Comment

Sending email using Python

Python has provided smtplib library in order to send email. import smtplib #domain name for the smtp server and port number conn = smtplib.S...