Python Basics: Web Scraping In Python

Web scraping means you can fetch URLs, email addresses, phone numbers, names and other text-like data from a webpage.

Python provides helpful libraries to read and extract the data from webpages. Let us delve deeper into the concept of web scraping using Python.

Libraries used:-

urllib --> to call the particular url and extract the data.
re(Regular Expression) --> to clean the data
pandas --> convert the extracted data into the dataframe

import urllib.request

import re

import pandas as pd

url = "<URL>"

response = urllib.request.urlopen(url)

html = response.read()

htmlStr = html.decode()

#extract all the phone numbers from the webpage

# we are using re.findall function to extract the data. The O/P will be a list.

phdata = re.findall("\(\d{3}\) \d{3}-\d{4}", htmlStr)

print(phdata)

regex = re.compile("\n")

htmlStr1 = regex.sub("",htmlStr)

for name in re.findall("<li>\w{2,20} \w{2,20}<br/>",htmlStr1):

print(name)

#extractall the names from the webpage. The o/p will be a list

name1 = re.findall("<li>\w{2,20} \w{2,20}<br/>",htmlStr1)

#Cleaning the data

for i in range(len(name1)):

#print(name1[i])

name1[i] = name1[i].replace("<li>","")

name1[i] = name1[i].replace("<br/>","")

name1

# adding the extracted data to a dictionary

phDict = {}

x = 0

for name in name1:

phDict[name] = phdata[x]

x+=1

print(phDict)

#Creating a dataframe through a dictionary

phone_df = pd.DataFrame(list(phDict.items()), index=range(len(phDict)))

phone_df.info()

phone_df

# Renaming the columns of dataframe

phone_df=phone_df.rename(index=str,columns={0:"Name",1:"Phone"})

phone_df

# extracting a perticular person's phone number

phone_df[phone_df['Name']=='Tamara Howe']

Python Basics

Sunday, 21 April 2019

Web Scraping In Python

No comments:

Post a Comment

Sending email using Python

Search This Blog