Firebase Cloud Storage & Pythons BS(4) Logo | Edited by Author ©

Scraping and Uploading Images to Firebase Storage using Python’s BeautifulSoup 4

Habibur Rahaman Fahim

--

Introduction:

In today’s digital era, web scraping and data storage are vital tasks for many web-based applications. Python, with its extensive range of libraries, provides powerful tools for web scraping and data manipulation. In this article, we will explore how to scrape images from websites and upload them to Firebase Storage using Python.

Prerequisites:

Before we delve into the code, make sure you have the following prerequisites set up:

  1. Python installed on your machine.
  2. requests library installed (pip install requests).
  3. BeautifulSoup library installed (pip install beautifulsoup4).
  4. A Firebase project set up with a storage bucket and the Firebase Admin SDK JSON file.

Scraping and Uploading Images:

Let’s examine the Python code snippet and understand how it scrapes images from websites and uploads them to Firebase Storage.

"""
Copyright 2023, Habibur Rahaman Fahim

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""


import requests
from bs4 import BeautifulSoup
import os
import firebase_admin
from firebase_admin import credentials, storage

BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

# Path to the Firebase admin SDK JSON file ('.json' file named as 'firebase_config.json' here, which is inside 'firebase' folder)
FIREBASE_ADMIN_SDK_FILE = os.path.join(BASE_DIR, "firebase", "firebase_config.json")

# Initialize the Firebase app with the admin SDK credentials
cred = credentials.Certificate(FIREBASE_ADMIN_SDK_FILE)
firebase_admin.initialize_app(
# Add storageBucket link here
cred, {"storageBucket": "authentication-2244f.appspot.com"}
)

# Get a reference to the Firebase Storage bucket
bucket = storage.bucket()

urls = {
# First tested site: https://rents.com.bd/all-properties/
"https://rents.com.bd/all-properties/": (
"h2",
{"class": "item-title"}),
# Second tested site: https://www.bproperty.com/en/dhaka/apartments-for-rent/
"https://www.bproperty.com/en/dhaka/apartments-for-rent/": (
"h2",
{"class": "_7f17f34f"},
),
}

for url, search_pattern in urls.items():
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

for item_listing_wrap in soup.find_all(
"div",
{"class": "item-listing-wrap hz-item-gallery-js card"}
) + soup.find_all(
"li",
{"class": "ef447dde"}
):
ad_name_element = item_listing_wrap.find(search_pattern[0], search_pattern[1])
if ad_name_element is not None:
ad_name = ad_name_element.get_text().strip()
ad_name = (
ad_name.replace(".", " ")
.replace("$", " ")
.replace("#", " ")
.replace("[", " ")
.replace("]", " ")
.replace("/", " ")
.replace("-", " ")
)
else:
ad_name = ""

for img in item_listing_wrap.find_all("img"):
src = img.get("src")
if src:
filename, extension = os.path.splitext(os.path.basename(src))
if extension.lower() in (".jpg", ".jpeg"):
# The image with the same name as the ad name
filename = f"{ad_name}"
response = requests.get(src)
if response.status_code == 200:
# Image to Firebase Storage
blob = bucket.blob(filename)
blob.upload_from_string(
response.content, content_type="image/jpeg"
)
print(f"Image {filename} uploaded to Firebase Storage.")
else:
print(
f"Failed to load image {filename}. Status code: {response.status_code}"
)
  1. Importing Required Libraries: Begin by importing the necessary libraries: requests, BeautifulSoup, os, firebase_admin, and storage from the Firebase Admin SDK.
  2. Initializing the Firebase App: Load the Firebase Admin SDK JSON file and use it to initialize the Firebase app with the appropriate credentials. Obtain the reference to the storage bucket.
  3. Defining URLs and Search Patterns: Define a dictionary with the URLs of the websites to scrape and their corresponding search patterns. Each URL is associated with an HTML tag and its attributes, which aid in locating the desired elements containing the images and ad names.
  4. Scraping and Uploading: Iterate through the URLs using a loop and perform the following steps:
  • Send a GET request to the URL and retrieve the webpage content.
  • Parse the HTML content using BeautifulSoup.
  • Find the relevant div elements or list items that contain the image and ad name information.
  • Clean the ad name by replacing unwanted characters.
  • Iterate through the image elements, extract the source URL, and check if it is a JPEG image.
  • Download the image using the requests library and upload it to Firebase Storage using the Firebase Admin SDK.
  • Print the status of the image upload.

Conclusion:

In this article, we explored how to scrape images from websites and upload them to Firebase Storage using Python. We learned about the requests and BeautifulSoup libraries for web scraping and how to utilize the Firebase Admin SDK to interact with Firebase Storage. By leveraging these tools, you can automate the process of extracting images and storing them in the cloud for further processing or analysis.

Always remember to respect website scraping policies and terms of service, and ensure that you have the necessary rights or permissions to scrape and use the data obtained.

That concludes this tutorial! Hope you found it informative and helpful. Happy scraping and image uploading!

--

--