In the era of automation, manual data collection tasks like downloading PDFs from websites can be streamlined using tools like Playwright. Playwright, a robust framework for web automation, allows developers to interact with web pages and automate repetitive tasks efficiently. In this blog, we will explore how to use Playwright in Python to automate the process of downloading PDFs from websites by fetching links from an Excel file.
Playwright stands out among web automation tools due to its:
Cross-browser compatibility: It supports major browsers like Chromium, Firefox, and WebKit.
Robustness: Built-in features like automatic waiting and network interception make it reliable.
Ease of use Simple API and Python bindings make it a great choice for Python developers.
Imagine a scenario where you need to download specific PDFs from various websites listed in an Excel file. Each row in the file contains a link to a website and metadata like folder names to store the downloaded PDFs. Automating this task can save hours of manual effort.
To get started, ensure the following are installed:
Python 3.7+
Playwright
Pandas (for Excel handling)
Requests (for downloading PDFs)
You can install the necessary packages using pip:
pip install playwright pandas requests
Set up Playwright by running:
playwright install
Load the Excel File
Navigate to URLs
Search for PDF Links
Download and Save PDF
import os
import pandas as pd
from playwright.sync_api import sync_playwright
import requests
from urllib.parse import urljoin
import logging
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# Load the Excel file
excel_file = "data.xlsx" # Replace with your file path
df = pd.read_excel(excel_file, sheet_name="Sheet1")
# Filter rows with necessary data
df_filtered = df[df['Website_Link'].notna() & df['Folder_Name'].notna()]
if df_filtered.empty:
logger.warning("No valid data found in the Excel file.")
else:
logger.info(f"Processing {len(df_filtered)} rows from the Excel file.")
# Start Playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
for _, row in df_filtered.iterrows():
url = row['Website_Link']
folder_name = row['Folder_Name']
# Create a folder for downloads
download_dir = os.path.join("downloads", folder_name)
os.makedirs(download_dir, exist_ok=True)
try:
logger.info(f"Navigating to {url}...")
page.goto(url, timeout=120000)
page.wait_for_load_state('networkidle', timeout=120000)
# Find all PDF links on the page
pdf_links = page.locator('a[href$=".pdf"]')
if pdf_links.count() == 0:
logger.warning(f"No PDF links found on the page: {url}")
else:
for i in range(pdf_links.count()):
pdf_url = pdf_links.nth(i).get_attribute('href')
if pdf_url:
full_pdf_url = urljoin(url, pdf_url)
pdf_name = os.path.basename(full_pdf_url)
# Download the PDF
logger.info(f"Downloading {pdf_name} from {full_pdf_url}...")
response = requests.get(full_pdf_url)
if response.status_code == 200:
file_path = os.path.join(download_dir, pdf_name)
with open(file_path, 'wb') as f:
f.write(response.content)
logger.info(f"Downloaded successfully: {file_path}")
else:
logger.error(f"Failed to download {pdf_name}. Status code: {response.status_code}")
except Exception as e:
logger.error(f"Error processing {url}: {e}")
browser.close()
This are images followed by its instructions
This is the main.py script that organizes and manages the execution of Python scripts for various companies, categorized by country. Each country's section in the script references and integrates the specific Python files related to companies operating within that country, ensuring structured and efficient execution of all company-specific functionalities.
The settings.json file serves as the centralized configuration hub for all company-specific scripts across different countries. It stores essential settings such as API keys, file paths, environment variables, and other parameters needed by each script to function properly. This approach promotes consistency and easy management of configurations across multiple scripts.
The script is designed to generate a structured folder system where each folder represents either a ferry line or a port name. Inside each folder, the corresponding timetable PDFs are stored. This ensures organized access to schedules, making it easy to locate and reference specific ferry or port timetables.
Example of a Ferry Company Website for Timetable PDF Downloads Our script automates downloading timetable PDFs from ferry company websites, covering multiple routes and ports. It navigates the schedule sections of these sites, identifies available PDFs, and downloads them efficiently
Read and Filter Excel Data: The script uses Pandas to read the Excel file and filter rows with valid website links and folder names.
Navigate to URLs: Playwright navigates to each URL and waits for the page to load.
Find and Download PDFs: All links ending with .pdf
are located, and each PDF is downloaded using the requests library. The PDFs are saved into folders named after the Folder_Name
column in the Excel file.
Error Handling: The script logs errors for any failed downloads or missing data.
Efficiency: Automates repetitive tasks, saving time and effort.
Scalability: Can handle multiple websites and large datasets from Excel.
Flexibility: Easy to customize for specific requirements, such as filtering PDFs based on keywords.
Dynamic Filters: Add functionality to search for specific text in PDF links.
Error Logging: Store errors in a separate file for better debugging.
Headed Mode: Use headless=False
during debugging to view browser actions.
Parallel Execution: Use multiprocessing to process multiple URLs simultaneously.
Automating PDF downloads using Playwright and Python is a powerful way to streamline workflows involving web data extraction. By leveraging Excel for input data, the solution becomes dynamic and adaptable for various use cases. Playwright’s robust capabilities combined with Python’s versatility make this a go-to solution for automation enthusiasts and developers alike .With the provided script as a starting point, you can build your custom automation solutions tailored to your specific needs.