Google-Image-Crawler is a Python-based tool designed to automate the process of downloading images from Google Image Search results. It leverages Selenium and BeautifulSoup to scrape images based on user-defined keywords and save them locally.
Features
- Automated image scraping from Google Image Search
- Supports keyword-based image queries
- Configurable download limits and output directories
- Uses Selenium WebDriver for dynamic page interaction
- Supports batch processing via JSON configuration
Tech Stack
- Python 3
- Selenium WebDriver
- BeautifulSoup4
- urllib3
Getting Started
Prerequisites
- Python 3.x installed
- Google Chrome browser installed
- ChromeDriver executable compatible with your Chrome version (download from https://sites.google.com/chromium.org/driver/)
Installation
- Clone the repository:
git clone https://github.com/justin-napolitano/Google-Image-Crawler.git
cd Google-Image-Crawler
- Install required Python packages:
pip install -r requirements.txt
Note: If requirements.txt is not present, install dependencies manually:
pip install selenium beautifulsoup4 urllib3
- Place the
chromedriver.exein the project directory or update the path inconfig.json.
Usage
- Modify
config.jsonto set keywords, download limits, ChromeDriver path, and output directories. - Run the crawler script (e.g.,
seleniumCrawler.pyorgooglecrawler.py) to start downloading images.
Example running seleniumCrawler.py:
python seleniumCrawler.py
Project Structure
Google-Image-Crawler/
├── chromedriver.exe # ChromeDriver executable for Selenium
├── config.json # Configuration file with keywords and settings
├── googlecrawler.py # Script using BeautifulSoup for scraping
├── seleniumCrawler.py # Script using Selenium WebDriver for scraping
├── seleniumCrawler.txt # Possibly notes or logs related to Selenium crawler
├── README.md # This file
seleniumCrawler.py: Uses Selenium to automate Chrome, scroll through image results, and download images.googlecrawler.py: Uses requests and BeautifulSoup to scrape image URLs and download images.config.json: Contains multiple records for batch image crawling with different keywords and settings.
Future Work / Roadmap
- Improve error handling and logging for robustness.
- Add command-line interface to specify keywords and settings dynamically.
- Support more flexible output naming conventions.
- Implement parallel downloads to improve speed.
- Add support for other search engines or image sources.
- Package as a Python module for easier integration.
- Update scraping logic to handle changes in Google Image Search markup.
Assumptions: The project requires manual setup of ChromeDriver and Python dependencies. The README assumes basic familiarity with Python and Selenium.