Build Your Own Image Scraper with PHP

There are various ways to scrape information from other websites, including JSON and RSS. However, it’s also possible to build a scraper using PHP.

With regular expressions you can extract portions of content such as images, text and metadata.

Why would you do this? In one example, I built a WordPress site that automatically generates new posts every day by crawling the content of my Facebook page. You should be able to do the same on Joomla, Drupal or any other CMS you use.

In this post, I’ll share with you the code to extract images from a URL by using the file_get_contents() and preg_match_all() functions.

Step #1. The PHP code

Create a PHP file with the code below:

{codecitation php}<?php
$html = file_get_contents(‘http://www.website.any’);
preg_match_all( ‘|<img.*?src=[\'”](.*?)[\'”].*?>|i’,$html, $matches );
echo $matches[ 1 ][ 0 ];
?>{/codecitation}

Let’s breakdown the code you see above …

First, we tell the script which URL we can to scrape. Replace ‘http://www.website.any’ with a valid URL.

{codecitation php}$html = file_get_contents(‘http://www.website.any’);{/codecitation}

Next, we search for the img tags in $html. We also save the src values in the $matches array:

{codecitation php}preg_match_all( ‘|<img.*?src=[\'”](.*?)[\'”].*?>|i’,$html, $matches );{/codecitation}

Finally, we can print the images. Here’s how to use print the first image we find:

{codecitation php}echo $matches[ 1 ][ 0 ];{/codecitation}

If we change the third line, we can print the second image:

{codecitation php}echo $matches[ 1 ][ 1 ];{/codecitation}

If we change the third line again, we can print the third image:

{codecitation php}echo $matches[ 1 ][ 2 ];{/codecitation}

Step #2. Execute the script

Now it’s time to run the PHP file through your browser. For example, visit the URL of your file: http://localhost/your-folder/your-file.php

In my test, I successfully extracted the path for the first image from ostraining.com.

Extract Images with PHP. Showing the relative path.

Step #3. Print the img tag

Now that we know the relative path to the image, let’s include the URL so that we can ydispla the image. Edit the third line of the code from step 1 to include the img tag as follows:

Edit the third line of the code from step 1 to include the img tag as follows:

{codecitation php}echo ‘<img src=”http://www.website.any’ . $matches[ 1 ][ 0 ] . ‘” />’;{/codecitation}

In my example, I added the domain “http://www.ostraining.com”.

Run the script again through the browser to preview the image:

Author

Valentin Garcia

Valentin discovered Joomla in 2010, and since then he has considered it as the best CMS. Valentin has been coding extensions and templates for Joomla for many years and truly enjoys helping people build their own websites with Open Source tools. He lives in San Julián, Jalisco, México.

View all posts

Latest Comments

You're using the paragraphs module? I'm not sure that's possible - have you looked at the Feeds module? https://www.drupal.org/docs/contributed-modules/feeds-paragraphs

Hi, there is a way to insert data into an Entity reference revision field ? I have a node with…

Hey Ed - totally agree. We'll get that in the next draft of the workflow. thanks!

Also, should consider website translation workflow in a Drupal site. Whether it be utilization of a plugin or using a…

Hey nstocco - that's the subject for an entire course :). Yes - the best practice is to use Git…

0 0 votes

Article Rating

8 Comments

Oldest

Newest

Inline Feedbacks

View all comments

Rimsha Ishaq

8 years ago

how to scrap all the images of multiple website? any solution

thanks in advance

Zul Fikar

7 years ago

how to put the href and not jump to source website?

Uptown

While PHP can be nice to make a lightweight scraper, .net is starting to really get ahead with libs like HtmlAgilityPack. Like this one. [url=http://foggymountainsolutions.com/]http://foggymountainsolutio…[/url]

-1

Anonymous

6 years ago

how to get title and domain

Vishal

I want to scrape all images means if the website has categories & subcategories so the code would iterate all folders until all images get found

muhammadmukhshif

5 years ago

my image is not show ?

sakthi

3 years ago

Reply to muhammadmukhshif

yes me too

alex

awesome tutorial, thank you for your it. it is very clear and easy. Also as newbie in WooCommerce eCommerce i am using e-scraper to scrape all product data from my supplier sites and other sources. It helps me a lot. maybe it helps somebody too.

Thank you for your input!!!