There are various ways to scrape information from other websites, including JSON and RSS. However, it's also possible to build a scraper using PHP.
With regular expressions you can extract portions of content such as images, text and metadata.
Why would you do this? In one example, I built a WordPress site that automatically generates new posts every day by crawling the content of my Facebook page. You should be able to do the same on Joomla, Drupal or any other CMS you use.
In this post, I'll share with you the code to extract images from a URL by using the file_get_contents() and preg_match_all() functions.
Step #1. The PHP code
Create a PHP file with the code below:
<?php $html = file_get_contents('http://www.website.any'); preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html, $matches ); echo $matches[ 1 ][ 0 ]; ?>
Let's breakdown the code you see above ...
First, we tell the script which URL we can to scrape. Replace 'http://www.website.any' with a valid URL.
$html = file_get_contents('http://www.website.any');
Next, we search for the img tags in $html. We also save the src values in the $matches array:
preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html, $matches );
Finally, we can print the images. Here's how to use print the first image we find:
echo $matches[ 1 ][ 0 ];
If we change the third line, we can print the second image:
echo $matches[ 1 ][ 1 ];
If we change the third line again, we can print the third image:
echo $matches[ 1 ][ 2 ];
Step #2. Execute the script
Now it's time to run the PHP file through your browser. For example, visit the URL of your file: http://localhost/your-folder/your-file.php
In my test, I successfully extracted the path for the first image from OSTraining.com.
Step #3. Print the img tag
Now that we know the relative path to the image, let's include the URL so that we can ydispla the image. Edit the third line of the code from step 1 to include the img tag as follows:
Edit the third line of the code from step 1 to include the img tag as follows:
echo '<img src="http://www.website.any' . $matches[ 1 ][ 0 ] . '" />';
In my example, I added the domain "http://www.ostraining.com".
Run the script again through the browser to preview the image: