Tutorials

Build a Web Crawler in 10 Minutes

Today we’re going to show you how write a web crawler in about 10 minutes. You’ll be able to crawl your website, collect your links and do whatever you want with them, use this power wisely my friend. We’re going to write this with PHP and there are many people who will say “why not do it with NodeJS, or Python”, which are both great languages for this type of work. But today we’re going to do this with PHP.

Requirements:

  • composer
  • PHP 5.6+

The general structure of a crawler is a parent and children, or another way to think about it is a commander and soldiers. The commander will send out the soldiers who will report back, which may require more soldiers to be sent out. The process is usually controlled with a limit to the number of soldiers who can be sent out. This limit will hit its peak and the crawler will finish. The key to remember here is that the crawler has a limit. The method like all needs to complete its process, and we may need to run it again, and again to collect all the things. Here at Yab we’ve developed a few projects that have complex crawlers that run on CRON jobs in order to keep crawling and collecting content, and we’re constantly tweaking them. With that scope in mind, lets code!

So lets look first at a very basic use case of this, lets list out the links on a site:

<?php

require 'vendor/autoload.php';
require 'src/crawler.php';

echo "A simple web crawler";

$crawler = new \Yab\Crawler\Crawler;

$dom = $crawler->crawl('http://yourwebsite.com', 10);

foreach ($dom->links() as $link) {
    if ($link['visited']) {
       echo ''.$link['url'].'
'; } }

Obviously we pull in the vendor/autoload and we’re going to need our very simple crawler. We create the new instance of the class, and have it crawl our desired URL. Did you notice the 10?


$dom = $crawler->crawl('http://yourwebsite.com', 10);

We can see here that we have a number 10. This is our limit. This means that when we find a page, we’re willing to go 9 more pages deep in our crawling. You can set it higher or lower and see the result. Bear in mind that the larger your limit the longer the processing time. And nobody likes a process timeout.

We then see that we get out links with the links method and then list them out accordingly. That’s the most basic act we could have a crawler do, well, we could have it do nothing but that’s pointless. The next component to cover is the actual crawler.

So we’ll look at the crawler in pieces – it just makes it easier – trust me.



namespace Yab\Crawler;

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler as DomCrawler;


We start off with our namespace and some use Classes. Guzzle and Symphony provide the best tools to get the job done so there is no point in trying to write our own version or this would be writing a web crawler in 10 weeks.


class Crawler
{
    protected $url;
    protected $links;
    protected $maxDepth;

    public function __construct()
    {
        $this->baseUrl = '';
        $this->links = [];
        $this->depth = 0;
    }


The basic parts of the class are the base url, links, and depth. The basic Crawler as an object would have these attributes, its what allows us to build a nice array of links, while maintaining the depth and the base url.



    public function crawl($url, $maxDepth = 10)
    {
        $this->baseUrl = $url;
        $this->depth = $maxDepth;

        $this->spider($this->baseUrl, $maxDepth);

        return $this;
    }

    public function links()
    {
        return $this->links;
    }


The crawl and links methods are the only two public methods. The remaining methods are all private and serve no purpose outside the crawler object. It’s simple enough story to read: crawler crawls website and crawler provides links.

The next three methods cover the spawning, spiders, and extracting or commander, soldiers and reports. Let’s take a quick look at the methods and their arguments to get our basic understanding:



spider($url, $maxDepth) // url and depth
spawn($links, $maxDepth) // link array and depth
extractLinks($html, $url) // html of page and its url
checkIfExternal($url) // url to check


Ok I lied there are four methods, but the checkIfExternal is really just a small checker, but isn’t critical in understanding the high level logic.

Lets look at the spider method. This is the method that will crawl the page and get the content of the page, as well as make some notes related to the url, being crawled.



    private function spider($url, $maxDepth)
    {
        try {

            $this->links[$url] = [
                'status_code' => 0,
                'url' => $url,
                'visited' => false,
                'is_external' => false,
            ];

            // Create a client and send out a request to a url
            $client = new Client();
            $crawler = $client->request('GET', $url);

            // get the content of the request result
            $html = $crawler->getBody()->getContents();
            // lets also get the status code
            $statusCode = $crawler->getStatusCode();

            // Set the status code
            $this->links[$url]['status_code'] = $statusCode;
            if ($statusCode == 200) {

                // Make sure the page is html
                $contentType = $crawler->getHeader('Content-Type');
                if (strpos($contentType[0], 'text/html') !== false) {

                    // collect the links within the page
                    $pageLinks = [];
                    if (@$this->links[$url]['is_external'] == false) {
                        $pageLinks = $this->extractLinks($html, $url);
                    }

                    // mark current url as visited
                    $this->links[$url]['visited'] = true;
                    // spawn spiders for the child links, marking the depth as decreasing, or send out the soldiers
                    $this->spawn($pageLinks, $maxDepth - 1);
                }
            }
        } catch(\GuzzleHttp\Exception\RequestException $ex)  {
            // do nothing or something
        } catch (Exception $ex) {
            // call it a 404?
            $this->links[$url]['status_code'] = '404';
        }
    }


The spider may seem like the commander but its actually the soldier. The spawning method will create more and more spiders, but the spider is just a foot soldier who enters into the trenches and reports back to the class by adding to the crawler class’s properties.



    private function spawn($links, $maxDepth)
    {
        // if we hit the max - then its the end of the rope
        if ($maxDepth == 0) {
            return;
        }

        foreach ($links as $url => $info) {
            // only pay attention to those we do not know
            if (! isset($this->links[$url])) {
                $this->links[$url] = $info;
                // we really only care about links which belong to this domain
                if (! empty($url) && ! $this->links[$url]['visited'] && ! $this->links[$url]['is_external']) {
                    // restart the process by sending out more soldiers!
                    $this->spider($this->links[$url]['url'], $maxDepth);
                }
            }
        }
    }


The spawn method is the commander. Its job is very simple, if there is a reason to send out more soldiers, it will do exactly that. This is where you can modify the logic and set the rules of why you would crawl the site. The soldier or spider is what you modify to change what you collect on each page.

The next methods handle the logic of parsing more or less. The checkIfExternal method does exactly what its name says. The extractLinks method on the other hand uses Symfony’s DomCrawler to rebuild the dom of the site, in order to isolate the links and their attributes.



    private function checkIfExternal($url)
    {
        $baseUrl = str_replace(['http://', 'https://'], '', $this->baseUrl);
        // if the url fits then keep going!
        if (preg_match("@http(s)?\://$baseUrl@", $url)) {
            return false;
        }

        return true;
    }

    private function extractLinks($html, $url)
    {
        $dom = new DomCrawler($html);
        $currentLinks = [];

        // get the links
        $dom->filter('a')->each(function(DomCrawler $node, $i) use (&$currentLinks) {
            // get the href
            $nodeUrl = $node->attr('href');

            // If we don't have it lets collect it
            if (! isset($this->links[$nodeUrl])) {
                // set the basics
                $currentLinks[$nodeUrl]['is_external'] = false;
                $currentLinks[$nodeUrl]['url'] = $nodeUrl;
                $currentLinks[$nodeUrl]['visited'] = false;

                // check if the link is external
                if ($this->checkIfExternal($currentLinks[$nodeUrl]['url'])) {
                    $currentLinks[$nodeUrl]['is_external'] = true;
                }
            }
        });

        // if page is linked to itself, ex. homepage
        if (isset($currentLinks[$url])) {
            // let's avoid endless cycles
            $currentLinks[$url]['visited'] = true;
        }

        // Send back the reports
        return $currentLinks;
    }

}

So with that all in mind you can get a copy of the repository here, but let’s do a recap of what happens so we’re all on the same page. A very simple crawler will grab a page look at it, get the links on it and repeat the process for each link, until it reaches a given limit. Our example here is very simple and doesn’t do much with that data. There are some very powerful similar tools that are out there and can handle more abstract collections etc.

Hopefully you’ve been able to get the basic crawler working, and hopefully this gives you some insight into how very simple crawlers work. We’d love to hear about the sort of crawlers you’re building and the tools that help you do your crawling.

One last thing: Don’t forget that crawler’s when used unfavourably can seriously distort your website’s analytics, so don’t test on live sites that rely on that data, or you’ll be kicking yourself for days 😉

Let us know if you’d like any other tutorials or if you want to know more about the tools we use, reach out and let us know. And if you’re curious if this less than 10 min, copy and paste the article into: http://readtime.eu/ and you’ll see ~ 00:09:24:9 🙂