Simple URL Crawler to Extract Internal and External Links in PHP

One of my partners asked me for support on his personal website. He is not a pro user, meaning that he created a simple website a couple of years ago and relies on support of his hoster. Stuff like responsive design and SEO (what is this?) does not matter and he just wants to publish his work for his students.

My partner told me that one of his students has reported about a dubious link referring from his website to a suspect website. He asked me to find and remove this link.

Challenge Accpeted

It could be so easy to click through all menu points and click on all links. But well, let’s create a simple tool 😃. My idea was a simple web crawler in order to get all (sub)pages of the website. Doing this, I get all pages as a plain string and can use regular expressions in order to extract URL as substrings. Whenever I identify a URL as a subpage of the current page, I need to visit it and parse the content as well.

But at this point, it is very important to be very carefully. I need a way to differentiate between internal and external links as otherwise, the crawler would visit and crawl the external sites as well. And thus, I possibly would crawl the whole web 😅

Some Code

First things first – this approach has no claim to be mature. It just work’s and fulfills my requirements. There are  things to make better, but hey – this was a 2 hour evening “project”. Please just have some mercy on me 😊

I had a Queue data structure as a temporary buffer in my mind. A queue provides easy access to elements and cares about adding and removing them. Using a queue, we can add and iterate all internal URL’s and process them while other data structures serve as a “storage” for the URL’s1. I decided to use the HashSet data structure as the URL storage because it detects duplicates automatically and does not allow them.

Initially, we add the root domain to the queue to ensure that we crawl at least one page. Next, we retrieve the URL and try to load the content. If PHP’s “DOMDocument” can not load the page (for whatever reason), we add the URL to the “error” HashSet and process the next URL.

Next, we look for all HTML a-tags as we assume that all URL’s are in such a tag. We further assume that there is no other stuff than URL’s. We just assume that we possibly may have relative paths (such as public/contact.html instead of https://example.com/public/contact.html). In this case we just add the root domain to the URL and go further2.

We also want to prevent our code to crawl a page multiple times. Therefore, we need to check our internal and external storages if we have crawled this URL before. If this is the case, we simply skip the URL and process the next one.

As a last (and very critical) step, we need to differentiate between internal and external URL’s. In my case, it was enough to check if the root domain is presentable in the URL. However, this could not be enough, please adapt this according to your own requirements! If the URL contains the root domain, we add the the URL to the internal’s HashSet and to the queue as well. If the URL does not contain the root domain, it is determined as an external URL and is only added to the external’s Set and not to the queue. This is very critical, as otherwise we would start to crawl the external URL’s as well!!

Here is the corresponding code:

doganoo\PHPAlgorithms\Datastructure\Stackqueue\Queue and doganoo\PHPAlgorithms\Datastructure\Sets\HashSet are part of PHPAlgorithms, an open source PHP library for Algorithms and Data Structures. Simply require it via composer: composer require doganoo/php-algorithms

EDIT

Since we released PHPAlgorithms in Version 1, there are some breaking changes regarding to this post. You can either use a version lower than 1 or check that you use the right namespace/classes.

Footnotes

  1. In fact, the Stack data structure would also do the job
  2. This is one of the incomplete and ugly things. Please take note of this and correct it if necessary.