Crawling Web Pages and Creating Sitemaps

Creating a Sitemap Based on all the Links within a Website

I built this web crawler because I wanted a way to create a sitemap of this website I was building. I know there are a few websites out there that will do this for you but I didn’t want to rely on someone else and I wanted to change a few things. So in order to do this I used php and cURL.
I started out creating a class for the crawler. When I create a new crawler class I pass in the url of the website I want to start with. This also uses cURL to access the webpage and get the content and headers. Inside this class are also methods to get all the links of a page, the page title, the entire content, just the body content, and the headers. But you could easily add more to say grab all the images on a page.

The Crawler Class

  

class Crawler {
  protected $markup='';
  protected $httpinfo='';

  public function __construct($uri, $justheaders=0){
    $output = $this->getMarkup($uri, $justheaders);
    $this->markup = $output['output'];
    $this->httpinfo = $output['code'];
  }

  public function getMarkup($uri, $justheaders) {
    $ch = curl_init($uri);
    curl_setopt($ch, CURLOPT_HEADER, 1);
    if($justheaders){
      curl_setopt($ch, CURLOPT_NOBODY, 1);
      curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
      curl_setopt($ch, CURLOPT_FAILONERROR, 1);
    }
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_MAXREDIRS, 5);

    $output['output'] = curl_exec($ch);
    $output['code'] = curl_getinfo($ch);
    curl_close($ch);
    return $output;
  }

  public function get($type){
    $method = "_get_{$type}";
    if (method_exists($this, $method)){
      return call_user_method($method, $this);
    }
  }

  protected function _get_info(){
    return $this->httpinfo;
  }

  protected function _get_links(){
    if(!empty($this->markup)){
      preg_match_all('/<a(?:.*?)href=(["|\'].*?["|\'])(.*?)>(.*?)\<\/a\>/i',
                               $this->markup, $links);
      return !empty($links[1]) ? array_flip(array_flip($links[1])) : FALSE;
    }
  }

  protected function _get_body(){
    if(!empty($this->markup)){
      preg_match('/\<body\>(.*?)\<\/body\>/msU', $this->markup, $body);
      return $body[1];
    }
  }
  protected function _get_content(){
    if(!empty($this->markup)){
      return $this->markup;
    }
  }

  protected function _get_pagetitle() {
    if (!empty($this->markup)){
     preg_match_all('/<title>(.*?)\<\/title\>/si', $this->markup, $pagetitles);
     return !empty($pagetitles[1]) ? $pagetitles[1] : FALSE;
    }
  }
}


After this I create a recursive function that will follow each of the links. Each time I call this function I create a new instance of the Crawler class. If the url isn’t valid I just return. If the url is redirected curl has an option to follow links, the CURLOPT_FOLLOWLOCATION option. Since this is set to on, you need to get the actual url which is contained in the header information. After this I call the get links function. This will return all the unique links on a page. (Calling the array_flip twice makes them unique).

I then get the title tag for each page. This is used when creating the sitemap. I remove all the script tags and all the html tags. The next thing I do is get a base url. Since I’m creating a sitemap of just one website I don’t need any links to external pages. So if any of the links don’t contain this base url I wont follow it.

Now I begin looping through all the links on the page. If its an external link I return. If it is an absolute link and it contains a “/” I replace the whole thing with a slash. If it doesn’t have the slash but is an absolute link then the new val is “”. I do this because I am prepending the base url that we got earlier to it.

I then explode on the “/”. If the first element is empty then I put the base url in there. Otherwise I prepend the url to it. This is done to get the correct link no matter if it is a relative link, root relative, or absolute.
The complete link is formed and checked to see if it already exist in the globallinkarr. If it doesn’t I add it and then begin getting the different levels. Basically everytime there is a “/” in the url then that is a different level. This is used when I am creating the html sitemap. Also to create this array of levels, I have to call array_merge_recursive. Well the regular php function didn’t quite work. If a url has numbers as one of its levels for example blog/2009/12/post then that function would turn the 2009 to its own key. So I needed it to keep the keys the same so I just got another function off of php.net.

The Function to Get all the Links


$dontfollow = array('pdf', 'jpg', 'png', 'jpeg','zip', 'gz', 'tar', 'txt');

function findAllLinks($url){
  global $globallinkarr;
  global $depthlinks;
  global $dontfollow;
  global $pagetitles;
  global $contentarr;

  $crawl = new Crawler($url);
  $info = $crawl->get('info');

  $validcodes = array(200,301,302);
  if(!in_array($info['http_code'], $validcodes))
    return;
  $url = $info['url'];
  $links = $crawl->get('links');
  $title = $crawl->get('pagetitle');
  $title = $title[0];
  $body = $crawl->get('body');
  $count++;

  $content = strip_tags(preg_replace('//msU', '', $body));

  if(!array_key_exists($url, $contentarr))
    $contentarr[$url] = array('title'=>"$title", 'pagecontent'=>"$content");

  if(!count($links) || !is_array($links)) return;
  else{
    if(preg_match('/http(?:s)?:\/\/(.*?)\/(.*)/', $url, $pattern)){
      $baseurl = $pattern[1];
    }else{
      $baseurl = $url;
    }

    foreach($links as $val){
      if(preg_match('/.*?javascript:void\(0\)/', $val) || ereg('#', $val)){
        continue;
      }
      if(!preg_match('/[0-9a-zA-Z]/', $val)) continue;
      $val = trim($val, '"\'');

      /**
       * CHECK IF LINK IS GOING TO ANOTHER DOMAIN.  IF SO DONT FOLLOW IT.
      */

      if(preg_match('/^http(s)?:\/\//', $val) &&
               !strpos($val, preg_replace('/http(s)?:\/\//', '', $baseurl))){
        continue;
      }

      if(ereg('http', $val) && preg_match('/^http(s)?:\/\/.*?\//', $val)){
        $val = preg_replace('/^http(s)?:\/\/.*?\//', '/', $val);
      }else if(ereg('http', $val)){
        $val = '';
      }

      $sl = explode('/', $val);

      if(!preg_match('/[0-9a-zA-Z]/', $sl[0])){
        $sl[0] = preg_replace('/^http(s)?:\/\//', '', $baseurl);
        $complink = implode('/', $sl);
        $sl = explode('/', $complink);

      }else{
        $prepend = explode('/', preg_replace('/^http(s)?:\/\//', '', $url));
        if(count($prepend)>1){
          array_pop($prepend);
          $prep = implode('/', $prepend);
        }else $prep = $prepend[0];
        $sl[0] = $prep.'/'.$sl[0];
        $complink = implode('/', $sl);

        $sl = explode('/', $complink);

      }
      if(!end($sl)) array_pop($sl);

      if(!in_array($complink, $globallinkarr)){
        $globallinkarr[] = $complink;
        $pagetitles[$complink] = $title;

        $depth = count($sl);
        $templinks = array();
        $newlinks = array();
        if($depth > 1){
           if(!$sl[$depth-1]) $sl[$depth-1] = 'index';
           $templinks[$sl[$depth-2]][] = $sl[$depth-1];

           if($depth > 2){
	     for($i=$depth-2; $i>0; $i--){
               $hold = $templinks;
               $templinks = array();
               $templinks[$sl[$i-1]] = $hold;

             }
          }

          $temp = $templinks[$sl[0]];
          $newlinks[$sl[0]] = $temp;

        }
        $depthlinks = array_merge_recursive2($newlinks,$depthlinks);
        $end = strtolower(end(explode(".", $complink)));

        if(!preg_match('/^http(s)?:\/\//', $complink))
          $complink = 'http://'.$complink;
          if(!in_array($end, $dontfollow) && !ereg("sitemap", $complink)){
            findAllLinks($complink);
          }
       }
    }
    return;
  }
}

The Array Merge Recursive Function I Used From php.net

function array_merge_recursive2($array1, $array2){
  $arrays = func_get_args();
  $narrays = count($arrays);

  // check arguments
  // comment out if more performance is necessary
  //   (in this case the foreach loop will trigger a warning if the argument is not an array)
  for ($i = 0; $i < $narrays; $i ++) {
   if (!is_array($arrays[$i])) {
   // also array_merge_recursive returns nothing in this case
     trigger_error('Argument #' . ($i+1) . ' is not an array - trying to merge array with scalar! Returning null!', E_USER_WARNING);
     return;
    }
  }

    // the first array is in the output set in every case
  $ret = $arrays[0];

  // merege $ret with the remaining arrays
  for ($i = 1; $i < $narrays; $i ++) {
    foreach ($arrays[$i] as $key => $value) {
     /***  KEEP THIS COMMENTED OUT TO KEEP THE ORIGINAL KEYS
     //if (((string) $key) === ((string) intval($key))) { // integer or string as integer key - append
     //   $ret[] = $value;
    // }
    // else { // string key - merge
      if (is_array($value) && isset($ret[$key])) {
        // if $ret[$key] is not an array you try to merge an scalar
        // value with an array - the result is not defined (incompatible arrays)
        // in this case the call will trigger an E_USER_WARNING and the $ret[$key] will be null.
        $ret[$key] = array_merge_recursive2($ret[$key], $value);
      }
      else {
        $ret[$key] = $value;
      }
           // }
    }
  }
  return $ret;
}

So after I created the arrays with all the links I create the sitemaps. The first one here is an xml sitemap used for the robots.txt file.
Its really simple and used the globallinkarr array.

XML Sitemap

function createXMLSiteMap($globallinkarr){
  $xml = '<?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">';
   if(count($globallinkarr)){
   foreach($globallinkarr as $val){
     if(!preg_match('/http(s)?:\/\//', $val)){
       $val = 'http://'.$val;
     }
     $xml .= '
       <url>
          <loc>'.str_replace('&', '&amp',$val).'</loc>
       </url>';
    }
  }
  $xml .= '</urlset>';
  return $xml;
}

The html sitemap is created using a recursive function that goes through depthlinkarr. Also it checks if the link is valid if it isn’t in the globallinkarr since all those are already checked. When it does check it only needs the headers so to speed things up I set the curl option CURLOPT_NOBODY to true. The page was timing out on me a lot but with this option set and checking to see if it already existed in the globallinkarr helped stop it from timing out. But if you have a whole lot of links there is a good chance this will cause your page to timeout.

The HTML Sitemap

function createSiteMap($depthlinks, $before = ''){
  global $globallinkarr;
  global $pagetitles;
  $validcodes = array(200,301,302);
  if(count($depthlinks)){
  $sitetree = '<ul style="padding:5px; margin:5px;">';
  foreach($depthlinks as $key=>$val){
    if(is_array($val)){
      if($before) $newbefore = $before.'/';
      $newbefore .= $key;

      $newkey = preg_replace('/^http(s)?:\/\//', '', $newbefore);

    $title = ($pagetitles[$newkey] != "") ? $pagetitles[$newkey] : $newbefore;
      if(!preg_match('/^http(s)?:\/\//', $newbefore))
         $newbefore = 'http://'.$newbefore;
      $exist = 0;
      if(in_array($newbefore, $globallinkarr)){
        $exist = 1;
      }else{
        $test = new Crawler($newbefore, 1);
        $info = $test->get('info');
        if(in_array($info['http_code'], $validcodes))
          $exist = 1;
        }
        if($exist){
          $sitetree .= '
           <li><a style="display:block;" href="'.$newbefore.'"
           target="_blank" title="'.$title.'" />'.$title.'</a></li>';
        }else{
          $sitetree .= '
             <li>'.$title;
        }
        $temp = createSiteMap($val, $newbefore);
        if($temp){
          $sitetree .= $temp;
          $sitetree .= '</li>';
        }
      }else{
        if($before != '') $newval = $before.'/'.$val;
        else $newval = $val;
        $newkey = preg_replace('/^http(s)?:\/\//', '',$newval);
        $title = $pagetitles[$newkey] ? $pagetitles[$newkey] : $newval;
        if(!preg_match('/^http(s)?:\/\//', $newval))
           $newval = 'http://'.$newval;

        $exist = 0;
        if(in_array($newval, $globallinkarr)){
            $exist = 1;
        }else{
          $test = new Crawler($newval, 1);
          $info = $test->get('info');
          if(in_array($info['http_code'], $validcodes))
            $exist = 1;
        }

        if($exist){
          $sitetree .= '
            <li><a href="'.$newval.'" title="'.$title.'"
                target="_blank">'.$title.'</a></li>';
        }
      }
    }
  $sitetree .= '</ul>';
  return $sitetree;
  }
}

Execute Code After Browser Finishes Loading

How to execute a script after the browser stops loading

Recently I needed to execute a bit of code but the browser was timing out. I was actually using the php exec function to tar up some files. The browser kept showing a time out page. So in order to make the browser think it was done loading and then execute this bit of script, I put everything in the buffer with ob_start(). Then when I’m ready to stop the browser I get the size of the buffer. I then set the header content-length equal to this size and flush out the buffer. Now the browser will no longer show that its loading, since it has received all the content it is going to receive. So now you can execute any script.

Here is the code


<?php
ob_end_clean();
header("Connection: close");
ignore_user_abort();  // optional
ob_start();

echo "some content";

$size = ob_get_length();
 header("Content-Length: $size");
 ob_end_flush();
 flush();
//Browser done loading

//put your script here

?>

Round Robin Algorithm

Generating a Schedule Automatically

I need to create a way for every team to play each other in each round of games. I first started out creating a chart of who each team will play. After I created a chart that worked I built an algorithm to build that chart.

The Game Schedule

The Teams: Team 1 Team 2 Team 3 Team 4 Team 5 Team 6
Rounds: Round 1 Team 6 Team 5 Team 4 Team 3 Team 2 Team 1
Round 2 Team 4 Team 6 Team 5 Team 1 Team 3 Team 2
Round 3 Team 2 Team 1 Team 6 Team 5 Team 4 Team 3
Round 4 Team 5 Team 3 Team 2 Team 6 Team 1 Team 4
Round 5 Team 3 Team 4 Team 1 Team 2 Team 6 Team 5

Now you just need to figure out the pattern and we can code it up. I was told there is a name for this pattern but I didn’t know it, so if you know it let me know.

The Variables

$gamearr AND $tempgamearr are both an array of all the teams. The array would look like this: $gamearr[0] = “firstteam”, $gamearr[1] = “secondteam” and so on.

$numteamspadded is the number of teams. If the number was odd I add another team to the end of $gamearr and $tempgamearr as a “byteam”.

$gamenumperteam is however many number of games you want each team to play.

The Logic

As you can see at the beginning I set the variable $firstteam. You realize why I did this if you figured out the pattern above. The pattern above being that I have an array of teams playing the array of teams in reverse order. After each round I decrement the reverse order array and move the last one to the front. So we would have 654321 and then 165432 and then 216543. This would cause teams to play themselves though which is where the $firstteam comes in. Every time a team would play itself we switch it out with the firstteam.

I start out by looping through the number of games and then looping through just half of the teams since I’m setting both the home and away team in there. If I looped through all the teams I would have 1 vs 6 and then 6 vs 1 but I just want the unique games.

I then loop through all the games and set the $gamearr to the values that it would be had I been just decrementing the array and moving the last item to the beginning. Meaning the gamearr will look like this after each round 234561, 345612, 456123, etc. I use the $tempgamearr because I need to know the original order of the teams since we have to switch the first team out when there is a game playing itself.
After this I remove the first team from the array. I then loop through the game to find if one of the teams will be playing itself and get its value and position. This is one of the reasons why I remove the first team from the array. If I left it in there then 1 would play itself and overwrite the fact that another team was playing itself.
I then switch the team that is playing itself with the $firstteam.

After this you have an array of the games for each round.
I then create an array of all the games and check to see if the “byweek” is playing. If it is I don’t add it to the array cause we don’t want a byweek taking up a game slot.

The Round Robin Algorithm


$bottom = $numteamspadded-1;
$half = $numteamspadded/2;

$firsteam = $gamearr[0];
for($i=0; $i<$gamenumperteam; $i++){
    for($j=0; $j<$half; $j++){
         if($numteamspadded==4)
           $rrarr[$i][$j]['home'] = $tempgamearr[$j];
        else $rrarr[$i][$j]['home'] = $gamearr[$j];
        $rrarr[$i][$j]['away'] = $gamearr[$bottom-$j];
    }

    for($j=1; $j<=$bottom+1; $j++){
        $start = ($i+$j)%$numteamspadded;
        $gamearr[$j-1] = $tempgamearr[$start];
    }

    array_splice($gamearr, array_search($firsteam, $gamearr), 1);
    foreach($gamearr as $key=>$val){
        if($tempgamearr[$bottom-$key] == $val){
              $TempGameValue = $val;
              $switch = $key;
          }
    }

    $gamearr[$bottom] = $TempGameValue;
    $gamearr[$switch] = $firsteam;
}
foreach($rrarr as $key=>$val){
  foreach($val as $gkey=>$games){
    if($games['home'] != 'byweek' && $games['away'] != 'byweek'){
      $allgames[] = $games;
    }
  }
}

UPDATE: I came across another good implementation of the round robin here at http://phpbuilder.com/board/showthread.php?p=10935200

Sorting with Umlauts

Recently I needed to have a drop down box for countries. In english I just simply order by name but when you look at the german translation, ordering by name puts all the countries that have an umlaut character at the bottom. For example, I know the German translation for Austria is Österreich. So when I sort the translated names I now want this name to appear after the country named Oman. I know that the german umlaut html entity for ö is &ouml. I use that to get the character to sort the countries with.
Here is the code using PHP.

foreach($country as $val){
  $charset = htmlentities(utf8_decode($val['country_name']));
  if(substr($charset, 0,1)== '&'){
    $newcharset = str_replace(substr((html_entity_decode($charset)), 0,1),
                substr($charset, 1,1), html_entity_decode($charset));
    $origcountry[$val['country_id']] = $val['country_name'];
    $testarr[$val['country_id']]= $newcharset;
    $swap[$val['country_id']] = $newcharset;
  }else{
    $testarr[$val['country_id']] = $val['country_name'];
  }
}

asort($testarr);
foreach($testarr as $ckey=>$val){
  if($key= array_search($val, $swap)){
    $newcountryarr[$ckey] = $origcountry[$key];
  }else{
    $newcountryarr[$ckey] = $val;
  }
}