Jump to content

Check if multiple remote files exists HELP! (Using curl)

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
7 replies to this topic

#1
SeanStar

SeanStar

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
$con = mysql_connect("localhost", "seanstar", "");

if (!$con)

  {

  die('Could not connect: ' . mysql_error());

  }

$db_selected = mysql_select_db("seanstar_urls",$con);

    $SQL = mysql_query("SELECT url FROM urls",$con);

    $mh = curl_multi_init();

    $handles = array();	

    while($resultSet = mysql_fetch_array($SQL)){

            $ch = curl_init($resultSet['url']);

	    curl_setopt($ch, CURLOPT_HEADER, true);

	    curl_setopt($ch, CURLOPT_NOBODY, true);

	    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

            curl_multi_add_handle($mh, $ch);

            $handles[] = $ch;

    }

    $running = null;

      $result = curl_multi_exec($mh,$running);

	if (strstr($result, "404") != FALSE) {

	    echo "It is 404.<br>";

	} else {

	    echo  "It is not 404.<br>";

}

    foreach($handles as $ch)

    {

      curl_multi_remove_handle($mh, $ch);

    } 

    curl_multi_close($mh);

I need it to spit out if each url in the table url is 404 or not. Any help? What I tried does not work!

#2
Alexander

Alexander

    It's Science!

  • Moderators
  • 4,124 posts
Have you tried to see what $result returns?
Be sure to read the updated FAQ! || Health is achieved through the same 10,000 steps.
If a suggested code/method fails, informing us is less important than telling us why or what errors occurred.

#3
SeanStar

SeanStar

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts

Nullw0rm said:

Have you tried to see what $result returns?

It returns a 0.. for some reason. With a single curl function it displays the full header of the first url, like it should.

#4
SeanStar

SeanStar

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
SOLUTION:
while($row = mysql_fetch_array($sql)){
extract($row);
    if(doesFileExist($url)){
        echo "Website OK<br>";
    }
    else{
        echo "Website DOWN<br>";
    }
}
function doesFileExist($link) {
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $link);
	curl_setopt($ch, CURLOPT_HEADER, TRUE);
        curl_setopt($ch, CURLOPT_NOBODY, TRUE);
        curl_setopt($ch, CURLOPT_TIMEOUT, 3);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
	$result = curl_exec($ch);
	if (strstr($result, "404") != FALSE) {
	    $found = false;
	} else {
	    $found = true;
	}
	curl_close($ch);
	return ($found);
}


#5
Alexander

Alexander

    It's Science!

  • Moderators
  • 4,124 posts
I apologize, file handles are not the most straightforward thing. Your original code only verified if the header 404 was sent, that header is only sent if the server specifically chose to, it does not tell you if an address exists or if a server is physically offline. PHP has a great stream handler that is often underused or not known of, I will give it my go:
<?php
$urls = array('http://php.net/mtime',           //exists
              'http://foobarbazinvalid3214.com',//invalid host
              'http://hosting.com/aboutus/',      //exists but subfolder does not
              'http://321252351xxz.com');        //invalid host

$options = array( 'http' => array(
        'max_redirects' => 10,        // stop after 10 redirects
        'timeout'       => 5,         // timeout on response
) );

foreach($urls as $url) {
    $context = stream_context_create( $options );
    $page    = @file_get_contents( $url, false, $context );
    if($page != false) {
        if(strstr($http_response_header[0], '200') === false && //OK
            strstr($http_response_header[0], '302') === false &&
            strstr($http_response_header[0], '301') !== false) { //Found
            print "Url $url does not exist<br/>";
        }
    } else {
        print "Url $url does not exist<br/>";
    }
}
?>
Essentially anything other than a 302 found or 200 OK is what you are trying to catch. If somebody's address is foo.com/subfolder, and that does not exist anymore, it will either 404 or 301 back to the original domain (most likely serving a "Not found" page. My code will check for validity of the server, and existence of the physical file requested (which can be added security, in case your addresses are supposed to be client entered).
Be sure to read the updated FAQ! || Health is achieved through the same 10,000 steps.
If a suggested code/method fails, informing us is less important than telling us why or what errors occurred.

#6
SeanStar

SeanStar

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts
Thanks for your response, null, but I already figured it out with a function. I figured I could use a function to allow the curl to do its magic.

#7
Orjan

Orjan

    Writes binary right handed and hex left handed

  • Moderators
  • 3,299 posts
If you use curl as you found in your solution, each site check time will be added onto eachother and if there are many timeouts, it will take a very long time. that's why multi_curl is a good option. you just didn't use it correctly in your first code. this is how I've been using it:


function multifeed($urls) {

	$mcurl = curl_multi_init();

	foreach ((array)$urls as $key => $url) {

		$curl[$key] = curl_init();

		curl_setopt($curl[$key], CURLOPT_RETURNTRANSFER, 1);

		curl_setopt($curl[$key], CURLOPT_CONNECTTIMEOUT, 2);

		curl_setopt($curl[$key], CURLOPT_FRESH_CONNECT, true);

		curl_setopt($curl[$key], CURLOPT_URL, $url);

		curl_multi_add_handle($mcurl, $curl[$key]);

	}

	do {

		curl_multi_exec($mcurl, $running);

		usleep(40);

	} while ($running > 0);


	foreach ((array)$urls as $key => $url) {

		$results[$key] = curl_multi_getcontent($curl[$key]);

		curl_multi_remove_handle($mcurl, $curl[$key]);

		curl_close($curl[$key]);

	}

	curl_multi_close($mcurl);

	return $results;

}

// usage:

$list = array ("http://www.oracle.com", "http://www.microsoft.com", "http://www.linux.org");

multifeed($list);


__________________________________________
I study Information Systems at Karlstad University when I'm not on CodeCall

#8
SeanStar

SeanStar

    Learning Programmer

  • Members
  • PipPipPip
  • 32 posts

Orjan said:

If you use curl as you found in your solution, each site check time will be added onto eachother and if there are many timeouts, it will take a very long time. that's why multi_curl is a good option. you just didn't use it correctly in your first code. this is how I've been using it:


function multifeed($urls) {

	$mcurl = curl_multi_init();

	foreach ((array)$urls as $key => $url) {

		$curl[$key] = curl_init();

		curl_setopt($curl[$key], CURLOPT_RETURNTRANSFER, 1);

		curl_setopt($curl[$key], CURLOPT_CONNECTTIMEOUT, 2);

		curl_setopt($curl[$key], CURLOPT_FRESH_CONNECT, true);

		curl_setopt($curl[$key], CURLOPT_URL, $url);

		curl_multi_add_handle($mcurl, $curl[$key]);

	}

	do {

		curl_multi_exec($mcurl, $running);

		usleep(40);

	} while ($running > 0);


	foreach ((array)$urls as $key => $url) {

		$results[$key] = curl_multi_getcontent($curl[$key]);

		curl_multi_remove_handle($mcurl, $curl[$key]);

		curl_close($curl[$key]);

	}

	curl_multi_close($mcurl);

	return $results;

}

// usage:

$list = array ("http://www.oracle.com", "http://www.microsoft.com", "http://www.linux.org");

multifeed($list);

I may try this. Fortunately, though, very rarely will I have a timeout.