Jump to content

PHP Keyword Cloud Generator

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
5 replies to this topic

#1
BlaineSch

BlaineSch

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,448 posts
My friend had asked me to make a keyword cloud generator. If you don't know what this is basically it's a program that will get user inputted text, get the top words and make the more common words bigger. Would be useful for websites popular searches or something similar.

To explain how it works, first it grabs some text, the default is some random "lorem ipsum" text. It will take out all special characters leaving only a-z, split it into an array, make an array with $arra[word] = count; Word being the word and count behind how many times it occurs. Then it only uses the top 50 words, figures out the percent of how common it is... if the top word occurs 14 times then a word occurring 7 times would be 50% (7*(100/14)=50). I originally made it have 6 categories since I figured I would try using the h1-6 but that didn't work, but I kept the 6 categories anyways.

Working example:
http://www.blainesch.com/cloudIt.php

<?PHP
//1) Define:
$bigwds = (strlen($_POST['q'])!=0)?$_POST['q']:"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Praesent laoreet porttitor dolor sit amet gravida. Fusce ante lorem, adipiscing non malesuada id, elementum consectetur nunc. Proin auctor cursus sem vitae auctor. Morbi vestibulum diam non diam hendrerit varius. Aliquam ac risus quis nulla lacinia tempor. Sed vestibulum blandit malesuada. Nulla purus quam, condimentum placerat euismod vel, ultrices at purus. Quisque accumsan auctor nunc at auctor. Praesent pellentesque bibendum nisi et blandit. In auctor sapien iaculis massa sagittis non feugiat mi malesuada. Donec tincidunt volutpat ullamcorper. Donec volutpat elit et quam feugiat blandit. Pellentesque cursus quam et nunc dignissim in pulvinar lorem pharetra. Suspendisse scelerisque, orci vitae scelerisque facilisis, felis purus cursus urna, sit amet tincidunt nibh quam quis orci. Morbi non sapien et est euismod luctus interdum et lacus. Praesent sed purus quis nunc cursus vehicula. Donec ipsum risus, fringilla at euismod sit amet, fringilla non lacus. Aliquam sapien libero, fringilla sed scelerisque sit amet, interdum non elit. Donec sapien diam, bibendum et dictum congue, facilisis vitae ante. Donec lobortis, augue vitae volutpat vehicula, lectus lorem ultrices tortor, et volutpat urna libero vitae enim. Quisque vitae rhoncus velit. Integer lacus lorem, fringilla quis interdum nec, ultrices id lacus. Donec ac augue mauris. Aenean a dignissim magna. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Morbi sit amet erat vitae mi bibendum vehicula. Integer porta faucibus congue. Phasellus egestas elementum ligula, nec pellentesque magna egestas nec. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Proin dictum ligula quis elit fermentum quis ornare quam tincidunt. Phasellus sit amet metus eu lacus convallis pharetra. Nullam sit amet varius lectus. Donec ut felis nisi, sed hendrerit urna. Fusce vitae lacus nibh. Cras porta eleifend ipsum sed tincidunt. Pellentesque arcu tortor, fringilla vitae imperdiet vel, adipiscing a nisl. Praesent vulputate lobortis dolor, in malesuada ligula pretium quis. Cras vestibulum tempus velit in malesuada. Nullam malesuada tortor id neque rhoncus et dapibus nisl imperdiet. Morbi in tellus at quam varius consectetur. Maecenas porttitor condimentum lorem, at aliquet quam molestie ut. In at odio leo, a tempor felis. Pellentesque in dolor vitae neque faucibus lobortis. Curabitur sollicitudin ullamcorper consectetur. Quisque lacinia est ac dolor porttitor dapibus. Vivamus tempus ultricies leo vel varius. Vivamus ut metus dolor. Donec eu neque et dolor blandit vulputate. Praesent diam sapien, vehicula interdum luctus ac, consequat et tellus. Maecenas ac ullamcorper justo. Nullam sit amet felis sit amet massa fringilla vestibulum ac quis risus. Curabitur lorem tortor, rutrum sit amet tempus at, euismod vitae lectus. Suspendisse nec lacus dui, a rhoncus lectus. In sit amet mattis quam. Nunc ornare, quam nec tristique malesuada, libero purus vehicula tortor, egestas luctus erat ante in lorem. Nulla laoreet ligula at nibh bibendum sed lobortis ante hendrerit. Integer ac quam turpis, elementum feugiat lorem. Integer luctus gravida nisl, in cursus massa ullamcorper in. Aliquam luctus odio sit amet arcu congue sed varius velit porta. Duis dictum leo nisi, nec luctus nunc. Integer quis ultrices dolor. Mauris viverra pellentesque varius. Integer sit amet dolor vitae justo cursus hendrerit. Vivamus ullamcorper nisl velit. Phasellus laoreet, justo at hendrerit dapibus, metus lacus egestas mauris, in consectetur turpis leo porttitor mauris. Praesent tincidunt risus egestas eros tempus nec consequat lectus adipiscing. Aenean at imperdiet arcu. Nulla facilisi. Vestibulum ut lobortis dolor. Maecenas vestibulum augue eget felis eleifend aliquam. Donec lacinia luctus felis, vel scelerisque nisl lacinia ac. Nunc scelerisque sapien commodo arcu dapibus vel accumsan est hendrerit. Vestibulum id mi consectetur mi euismod rhoncus. Morbi lacinia, tortor quis interdum pulvinar, purus lectus aliquet leo, nec placerat tortor urna in nisl. ";
$words = explode(' ', preg_replace('/[^a-zA-Z ]+/', '', strtolower($bigwds))); 
//breaks up the string into an array, after taking out things that are not words (commas, smiley's, etc)
//also puts everything in lower case :]

//2) Count
foreach($words as $id=>$word) {
    //traverse the words, eliminiate 2 letter words
    if(strlen($word)>= 3) {
        if(isset($keys[$word])) {
            $keys[$word]++; //increases value
        } else {
            $keys[$word] = 1; //initialize value
        }
    }
}
arsort($keys); //sorts by word count desc

//3) Delete
$i = 0;
foreach($keys as $id=>$value) {
    ////delete everything after top 50 words
    if($i==50) { break; }
    $highest = ($value>$highest)?$value:$highest; //highest
    $nkeys[$id] = $value;
    $i++;
}


//4) mixed it the hell up!
$keys = array_keys($nkeys);
shuffle($keys);
foreach($keys as $key) {
    $new[$key] = $nkeys[$key];
}
$nkeys = $new;
    
//5) output
$total = ""; //starts string
foreach($nkeys as $id=>$value) {
    $hv = ceil($value*(100/$highest)); //gets percent
    if($hv>90) {
        $hn = 1;
    } else if ($hv>80) {
        $hn = 2;
    } else if ($hv>60) {
        $hn = 3;
    } else if ($hv>40) {
        $hn = 4;
    } else if ($hv>30) {
        $hn = 5;
    } else if ($hv>15) {
        $hn = 6;
    } else {
        $hn = 0;
    }
    if($hn != 0) {
        $total .= "<span class='size{$hn} scloud'>{$id}</h{$hn}></span> \n";
    }
}
?>
<html>
    <head>
        <title>Status Cloud!</title>
        <style>
            body, div.cloud {
                background: #f8f8fa;
            }
            div.cloud {
                border: 1px solid #4c6db9;
                width: 600px;
                padding:10px;
                text-align:center;
            }
            span.scloud {
                font-weight:bold;
                line-height:18pt;
            }
            span.size6 {
                font-size: 12pt;
                color: #9cb3e7;
            }
            span.size5 {
                font-size: 16pt;
                color: #7693d4;
            }
            span.size4 {
                font-size: 22pt;
                color: #5375c3;
            }
            span.size3 {
                font-size: 30pt;
                color: #355aaf;
            }
            span.size2 {
                font-size: 40pt; 
                color: #1a429c;
            }
            span.size1 {
                font-size: 52pt;
                color: #002a8b;
            }
        </style>
    </head>
    <body>
        <div class="cloud">
            <?=$total?>
        </div><br /><br />
        <form method="post">
            <textarea name="q" style="width: 400px;height:200px;"><?=$bigwds?></textarea><br />
            <input type="submit" value="Cloud it!">
        </form>
    </body>
</html>

Edited by BlaineSch, 18 May 2011 - 08:29 PM.


#2
James.H

James.H

    Programming God

  • Members
  • PipPipPipPipPipPipPip
  • 866 posts
Thanks for this code BlaineSch, always helpful +rep!

#3
John

John

    Writes binary right handed and hex left handed

  • Moderators
  • 6,321 posts
explode() works well on "small" data sets, but when you are trying to analyze the "keywords" in War and Peace (like I'm doing), explode() will try to allocate a *big* chunk of memory (64+ MB in my case). Below are a few things you might find interesting:

1. You can remove words less than X characters in your preg_replace function.
2. You don't need a loop to get the N most popular words (you can use array slice).
3. When you tokenize the string using the method below you avoid having to allocate space for _every_ word (as you do with explode).

It doesn't do the _same_ thing as yours, but I think you could barrow a few of the ideas to improve your algorithm (which is necessary in my case)
<?php

/*get contents of war and peace */

$contents = file_get_contents("2600.txt"); 


/* remove "special" characters and words >= 3 characters */

$contents = preg_replace('/[^a-z ]*|\s\w{0,3}\s/', '', strtolower($contents));


/* tokenize the string */

$tok = strtok($contents, " ");

while($tok !== false) {

    empty($a[$tok]) ? $a[$tok] = 1 : $a[$tok]++;

    $tok = strtok(" ");

}


/* sort */

arsort($a);


/* take the 50 most popular */

$a = array_slice($a, 0, (count($a) > 50) ? 50-count($a) : count($a), true);


print_r($a);


?>


Edit: Actually, the regexp isn't perfect - but you can see the idea ^^

#4
BlaineSch

BlaineSch

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 2,448 posts
Wow many improvements, I really did not see explode as taking that much space. I will definitely run some tests!

Didn't think about using regex for the less than 3 character thing, I am still no master at it! I will play around with it more :]

Lots of useful tips for me to play around with [:

#5
John

John

    Writes binary right handed and hex left handed

  • Moderators
  • 6,321 posts
I was surprised how memory abusive explode was as well (and I am interested as to why?). But when I tried to explode the contents of War and Peace (http://www.gutenberg...s/2600/2600.txt) I got:

Quote

john@earth:~$ php parse.php

Fatal error: Allowed memory size of 67108864 bytes exhausted...

And I don't know if you saw the edit, but there is something wrong with the regexp (if you fix it let me know :)).

#6
SoN9ne

SoN9ne

    Programmer

  • Members
  • PipPipPipPip
  • 129 posts
This looked to be fun for me so I gave it a go.

When stepping into the process, it was obvious that certain words were much higher than the rest. These higher used words were causing the percentage to be less than 15 which is basically ignored. By using ignored words and filtering them in regex, I was able to "fix" it. (since its not really a fix...) UPDATE: Since I got the regex working properly, this is just an added feature. Now 3 char words are removed so words like "the" won't mess up the percentages.

I am getting better with regex but I still feel I could have cleaned the expressions to be more efficient...

This is what worked for me, using the code form the original post, fixes from the latter, and my own.
<?php
# Define limit
$limit = 50;

# Define ignore words (common words that shouldn't really be counted because they can be drastically high)
$ignoreList[]='that';
$ignoreList[]='with';
$ignoreList[]='hers';

# Define root path
$rootPath = dirname(__FILE__).'/';

# Load the fullText
$fullText = (isset($_POST['content']) && $_POST['content']) ? htmlentities($_POST['content'], ENT_QUOTES) : NULL;
if (!$fullText) {
	# Load file contents
	$fullText = file_get_contents($rootPath."2600.txt");
}

# Build ignore words for regex usage
$ignoreString='';
if (isset($ignoreList) && $ignoreList) {
	foreach ($ignoreList as $ignore) {
		# Just to be safe
		$ignore=strtolower($ignore);
		$ignoreString.="$ignore|";
	}
	$ignoreString=substr($ignoreString, 0, -1);
}
# Build our regex filters
$regex[]= "/[^a-z ]/"; // Get rid of anything we don't need
$regex[]= "/\b(\w){0,3}\b/"; // Drop the words 3 chars or less

# Are we using the ignore list?
if ($ignoreString) $regex[]= "/\b($ignoreString)\b/"; // Ignored list

$regex[]= '/\s\s+/'; // Get rid of double white spaces

# Process the regex
$contents = trim(preg_replace($regex, ' ', strtolower($fullText)));

# Fail safe
if (!$contents) {
	exit('No content');
}

# Tokenize the string
$tok = strtok($contents, ' ');
while($tok !== false) {
	if (!trim($tok)) continue; // Can be a little heavy for larger text

	empty($a[$tok]) ? $a[$tok] = 1 : $a[$tok]++;
	$tok = strtok(' ');
}

# arsort on sort filter SORT_NUMERIC to order the array from higest to lowest, this way we can drop the lower values on the array_chunk below
arsort($a, SORT_NUMERIC);

# Keep only the amount we want
$nkeys = array_slice($a, 0, (count($a) > $limit) ? $limit : count($a), true);

# Drop useless vars
unset($a, $tok, $limit, $regex, $ignoreString, $ignoreList, $contents, $rootPath);

# Shuffle the array
$keys = array_keys($nkeys);
shuffle($keys);
foreach($keys as $key) {
	$new[$key] = $nkeys[$key];
}
unset($keys);

# Fetch the highest value from the array
$highest = max($new);

# Output
$output = '';
foreach($new as $id=>$value) {
	$hv = ceil($value*(100/$highest)); //gets percent

	switch ($hv) {
		case ($hv > 90):
			$hn = 1;
			break;

		case ($hv > 80):
			$hn = 2;
			break;

		case ($hv > 60):
			$hn = 3;
			break;

		case ($hv > 40):
			$hn = 4;
			break;

		case ($hv > 30):
			$hn = 5;
			break;

		case ($hv > 15):
			$hn = 6;
			break;

		default:
			$hv=0;
			break;
	}

	if($hn != 0) {
		$output .= "<span class=\"size{$hn} scloud\">{$id}</span> ".PHP_EOL;
	}
}

# Clean up
unset($highest, $new, $nkeys, $hv, $hn);
//unset($fullText);
?>
<html>
    <head>
        <title>Status Cloud!</title>
        <style>
            body, div.cloud {
                background: #f8f8fa;
            }
            div.cloud {
                border: 1px solid #4c6db9;
                width: 600px;
                padding:20px;
                text-align:center;
            }
            span.scloud {
                font-weight:bold;
                line-height:18pt;
            }
            span.size6 {
                font-size: 12pt;
                color: #9cb3e7;
            }
            span.size5 {
                font-size: 16pt;
                color: #7693d4;
            }
            span.size4 {
                font-size: 22pt;
                color: #5375c3;
            }
            span.size3 {
                font-size: 30pt;
                color: #355aaf;
            }
            span.size2 {
                font-size: 40pt; 
                color: #1a429c;
            }
            span.size1 {
                font-size: 52pt;
                color: #002a8b;
            }
        </style>
    </head>
    <body>
        <div class="cloud">
            <?php echo (isset($output) ? $output : ''); ?>
        </div><br /><br />
        <form method="post">
            <textarea name="content" style="width: 400px;height:200px;"><?php echo ((isset($fullText)) ? $fullText : ''); ?></textarea><br />
            <input type="submit" value="Cloud it!" />
        </form>
    </body>
</html>

When testing, you may notice the page takes awhile to load the 2600.txt file. This is due to the content being loaded into the textarea, uncomment the unset($fullText); and it should load much faster.

Edited by SoN9ne, 06 April 2010 - 12:03 PM.
removed my pathing from the script, optimize code a little more, and added more information to the post