Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #1 modernize library #4

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/vendor/
51 changes: 40 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,32 +14,54 @@ than simply calling `strpos` many times, and it's much faster than calling
I originally wrote this to use with [F5Bot](https://f5bot.com), since it's
searching for the same set of a few thousand keywords over and over again.

# Install via Composer

Add the following to your project's `composer.json`:

```json
"repositories": [
{
"type": "vcs",
"url": "https://github.com/codeplea/ahocorasickphp"
}
],
"require": {
"codeplea/ahocorasickphp": "dev-master"
}
```

Then, install the package itself:

```bash
$ composer update
```

# Usage

It's designed to be really easy to use. You create the `ahocorasick` object,
It's designed to be really easy to use. You create the `Search` object,
add your keywords, call `finalize()` to finish setup, and then search your
text. It'll return an array of the keywords found and their position in the
search text.

Create, add keywords, and `finalize()`:

```php
require('ahocorasick.php');
use codeplea\AhoCorasick\Search;

$ac = new ahocorasick();
$ac = new Search();

$ac->add_needle('art');
$ac->add_needle('cart');
$ac->add_needle('ted');
$ac->addNeedle('art');
$ac->addNeedle('cart');
$ac->addNeedle('ted');

$ac->finalize();

```

Call `search()` to preform the actual search. It'll return an array of matches.
Call `execute()` to preform the actual search. It'll return an array of matches.

```php
$found = $ac->search('a carted mart lot one blue ted');
$found = $ac->execute('a carted mart lot one blue ted');
print_r($found);
```

Expand Down Expand Up @@ -97,10 +119,17 @@ time: 0.054709911346436

```

Note: the regex solutions are actually slightly broken. They won't work if you
**Note:** the regex solutions are actually slightly broken. They won't work if you
have a keyword that is a prefix or suffix of another. But hey, who really uses
regex when it's not slightly broken?

Also keep in mind that building the search tree (the `add_needle()` and
Also keep in mind that building the search tree (the `addNeedle()` and
`finalize()` calls) takes time. So you'll get the best speed-up if you're
reusing the same keywords and calling `search()` many times.
reusing the same keywords and calling `execute()` many times.

# Running tests

```$php
$ composer install
$ ./vendor/bin/phpunit
```
117 changes: 0 additions & 117 deletions ahocorasick.php

This file was deleted.

123 changes: 7 additions & 116 deletions benchmark.php
Original file line number Diff line number Diff line change
@@ -1,120 +1,11 @@
<?php
use codeplea\AhoCorasick\Benchmark;

/* This program will benchmark searching for 1,000 keywords in a 5,000 word text all at once. */
/* It compares our ahocorasick method with regex and strpos. */


require('ahocorasick.php');
require('benchmark_setup.php'); /* keywords and text */

$loops = 10;

print("Loaded " . count($needles) . " keywords to search on a text of " .
strlen($haystack) . " characters.\n");

print("\nSearching with strpos...\n");

$st = microtime(1);
for ($loop = 0; $loop < $loops; ++$loop) {
$found = array();
foreach($needles as $n) {
$k = 0;
while(($k = strpos($haystack, $n, $k)) !== FALSE) {
$found[] = array($n, $k);
++$k;
}
}
}
$et = microtime(1);
print("time: " . ($et - $st) . "\n");
$found_strpos = $found;






print("\nSearching with preg_match...\n");
//Note, this actually sucks and misses cases where one needle is a prefix or
//suffix of another.
$regex = '/' . implode('|', $needles) . '/';

$st = microtime(1);
for ($loop = 0; $loop < $loops; ++$loop) {
$found = array();
$k = 0;
while(preg_match($regex, $haystack, $m, PREG_OFFSET_CAPTURE, $k)) {
$found[] = $m[0];
$k = $m[0][1] + 1;
}
}
$et = microtime(1);
print("time: " . ($et - $st) . "\n");
//print_r($found);






print("\nSearching with preg_match_all...\n");
//Note, this actually sucks and misses cases where one needle is a prefix or
//suffix of another.
$regex = '/' . implode('|', $needles) . '/';

$st = microtime(1);
for ($loop = 0; $loop < $loops; ++$loop) {
$found = array();
$k = 0;
preg_match_all($regex, $haystack, $found, PREG_OFFSET_CAPTURE);
$found = $found[0];
}
$et = microtime(1);
print("time: " . ($et - $st) . "\n");





print("\nSearching with aho corasick...\n");
$ac = new ahocorasick();
foreach ($needles as $n) $ac->add_needle($n);
$ac->finalize();

$st = microtime(1);
for ($loop = 0; $loop < $loops; ++$loop) {
$found = array();
$found = $ac->search($haystack);
}
$et = microtime(1);
print("time: " . ($et - $st) . "\n");






//Check that the answers match.
//First sort the arrays.
$comp = function($a, $b) {return ($a[1] === $b[1]) ? ($a[0] > $b[0]) : ($a[1] > $b[1]);};
usort($found, $comp);
usort($found_strpos, $comp);

if ($found_strpos !== $found) {
print("ERROR - Aho Corasick got the wrong result.\n");

print("strpos size: " . count($found_strpos) . "\n");
print("aho corasick size: " . count($found) . "\n");

for ($i = 0; $i < count($found); ++$i) {
if ($found_strpos[$i] !== $found[$i]) {
print("Mismatch $i\n");
print_r($found_strpos[$i]);
print_r($found[$i]);
}
}
}


require 'vendor/autoload.php';

/* keywords and text */
require 'benchmark_setup.php';

// Benchmark searching for 1,000 keywords in a 5,000 word text all at once.
$benchmark = new Benchmark();
$benchmark->run($needles, $haystack);
4 changes: 2 additions & 2 deletions benchmark_setup.php
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<?php

$needles = array('abandonment', 'abashed', 'abashments', 'abduction',
$needles = ['abandonment', 'abashed', 'abashments', 'abduction',
'aberrant', 'abiding', 'abidingly', 'abjures', 'ablution', 'abolishes',
'abominably', 'aborted', 'abrasion', 'abridgment', 'abscesses', 'absconds',
'absences', 'absinthe', 'absolves', 'absorbingly', 'abundant', 'abused',
Expand Down Expand Up @@ -532,7 +532,7 @@
'remonstrates', 'remorse', 'removes', 'remunerated', 'rendering',
'renditions', 'reneged', 'renominate', 'renovators', 'reorders',
'repatriates', 'repave', 'repaying', 'repeatedly', 'repertoires', 'replied',
'reprehend', 'reprieves', 'reprimanded');
'reprehend', 'reprieves', 'reprimanded'];

$haystack = 'unscathed grampus antinuclear avenged waste oversee doggies spumes
senators balk gooseberries grilles respelled ceramists outlaid maladroitly
Expand Down
25 changes: 25 additions & 0 deletions composer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"name": "codeplea/ahocorasickphp",
"description": "Aho-Corasick multi-keyword string searching library in PHP.",
"authors": [
{
"name": "Lewis Van Winkle"
}
],
"type": "library",
"license": "zlib",
"config": {
"sort-packages": true
},
"require": {
"php": ">=7.0"
},
"require-dev": {
"phpunit/phpunit": "^7.3"
},
"autoload": {
"psr-4": {
"codeplea\\AhoCorasick\\": "src"
}
}
}
Loading