Pycspade

What is this?

This is a python wrapper for the C++ implementation of C-SPADE algorithm by the author, Mohammed J. Zaki Original code was downloaded from http://www.cs.rpi.edu/~zaki/www-new/pmwiki.php/Software/Software#toc11 Since this is just a wrapper it is as fast as the C++ code

How to install?

Compatible with Python 2 and 3. On Windows, Visual Studio 2015 Build Tools is also required.

pip install Cython pycspade

How to use?

Your data needs to be in a particular format similar to the following:

1 1 3 8 37 42
1 2 4 4 11 37 42
2 1 2 10 73
2 2 1 72
2 3 3 4 24 77
...

The first number is the sequence index, the second is the event index, the third is the number of elements, followed by the element, space separated

Let's call this file data.txt. You will call cspade as following:

from pycspade.helpers import spade, print_result

# To get raw SPADE output
result = spade(filename='tests/zaki.txt', support=0.3, parse=False)
print(result['mined'])

1 -- 4 4 
2 -- 4 4 
4 -- 2 2 
6 -- 4 4 
4 -> 6 -- 2 2 
4 -> 2 -- 2 2 
2 -> 1 -- 2 2 
4 -> 1 -- 2 2 
6 -> 1 -- 2 2 
4 -> 6 -> 1 -- 2 2 
4 -> 2 -> 1 -- 2 2

print(result['logger'])

CONF 4 9 2.7 2.5
args.MINSUPPORT 2 4
MINMAX 1 4
1 SUPP 4
2 SUPP 4
4 SUPP 2
6 SUPP 4
numfreq 4 :   SUMSUP SUMDIFF = 0 0
EXTRARYSZ 2465792
OPENED /tmp/cspade-WWv9bQWBYdDyH85T.idx
OFF 9 38
Wrote Offt 
BOUNDS 1 5
WROTE INVERT 
Cleaned up successful: /tmp/cspade-WWv9bQWBYdDyH85T.tpose
Cleaned up successful: /tmp/cspade-WWv9bQWBYdDyH85T.idx
Cleaned up successful: /tmp/cspade-WWv9bQWBYdDyH85T.data
Cleaned up successful: /tmp/cspade-WWv9bQWBYdDyH85T.conf

print(result['summary'])

CONF 4 9 2.5 2.7 10 1 4 0.781025 4
TPOSE SEQ NOF2 /tmp/cspade-WWv9bQWBYdDyH85T.data 0.3 4 2 1 
F1stats = [ 4 0 0 ]
SPADE /tmp/cspade-WWv9bQWBYdDyH85T.tpose 0.3 2 7 0 0 0 0 0 -1 1 100 100 4 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# To also get other sequence mining's measures, incl. lift, support, confidence:
result = spade(filename='tests/zaki.txt', support=0.3, parse=True)
# Pretty print result:
print_result(result)

   Occurs     Accum   Support    Confid      Lift          Sequence
        4        14 1.0000000       N/A       N/A               (1) 
        4         6 1.0000000       N/A       N/A               (2) 
        2         4 0.5000000 0.5000000 0.5000000          (2)->(1) 
        2         2 0.5000000       N/A       N/A               (4) 
        2         2 0.5000000 1.0000000 1.0000000          (4)->(1) 
        2         2 0.5000000 1.0000000 1.0000000          (4)->(2) 
        2         2 0.5000000 1.0000000 1.0000000     (4)->(2)->(1) 
        2         2 0.5000000 1.0000000 1.0000000          (4)->(6) 
        2         2 0.5000000 1.0000000 1.0000000     (4)->(6)->(1) 
        4         6 1.0000000       N/A       N/A               (6) 
        2         4 0.5000000 0.5000000 0.5000000          (6)->(1)

You can provide cspade with list of sequences instead of a file:

data = [
    [1, 10, [3, 4]],
    [1, 15, [1, 2, 3]],
    [1, 20, [1, 2, 6]],
    [1, 25, [1, 3, 4, 6]],
    [2, 15, [1, 2, 6]],
    [2, 20, [5]],
    [3, 10, [1, 2, 6]],
    [4, 10, [4, 7, 8]],
    [4, 20, [2, 6]],
    [4, 25, [1, 7, 8]]
]

result = spade(data=data, support=0.01)
print_result(result)

The result seq is a string, that have multiple rows and looks like this:

22 80 -> 72 -> 42 -> 22 -- 2 2
22 -> 45 71 -> 42 -- 1 1
80 -> 45 71 -> 42 -- 1 1
22 80 -> 45 71 -> 42 -- 1 1

Let's decipher the first row:

22 80 -> 72 -> 42 -> 22 -- 2 2

It gives you the frequent sequence followed by support (the last two numbers, which will be the same in this application). The row reads: the itemset (22 80) is followed by (72) followed by (42) followed by (22).

There're a lot of parameters that can be passed to this function. most important ones are:

support: this is the minimum support level, default to 0 (not excluding anything)
max_gap: The max number of itemset that can be skipped in a sequence
min_gap: The min number of itemset that must be skipped in a sequence

Read the original paper and the C++ implementation for more details

How to contribute?

Fork this repo
Make change
Pull request

How to recompile to use in IDE?

rm cspade.cpp; python setup.py build_ext --inplace

Licence

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
csrc		csrc
pycspade		pycspade
tests		tests
utilssrc		utilssrc
.flake8		.flake8
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
test-global.sh		test-global.sh
test-local.sh		test-local.sh
uppypi.sh		uppypi.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pycspade

What is this?

How to install?

How to use?

You can provide cspade with list of sequences instead of a file:

How to contribute?

How to recompile to use in IDE?

Licence

About

Releases

Packages

Contributors 3

Languages

License

fzyukio/python-cspade

Folders and files

Latest commit

History

Repository files navigation

Pycspade

What is this?

How to install?

How to use?

You can provide cspade with list of sequences instead of a file:

How to contribute?

How to recompile to use in IDE?

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages