Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the way features are saved #158

Open
1 task
abojchevski opened this issue Dec 7, 2015 · 4 comments
Open
1 task

Change the way features are saved #158

abojchevski opened this issue Dec 7, 2015 · 4 comments

Comments

@abojchevski
Copy link
Collaborator

Currently when someone does:
token.features['my feature'] = value

we automatically append '[0]' and the feature becomes
'my feature[0]' : value

  • change to NOT append a string, but rather the key should be a tuple ('my feature', 0)

NOTE:
Huge improvement in performance, since python sucks with creating strings... For example for WindowFeatureGenerator we go from 16 second to less than 1 after such change.

NOTE:
Changes in nalaf and nala are trivial... not sure about other depending tools such as relna

@abojchevski abojchevski added this to the 5-Maybe/Someday milestone Dec 7, 2015
@juanmirocks
Copy link
Collaborator

Is this for real? How is in python creating a tuple faster than appending a string ?? 😕 ❓

No doubt you, I'm just really surprised...

@abojchevski
Copy link
Collaborator Author

Run this small (pure python) snippet of code, and even here the difference can be seen.

    import time

    features = {'some_feature_name_{}'.format(i):1 for i in range(100000)}
    keys = list(features.keys())
    # WITH STRINGS
    start = time.time()
    for feature_name in keys:
        for template_index in [-3. -2, -1, 0, 1, 2, 3]:
            features['{}[{}]'.format(feature_name[:-3], template_index)] = 1
    print('strings', time.time()-start)

    # WITH TUPLES
    start = time.time()
    for feature_name in keys:
        for template_index in (-3. -2, -1, 0, 1, 2, 3):
            features[(feature_name[:-3], template_index)] = 1
    print('tuples', time.time()-start)

@juanmirocks
Copy link
Collaborator

With this modified script and in my poor awesome rMBP 2015 I get these numbers (only 1.5 faster for me):

strings 9.162139892578125
tuples 6.199376106262207
import time

features = {'some_feature_name_{}'.format(i):1 for i in range(1000000)}
keys = list(features.keys())

# WITH STRINGS
start = time.time()
for feature_name in keys:
    for template_index in [-3. -2, -1, 0, 1, 2, 3]:
        features['{}[{}]'.format(feature_name[:-3], template_index)] = 1
        #print('strings', time.time()-start)
        stringsTime = time.time()-start

# WITH TUPLES
start = time.time()
for feature_name in keys:
    for template_index in (-3. -2, -1, 0, 1, 2, 3):
        features[(feature_name[:-3], template_index)] = 1
        #print('tuples', time.time()-start)
        tuplesTime = time.time()-start

print('strings', stringsTime)
print('tuples', tuplesTime)

@abojchevski
Copy link
Collaborator Author

Well there's is also another element to it I guess.

In nalaf. Every time when adding a new feature.... We check if ends with '[0]' if not we append it.... Conversely with tuples we just check if the key is of type tuple... no string manipulation... So that is also having an impact.

More precisely ends with '[some number]' (with regex).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants