Skip to content

LazyText is inspired by the idea of lazypredict, a library which helps build lot of basic models without much code. LazyText is for text what lazypredict is for numeric data.

License

Notifications You must be signed in to change notification settings

jdvala/lazytext

Repository files navigation

LazyText

lazy

lazytext Documentation Code Coverage Downloads

LazyText is inspired b the idea of lazypredict, a library which helps build a lot of basic mpdels without much code. LazyText is for text what lazypredict is for numeric data.

  • Free Software: MIT licence

Installation

To install LazyText

pip install lazytext

Usage

To use lazytext import in your project as

from lazytext.supervised import LazyTextPredict

Text Classification

Text classification on BBC News article classification.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from lazytext.supervised import LazyTextPredict
import re
import nltk

# Load the dataset
df = pd.read_csv("tests/assets/bbc-text.csv")
df.dropna(inplace=True)

# Download models required for text cleaning
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# split the data into train set and test set
df_train, df_test = train_test_split(df, test_size=0.3, random_state=13)

# Tokenize the words
df_train['clean_text'] = df_train['text'].apply(nltk.word_tokenize)
df_test['clean_text'] = df_test['text'].apply(nltk.word_tokenize)

# Remove stop words
stop_words=set(nltk.corpus.stopwords.words("english"))
df_train['text_clean'] = df_train['clean_text'].apply(lambda x: [item for item in x if item not in stop_words])
df_test['text_clean'] = df_test['clean_text'].apply(lambda x: [item for item in x if item not in stop_words])

# Remove numbers, punctuation and special characters (only keep words)
regex = '[a-z]+'
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [item for item in x if re.match(regex, item)])

# Lemmatization
lem = nltk.stem.wordnet.WordNetLemmatizer()
df_train['text_clean'] = df_train['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])
df_test['text_clean'] = df_test['text_clean'].apply(lambda x: [lem.lemmatize(item, pos='v') for item in x])

# Join the words again to form sentences
df_train["clean_text"] = df_train.text_clean.apply(lambda x: " ".join(x))
df_test["clean_text"] = df_test.text_clean.apply(lambda x: " ".join(x))

# Tfidf vectorization
vectorizer = TfidfVectorizer()

x_train = vectorizer.fit_transform(df_train.clean_text)
x_test = vectorizer.transform(df_test.clean_text)
y_train = df_train.category.tolist()
y_test = df_test.category.tolist()

lazy_text = LazyTextPredict(
    classification_type="multiclass",
    )
models = lazy_text.fit(x_train, x_test, y_train, y_test)


            Label Analysis
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ ClassesWeights            ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ business0.8725490196078431 │
│ sport1.1528497409326426 │
│ politics1.0671462829736211 │
│ entertainment0.8708414872798435 │
│ tech1.1097256857855362 │
└───────────────┴────────────────────┘
                                                              Result Analysis
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ ModelAccuracyBalanced AccuracyF1 ScoreCustom Metric ScoreTime Taken           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│ AdaBoostClassifier0.72604790419161680.7177371721327690.7248335989941609NA1.4244091510772705   │
│ BaggingClassifier0.88173652694610780.87966339623636770.8814695332332374NA2.422576904296875    │
│ BernoulliNB0.95359281437125750.95059291934257330.9533647387436917NA0.015914201736450195 │
│ CalibratedClassifierCV0.97604790419161680.97600182203408470.9755904096436046NA0.36926722526550293  │
│ ComplementNB0.97604790419161680.97523291925465830.9754237510855159NA0.009947061538696289 │
│ DecisionTreeClassifier0.85329341317365270.84739566711942780.8496464898940103NA0.34440088272094727  │
│ DummyClassifier0.21556886227544910.20.07093596059113301NA0.005555868148803711 │
│ ExtraTreeClassifier0.72754491017964070.72535184599086580.7255575847020816NA0.018934965133666992 │
│ ExtraTreesClassifier0.96556886227544910.96353632859033020.9649837485086689NA1.2101161479949951   │
│ GradientBoostingClassifier0.95508982035928150.95263338871965290.9539060578037555NA30.256237030029297   │
│ KNeighborsClassifier0.9386227544910180.93700536939598140.9367294513157219NA0.12071108818054199  │
│ LinearSVC0.97455089820359290.9742626915993020.9740343976103922NA0.11713886260986328  │
│ LogisticRegression0.9685628742514970.96689958592132510.9678778814908909NA0.8916082382202148   │
│ LogisticRegressionCV0.97155688622754490.97088967572628610.971147482393915NA37.82431483268738    │
│ MLPClassifier0.97604790419161680.97533816425120780.9752912960666735NA30.700589656829834   │
│ MultinomialNB0.97005988023952090.96787957211870260.9689200656860745NA0.01410818099975586  │
│ NearestCentroid0.95209580838323350.94990451354547180.9515097876015481NA0.018617868423461914 │
│ NuSVC0.96706586826347310.96561594202898550.9669719954040374NA6.941549062728882    │
│ PassiveAggressiveClassifier0.97754491017964070.97723888207549250.9770812340935414NA0.05249309539794922  │
│ Perceptron0.97754491017964070.97692546583850940.9768161404324825NA0.030637741088867188 │
│ RandomForestClassifier0.96257485029940120.96051355426320810.9624462948504477NA0.9921820163726807   │
│ RidgeClassifier0.97754491017964070.97692546583850930.9769176825464448NA0.09582686424255371  │
│ SGDClassifier0.97005988023952090.96950078683739730.969787370271274NA0.04686570167541504  │
│ SVC0.97155688622754490.97037784679089020.9713021262026043NA6.64256477355957     │
└─────────────────────────────┴────────────────────┴────────────────────┴─────────────────────┴─────────────────────┴──────────────────────┘

Result of each estimator is stored in models which is a list and each trained estimator is also returned which can be used further for analysis.

confusion matrix and classification reports are also part of the models if they are needed.

print(models[0])
{
    'name': 'AdaBoostClassifier',
    'accuracy': 0.7260479041916168,
    'balanced_accuracy': 0.717737172132769,
    'f1_score': 0.7248335989941609,
    'custom_metric_score': 'NA',
    'time': 1.829047679901123,
    'model': AdaBoostClassifier(),
    'confusion_matrix': array([
        [ 89,   5,  12,  35,   3],
        [  8,  58,   5,  44,   0],
        [  5,   2, 108,  10,   1],
        [  5,   7,   5, 138,   2],
        [ 25,   5,   1,   3,  92]]),
 'classification_report':
 """
            precision    recall  f1-score   support
        0       0.67      0.62      0.64       144
        1       0.75      0.50      0.60       115
        2       0.82      0.86      0.84       126
        3       0.60      0.88      0.71       157
        4       0.94      0.73      0.82       126
 accuracy                           0.73       668
 macro avg       0.76      0.72     0.72       668
 weighted avg    0.75      0.73     0.72       668'}

Custom metrics

LazyText also support custom metric for evaluation, this metric can be set up like following

from lazytext.supervised import LazyTextPredict
# Custom metric
def my_custom_metric(y_true, y_pred):

    ...do your stuff

    return score


lazy_text = LazyTextPredict(custom_metric=my_custom_metric)
lazy_text.fit(X_train, X_test, y_train, y_test)

If the signature of the custom metric function does not match with what is given above, then even though the custom metric is provided, it will be ignored.

Custom model parameters

LazyText also support providing parameters to the esitmators. For this just provide a dictornary of the parameters as shown below and those following arguments will be applied to the desired estimator.

In the following example I want to apply/change the default parameters of SVC classifier.

LazyText will fit all the models but only change the default parameters for SVC in the following case.

from lazytext.supervisd
custom_parameters = [
    {
        "name": "SVC",
        "parameters": {
            "C": 0.5,
            "kernel": 'poly',
            "degree": 5
        }
    }
]


l = LazyTextPredict(
    classification_type="multiclass",
    custom_parameters=custom_parameters
    )
l.fit(x_train, x_test, y_train, y_test)

About

LazyText is inspired by the idea of lazypredict, a library which helps build lot of basic models without much code. LazyText is for text what lazypredict is for numeric data.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published