PCA, KDiscordODetector & Telemanon don't make predictions for all datapoints #99

Jeroenvanwely · 2023-06-09T14:30:01Z

I noticed that PCA, KDiscordODetector & Telemanon don't make predictions for all data points provided. One will get this issue after training (using .fit(X)) and now want to use .predict(Y) for evaluation. Let's say we want to run the following code:

X # Training data
Y # Eval data
y_true # Eval true labels
model # Either PCA, KDiscordODetector or Telemanon

model.fit(X) # Train model on X
y_pred = model.predict(Y) # Make model prediction on Y

# Analyse evaluation results
accuracy_score(y_true, y_pred)
confusion_matrix(y_true, y_pred)
classification_report(y_true, y_pred)

The last three lines won't run because y_pred is always shorter than y_true. That is due to these methods using the function get_sub_matrices(X, window_size, step, return_numpy, flatten, flatten_order) (found in utility.py) that returns a numpy array of shape (valid_len), window_size*n_sequenses), where each row stands for a flattened submatrix (Below you will find a copy of this function). This function cuts the data up into matrices based on the window_size and step parameters. However, if the last points in the data are not enough to form a new sub-matrix, they will not be taken along in the prediction. Therefore when analysing the evaluating results, you will have to change the above example code to:

X # Training data
Y # Eval data
y_true # Eval true labels
model # Either PCA, KDiscordODetector or Telemanon

model.fit(X) # Train model on X
y_pred = model.predict(Y) # Make model prediction on Y

# Analyse evaluation results
accuracy_score(y_true[:len(y_pred], y_pred)
confusion_matrix(y_true[:len(y_pred], y_pred)
classification_report(y_true[:len(y_pred], y_pred)

Here is the code where the sub_matrices are produced:

def get_sub_matrices(X, window_size, step=1, return_numpy=True, flatten=True,
                     flatten_order='F'):
    """Chop a multivariate time series into sub sequences (matrices).

    Parameters
    ----------
    X : numpy array of shape (n_samples,)
        The input samples.

    window_size : int
        The moving window size.

    step_size : int, optional (default=1)
        The displacement for moving window.
    
    return_numpy : bool, optional (default=True)
        If True, return the data format in 3d numpy array.

    flatten : bool, optional (default=True)
        If True, flatten the returned array in 2d.
        
    flatten_order : str, optional (default='F')
        Decide the order of the flatten for multivarite sequences.
        ‘C’ means to flatten in row-major (C-style) order. 
        ‘F’ means to flatten in column-major (Fortran- style) order. 
        ‘A’ means to flatten in column-major order if a is Fortran contiguous in memory, 
        row-major order otherwise. ‘K’ means to flatten a in the order the elements occur in memory. 
        The default is ‘F’.

    Returns
    -------
    X_sub : numpy array of shape (valid_len, window_size*n_sequences)
        The numpy matrix with each row stands for a flattend submatrix.
    """
    X = check_array(X).astype(np.float)
    n_samples, n_sequences = X.shape[0], X.shape[1]

    # get the valid length
    valid_len = get_sub_sequences_length(n_samples, window_size, step)

    X_sub = []
    X_left_inds = []
    X_right_inds = []

    # exclude the edge
    steps = list(range(0, n_samples, step))
    steps = steps[:valid_len]

    # print(n_samples, n_sequences)
    for idx, i in enumerate(steps):
        X_sub.append(X[i: i + window_size, :])
        X_left_inds.append(i)
        X_right_inds.append(i + window_size)

    X_sub = np.asarray(X_sub)

    if return_numpy:
        if flatten:
            temp_array = np.zeros([valid_len, window_size * n_sequences])
            if flatten_order == 'C':
                for i in range(valid_len):
                    temp_array[i, :] = X_sub[i, :, :].flatten(order='C')

            else:
                for i in range(valid_len):
                    temp_array[i, :] = X_sub[i, :, :].flatten(order='F')
            return temp_array, np.asarray(X_left_inds), np.asarray(
                X_right_inds)

        else:
            return np.asarray(X_sub), np.asarray(X_left_inds), np.asarray(
                X_right_inds)
    else:
        return X_sub, np.asarray(X_left_inds), np.asarray(X_right_inds)


def get_sub_sequences_length(n_samples, window_size, step):
    """Pseudo chop a univariate time series into sub sequences. Return valid
    length only.

    Parameters
    ----------
    X : numpy array of shape (n_samples,)
        The input samples.

    window_size : int
        The moving window size.

    step_size : int, optional (default=1)
        The displacement for moving window.

    Returns
    -------
    valid_len : int
        The number of subsequences.
        
    """
    # if X.shape[0] == 1:
    #     n_samples = X.shape[1]
    # elif X.shape[1] == 1:
    #     n_samples = X.shape[0]
    # else:
    #     raise ValueError("X is not a univarite series. The shape is {shape}.".format(shape=X.shape))

    # valid_len = n_samples - window_size + 1
    # valida_len = int_down(n_samples-window_size)/step + 1 
    valid_len = int(np.floor((n_samples - window_size) / step)) + 1
    return valid_len

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA, KDiscordODetector & Telemanon don't make predictions for all datapoints #99

PCA, KDiscordODetector & Telemanon don't make predictions for all datapoints #99

Jeroenvanwely commented Jun 9, 2023

PCA, KDiscordODetector & Telemanon don't make predictions for all datapoints #99

PCA, KDiscordODetector & Telemanon don't make predictions for all datapoints #99

Comments

Jeroenvanwely commented Jun 9, 2023