Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of CLR transformation #2

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

Vlasovets
Copy link

@Vlasovets Vlasovets commented Feb 2, 2022

the currently used function looks as follows:

def transform_features(
    features: pd.DataFrame, transformation: Str = "clr", coef: float = 0.5
) -> pd.DataFrame:
    if transformation == "clr":
        X = features.values
        #null_set = X <= 0.0 # just ignore zero replacement for the sake of experiment
        #X[null_set] = coef
        X = np.log(X)
		____________
		#bug is here
        X = (X.T - np.mean(X, axis=1)).T
		# one could change to 
		X = (X - np.mean(X, axis=0))
		____________

        return pd.DataFrame(
            data=X, index=list(features.index), columns=list(features.columns)
        )

    else:
        raise ValueError(
            "Unknown transformation name, use clr and not %r" % transformation
        )

a small test example, if check the first two values by hand, you will see the mistake

df = pd.DataFrame(np.random.randint(0,10,size=(3, 3)), columns=list('ABC'))
transform_features(df)

Cheers,
Oleg

@Vlasovets Vlasovets changed the title Mistake in calculation CLR transform Mistake in calculation of CLR transformation Feb 2, 2022
@Vlasovets Vlasovets changed the title Mistake in calculation of CLR transformation Calculation of CLR transformation Feb 2, 2022
@Leo-Simpson
Copy link
Owner

Hello Oleg,
Thank you for the pull request, and for "taking the lead" on the project somehow.

So it might be yes, then there might be something I do not understand. Shouldn't the mean be performed on each row of matrix X (if row i contains the features of sample i ) ?

@Vlasovets
Copy link
Author

oh, if row i contains the features of sample i", so your X is an (nxp) matrix where n - samples, p - features then axis=1 is correct since we need a geometric mean per sample
thanks for the clarification and sorry if I confused you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants