Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to implement CORR() using custom function logic, but it is not working #1966

Open
daviddabingu opened this issue Aug 30, 2024 · 4 comments

Comments

@daviddabingu
Copy link

daviddabingu commented Aug 30, 2024

Hello. I am trying to implement CORR() using custom function logic provided here:
https://github.com/alasql/alasql/wiki/User-Defined-Functions

This is my code.

alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0,
            sumY2: 0,
            count: 0
        };
        console.log('Stage 1: Initialized accumulator');
    }

    if (stage === 1 || stage === 2) {
        // Check if valueX and valueY are valid numbers
        if (typeof valueX === 'number' && typeof valueY === 'number') {
            accumulator.sumX += valueX;
            accumulator.sumY += valueY;
            accumulator.sumXY += valueX * valueY;
            accumulator.sumX2 += valueX * valueX;
            accumulator.sumY2 += valueY * valueY;
            accumulator.count++;
            console.log('Stage 1/2: Updated accumulator with values:', valueX, valueY);
        } else {
            console.log('Stage 1/2: Skipped non-numeric values:', valueX, valueY);
        }
    }

    if (stage === 3) {
        console.log('Stage 3: Final accumulator state:', accumulator);

        // Calculate the Pearson correlation coefficient
        const count = accumulator.count;
        const sumX = accumulator.sumX;
        const sumY = accumulator.sumY;
        const sumXY = accumulator.sumXY;
        const sumX2 = accumulator.sumX2;
        const sumY2 = accumulator.sumY2;

        const numerator = (count * sumXY) - (sumX * sumY);
        const denominatorX = Math.sqrt((count * sumX2) - (sumX * sumX));
        const denominatorY = Math.sqrt((count * sumY2) - (sumY * sumY));
        const denominator = denominatorX * denominatorY;

        console.log('Stage 3: Calculated values:');

        // Check if the denominator is zero or if there are no valid data points
        if (denominator === 0 || count === 0) {
            console.log('Stage 3: Undefined correlation');
            return null; // Return null to indicate undefined correlation
        }

        const result = numerator / denominator;
        console.log('Stage 3: Correlation coefficient:', result);
        return result;
    }

    return accumulator;
};

But, somehow this returned {0 : {corr: 3}} , which signifies the calculation is not being done properly, considering pearson correlation always lies between -1 and 1...
Any guess or help to figure out what to do?
Thank you in advance.

@mathiasrw
Copy link
Member

If you provide some example data and expected output it makes it easier to help out.


From the AlaSQL Bot

Your implementation of the CORR() function logic looks well-structured, but there seems to be an issue with the Pearson correlation coefficient calculation, possibly due to either the data or the implementation of the accumulation logic.

Here are a few potential issues to check:

  1. Data Points: Make sure that the data passed to CORR() contains valid numeric values and there are no undefined or non-numeric values. If valueX or valueY is non-numeric or undefined, it should not contribute to the calculation.

  2. Numerator and Denominator Calculation:

    • Numerator: Ensure that the numerator calculation (count * sumXY) - (sumX * sumY) is correct and makes sense given the data.
    • Denominator: Ensure that the denominator (denominatorX * denominatorY) does not lead to a division by zero. If denominatorX or denominatorY is zero, the correlation should be undefined, hence returning null (which you already handle).
  3. Edge Cases: Ensure your function handles edge cases, such as when there is only one data point or when all valueX or valueY are the same, which would result in a zero denominator.

Here's an improved implementation suggestion:

  • Verify the calculation at each step by logging intermediate results.
  • Ensure stage 1/2 accumulates data correctly.
  • Return null when invalid data (e.g., single data point or zero variance) is encountered.
alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0,
            sumY2: 0,
            count: 0
        };
    }

    if (stage === 1 || stage === 2) {
        if (typeof valueX === 'number' && typeof valueY === 'number') {
            accumulator.sumX += valueX;
            accumulator.sumY += valueY;
            accumulator.sumXY += valueX * valueY;
            accumulator.sumX2 += valueX * valueX;
            accumulator.sumY2 += valueY * valueY;
            accumulator.count++;
        }
    }

    if (stage === 3) {
        const { count, sumX, sumY, sumXY, sumX2, sumY2 } = accumulator;

        const numerator = (count * sumXY) - (sumX * sumY);
        const denominatorX = Math.sqrt((count * sumX2) - (sumX * sumX));
        const denominatorY = Math.sqrt((count * sumY2) - (sumY * sumY));
        const denominator = denominatorX * denominatorY;

        if (denominator === 0 || count === 0) {
            return null;
        }

        const result = numerator / denominator;
        return result;
    }

    return accumulator;
};

Debugging Steps:

  • Add logs for numerator, denominatorX, denominatorY, and denominator to check values during the computation.
  • Ensure that valueX and valueY are not mistakenly stringified or null during the process.

If the issue persists, you might want to inspect the input data to confirm it's as expected.

@daviddabingu
Copy link
Author

I confirmed that there is no null or undefined in the dataset.
When I try the same calculation in R, it gives me corr: 0.48.

This is the test dataset that we are using.
Global Superstore_latest.csv

Thank you.

@daviddabingu
Copy link
Author

I've checked the source file src/55functions.js where I was able to have a look at how aggr functions are defined.
However, I figured all the aggr functions only take a single variable as an input, such as below.

alasql.aggr.median = alasql.aggr.MEDIAN = function (v, s, stage) {
	if (stage === 2) {
		if (v !== null) {
			s.push(v);
		}
		return s;
	}

	if (stage === 1) {
		if (v === null) {
			return [];
		}
		return [v];
	}

	if (!s.length) {
		return null;
	}

	let r = s.sort((a, b) => {
		if (a > b) return 1;
		if (a < b) return -1;
		return 0;
	});

	let middle = (r.length + 1) / 2;
	let middleFloor = middle | 0;
	let el = r[middleFloor - 1];

	if (middle === middleFloor || (typeof el !== 'number' && !(el instanceof Number))) {
		return el;
	} else {
		return (el + r[middleFloor]) / 2;
	}
};

So, it seems it only takes a single column(v), whereas calculating correlation is always a matter of two columns... just like this mathematical equation.
image

Any clever way to take a detour on this?

@mathiasrw
Copy link
Member

Yep. A .aggr function only takes one col as input. I would try to do the calculation outside of alasql if you need two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants