Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rri-204-2 [project transparency] #37

Open
4 of 6 tasks
chrisdburr opened this issue Feb 13, 2023 · 6 comments
Open
4 of 6 tasks

rri-204-2 [project transparency] #37

chrisdburr opened this issue Feb 13, 2023 · 6 comments
Assignees
Labels
new root file New root file created for review and conversion

Comments

@chrisdburr
Copy link
Collaborator

chrisdburr commented Feb 13, 2023

The following root file has been created:

  • Title: Project Transparency
  • Module: Explainability (SAFE-D Module)
  • Skills Track: RRI
  • Section: 2

Tasks

  • Review root file (Clau)
  • Answer comments (Chris)
  • Draft web version (Clau)
  • Review web version (Chris)
  • Create slides (Chris)
  • Review slides and draft script (Clau)

Link to file: https://github.com/alan-turing-institute/turing-commons/blob/drafts/drafts/rri-skillstrack/rri-modules/root-files/rri-204-2.md

@chrisdburr chrisdburr added the new root file New root file created for review and conversion label Feb 13, 2023
@ClauFischer
Copy link
Contributor

ClauFischer commented Feb 15, 2023

  • @chrisdburr I think this sentence may be a bit confusing. It suggests that what has changed is people's purchasing behaviour (which may have happened), but what we have been exploring is the change in recommendations by the site's recommendation algorithm. We haven't yet established that this recommendations have then changed people's actual purchasing patterns. Unless I am missing something, I would suggest changing to something like "... that explains the change in the distribution of the system's recommendations".

Here, it would be easy enough to identify that the variable `season` is an important feature used by the model that explains the change in people's purchasing behaviour.

  • @chrisdburr Again this seems to be another case where we are not keeping a clear enough distinction between the model's behaviour and the customers' final purchasing behaviours (which are of course correlated). I would suggest changing the sentence below to something like "Here, it is the customers' actual behaviour which has changed drastically (they are now purchasing less holiday packs). However, it is sensible for the data analysts to assume that there has been another underlying shift in the data distribution (similar to the seasonal shift above), which has changed the model's behaviour in a way that affects customer's final behaviour."

Again, this may seem like another case where there is a need to explain the model's behaviour in terms of an underlying shift in the data distribution—as with the seasonal shift above.

@chrisdburr
Copy link
Collaborator Author

@ClauFischer, please take a look at this revised section when you have a chance:

Consider the following scenario.[^ambiata]
A team of data analysts who work for a travel booking website are asked to explain a model has drastically changed its predictions about customer purchasing behaviour.
Perhaps the model is recommending significantly more trips to beach resorts now instead of ski trips.
Here, if the features used by the model were investigated it would be easy enough to identify that `season` is a feature with high importance.
It is well known that customers alter their purchasing behaviour between seasons (e.g. Winter, Summer).[^example]
From this we could explain the change in the model's predictions, as a result of a significnat change in the data distribution, which itself is a representation of a change in the underlying phenomena (i.e. changing seasons).
Simple enough.
But now let's assume that there is another change in customer behaviour, but this time a significant drop in conversion rate (i.e. the ratio of the number of people who view, say, a holiday deal, to the number who actually purchase the holiday) suddenly drops.
That is, customers are not just booking different holidays, they are not booking as many holidays at all.
Again, this may seem like another case where there is a need to explain the model's behaviour in terms of an underlying shift in the data distribution, which is in turn representative of some change in the underlying phenomena.
However, this time, let's pretend that the problem turns out to be a fault with a third-party piece of software, used as a dependency in the team's data pipeline, which is now causing the data about a user's `location` to be incorrectly recorded.
As it turns out, the company's model has learned that those who live in affluent neighbourhoods are more likely to purchase more expensive packages, and the company's recommendation system uses this to show customers holidays that are in their predicted price range, or dynamically alter the price of holiday packages based on their estimated"willingness-to-pay"—two ethically dubious practices known as personalised and dynamic pricing[^pricing].
However, due to the aforementioned fault in the data pipeline, all customers are now being shown the same, more expensive, holiday deals because their `postcodes` are all being recorded as all being located in affluent neighbourhoods.
As such, fewer customers are purchasing their packages, because they cannot afford them, and the conversion rate has dropped.
Again, there is no fault with the model (or its parameters).
Rather, the target of any explanation lies in the data and the generative mechanisms responsible for producing the data.
The model is still making the same predictions, but the predictions are now incorrect.

@ClauFischer
Copy link
Contributor

ClauFischer commented Feb 16, 2023

@chrisdburr:

  • Here it says recommending when it should only be predicting at this stage.

    Perhaps the model is recommending significantly more trips to beach resorts now instead of ski trips.

  • I think that if we are changing to a recommender system now we should say something about it being a different model (that it's now a recommender system and not only a predictive system).

    Again, this may seem like another case where there is a need to explain the model's behaviour in terms of an underlying shift in the data distribution, which is in turn representative of some change in the underlying phenomena.

  • Perhaps a footnote should be added to say that there may be fault in the fact that the model is discriminating based on locations, even though there is no error in the way the model is operating (it's the data pipeline). Although, on the other hand, you have already mentioned that the practice is ethically-dubious.

    Again, there is no fault with the model (or its parameters).

Let me know what you think about these suggestions and I can draft some amendments and send them for your revision 😃

@chrisdburr
Copy link
Collaborator Author

I think I've answered all your comments. Main changes are as follows:

- Determining the Problem the System is Designed to Address: this task includes information about why the problem is important and why the technical description (e.g. translation of the set of input variable into target variables is adequate for the problem at hand). For instance, why a set of features about a candidate are adequate measures for assessing their `suitability for a job role`. Aside from the technical "solution" to the problem, there is also a social dimension that needs to be justified, such as why an automated system is appropriate for use in hiring decisions (e.g. the system is not biased against protected groups).

Consider the following scenario.[^ambiata]
A team of data analysts who work for a travel booking website are asked to explain why a model has altered its predictions about customer purchasing behaviour.
This time, the model is used to drive a recommender system, which shows holiday packages to customers based on its predictions about which are most likely to be purchased.
Perhaps the system is recommending significantly more trips to beach resorts, whereas previously it was recommending ski trips.
Here, if the features used by the model were investigated it would be easy enough to identify that `season` is a feature with high importance for the model.
It is well known that customers alter their purchasing behaviour between seasons (e.g. Winter, Summer).[^example]
From this we could explain the change in the model's predictions, as a result of a significant change in the data distribution, which itself is a representation of a change in the underlying phenomena (i.e. changing seasons).
Simple enough.
But now let's assume that there is another change, which results in a significant drop in conversion rate (i.e. the ratio of the number of people who view, say, a holiday deal, to the number who actually purchase the holiday) suddenly drops.
That is, customers are not just booking different holidays, they are not booking as many holidays at all.
Again, this may seem like another case where there is a need to explain the model's behaviour in terms of an underlying shift in the data distribution, which is in turn representative of some change in the underlying phenomena.
However, this time, let's pretend that the problem turns out to be a fault with a third-party piece of software, used as a dependency in the team's data pipeline, which is now causing the data about a user's `location` to be incorrectly recorded.
As it turns out, the company's model has learned that those who live in affluent neighbourhoods are more likely to purchase more expensive packages, and the company's recommendation system uses this to show customers holidays that are in their predicted price range, or dynamically alter the price of holiday packages based on their estimated "willingness-to-pay"—two ethically dubious practices known as personalised and dynamic pricing[^pricing].
However, due to the aforementioned fault in the data pipeline, all customers are now being shown the same, more expensive, holiday deals because their `postcodes` are all being recorded as all being located in affluent neighbourhoods.
As such, fewer customers are purchasing their packages, because they cannot afford them, and the conversion rate has dropped.
Again, there is no fault with the model (or its parameters).
Rather, the target of any explanation lies in the data and the generative mechanisms responsible for producing the data.
The model is still making the same predictions, but the predictions are now incorrect and the recommender system is now unable to recommend the correct holiday packages to customers.

@ClauFischer
Copy link
Contributor

@chrisdburr Is this paper (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=36708ab24406aa3ec931fb9ba3f4a6cd7c3bd4b6) the one you want to refer to in this line

[^pricing]: This example refers to a practice known as 'personalised pricing', or sometimes 'price discrimination'. Neither are new practices (see [here](https://www.washingtonpost.com/archive/politics/2000/09/27/on-the-web-price-tags-blur/14daea51-3a64-488f-8e6b-c1a3654773da/)), but the widespread use of algorithmic techniques is enabling more dynamic and hyper-personalised forms of both personalised pricing and price discrimination (see [this article](https://www.washingtonpost.com/archive/politics/2000/09/27/on-the-web-price-tags-blur/14daea51-3a64-488f-8e6b-c1a3654773da/)).

?

Right now, both links take you to the Washington Post article. I searched through our conversations and the paper I just posted was the one I sent as more academic. If it is not, let me know please so I can link to the correct paper.

@chrisdburr
Copy link
Collaborator Author

I think I was intending for the second link to be the Guardian article you shared: https://www.theguardian.com/global/2017/nov/20/dynamic-personalised-pricing

I guess I didn't copy/paste the link properly. Sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new root file New root file created for review and conversion
Projects
None yet
Development

No branches or pull requests

2 participants