Skip to content

Data and code for a story we did on whether World Cups impact tourism

Notifications You must be signed in to change notification settings

HindustanTimesLabs/world_cup_tourism

 
 

Repository files navigation

This repo contains the data and code for a story I did on football World Cups and their impact on tourism. You can read an expanded version of the story here.

Didn't have space to put this graphic in the print edition. If you read the text, it refers to a table of parameters on which Brazil and South Africa do worse than France, Japan and Germany. Here's the table:

alt text

All the text below is there in the expanded version, putting it here too for the sake of posterity.

METHODOLOGY / HOW WE DID THE MATH

So how were the hypothetical countries in the story created? They were constructed using a method called ‘synthetic control’. This is a method is typically used by academics and analysts for ‘impact evaluation’, ie. assessing whether a government policy or program has had any effect or not.

Now to figure out if hosting the World Cup leads to an increase or decrease in tourist spending, we need some kind of a counterfactual. Meaning, a way to let us know what would happen to the tourism figures of a country if it hadn’t hosted the World Cup.

Synthetic control helps us in constructing this counterfactual case, something that we can compare the actual figures to and make an assessment of how well a country has done.

We do this by constructing a synthetic country whose tourist spending figures are similar to that of the World Cup host country. This synthetic country is a weighted combination of countries similar to the host country.

To construct this synthetic case, we’ve chosen neighbouring countries whose citizens are, on average, as wealthy as the average person in the host country, ie. their per capita income levels aren’t that far apart.

(Typically, they’re in the same income group of countries as classified by the World Bank. Because South Africa just had two neighbours in the same income group, we relaxed the rule in this case to give us more countries to use. The reasoning was that the other countries would be similar enough to South Africa in several other respects to make up for not being in the same income group.)

This table below shows us the various countries that have been used to form the synthetic case for the past five world cups in this analysis:

alt text

For example, for Brazil, we constructed a ‘synthetic Brazil’ whose tourist spending figures are similar to that of the actual Brazil for three years prior to the World Cup in 2014, ie. 2011, 2012 and 2013. These three years represent a period of normalcy for the host country, a period before the World Cup has had an effect on tourist spending figures.

In this analysis, Synthetic Brazil is combination of five countries viz. Ecuador, Colombia, Paraguay, Venezuela and Peru. We would have chosen Argentina too if it hadn’t hosted the South American international football tournament, the Copa America in 2011. Because it hosted that tournament, its tourist spending figures for that year would have been higher than usual and skewed the figures for Synthetic Brazil.

(In fact, while selecting the countries used for constructing the synthetic case, it’s important to select countries that haven’t been through a ‘shock’ such as hosting a major tournament such as the Olympics or World Cup.)

So we take the tourist spending figures for the five South American countries for 2011, 2012 and 2013, find weightages/multipliers that we can apply to each of the countries, so that when the figures are added up, we get something close to the figures of actual Brazil.

alt text

We arrive at the weightages/multipliers to be used for each country by doing something called ‘constrained optimisation’ .

We then apply these weightages/multipliers to the tourist spending figures for these five countries for the years 2014 to 2016 and so arrive at the figures of Synthetic Brazil. By doing this, we get an idea of what the counterfactual would be, what the figures for Brazil would have looked like for all those years if it hadn’t hosted the World Cup.

CREDITS & FURTHER READING

The data for the story was taken from figures collated by the World Bank.

The data on tourism competitiveness of countries was taken from the World Travel & Tourism Competitiveness Report 2017 published by the World Economic Forum.

Got the idea for using the Synthetic Control method from this paper by Jorge Viana.

For an overview of synthetic control without much mathematical notation, this paper from the British Medical Journal does a good job.

If you’re comfortable with mathematical notation and have some understanding of statistics or econometrics, I guess you should go for this overview of the method in the American Journal of Political Science.

For a summary of impact evaluation methods in general, this Harvard Business School (HBS) working paper is a good one.

The particular method I use here is a different version of synthetic control than is normally used.

The method I used was inspired by this paper by Guido Imbens (who seems to be an expert in the broader field of causal inference, he’s even written textbooks on it).

Now this following sentence will only make sense to you if you’re familiar with the synthetic control method, but Imbens in his paper only uses outcome data and doesn’t use any covariates or predictor variables.

I don’t exactly use the method proposed by Imbens, the method I use is a simpler variant with -- bear with me as I get technical for a moment -- a non-negativity constraint on the coefficients for the outcome variables, while the intercept is allowed to be either positive or negative.

By allowing only for positive coefficients, it's closer to the 'traditional' synthetic control method. In a way what I’m doing kind of lies between the Difference-in-Differences method (see the HBS paper for what that is) and the traditional synthetic control method.

To arrive at the weightages I used the nnls package in R. The code used is available here. Nnls stands for non-negative least squares. The package is usually used to do something called ‘constrained regression’ but I’ve used it here to do ‘constrained optimisation’.

There’s a bit of python code I used to create CSVs for the graphics, it's available in a jupyter notebook here. Could have done it in R, but it’s just easier for me to code in python.

If you want to do synthetic control the traditional way, academics usually use the ‘Synth’ package in R. If you want some sample R code to start you off, this blog post is useful.

There’s already been a lot, and I mean a lot, of work done on the impact of mega sporting events such as the World Cup and Olympics. I did a good-faith effort to see if someone had done anything similiar to this article which is a) based on synthetic control methods and b) which looks at tourism figures. Searched all the major journal databases, didn’t find any, so hopefully this article represents original work in some sense.

Now statistics can get very advanced and is very easy to get wrong. If you’re an econometrician, statistician or data scientist and think I should have done something differently in my analysis, do let me know! You can contact me on twitter at @shijith and through email at [email protected].

About

Data and code for a story we did on whether World Cups impact tourism

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 73.5%
  • Jupyter Notebook 14.0%
  • R 12.5%