-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9d09dda
commit 9dfd7b6
Showing
8 changed files
with
241 additions
and
7 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
""" | ||
To illustrate the limitations of presidential preference polling it is possible to simulate accuracy of polling conducted for the Georgia election in 2020. It is assumed that one or more pollsters independently sample after polls close election night but before official results are released. The pollsters each ask three questions: | ||
1. Did you vote? | ||
2. Did you vote for one of Donald Trump or Joe Biden? | ||
3. Which one? | ||
until some preset number of replies to the third question are obtained. | ||
The objective is to determine | ||
1. The minimum sample size needed to have a 97.5% probablility of identifying a vote spread less than or equal to 246775, or 0.5% of the total cast for Trump and Biden, which is the threshhold for a mandatory recount. | ||
2. The probability that the mean spread is positive (a Biden win) at the 97.5% confidence interval, which is 39 chances in 40 of being correct. | ||
3. The probability that all polls will show a positive spread at the 97.5% confidence interval | ||
4. The probability that at least one poll will show a negative spread (Trump win) at the 97.5% confidence interval | ||
""" | ||
|
||
using Distributions | ||
using Plots | ||
using Statistics | ||
|
||
z = 1.96 # z-score for 97.5% confidence level | ||
p = 0.5 # assumed population proportion | ||
margin_of_error = 0.005 # 0.5% margin of error | ||
|
||
# Solve for the sample size | ||
n = ceil(Int, (z^2 * p * (1 - p)) / margin_of_error^2) | ||
|
||
println("Minimum sample size needed: $n") | ||
|
||
biden = 2473633 | ||
trump = 2461854 | ||
total_votes = biden + trump | ||
spread = biden - trump | ||
recount = 246775 | ||
actual_spread = (biden - trump) / total_votes | ||
|
||
sample_size = 1000 | ||
repetitions = 10 | ||
|
||
predicted_spreads = zeros(repetitions) | ||
|
||
for i in 1:repetitions | ||
sample = rand(1:total_votes, sample_size) | ||
sample_biden = count(sample .<= biden) | ||
sample_trump = sample_size - sample_biden | ||
|
||
predicted_spread = (sample_biden - sample_trump) / sample_size | ||
predicted_spreads[i] = predicted_spread | ||
end | ||
|
||
mean_predicted_spread = mean(predicted_spreads) | ||
|
||
result = Int64.(floor.(total_votes .* predicted_spreads)) | ||
mean(result) | ||
|
||
mean_spread = mean(predicted_spreads) | ||
std_spread = std(predicted_spreads) | ||
|
||
# Calculate the z-score for the mean spread | ||
z_score = mean_spread / (std_spread / sqrt(repetitions)) | ||
|
||
# Calculate the probability using the cumulative distribution function (cdf) | ||
prob_positive_mean = 1 - cdf(Normal(), z_score) | ||
|
||
println("Probability that the mean spread is positive: $(round(prob_positive_mean, digits=4))") | ||
|
||
# Calculate the z-score for a spread of 0 | ||
z_score = 0 / std_spread | ||
|
||
# Calculate the probability of a single poll showing a positive spread | ||
prob_positive_poll = 1 - cdf(Normal(), z_score) | ||
|
||
# Calculate the probability of all polls showing a positive spread | ||
prob_all_positive = prob_positive_poll^repetitions | ||
|
||
println("Probability that all polls will show a positive spread: $(round(prob_all_positive, digits=4))") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
using DataFrames | ||
using Distributions | ||
using Formatting | ||
using Statistics | ||
|
||
const recount = 0.025 # based on Georgia recount rule of half-percent | ||
|
||
# confidence interval, alpha, proportion and z scores | ||
cis = [0.01, 0.025, 0.05, 0.10, 0.20] # 99% 97.5% 95% 90% 80% | ||
alf = 1 .- cis | ||
""" | ||
The assumption of p = 0.5 for the population proportion is based on the concept of maximum variance in a binomial distribution. In a binomial distribution, the variance is calculated as p * (1 - p) * n, where p is the probability of success, and n is the number of trials. The maximum variance occurs when p = 0.5, which means that the probability of success is equal to the probability of failure. When calculating the sample size for a proportion, if there is no prior knowledge about the population proportion, it is common practice to assume p = 0.5. This assumption provides the most conservative estimate of the sample size needed to achieve a desired level of precision. By setting p = 0.5, we are essentially assuming the worst-case scenario in terms of variability, which leads to the largest possible sample size. This ensures that the sample size is sufficient to achieve the desired level of precision, regardless of the actual population proportion. | ||
""" | ||
p = 0.5 # assumed proportion | ||
|
||
# one tailed test | ||
|
||
zzs = quantile(Normal, 1 .- alf) | ||
|
||
labs = map(f -> @sprintf("%.2f%%", f * 100), alf) | ||
rads = Formatting.format.(n, commas=true) | ||
|
||
|
||
# Solve for the sample size | ||
|
||
n = ceil.(Int, (zzs.^2 * p * (1 - p)) / recount^2) | ||
|
||
rads = Formatting.format.(n, commas=true) | ||
|
||
tab = DataFrame( | ||
ci = labs, | ||
n = rads | ||
) | ||
|
||
header = ["Confidence level", "Required sample"] | ||
|
||
pretty_tables(tab, backend = Val(:html), header = header, subhead = false) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
+++ | ||
title = "Margin of error" | ||
+++ | ||
|
||
> The latest poll shows President Biden leading former President Trump in Arizona in the two-candidate race by 50.1% to 49.9% with a margin of error of 2.5% | ||
is the form of a news lede often seen and it means that if the election were held at the time of the poll, those proportions of respondents indicate their preferences for the respective candidates. The margin of error provides the additional information that the respondents were selected at random and indicates how much the response varies from what could be found if everyone, and not just the sample, were asked. So, while the proportion of voters *in the sample* favoring Biden is *exactly* 50.1%, the proportion of *all voters* could be larger or smaller—47.6% to 52.6%. Because the margin for Trump overlaps—47.4% to 52.4%. This is often referred to as a "statistical dead heat." | ||
|
||
That is an over-simplification because there are two principal sources of error that can be estimated at the time of survey. | ||
|
||
One is called "design error," which arises from taking information gained from the voters being surveyed in addition to voting preference, such as age, gender, ethnicity, etc. This information is used to weight responses. If only 42.12% of the respondends identified as female, for example, and 50.01% of the population is, according to the latest demographic information, the proportion in the population, the lower number would be scaled up. This is sometimes reported explicitly, but often is not or is rolled into the other type of margin of error. | ||
|
||
The second is simple random sampling error. This arises simply from the math. It depends strictly on the number of random samples taken, often referred to as $N$. | ||
|
||
~~~ | ||
<table> | ||
<thead> | ||
<tr class = "header headerLastRow"> | ||
<th style = "text-align: right;">Confidence level</th> | ||
<th style = "text-align: right;">Required sample</th>\ | ||
<th style = "text-align: right;">Margin of error</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td style = "text-align: right;">99.00%</td> | ||
<td style = "text-align: right;">2,165</td> | ||
<td style = "text-align: right;">1%</td> | ||
</tr> | ||
<tr> | ||
<td style = "text-align: right;">97.50%</td> | ||
<td style = "text-align: right;">1,537</td> | ||
<td style = "text-align: right;">2.5%</td> | ||
</tr> | ||
<tr> | ||
<td style = "text-align: right;">95.00%</td> | ||
<td style = "text-align: right;">1,083</td> | ||
<td style = "text-align: right;">5%</td> | ||
</tr> | ||
<tr> | ||
<td style = "text-align: right;">90.00%</td> | ||
<td style = "text-align: right;">657</td> | ||
<td style = "text-align: right;">10%</td> | ||
</tr> | ||
<tr> | ||
<td style = "text-align: right;">80.00%</td> | ||
<td style = "text-align: right;">284</td> | ||
<td style = "text-align: right;">20%</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
~~~ | ||
|
||
A margin of error is the probability of being wrong about the population. A dramatic way to think of it is in terms of Russian roullette. Imagine a 5-shot revolver on the table that has two cartridges. The drunken Russian holding it to his head has a 2 in 5 chance of blowing out his brains. If he is given a choice of two revolvers with only a single cartridge, that drops to 1 in 10. Four revolvers similarly is 1 in 20. Ten revolvers is 1 in 100. | ||
|
||
Random sampling derives it power from mathematical properties, not from the real world. There is little in the real world directly experienced in daily life that is truly random. Life is more organized than that. What's more, randomness itself, is chock full of patterns. However, a random sample has useful properties. It may, with only a relatively few draws have a normal distribution *even if the **population** is is drawn from does not.* | ||
|
||
So, the well known regularities of the *normal distribution*, the *Gaussian distribution* or what is called the bell curve are properties of abstraction, not of the underlying reality. | ||
|
||
~~~ | ||
<img src="/assets/img/normal.png" style="width: 100%; display: block;"> | ||
~~~ | ||
|
||
A margin of error of a sample is constructed by considering a measure of central tendendcy, the *mean* and a measure of variability, the *standard deviation*. Those interact to produce the table above. If you see poll results reported with a margin of error but not an $N$, it can be interpolated to $N$ using the table above. | ||
|
||
~~~ | ||
<img src="/assets/img/sampling_error.png" style="width: 100%; display: block;"> | ||
~~~ | ||
|
||
Unless there is a specific reason, most readers have no need to do the calculations themselves. For the curious, however, formulas follow. | ||
|
||
$\text{Margin of Error} = z \times \sqrt{\frac{p(1-p)}{n}}$ | ||
|
||
where | ||
|
||
* `z` is the z-score corresponding to the desired confidence level | ||
* `p` is the sample proportion (or assumed population proportion) | ||
* `n` is the sample size | ||
|
||
The `z-score`, in turn, is $z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$ | ||
|
||
where | ||
|
||
$\bar{x}$ is the sample mean | ||
$\mu$ is the population mean | ||
$\sigma$ is the population standard deviation | ||
$n$ is the sample size | ||
|
||
The term $\frac{\sigma}{\sqrt{n}}$ represents the standard error of the mean, which is the standard deviation of the sampling distribution of the mean. | ||
|
||
The z-score measures the number of standard deviations an individual value or sample mean is away from the population mean. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
+++ | ||
title = "Recounts" | ||
+++ | ||
|
||
## Pennsylvania | ||
|
||
Automatic when margin between first and second place candidates is less than or equal to 0.5% of the total vote for all candidates for the office. **34,685** votes, based on 2020 election. [Citation](https://govt.westlaw.com/pac/Document/NE079CBD017FB11EA9B799CBCA5DC090C?viewType=FullText&originationContext=documenttoc&transitionType=CategoryPageItem&contextData=(sc.Default)) | ||
|
||
## Georgia | ||
|
||
On request when margin between first and second place candidates is less than or equal to 0.5% of the total vote for all candidates for the office. **25,000** votes, based on 2020 election. [Citation](https://law.justia.com/codes/georgia/2022/title-21/chapter-2/article-12/section-21-2-495/) | ||
|
||
## North Carolina | ||
|
||
On request when margin between first and second place candidates is less than or equal to 0.5% of the total vote for all candidates for the office or 10,000 votes, whichever is less. **27,625** votes, based on 2020 election. [Citation](https://ncleg.gov/EnactedLegislation/Statutes/PDF/ByArticle/Chapter_163/Article_15A.pdf) | ||
|
||
## Michigan | ||
|
||
Automatic when margin between first and second place candidates is less than or equal to 0.5% of their combined vote. On request on allegation of fraud or mistake without limit. **27,270** votes, based on 2020 election. [Citation](https://www.legislature.mi.gov/Laws/MCL?objectName=mcl-168-881) | ||
|
||
## Arizona | ||
|
||
Automatic when margin between first and second place candidates is less than or equal to 0.5% of their combined vote. **16,937** votes, based on 2020 election. [Citation](https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/16/00661.htm) | ||
|
||
## Wisconsin | ||
|
||
On request when margin between first and second place candidates is less than or equal to 1% of the votes cast for the office. | ||
[Citation](https://docs.legis.wisconsin.gov/statutes/statutes/9/01) **32,981** votes, based on 2020 election. | ||
|
||
## Nevada | ||
|
||
On request. **No minimum number of votes** [Citation](https://www.leg.state.nv.us/Division/Legal/LawLibrary/NRS/NRS-293.html#NRS293Sec403) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters