Skip to content

Commit

Permalink
Added page on recounts
Browse files Browse the repository at this point in the history
  • Loading branch information
technocrat committed May 12, 2024
1 parent 9d09dda commit 9dfd7b6
Show file tree
Hide file tree
Showing 8 changed files with 241 additions and 7 deletions.
Binary file added _assets/img/normal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _assets/img/sampling_error.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
78 changes: 78 additions & 0 deletions _assets/scripts/hypo.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
"""
To illustrate the limitations of presidential preference polling it is possible to simulate accuracy of polling conducted for the Georgia election in 2020. It is assumed that one or more pollsters independently sample after polls close election night but before official results are released. The pollsters each ask three questions:
1. Did you vote?
2. Did you vote for one of Donald Trump or Joe Biden?
3. Which one?
until some preset number of replies to the third question are obtained.
The objective is to determine
1. The minimum sample size needed to have a 97.5% probablility of identifying a vote spread less than or equal to 246775, or 0.5% of the total cast for Trump and Biden, which is the threshhold for a mandatory recount.
2. The probability that the mean spread is positive (a Biden win) at the 97.5% confidence interval, which is 39 chances in 40 of being correct.
3. The probability that all polls will show a positive spread at the 97.5% confidence interval
4. The probability that at least one poll will show a negative spread (Trump win) at the 97.5% confidence interval
"""

using Distributions
using Plots
using Statistics

z = 1.96 # z-score for 97.5% confidence level
p = 0.5 # assumed population proportion
margin_of_error = 0.005 # 0.5% margin of error

# Solve for the sample size
n = ceil(Int, (z^2 * p * (1 - p)) / margin_of_error^2)

println("Minimum sample size needed: $n")

biden = 2473633
trump = 2461854
total_votes = biden + trump
spread = biden - trump
recount = 246775
actual_spread = (biden - trump) / total_votes

sample_size = 1000
repetitions = 10

predicted_spreads = zeros(repetitions)

for i in 1:repetitions
sample = rand(1:total_votes, sample_size)
sample_biden = count(sample .<= biden)
sample_trump = sample_size - sample_biden

predicted_spread = (sample_biden - sample_trump) / sample_size
predicted_spreads[i] = predicted_spread
end

mean_predicted_spread = mean(predicted_spreads)

result = Int64.(floor.(total_votes .* predicted_spreads))
mean(result)

mean_spread = mean(predicted_spreads)
std_spread = std(predicted_spreads)

# Calculate the z-score for the mean spread
z_score = mean_spread / (std_spread / sqrt(repetitions))

# Calculate the probability using the cumulative distribution function (cdf)
prob_positive_mean = 1 - cdf(Normal(), z_score)

println("Probability that the mean spread is positive: $(round(prob_positive_mean, digits=4))")

# Calculate the z-score for a spread of 0
z_score = 0 / std_spread

# Calculate the probability of a single poll showing a positive spread
prob_positive_poll = 1 - cdf(Normal(), z_score)

# Calculate the probability of all polls showing a positive spread
prob_all_positive = prob_positive_poll^repetitions

println("Probability that all polls will show a positive spread: $(round(prob_all_positive, digits=4))")

38 changes: 38 additions & 0 deletions _assets/scripts/moe.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
using DataFrames
using Distributions
using Formatting
using Statistics

const recount = 0.025 # based on Georgia recount rule of half-percent

# confidence interval, alpha, proportion and z scores
cis = [0.01, 0.025, 0.05, 0.10, 0.20] # 99% 97.5% 95% 90% 80%
alf = 1 .- cis
"""
The assumption of p = 0.5 for the population proportion is based on the concept of maximum variance in a binomial distribution. In a binomial distribution, the variance is calculated as p * (1 - p) * n, where p is the probability of success, and n is the number of trials. The maximum variance occurs when p = 0.5, which means that the probability of success is equal to the probability of failure. When calculating the sample size for a proportion, if there is no prior knowledge about the population proportion, it is common practice to assume p = 0.5. This assumption provides the most conservative estimate of the sample size needed to achieve a desired level of precision. By setting p = 0.5, we are essentially assuming the worst-case scenario in terms of variability, which leads to the largest possible sample size. This ensures that the sample size is sufficient to achieve the desired level of precision, regardless of the actual population proportion.
"""
p = 0.5 # assumed proportion

# one tailed test

zzs = quantile(Normal, 1 .- alf)

labs = map(f -> @sprintf("%.2f%%", f * 100), alf)
rads = Formatting.format.(n, commas=true)


# Solve for the sample size

n = ceil.(Int, (zzs.^2 * p * (1 - p)) / recount^2)

rads = Formatting.format.(n, commas=true)

tab = DataFrame(
ci = labs,
n = rads
)

header = ["Confidence level", "Required sample"]

pretty_tables(tab, backend = Val(:html), header = header, subhead = false)

7 changes: 1 addition & 6 deletions _assets/scripts/vote2022.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ end
# Example usage
# df = DataFrame(name=["John", "Jane"], age=[28, 34])
# meta_info = Dict(:source => "Survey Data", :year => 2021)
# df_meta = DataFrame_withMeta(meta_info, df)
# df_meta = MetaFrames(meta_info, df)

turnout = CSV.read("../data/vote2022.csv",DataFrame)
meta_info = Dict(
Expand All @@ -39,11 +39,6 @@ state_turnout = turnout[turnout.cohort .== "Total", :]
turnout = turnout[turnout.cohort .!= "Total", :]
vote2022 = MetaFrames(meta_info,turnout)






state_turnout.st = convert.(String, state_turnout.st)
state_turnout.totpop = convert.(Int64, state_turnout.totpop)
state_turnout.cpop = convert.(Int64, state_turnout.cpop)
Expand Down
91 changes: 91 additions & 0 deletions moe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
+++
title = "Margin of error"
+++

> The latest poll shows President Biden leading former President Trump in Arizona in the two-candidate race by 50.1% to 49.9% with a margin of error of 2.5%
is the form of a news lede often seen and it means that if the election were held at the time of the poll, those proportions of respondents indicate their preferences for the respective candidates. The margin of error provides the additional information that the respondents were selected at random and indicates how much the response varies from what could be found if everyone, and not just the sample, were asked. So, while the proportion of voters *in the sample* favoring Biden is *exactly* 50.1%, the proportion of *all voters* could be larger or smaller—47.6% to 52.6%. Because the margin for Trump overlaps—47.4% to 52.4%. This is often referred to as a "statistical dead heat."

That is an over-simplification because there are two principal sources of error that can be estimated at the time of survey.

One is called "design error," which arises from taking information gained from the voters being surveyed in addition to voting preference, such as age, gender, ethnicity, etc. This information is used to weight responses. If only 42.12% of the respondends identified as female, for example, and 50.01% of the population is, according to the latest demographic information, the proportion in the population, the lower number would be scaled up. This is sometimes reported explicitly, but often is not or is rolled into the other type of margin of error.

The second is simple random sampling error. This arises simply from the math. It depends strictly on the number of random samples taken, often referred to as $N$.

~~~
<table>
<thead>
<tr class = "header headerLastRow">
<th style = "text-align: right;">Confidence level</th>
<th style = "text-align: right;">Required sample</th>\
<th style = "text-align: right;">Margin of error</th>
</tr>
</thead>
<tbody>
<tr>
<td style = "text-align: right;">99.00%</td>
<td style = "text-align: right;">2,165</td>
<td style = "text-align: right;">1%</td>
</tr>
<tr>
<td style = "text-align: right;">97.50%</td>
<td style = "text-align: right;">1,537</td>
<td style = "text-align: right;">2.5%</td>
</tr>
<tr>
<td style = "text-align: right;">95.00%</td>
<td style = "text-align: right;">1,083</td>
<td style = "text-align: right;">5%</td>
</tr>
<tr>
<td style = "text-align: right;">90.00%</td>
<td style = "text-align: right;">657</td>
<td style = "text-align: right;">10%</td>
</tr>
<tr>
<td style = "text-align: right;">80.00%</td>
<td style = "text-align: right;">284</td>
<td style = "text-align: right;">20%</td>
</tr>
</tbody>
</table>
~~~

A margin of error is the probability of being wrong about the population. A dramatic way to think of it is in terms of Russian roullette. Imagine a 5-shot revolver on the table that has two cartridges. The drunken Russian holding it to his head has a 2 in 5 chance of blowing out his brains. If he is given a choice of two revolvers with only a single cartridge, that drops to 1 in 10. Four revolvers similarly is 1 in 20. Ten revolvers is 1 in 100.

Random sampling derives it power from mathematical properties, not from the real world. There is little in the real world directly experienced in daily life that is truly random. Life is more organized than that. What's more, randomness itself, is chock full of patterns. However, a random sample has useful properties. It may, with only a relatively few draws have a normal distribution *even if the **population** is is drawn from does not.*

So, the well known regularities of the *normal distribution*, the *Gaussian distribution* or what is called the bell curve are properties of abstraction, not of the underlying reality.

~~~
<img src="/assets/img/normal.png" style="width: 100%; display: block;">
~~~

A margin of error of a sample is constructed by considering a measure of central tendendcy, the *mean* and a measure of variability, the *standard deviation*. Those interact to produce the table above. If you see poll results reported with a margin of error but not an $N$, it can be interpolated to $N$ using the table above.

~~~
<img src="/assets/img/sampling_error.png" style="width: 100%; display: block;">
~~~

Unless there is a specific reason, most readers have no need to do the calculations themselves. For the curious, however, formulas follow.

$\text{Margin of Error} = z \times \sqrt{\frac{p(1-p)}{n}}$

where

* `z` is the z-score corresponding to the desired confidence level
* `p` is the sample proportion (or assumed population proportion)
* `n` is the sample size

The `z-score`, in turn, is $z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$

where

$\bar{x}$ is the sample mean
$\mu$ is the population mean
$\sigma$ is the population standard deviation
$n$ is the sample size

The term $\frac{\sigma}{\sqrt{n}}$ represents the standard error of the mean, which is the standard deviation of the sampling distribution of the mean.

The z-score measures the number of standard deviations an individual value or sample mean is away from the population mean.
32 changes: 32 additions & 0 deletions recount.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
+++
title = "Recounts"
+++

## Pennsylvania

Automatic when margin between first and second place candidates is less than or equal to 0.5% of the total vote for all candidates for the office. **34,685** votes, based on 2020 election. [Citation](https://govt.westlaw.com/pac/Document/NE079CBD017FB11EA9B799CBCA5DC090C?viewType=FullText&originationContext=documenttoc&transitionType=CategoryPageItem&contextData=(sc.Default))

## Georgia

On request when margin between first and second place candidates is less than or equal to 0.5% of the total vote for all candidates for the office. **25,000** votes, based on 2020 election. [Citation](https://law.justia.com/codes/georgia/2022/title-21/chapter-2/article-12/section-21-2-495/)

## North Carolina

On request when margin between first and second place candidates is less than or equal to 0.5% of the total vote for all candidates for the office or 10,000 votes, whichever is less. **27,625** votes, based on 2020 election. [Citation](https://ncleg.gov/EnactedLegislation/Statutes/PDF/ByArticle/Chapter_163/Article_15A.pdf)

## Michigan

Automatic when margin between first and second place candidates is less than or equal to 0.5% of their combined vote. On request on allegation of fraud or mistake without limit. **27,270** votes, based on 2020 election. [Citation](https://www.legislature.mi.gov/Laws/MCL?objectName=mcl-168-881)

## Arizona

Automatic when margin between first and second place candidates is less than or equal to 0.5% of their combined vote. **16,937** votes, based on 2020 election. [Citation](https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/16/00661.htm)

## Wisconsin

On request when margin between first and second place candidates is less than or equal to 1% of the votes cast for the office.
[Citation](https://docs.legis.wisconsin.gov/statutes/statutes/9/01) **32,981** votes, based on 2020 election.

## Nevada

On request. **No minimum number of votes** [Citation](https://www.leg.state.nv.us/Division/Legal/LawLibrary/NRS/NRS-293.html#NRS293Sec403)
2 changes: 1 addition & 1 deletion stats.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ Key points about simple random sampling error:

6. Confidence intervals: Sampling error is often quantified using confidence intervals, which provide a range of plausible values for the population parameter based on the sample data and the desired level of confidence.

7. Margin of error: The margin of error is a common way to express sampling error, representing the maximum expected difference between the sample statistic and the population parameter at a given confidence level.
7. Margin of error: The margin of error is a common way to express sampling error, representing the maximum expected difference between the sample statistic and the population parameter at a given confidence level. [*See more*](/moe)

Factors affecting simple random sampling error:

Expand Down

0 comments on commit 9dfd7b6

Please sign in to comment.