We take the preliminary results of Project 2, and we redo some of them so it's quantitative instead of qualitative analysis- and then we slap that into a OpenAPI compliant interface using FastAPI, and produce a half-baked NextJS based dashboard to showcase some of the the graphs as dashboards. Unfortunately due to time constraints, a significant portion of the functionality is still locked behind the API routes only.
- Jacob Barkovitch, [email protected]
- Guy Ben-Yishai, [email protected]
- Alan Bixby, [email protected]
- Jacob Coddington, [email protected]
- Ryan Geary, [email protected]
- Joseph Lieberman, [email protected]
Python 3.8
- The project is developed and tested using python v3.8; packages are managed usingpoetry
, but thepyproject.toml
should be fairly package manager agnosticMongoDB v6.0.2
- An industry staple, web scale, NoSQL database- handles schemaless input and theres less of a concern for accidental SQL injections. MongoDBpython-dotenv
- Python-dotenv reads key-value pairs from a .env file and can set them as environment variables. python-dotenv reponumpy
- a fundamental package for scientific computing with Pythonmatplotlib
- a comprehensive library for creating static, animated, and interactive visualizations in Pythonseaborn
- a Python data visualization library based on matplotlibfastapi
- a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hintsuvicorn
- an ASGI web server implementation for Python, used to runfastapi
thefuzz
- used for fuzzy search of API inputs to prevent the need to memorize exact team names or subredditsscipy
- used for integrating the time series data to find overall sentiment scores across time
Dashboard
NextJS
- a flexible React framework that gives you building blocks to create fast web applicationsChartJS
- a simple yet flexible JavaScript charting library for the modern webChartJSNextJSDashboard Github Template
: a Python centric, full stack dashboard template found on YouTube
Minimal webscraping for extra data used, these projects will appear as branches:
NodeJS
- a JavaScript runtimeTypeScript
- the superior form of JavaScriptFuseJS
- a powerful, lightweight fuzzy-search library, with zero dependencies.Cheerio
- fast, flexible & lean implementation of core jQuery designed specifically for the server.got-scraping
- an extended version ofGot
that automatically changes user-agents, etc, to avoid rate limiting while webscraping
Our three data sources consist of the same dataset defined in Project 1; however for computational complexity, the focus of our time was put on NFL data. There is heavy overlap between what was implemented in Project 2 and Project3. The following cleaning has been done as outlines:
- Team rosters were web-scraped from
footballdb.com
usingCheerio
andgot-scraping
, implemented in project 2roster-to-entitlementids
- Game scores and durations (start/end times in EST) were fetched from the
pro-football-reference.com
data set, implemented innfl-endtimes
- Teams were designated for tweets by using the Twitter Entitlement IDs that matches the team name or player roster names; matches were done using a fuzzy search with
FuseJS
, to reduce the amount of missed data from nicknames/name formatting differences; in instances where more than one match was found, the strongest match was used - See
teamToEntitlementIds.json
for entity ID to team mappings
- Due to time constraints, only
subreddit
was used to map comments to a team. Unfortunately this means places like/r/nfl
are entirely excluded from our analysis- but it was too complexity to design a ML model, etc that could accurately classify arbitrary text to a given team - See
teamToSubreddits.json
for subreddit to team mappings
- Only NFL data is considered; NCAAF while very well covered by the betting odds, had insufficient amounts of filtered player data in the Twitter and Reddit data pipelines to practically use.
As outlined in Project 1, a Google VM was created for slightly better specs/redundancy during an outage (using free credit!)- the demo was conducted using the Google VM, but by the time this is graded, this should be runnable on the school's provided VM. The only hold up for the demo was the fact that there was insufficient space to migrate the segregated data to the VM; so the analysis was needed to be computed on the full data set (in-place), which there was insufficient space to do.
Navigate to the following links:
- Dashboard: https://dashboard.ds.bxb.gg
- OpenAPI Docs: https://api.ds.bxb.gg
localhost/code level access is the same as
School VM
listed below
ssh prod@***.***.***.*** # (Google VM)
ssh prod@***.***.***.*** # (Binghamton VM)
Code should match on both machines; if not,
git pull
, with the associated SSH-key, same password as SSH.
- Connect to the Ivanti VPN (formerly Pulse VPN)
- In two separate terminals, construct SSH reverse tunnels to the VM on port 3000 (dashboard) and port 5000 (API):
ssh -L 3000:localhost:3000 -L 5000:localhost:5000 <username>@***.***.***.***
- I personally use
SSH Remote
on VSCode which automatically handles the remote port forwarding, so YMMV- consult thebxb.gg
links otherwise.
- From a local browser, access
localhost:3000
andlocalhost:5000/docs
- This is equvialent to
https://dashboard.ds.bxb.gg
andhttps://api.ds.bxb.gg
, respectively
- This is equvialent to
- In the API Docs, expanding a route, clicking
Try it out
in the top right corner, then filling in the appropriate fields, and hittingExecute
will process a request.
- Clone the repository and navigate into the package.
- Initialize a new virtual environment with
poetry shell
- Install the Python dependencies with
poetry install
- Run
python src/api.py
to start the FastAPI webserver - Navigate to
dashboard
, runnpm install
to install NodeJS dependencies - Run
npm start
to start the NextJS webserver for the dashboard
On both the Google VM and Binghamton VM, these are managed via PM2 and will already be running. Running second processes with fail/launch on a different port.
Accessible on https://dashboard.ds.bxb.gg (localhost:3000)
A minimal web dashboard using some of the endpoints of the API to generate data using ChartJS
; much more functionality is available (exclusively) on the API docs page.
This dashboard was slapped together last minute from this YT Tutorial GitHub template repo.
Clicking elements on the legend can hide/show data sources
Accessible on https://api.ds.bxb.gg/docs (localhost:5000/docs)
An API for generating graphs and dataframes of sentiment analysis data for NFL teams on Reddit and Twitter 🏈
General note:
Los Angeles Rams
are not supported for any Reddit data retrieval; the fact that there are two Los Angeles teams likely caused us to think one was a mistake/duplicate and it was purged from our original subreddit list for data collection.
Method | Path | Description |
---|---|---|
GET | /df/{team_name}/{collection}/{mode} | Generate Dataframe Data |
GET | /games | List of NFL Games |
GET | /most_positive/{data_source} | Get Most Positive Team |
GET | /least_postive/{data_source} | Get Least Positive Team |
GET | /positivity_sort/{data_source} | Get Positivity Sort |
GET | /odds/odds_ts | Get Bookmaker Odds |
GET | /odds/odds_avg | Get Average Bookmaker Odds |
GET | /graph/odds | Graph Betting Odds |
GET | /graph/team/{team_name} | Graph Team Polarity |
GET | /graph/game/{game_id} | Graph Team Vs Team Polarities |
GET | /diff/{game_id} | Calculate Game Difference |
Summary
- Generate Dataframe Data
Description
-
Generate a dataframe of data for a given team, data source, and metric; allows for custom time windows
-
team_name: name of the team; fuzzy search enabled ("buffalo" -> "Buffalo Bills")
-
collection: data source to use; reddit or twitter
-
mode: metric to use; sentiment or frequency
-
focus_datetime: a
datetime
to center the graph on; generally the game start time -
window_before: a
timedelta
of the amount of time to fetch prior to thefocus_datetime
-
window_after: a
timedelta
of the amount of time to fetch following thefocus_datetime
-
sample_window: a
timedelta
template string to be used for the moving average calculation -
resample_window: a
timedelta
template string to be used for the graphing interval (otherwise every point is shown which takes forver to render) -
all_data: bool; override the window parameters and just graph everything
team_name: string
collection: 'reddit' | 'twitter'
mode: 'sentiment'
focus_datetime?: string // 2022-11-14T12:00:00+00:00
window_before?: number // 172800 (seconds)
window_after?: number // 172800 (seconds)
sample_window?: string // 2D
resample_window?: string // 90T
all_data?: boolean // false
- 200 Successful Response
application/json
{
polarity: number[] // [0.08553888081356927, ..., 0.07893374467957985]
sentiment: number[] // [0.4102488992935106, ..., 0.4113862354514848]
}
Summary
- List of NFL Games
Description
- Generate an array of all recorded games
- 200 Successful Response
application/json
[
{
_id: string // 075dbff0219a3a1100c513400c4796ef
timestamp: string // 2022-11-04T00:15:00
endTimestamp: string // "2022-11-04T03:05:00"
home_team: string // Houston Texans
home_score: number // 17
away_team: string // 2022
winner: string // "Philadelphia Eagles"
duration: string // "2:50"
}
]
Summary
- Get Most Positive Team
Description
-
Return the NFL team with highest positive sentiment
-
data_source: data source to use; reddit or twitter
data_source: "reddit" | "twitter"
- 200 Successful Response
application/json
string // "baltimore-ravens"
Summary
- Get Least Positive Team
Description
-
Return the NFL team with lowest positive sentiment
-
data_source: Name of social media site
data_source: "reddit" | "twitter"
- 200 Successful Response
application/json
string // "las-vegas-raiders"
Summary
- Get Positivity Sort
Description
-
Generate a dataframe of NFL teams sorted by positivity
-
data_source: data source to use; reddit or twitter
data_source: "reddit" | "twitter"
- 200 Successful Response
application/json
[string, number][] // [["baltimore-ravens", 8.006139505827427], ..., ["las-vegas-raiders", 3.5768940857376483]]
Summary
- Get Bookmaker Odds
Description
-
Generate a dataframe of all bookmaker odds for a specific game in timeseries formatting; use
/games
to fetch an applicablegame_id
-
game_id: game id of desired game; can be provided by /games
game_id: string // 38bf123ec9df3af4efff83e45f472c61
- 200 Successful Response
application/json
{
string: [{
time: string // "2022-11-18 16:59:09.263000"
team1_name: string // "Chicago Bears"
team1_odds: number // 152
team2_name: string // "New York Jets"
team2_odds: number // -180
}]
}
Summary
- Get Average Bookmaker Odds
Description
-
Generate a dataframe of average bookmaker odds for a specific game
-
game_id: game id of desired game; can be provided by /games
game_id: string // 38bf123ec9df3af4efff83e45f472c61
- 200 Successful Response
application/json
{
[key: str]: {
team1_name: string // "Chicago Bears"
team1_odds: number // 194.12056737588654
team2_name: string // "New York Jets"
team2_odds: number // -234.2127659574468
}
}
Summary
- Graph Betting Odds
Description
-
Generate graph of specified bookmaker odds for a desired game
-
game_id: game id of desired game; can be provided by /games
-
bookmaker: bookmaker to use; fuzzy searchable (draft -> "DraftKings")
Dotted vertical line indicates winner
game_id: string // 49b82cfa8d543643cf2b5d5109afa7be
bookmaker: string // fanduels
- 200 Successful Response
image/png
The vertical dotted line marks the start of the game, the color of the line matches the winner
Notice that odds fluctuate greatly after the dotted line due to mid-game betting
Summary
- Graph Team Polarity
Description
-
Generate graph of polarity for specified team
-
team_name: name of the team; fuzzy search enabled ("buffalo" -> "Buffalo Bills")
-
data_source:data source to use; reddit or twitter
-
sample_window: a
timedelta
template string to be used for the moving average calculation -
resample_window: a
timedelta
template string to be used for the graphing interval (otherwise every point is shown which takes forver to render)
team_name: TeamNameLiteral // Philidelphia Eagles
data_source: "reddit" | "twitter" // reddit
sample_window?: string // 3D
resample_window?: string // 3H
- 200 Successful Response
image/png
Vertical bars indicate the the start to end time of a game; the color of the bar indicates the outcome.
Green ✅= Win, Red ❌= Loss
Summary
- Graph Team Vs Team Polarities
Description
-
Generate graph of team VS team polarities
-
game_id: game id of desired game; can be provided by /games
-
data_source: data source to use; reddit or twitter
game_id: string // 38bf123ec9df3af4efff83e45f472c61
data_source: "reddit" | "twitter" // reddit
- 200 Successful Response
image/png
The vertical bar indicates the start and end of the game; the color of the bar matches the winning team.
Notice how the losing team has a bigger sentiment drop; we attribute the drop in the winning team to general less activity
Summary
- Calculate Game Difference
Description
-
Calculate overall difference in sentiment between two teams
-
game_id: game id of desired game; can be provided by /games
-
data_source: data source to use; reddit or twitter
game_id: string // 38bf123ec9df3af4efff83e45f472c61
data_source: "reddit" | "twitter" // reddit
- 200 Successful Response
application/json
{
<team_name_1>: number
<team_name_2>: number
delta: number
predicted_winner: <team_name_1> | <team_name_2>
winner: <team_name_1> | <team_name_2>
theory_supporting: bool
}
[
"Philadelphia Eagles",
"Buffalo Bills",
"Green Bay Packers",
"Minnesota Vikings",
"Los Angeles Chargers",
"Carolina Panthers",
"Miami Dolphins",
"Indianapolis Colts",
"Las Vegas Raiders",
"Seattle Seahawks",
"Los Angeles Rams",
"Tennessee Titans",
"Baltimore Ravens",
"Atlanta Falcons",
"Cleveland Browns",
"Denver Broncos",
"Houston Texans",
"Detroit Lions",
"Jacksonville Jaguars",
"New Orleans Saints",
"Dallas Cowboys",
"Arizona Cardinals",
"Washington Commanders",
"New York Jets",
"Chicago Bears",
"Cincinnati Bengals",
"Kansas City Chiefs",
"San Francisco 49ers",
"New York Giants",
"New England Patriots",
"Tampa Bay Buccaneers",
"Pittsburgh Steelers",
]