Skip to content

Commit

Permalink
Merge pull request #65 from sfu-discourse-lab/V7.0
Browse files Browse the repository at this point in the history
Update code base to V7.0
  • Loading branch information
prrao87 authored Feb 20, 2023
2 parents d822b0c + 1396702 commit 172962b
Show file tree
Hide file tree
Showing 89 changed files with 3,415 additions and 1,835 deletions.
695 changes: 21 additions & 674 deletions LICENSE

Large diffs are not rendered by default.

9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__Status: V6.1__ (Code provided as-is; only sporadic updates expected).
__Status: V7.0__ (Code provided as-is; only sporadic updates expected)

# The Gender Gap Tracker

Expand All @@ -18,8 +18,9 @@ See [CONTRIBUTORS.md](CONTRIBUTORS.md)

* `scraper`: Modules for scraping English and French news articles from various Canadian news organizations' websites and RSS feeds.
* `nlp`: NLP modules for performing quote extraction and entity gender annotation on both English and French news articles.
* `statistics`: Example scripts for running batch queries on our MongoDB database to retrieve source/gender statistics.
* `dashboard_for_research`: [Research dashboard and apps](https://gendergaptracker.research.sfu.ca/) that allow us to explore the GGT data in more detail.
* `api`: FastAPI code base exposing endpoints that serve our daily statistics to public-facing dashboards: [Gender Gap Tracker](https://gendergaptracker.informedopinions.org) and [Radar de Parité](https://radardeparite.femmesexpertes.ca)
* `research_dashboard`: [A multi-page, extensible dashboard](https://gendergaptracker.research.sfu.ca/) built in Plotly Dash that allows us to explore the GGT data in more detail.
* `statistics`: Scripts for running batch queries on our MongoDB database to retrieve source/gender statistics.

## Data

Expand All @@ -31,7 +32,7 @@ In future versions of the software, we are planning to visualize more fine-grain

From a research perspective, questions of salience and space arise, i.e., whether quotes by men are presented more prominently in an article, and whether men are given more space in average (perhaps counted in number of words). More nuanced questions that involve language analysis include whether the quotes are presented differently in terms of endorsement or distance from the content of the quote (*stated* vs. *claimed*). Analyses of transitivity structure in clauses can yield further insights about the type of roles women are portrayed in, complementing some of our studies' findings via dependency analyses.

We are mindful of and acknowledge the relative lack of work in NLP, topic modelling and gender equality for corpora in languages other than English. Our hope is that we are at least playing a small role here, through our analyses of Canadian French-language news whose code we share in this repo. We believe that such work will yield not only interesting methodological insights (for example, the relative benefits of stemming vs. lemmatization on topic keyword interpretability for non-English corpora), but also reveal whether the same gender disparities we observed in our English corpus are present in French. While we are actively pursuing such additional areas of inquiry, we also invite other researchers to join in this effort!
We are mindful of and acknowledge the relative lack of work in NLP, topic modelling and gender equality for corpora in languages other than English. Our hope is that we are at least playing a small role here, through our analyses of Canadian French-language news whose code we share in this repo. We believe that such work will yield not only interesting methodological insights, but also reveal whether the same gender disparities we observed in our English corpus are present in French. While we are actively pursuing such additional areas of inquiry, we also invite other researchers to join in this effort!


## Contact
Expand Down
35 changes: 35 additions & 0 deletions api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# APIs for public-facing dashboards

This section hosts code for the backend APIs that serve our public-facing dashboards for our partner organization, Informed Opinions.

We have two APIs: one each serving the English and French dashboards (for the Gender Gap Tracker and the Radar de Parité, respectively).

## Dashboards
* English: https://gendergaptracker.informedopinions.org
* French: https://radardeparite.femmesexpertes.ca

### Front end code

The front end code base, for clearer separation of roles and responsibilities, is hosted elsewhere in private repos. Access to these repos is restricted, so please reach out to [email protected] to get access to the code, if required.

## Setup

Both APIs are written using [FastAPI](https://fastapi.tiangolo.com/), a high-performance web framework for building APIs in Python.

This code base has been tested in Python 3.9, but there shouldn't be too many problems if using a higher Python version.

Install the required dependencies via `requirements.txt` as follows.

Install a new virtual environment if it does not already exist:
```sh
$ python3.9 -m venv api_venv
$ python3.9 -m pip install -r requirements.txt
```

For further use, activate the virtual environment:

```sh
$ source api_venv/bin/activate
```


35 changes: 35 additions & 0 deletions api/english/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Gender Gap Tracker: API

This section contains the code for the API that serves the [Gender Gap Tracker public dashboard](https://gendergaptracker.informedopinions.org/). The dashboard itself is hosted externally, and its front end code is hosted on this [GitLab repo](https://gitlab.com/client-transfer-group/gender-gap-tracker).

## API docs

The docs can be accessed in one of two ways:

* Swagger: https://gendergaptracker.informedopinions.org/docs
* Useful to test out the API interactively on the browser
* Redoc: https://gendergaptracker.informedopinions.org/redoc
* Clean, modern UI to see the API structure in a responsive format

## Extensibility

The code base has been written with the intention that future developers can add endpoints for other functionality that can potentially serve other dashboards.

* `db`: Contains MongoDB-specific code (config and queries) that help interact with the RdP data on our MongoDB database
* `endpoints`: Add new functionality to process and serve results via RESTful API endpoints
* `schemas`: Perform response data validation so that the JSON results from the endpoint are formatted properly in the docs
* `utils`: Add utility functions that support data manipulation within the routers
* `gunicorn_conf.py`: Contains deployment-specific instructions for the web server, explained below.

## Deployment

We perform a standard deployment of FastAPI in production, as per the best practices [shown in this blog post](https://www.vultr.com/docs/how-to-deploy-fastapi-applications-with-gunicorn-and-nginx-on-ubuntu-20-04/).

* `uvicorn` is used as an async web server (compatible with the `gunicorn` web server for production apps)
* `gunicorn` works as a process manager that starts multiple `uvicorn` processes via the `uvicorn.workers.UvicornWorker` class
* `nginx` is used as a reverse proxy

The deployment and maintenance of the web server is carried out by SFU's Research Computing Group (RCG).



File renamed without changes.
18 changes: 18 additions & 0 deletions api/english/db/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
host = ["mongo0", "mongo1", "mongo2"]
# host = "localhost"
is_direct_connection = True if (host == "localhost") else False

config = {
"MONGO_HOST": host,
"MONGO_PORT": 27017,
"MONGO_ARGS": {
"authSource": "admin",
"readPreference": "primaryPreferred",
"username": "username",
"password": "password",
"directConnection": is_direct_connection,
},
"DB_NAME": "mediaTracker",
"LOGS_DIR": "logs/",
}

33 changes: 33 additions & 0 deletions api/english/db/mongoqueries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
def agg_total_per_outlet(begin_date: str, end_date: str):
query = [
{"$match": {"publishedAt": {"$gte": begin_date, "$lte": end_date}}},
{
"$group": {
"_id": "$outlet",
"totalArticles": {"$sum": "$totalArticles"},
"totalFemales": {"$sum": "$totalFemales"},
"totalMales": {"$sum": "$totalMales"},
"totalUnknowns": {"$sum": "$totalUnknowns"},
}
},
]
return query


def agg_total_by_week(begin_date: str, end_date: str):
query = [
{"$match": {"publishedAt": {"$gte": begin_date, "$lte": end_date}}},
{
"$group": {
"_id": {
"outlet": "$outlet",
"week": {"$week": "$publishedAt"},
"year": {"$year": "$publishedAt"},
},
"totalFemales": {"$sum": "$totalFemales"},
"totalMales": {"$sum": "$totalMales"},
"totalUnknowns": {"$sum": "$totalUnknowns"},
}
},
]
return query
120 changes: 120 additions & 0 deletions api/english/endpoints/outlet_stats.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import pandas as pd
import utils.dateutils as dateutils
from db.mongoqueries import agg_total_by_week, agg_total_per_outlet
from fastapi import APIRouter, HTTPException, Request, Query
from schemas.stats_by_date import TotalStatsByDate
from schemas.stats_weekly import TotalStatsByWeek

outlet_router = APIRouter()
COLLECTION_NAME = "mediaDaily"
LOWER_BOUNT_START_DATE = "2018-10-01"
ID_MAPPING = {"Huffington Post": "HuffPost Canada"}


@outlet_router.get(
"/info_by_date",
response_model=TotalStatsByDate,
response_description="Get total and per outlet gender statistics for English outlets between two dates",
)
def expertwomen_info_by_date(
request: Request,
begin: str = Query(description="Start date in yyyy-mm-dd format"),
end: str = Query(description="End date in yyyy-mm-dd format"),
) -> TotalStatsByDate:
if not dateutils.is_valid_date_range(begin, end, LOWER_BOUNT_START_DATE):
raise HTTPException(
status_code=416,
detail=f"Date range error: Should be between {LOWER_BOUNT_START_DATE} and tomorrow's date",
)
begin = dateutils.convert_date(begin)
end = dateutils.convert_date(end)

query = agg_total_per_outlet(begin, end)
response = request.app.connection[COLLECTION_NAME].aggregate(query)
# Work with the data in pandas
source_stats = list(response)
df = pd.DataFrame.from_dict(source_stats)
df["totalGenders"] = df["totalFemales"] + df["totalMales"] + df["totalUnknowns"]
# Replace outlet names if necessary
df["_id"] = df["_id"].replace(ID_MAPPING)
# Take sums of total males, females, unknowns and articles and convert to dict
result = df.drop("_id", axis=1).sum().to_dict()
# Compute per outlet stats
df["perFemales"] = df["totalFemales"] / df["totalGenders"]
df["perMales"] = df["totalMales"] / df["totalGenders"]
df["perUnknowns"] = df["totalUnknowns"] / df["totalGenders"]
df["perArticles"] = df["totalArticles"] / result["totalArticles"]
# Convert dataframe to dict prior to JSON serialization
result["sources"] = df.to_dict("records")
result["perFemales"] = result["totalFemales"] / result["totalGenders"]
result["perMales"] = result["totalMales"] / result["totalGenders"]
result["perUnknowns"] = result["totalUnknowns"] / result["totalGenders"]
return result


@outlet_router.get(
"/weekly_info",
response_model=TotalStatsByWeek,
response_description="Get gender statistics per English outlet aggregated WEEKLY between two dates",
)
def expertwomen_weekly_info(
request: Request,
begin: str = Query(description="Start date in yyyy-mm-dd format"),
end: str = Query(description="End date in yyyy-mm-dd format"),
) -> TotalStatsByWeek:
if not dateutils.is_valid_date_range(begin, end, LOWER_BOUNT_START_DATE):
raise HTTPException(
status_code=416,
detail=f"Date range error: Should be between {LOWER_BOUNT_START_DATE} and tomorrow's date",
)
begin = dateutils.convert_date(begin)
end = dateutils.convert_date(end)

query = agg_total_by_week(begin, end)
response = request.app.connection[COLLECTION_NAME].aggregate(query)
# Work with the data in pandas
df = (
pd.json_normalize(list(response), max_level=1)
.sort_values(by="_id.outlet")
.reset_index(drop=True)
)
df.rename(
columns={
"_id.outlet": "outlet",
"_id.week": "week",
"_id.year": "year",
},
inplace=True,
)
# Replace outlet names if necessary
df["outlet"] = df["outlet"].replace(ID_MAPPING)
# Construct DataFrame and handle begin/end dates as datetimes for summing by week
df["w_begin"] = df.apply(lambda row: dateutils.get_week_bound(row["year"], row["week"], 0), axis=1)
df["w_end"] = df.apply(lambda row: dateutils.get_week_bound(row["year"], row["week"], 6), axis=1)
df["w_begin"], df["w_end"] = zip(*df.apply(lambda row: (pd.to_datetime(row["w_begin"]), pd.to_datetime(row["w_end"])), axis=1))
df = (
df.drop(columns=["week", "year"], axis=1)
.sort_values(by=["outlet", "w_begin"])
)
# In earlier versions, there was a bug due to which we returned weekly information for the same week begin date twice
# This bug only occurred when the last week of one year spanned into the next year (partial week across a year boundary)
# To address this, we perform summation of stats by week to avoid duplicate week begin dates being passed to the front end
df = df.groupby(["outlet", "w_begin", "w_end"]).sum().reset_index()
df["totalGenders"] = df["totalFemales"] + df["totalMales"] + df["totalUnknowns"]
df["perFemales"] = df["totalFemales"] / df["totalGenders"]
df["perMales"] = df["totalMales"] / df["totalGenders"]
df["perUnknowns"] = df["totalUnknowns"] / df["totalGenders"]
# Convert datetimes back to string for JSON serialization
df["w_begin"] = df["w_begin"].dt.strftime("%Y-%m-%d")
df["w_end"] = df["w_end"].dt.strftime("%Y-%m-%d")
df = df.drop(columns=["totalGenders", "totalFemales", "totalMales", "totalUnknowns"], axis=1)

# Convert dataframe to dict prior to JSON serialization
weekly_data = dict()
for outlet in df["outlet"]:
per_outlet_data = df[df["outlet"] == outlet].to_dict(orient="records")
# Remove the outlet key from weekly_data
[item.pop("outlet") for item in per_outlet_data]
weekly_data[outlet] = per_outlet_data
output = {"outlets": weekly_data}
return output
14 changes: 14 additions & 0 deletions api/english/gunicorn_conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# gunicorn_conf.py to point gunicorn to the uvicorn workers
from multiprocessing import cpu_count

# Socket path
bind = 'unix:/path_to_code/GenderGapTracker/api/english/g-tracker.sock'

# Worker Options
workers = cpu_count() + 1
worker_class = 'uvicorn.workers.UvicornWorker'

# Logging Options
loglevel = 'debug'
accesslog = '/path_to_code/GenderGapTracker/api/english/access_log'
errorlog = '/path_to_code/GenderGapTracker/api/english/error_log'
48 changes: 48 additions & 0 deletions api/english/logging.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
[loggers]
keys=root, gunicorn.error, gunicorn.access

[handlers]
keys=console, error_file, access_file

[formatters]
keys=generic, access

[logger_root]
level=INFO
handlers=console

[logger_gunicorn.error]
level=INFO
handlers=error_file
propagate=1
qualname=gunicorn.error

[logger_gunicorn.access]
level=INFO
handlers=access_file
propagate=0
qualname=gunicorn.access

[handler_console]
class=StreamHandler
formatter=generic
args=(sys.stdout, )

[handler_error_file]
class=logging.FileHandler
formatter=generic
args=('/var/log/gunicorn/error.log',)

[handler_access_file]
class=logging.FileHandler
formatter=access
args=('/var/log/gunicorn/access.log',)

[formatter_generic]
format=%(asctime)s [%(process)d] [%(levelname)s] %(message)s
datefmt=%Y-%m-%d %H:%M:%S
class=logging.Formatter

[formatter_access]
format=%(message)s
class=logging.Formatter
56 changes: 56 additions & 0 deletions api/english/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from pathlib import Path

from fastapi import FastAPI
from fastapi.responses import HTMLResponse
from fastapi.staticfiles import StaticFiles
from pymongo import MongoClient

from db.config import config
from endpoints.outlet_stats import outlet_router

# Constants
HOST = config["MONGO_HOST"]
PORT = config["MONGO_PORT"]
MONGO_ARGS = config["MONGO_ARGS"]
DB = config["DB_NAME"]
STATIC_PATH = "gender-gap-tracker"
STATIC_HTML = "tracker.html"

app = FastAPI(
title="Gender Gap Tracker",
description="RESTful API for the Gender Gap Tracker public-facing dashboard",
version="1.0.0",
)


@app.get("/", include_in_schema=False)
async def root() -> HTMLResponse:
with open(Path(f"{STATIC_PATH}") / STATIC_HTML, "r") as f:
html_content = f.read()
return HTMLResponse(content=html_content, media_type="text/html")


@app.on_event("startup")
def startup_db_client() -> None:
app.mongodb_client = MongoClient(HOST, PORT, **MONGO_ARGS)
app.connection = app.mongodb_client[DB]
print("Successfully connected to MongoDB!")


@app.on_event("shutdown")
def shutdown_db_client() -> None:
app.mongodb_client.close()


# Attach routes
app.include_router(outlet_router, prefix="/expertWomen", tags=["info"])
# Add additional routers here for future endpoints
# ...

# Serve static files for front end from directory specified as STATIC_PATH
app.mount("/", StaticFiles(directory=STATIC_PATH), name="static")


if __name__ == "__main__":
import uvicorn
uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)
Loading

0 comments on commit 172962b

Please sign in to comment.