Merge pull request #65 from sfu-discourse-lab/V7.0

Update code base to V7.0
sfu-discourse-lab · Feb 20, 2023 · 172962b · 172962b
2 parents d822b0c + 1396702
commit 172962b
Show file tree

Hide file tree

Showing 89 changed files with 3,415 additions and 1,835 deletions.
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-__Status: V6.1__ (Code provided as-is; only sporadic updates expected).
+__Status: V7.0__ (Code provided as-is; only sporadic updates expected)
 
 # The Gender Gap Tracker
 
@@ -18,8 +18,9 @@ See [CONTRIBUTORS.md](CONTRIBUTORS.md)
 
 * `scraper`: Modules for scraping English and French news articles from various Canadian news organizations' websites and RSS feeds.
 * `nlp`: NLP modules for performing quote extraction and entity gender annotation on both English and French news articles.
-* `statistics`: Example scripts for running batch queries on our MongoDB database to retrieve source/gender statistics.
-* `dashboard_for_research`: [Research dashboard and apps](https://gendergaptracker.research.sfu.ca/) that allow us to explore the GGT data in more detail.
+* `api`: FastAPI code base exposing endpoints that serve our daily statistics to public-facing dashboards: [Gender Gap Tracker](https://gendergaptracker.informedopinions.org) and [Radar de Parité](https://radardeparite.femmesexpertes.ca)
+* `research_dashboard`: [A multi-page, extensible dashboard](https://gendergaptracker.research.sfu.ca/) built in Plotly Dash that allows us to explore the GGT data in more detail.
+* `statistics`: Scripts for running batch queries on our MongoDB database to retrieve source/gender statistics.
 
 ## Data
 
@@ -31,7 +32,7 @@ In future versions of the software, we are planning to visualize more fine-grain
 
 From a research perspective, questions of salience and space arise, i.e., whether quotes by men are presented more prominently in an article, and whether men are given more space in average (perhaps counted in number of words). More nuanced questions that involve language analysis include whether the quotes are presented differently in terms of endorsement or distance from the content of the quote (*stated* vs. *claimed*). Analyses of transitivity structure in clauses can yield further insights about the type of roles women are portrayed in, complementing some of our studies' findings via dependency analyses.
 
-We are mindful of and acknowledge the relative lack of work in NLP, topic modelling and gender equality for corpora in languages other than English. Our hope is that we are at least playing a small role here, through our analyses of Canadian French-language news whose code we share in this repo. We believe that such work will yield not only interesting methodological insights (for example, the relative benefits of stemming vs. lemmatization on topic keyword interpretability for non-English corpora), but also reveal whether the same gender disparities we observed in our English corpus are present in French. While we are actively pursuing such additional areas of inquiry, we also invite other researchers to join in this effort!
+We are mindful of and acknowledge the relative lack of work in NLP, topic modelling and gender equality for corpora in languages other than English. Our hope is that we are at least playing a small role here, through our analyses of Canadian French-language news whose code we share in this repo. We believe that such work will yield not only interesting methodological insights, but also reveal whether the same gender disparities we observed in our English corpus are present in French. While we are actively pursuing such additional areas of inquiry, we also invite other researchers to join in this effort!
 
 
 ## Contact

diff --git a/api/README.md b/api/README.md
@@ -0,0 +1,35 @@
+# APIs for public-facing dashboards
+
+This section hosts code for the backend APIs that serve our public-facing dashboards for our partner organization, Informed Opinions.
+
+We have two APIs: one each serving the English and French dashboards (for the Gender Gap Tracker and the Radar de Parité, respectively).
+
+## Dashboards
+* English: https://gendergaptracker.informedopinions.org
+* French: https://radardeparite.femmesexpertes.ca
+
+### Front end code
+
+The front end code base, for clearer separation of roles and responsibilities, is hosted elsewhere in private repos. Access to these repos is restricted, so please reach out to [email protected] to get access to the code, if required.
+
+## Setup
+
+Both APIs are written using [FastAPI](https://fastapi.tiangolo.com/), a high-performance web framework for building APIs in Python.
+
+This code base has been tested in Python 3.9, but there shouldn't be too many problems if using a higher Python version.
+
+Install the required dependencies via `requirements.txt` as follows.
+
+Install a new virtual environment if it does not already exist:
+```sh
+$ python3.9 -m venv api_venv
+$ python3.9 -m pip install -r requirements.txt
+```
+
+For further use, activate the virtual environment:
+
+```sh
+$ source api_venv/bin/activate
+```
+
+
diff --git a/api/english/README.md b/api/english/README.md
@@ -0,0 +1,35 @@
+# Gender Gap Tracker: API
+
+This section contains the code for the API that serves the [Gender Gap Tracker public dashboard](https://gendergaptracker.informedopinions.org/). The dashboard itself is hosted externally, and its front end code is hosted on this [GitLab repo](https://gitlab.com/client-transfer-group/gender-gap-tracker).
+
+## API docs
+
+The docs can be accessed in one of two ways:
+
+* Swagger: https://gendergaptracker.informedopinions.org/docs
+  * Useful to test out the API interactively on the browser
+* Redoc: https://gendergaptracker.informedopinions.org/redoc
+  * Clean, modern UI to see the API structure in a responsive format
+
+## Extensibility
+
+The code base has been written with the intention that future developers can add endpoints for other functionality that can potentially serve other dashboards.
+
+* `db`: Contains MongoDB-specific code (config and queries) that help interact with the RdP data on our MongoDB database
+* `endpoints`: Add new functionality to process and serve results via RESTful API endpoints
+* `schemas`: Perform response data validation so that the JSON results from the endpoint are formatted properly in the docs
+* `utils`: Add utility functions that support data manipulation within the routers
+* `gunicorn_conf.py`: Contains deployment-specific instructions for the web server, explained below.
+
+## Deployment
+
+We perform a standard deployment of FastAPI in production, as per the best practices [shown in this blog post](https://www.vultr.com/docs/how-to-deploy-fastapi-applications-with-gunicorn-and-nginx-on-ubuntu-20-04/).
+
+* `uvicorn` is used as an async web server (compatible with the `gunicorn` web server for production apps)
+* `gunicorn` works as a process manager that starts multiple `uvicorn` processes via the `uvicorn.workers.UvicornWorker` class
+* `nginx` is used as a reverse proxy
+
+The deployment and maintenance of the web server is carried out by SFU's Research Computing Group (RCG).
+
+
+
diff --git a/dashboard_for_research/apps/__init__.py → api/english/__init__.py b/dashboard_for_research/apps/__init__.py → api/english/__init__.py
diff --git a/api/english/db/config.py b/api/english/db/config.py
@@ -0,0 +1,18 @@
+host = ["mongo0", "mongo1", "mongo2"]
+# host = "localhost"
+is_direct_connection = True if (host == "localhost") else False
+
+config = {
+    "MONGO_HOST": host,
+    "MONGO_PORT": 27017,
+    "MONGO_ARGS": {
+        "authSource": "admin",
+        "readPreference": "primaryPreferred",
+        "username": "username",
+        "password": "password",
+        "directConnection": is_direct_connection, 
+    },
+    "DB_NAME": "mediaTracker",
+    "LOGS_DIR": "logs/",
+}
+
diff --git a/api/english/db/mongoqueries.py b/api/english/db/mongoqueries.py
@@ -0,0 +1,33 @@
+def agg_total_per_outlet(begin_date: str, end_date: str):
+    query = [
+        {"$match": {"publishedAt": {"$gte": begin_date, "$lte": end_date}}},
+        {
+            "$group": {
+                "_id": "$outlet",
+                "totalArticles": {"$sum": "$totalArticles"},
+                "totalFemales": {"$sum": "$totalFemales"},
+                "totalMales": {"$sum": "$totalMales"},
+                "totalUnknowns": {"$sum": "$totalUnknowns"},
+            }
+        },
+    ]
+    return query
+
+
+def agg_total_by_week(begin_date: str, end_date: str):
+    query = [
+        {"$match": {"publishedAt": {"$gte": begin_date, "$lte": end_date}}},
+        {
+            "$group": {
+                "_id": {
+                    "outlet": "$outlet",
+                    "week": {"$week": "$publishedAt"},
+                    "year": {"$year": "$publishedAt"},
+                },
+                "totalFemales": {"$sum": "$totalFemales"},
+                "totalMales": {"$sum": "$totalMales"},
+                "totalUnknowns": {"$sum": "$totalUnknowns"},
+            }
+        },
+    ]
+    return query
diff --git a/api/english/endpoints/outlet_stats.py b/api/english/endpoints/outlet_stats.py
@@ -0,0 +1,120 @@
+import pandas as pd
+import utils.dateutils as dateutils
+from db.mongoqueries import agg_total_by_week, agg_total_per_outlet
+from fastapi import APIRouter, HTTPException, Request, Query
+from schemas.stats_by_date import TotalStatsByDate
+from schemas.stats_weekly import TotalStatsByWeek
+
+outlet_router = APIRouter()
+COLLECTION_NAME = "mediaDaily"
+LOWER_BOUNT_START_DATE = "2018-10-01"
+ID_MAPPING = {"Huffington Post": "HuffPost Canada"}
+
+
+@outlet_router.get(
+    "/info_by_date",
+    response_model=TotalStatsByDate,
+    response_description="Get total and per outlet gender statistics for English outlets between two dates",
+)
+def expertwomen_info_by_date(
+    request: Request,
+    begin: str = Query(description="Start date in yyyy-mm-dd format"),
+    end: str = Query(description="End date in yyyy-mm-dd format"),
+) -> TotalStatsByDate:
+    if not dateutils.is_valid_date_range(begin, end, LOWER_BOUNT_START_DATE):
+        raise HTTPException(
+            status_code=416,
+            detail=f"Date range error: Should be between {LOWER_BOUNT_START_DATE} and tomorrow's date",
+        )
+    begin = dateutils.convert_date(begin)
+    end = dateutils.convert_date(end)
+
+    query = agg_total_per_outlet(begin, end)
+    response = request.app.connection[COLLECTION_NAME].aggregate(query)
+    # Work with the data in pandas
+    source_stats = list(response)
+    df = pd.DataFrame.from_dict(source_stats)
+    df["totalGenders"] = df["totalFemales"] + df["totalMales"] + df["totalUnknowns"]
+    # Replace outlet names if necessary
+    df["_id"] = df["_id"].replace(ID_MAPPING)
+    # Take sums of total males, females, unknowns and articles and convert to dict
+    result = df.drop("_id", axis=1).sum().to_dict()
+    # Compute per outlet stats
+    df["perFemales"] = df["totalFemales"] / df["totalGenders"]
+    df["perMales"] = df["totalMales"] / df["totalGenders"]
+    df["perUnknowns"] = df["totalUnknowns"] / df["totalGenders"]
+    df["perArticles"] = df["totalArticles"] / result["totalArticles"]
+    # Convert dataframe to dict prior to JSON serialization
+    result["sources"] = df.to_dict("records")
+    result["perFemales"] = result["totalFemales"] / result["totalGenders"]
+    result["perMales"] = result["totalMales"] / result["totalGenders"]
+    result["perUnknowns"] = result["totalUnknowns"] / result["totalGenders"]
+    return result
+
+
+@outlet_router.get(
+    "/weekly_info",
+    response_model=TotalStatsByWeek,
+    response_description="Get gender statistics per English outlet aggregated WEEKLY between two dates",
+)
+def expertwomen_weekly_info(
+    request: Request,
+    begin: str = Query(description="Start date in yyyy-mm-dd format"),
+    end: str = Query(description="End date in yyyy-mm-dd format"),
+) -> TotalStatsByWeek:
+    if not dateutils.is_valid_date_range(begin, end, LOWER_BOUNT_START_DATE):
+        raise HTTPException(
+            status_code=416,
+            detail=f"Date range error: Should be between {LOWER_BOUNT_START_DATE} and tomorrow's date",
+        )
+    begin = dateutils.convert_date(begin)
+    end = dateutils.convert_date(end)
+
+    query = agg_total_by_week(begin, end)
+    response = request.app.connection[COLLECTION_NAME].aggregate(query)
+    # Work with the data in pandas
+    df = (
+        pd.json_normalize(list(response), max_level=1)
+        .sort_values(by="_id.outlet")
+        .reset_index(drop=True)
+    )
+    df.rename(
+        columns={
+            "_id.outlet": "outlet",
+            "_id.week": "week",
+            "_id.year": "year",
+        },
+        inplace=True,
+    )
+    # Replace outlet names if necessary
+    df["outlet"] = df["outlet"].replace(ID_MAPPING)
+    # Construct DataFrame and handle begin/end dates as datetimes for summing by week
+    df["w_begin"] = df.apply(lambda row: dateutils.get_week_bound(row["year"], row["week"], 0), axis=1)
+    df["w_end"] = df.apply(lambda row: dateutils.get_week_bound(row["year"], row["week"], 6), axis=1)
+    df["w_begin"], df["w_end"] = zip(*df.apply(lambda row: (pd.to_datetime(row["w_begin"]), pd.to_datetime(row["w_end"])), axis=1))
+    df = (
+        df.drop(columns=["week", "year"], axis=1)
+        .sort_values(by=["outlet", "w_begin"])
+    )
+    # In earlier versions, there was a bug due to which we returned weekly information for the same week begin date twice
+    # This bug only occurred when the last week of one year spanned into the next year (partial week across a year boundary)
+    # To address this, we perform summation of stats by week to avoid duplicate week begin dates being passed to the front end
+    df = df.groupby(["outlet", "w_begin", "w_end"]).sum().reset_index()
+    df["totalGenders"] = df["totalFemales"] + df["totalMales"] + df["totalUnknowns"]
+    df["perFemales"] = df["totalFemales"] / df["totalGenders"]
+    df["perMales"] = df["totalMales"] / df["totalGenders"]
+    df["perUnknowns"] = df["totalUnknowns"] / df["totalGenders"]
+    # Convert datetimes back to string for JSON serialization
+    df["w_begin"] = df["w_begin"].dt.strftime("%Y-%m-%d")
+    df["w_end"] = df["w_end"].dt.strftime("%Y-%m-%d")
+    df = df.drop(columns=["totalGenders", "totalFemales", "totalMales", "totalUnknowns"], axis=1)
+
+    # Convert dataframe to dict prior to JSON serialization
+    weekly_data = dict()
+    for outlet in df["outlet"]:
+        per_outlet_data = df[df["outlet"] == outlet].to_dict(orient="records")
+        # Remove the outlet key from weekly_data
+        [item.pop("outlet") for item in per_outlet_data]
+        weekly_data[outlet] = per_outlet_data
+    output = {"outlets": weekly_data}
+    return output
diff --git a/api/english/gunicorn_conf.py b/api/english/gunicorn_conf.py
@@ -0,0 +1,14 @@
+# gunicorn_conf.py to point gunicorn to the uvicorn workers
+from multiprocessing import cpu_count
+
+# Socket path
+bind = 'unix:/path_to_code/GenderGapTracker/api/english/g-tracker.sock'
+
+# Worker Options
+workers = cpu_count() + 1
+worker_class = 'uvicorn.workers.UvicornWorker'
+
+# Logging Options
+loglevel = 'debug'
+accesslog = '/path_to_code/GenderGapTracker/api/english/access_log'
+errorlog = '/path_to_code/GenderGapTracker/api/english/error_log'
diff --git a/api/english/logging.conf b/api/english/logging.conf
@@ -0,0 +1,48 @@
+[loggers]
+keys=root, gunicorn.error, gunicorn.access
+
+[handlers]
+keys=console, error_file, access_file
+
+[formatters]
+keys=generic, access
+
+[logger_root]
+level=INFO
+handlers=console
+
+[logger_gunicorn.error]
+level=INFO
+handlers=error_file
+propagate=1
+qualname=gunicorn.error
+
+[logger_gunicorn.access]
+level=INFO
+handlers=access_file
+propagate=0
+qualname=gunicorn.access
+
+[handler_console]
+class=StreamHandler
+formatter=generic
+args=(sys.stdout, )
+
+[handler_error_file]
+class=logging.FileHandler
+formatter=generic
+args=('/var/log/gunicorn/error.log',)
+
+[handler_access_file]
+class=logging.FileHandler
+formatter=access
+args=('/var/log/gunicorn/access.log',)
+
+[formatter_generic]
+format=%(asctime)s [%(process)d] [%(levelname)s] %(message)s
+datefmt=%Y-%m-%d %H:%M:%S
+class=logging.Formatter
+
+[formatter_access]
+format=%(message)s
+class=logging.Formatter
diff --git a/api/english/main.py b/api/english/main.py
@@ -0,0 +1,56 @@
+from pathlib import Path
+
+from fastapi import FastAPI
+from fastapi.responses import HTMLResponse
+from fastapi.staticfiles import StaticFiles
+from pymongo import MongoClient
+
+from db.config import config
+from endpoints.outlet_stats import outlet_router
+
+# Constants
+HOST = config["MONGO_HOST"]
+PORT = config["MONGO_PORT"]
+MONGO_ARGS = config["MONGO_ARGS"]
+DB = config["DB_NAME"]
+STATIC_PATH = "gender-gap-tracker"
+STATIC_HTML = "tracker.html"
+
+app = FastAPI(
+    title="Gender Gap Tracker",
+    description="RESTful API for the Gender Gap Tracker public-facing dashboard",
+    version="1.0.0",
+)
+
+
+@app.get("/", include_in_schema=False)
+async def root() -> HTMLResponse:
+    with open(Path(f"{STATIC_PATH}") / STATIC_HTML, "r") as f:
+        html_content = f.read()
+    return HTMLResponse(content=html_content, media_type="text/html")
+
+
+@app.on_event("startup")
+def startup_db_client() -> None:
+    app.mongodb_client = MongoClient(HOST, PORT, **MONGO_ARGS)
+    app.connection = app.mongodb_client[DB]
+    print("Successfully connected to MongoDB!")
+
+
+@app.on_event("shutdown")
+def shutdown_db_client() -> None:
+    app.mongodb_client.close()
+
+
+# Attach routes
+app.include_router(outlet_router, prefix="/expertWomen", tags=["info"])
+# Add additional routers here for future endpoints
+# ...
+
+# Serve static files for front end from directory specified as STATIC_PATH
+app.mount("/", StaticFiles(directory=STATIC_PATH), name="static")
+
+
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)