Skip to content

Latest commit

 

History

History
52 lines (29 loc) · 3.94 KB

README.md

File metadata and controls

52 lines (29 loc) · 3.94 KB

GlobalGiving Depth · {GitHub license CircleCI Status PRs Welcome}

Problem Statement: GlobalGiving’s network consists of many organizations based in the US along with some nonprofits in other countries. There are still hundreds of thousands more organiztions which GlobalGiving knows of, but may not have information on the types of work they do. It is possible, given an NGO's website, to discern and characterize the work of these NGOs using statistics, natural language processing, and machine learning in an automated way.

Repo Description: This repo consists of our various approaches to characterizing the work of various NGOs. These approaches fall into a few different categories:

  • Classification (see /classification folder for code, details, and examples)

Using machine learning classifiers, we can feed in text from an NGO's website and predict with reasonable accuracy the categories which that NGO may fall into. The classifiers provided here consist of a Stochastic Gradient Descent classifier and a Bag of Words classifier.

  • Clustering (see /clustering folder for code, details, and examples)

GlobalGiving's existing categorization scheme is certainly sufficient for the purposes it serves, but a categorization scheme based on the logical differences between language used on NGO websites would be more useful in identifying/characterizing unknown NGOs. The clustering algorithms provided here consist of a K-Means implementation using Document Embeddings and an implementation of Latent Dirichlet Allocation.

  • Processing (see /processing folder for code, details, and examples)

How we classify/cluster the data is just as important as the way we obtain/process the data. For this project, we used an HTML Parser that leverages the BeautifulSoup library to pull clean and filtered text from NGO websites.

  • Past approaches

Refer to the wiki to read about some other past approaches which were tried and abandoned.

Getting Started

Installation

This project was built in Python 3.7. Dependencies can be installed into a virtual environment from the requirements.txt file using pipenv:

pipenv install -r requirements.txt

For LDA: It is necessary to use the NLTK Downloader to obtain “stopwords,” “WordNetLemmatizer,” and other resources from the Natural Language Toolkit. For more information on the NLTK Downloader, please refer to NLTK Documentation.

Usage

Each subfolder (classification, clustering, processing) has Jupyter notebooks with examples of code usage. Refer to the wiki for detailed function documentation.

Team

Software Devs

License

This project is licensed under the terms of the MIT license.