diff --git a/README.md b/README.md index a63d82e0..8308bac5 100644 --- a/README.md +++ b/README.md @@ -21,28 +21,31 @@ You need to have installed on your local machine * Django and other Python libraries On a Debian- or Ubuntu-based system, it may suffice (untested) to run - $ sudo apt-get install git-core python-django python-django-south python-simplejson -On Mac OS, the easiest way may be to install pip: - http://www.pip-installer.org/en/latest/installing.html -and then - $ pip install Django + sudo apt-get install git-core python-django python-django-south python-simplejson + + +On Mac OS, the easiest way may be to install [pip](http://www.pip-installer.org/en/latest/installing.html) and then + + pip install Django + Initial setup ------------- - $ python website/manage.py syncdb && python website/manage.py migrate - $ mkdir articles + python website/manage.py syncdb && python website/manage.py migrate + mkdir articles Running NewsDiffs Locally ------------------------- Do the initial setup above. Then to start the webserver for testing: - $ python website/manage.py runserver -and visit http://localhost:8000/ + python website/manage.py runserver + +and visit [http://localhost:8000/](http://localhost:8000/) Running the scraper @@ -51,24 +54,26 @@ Running the scraper Do the initial setup above. You will also need additional Python libraries; on a Debian- or Ubuntu-based system, it may suffice (untested) to run - $ sudo apt-get install python-bs4 python-beautifulsoup + + sudo apt-get install python-bs4 python-beautifulsoup on a Mac, you will want something like - $ pip install beautifulsoup4 - $ pip install beautifulsoup - $ pip install html5lib + pip install beautifulsoup4 + pip install beautifulsoup + pip install html5lib Note that we need two versions of BeautifulSoup, both 3.2 and 4.0; some websites are parsed correctly in only one version. Then run - $ python website/manage.py scraper + + python website/manage.py scraper This will populate the articles repository with a list of current news articles. This is a snapshot at a single time, so the website will not yet have any changes. To get changes, wait some time (say, 3 -hours) and run 'python website/manage.py scraper' again. If any of +hours) and run `python website/manage.py scraper` again. If any of the articles have changed in the intervening time, the website should display the associated changes. @@ -78,7 +83,7 @@ is cumulative). To run the scraper every hour, run something like: - $ while true; do python website/manage.py scraper; sleep 60m; done + while true; do python website/manage.py scraper; sleep 60m; done or make a cron job. @@ -88,25 +93,25 @@ Adding new sites to the scraper The procedure for adding new sites to the scraper is outlined in parsers/__init__.py . You need to - (1) Create a new parser module in parsers/ . This should be a + 1. Create a new parser module in parsers/ . This should be a subclass of BaseParser (in parsers/baseparser.py). Model it off the other parsers in that directory. You can test the parser with by running, e.g., -$ python parsers/test_parser.py bbc.BBCParser + `python parsers/test_parser.py bbc.BBCParser` which will output a list of URLs to track, and -$ python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/uk-21649494 + `python parsers/test_parser.py bbc.BBCParser http://www.bbc.co.uk/news/uk-21649494` which will output the text that NewsDiffs would store. - (2) Add the parser to 'parsers' in parsers/__init__.py + 2. Add the parser to 'parsers' in parsers/__init__.py -This should cause the scraper to start tracking the site. + This should cause the scraper to start tracking the site. -To make the source display properly on the website, you will need -minor edits to two other files: website/frontend/models.py and -website/frontend/views.py (to define the display name and create a tab -for the source, respectively). Search for 'bbc' to find the locations -to edit. + 3. To make the source display properly on the website, you will need + minor edits to two other files: website/frontend/models.py and + website/frontend/views.py (to define the display name and create a tab + for the source, respectively). Search for 'bbc' to find the locations + to edit.