Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend/parsers #29

Open
wants to merge 75 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
573fc22
added gitignore and modified manage.py
anjakammer Apr 15, 2015
c6e220e
Merge pull request #2 from NewsdiffsDE/Setup/Project
thomaspuppe Apr 15, 2015
a02e0c1
First Version of the Zeit Online Parser
S3TH22 Apr 26, 2015
7e19cba
enhanced Baseparser
S3TH22 Apr 26, 2015
d0e77a2
modified frontend for new scrapers
anjakammer Apr 26, 2015
aeec6f6
parsing failes if change occures
anjakammer Apr 26, 2015
49ee7b6
displayes just german platforms
anjakammer Apr 27, 2015
674ef0a
translated article_history_missing.html
Relana Apr 28, 2015
b1d1b7c
Update 404.html
Relana Apr 28, 2015
5f49eb0
translated 500.html
Relana Apr 28, 2015
3919048
added Welt.de-Parser(seems to be working)
S3TH22 Apr 28, 2015
6082ecd
translated about.html
Relana Apr 28, 2015
1ca3ea8
translated article_history.html
Relana Apr 28, 2015
7abf9ca
translated article_history.xml
Relana Apr 28, 2015
89e4aad
translated browse.html
Relana Apr 28, 2015
416b2e7
translated browse_base.html
Relana Apr 28, 2015
e600556
translated contact.html
Relana Apr 28, 2015
9ba5037
translated diffview.html
Relana Apr 28, 2015
336fed6
translated part of examples.html
Relana Apr 28, 2015
518efb0
translated feed.xml
Relana Apr 28, 2015
7d79ccd
translated find_by_uri.html
Relana Apr 28, 2015
6e0bd13
edited front.html
Relana Apr 28, 2015
10afc18
translated navigation
Relana Apr 28, 2015
14f67a4
translated upvote.html
Relana Apr 28, 2015
d76b89d
optimized grabing
anjakammer Apr 29, 2015
d457383
modified welt.py
anjakammer Apr 29, 2015
f1d9eda
working on bild.de
anjakammer Apr 29, 2015
9a1c85c
added more 'content' pages
anjakammer Apr 29, 2015
410b7cb
modified bild.py
anjakammer Apr 29, 2015
28d69ca
minor changes like href fixes
Relana Apr 29, 2015
1fd2e30
corrected Twitterhandle and url
Relana Apr 29, 2015
5b142d1
deleted images in about and front
Relana May 1, 2015
50bb43f
Create webhook.txt
mgummich May 4, 2015
e7560e6
Update webhook.txt
mgummich May 4, 2015
f4dbacb
Delete webhook.txt
mgummich May 4, 2015
0ddf9e0
corrected some links, deleted unused pages/routing
anjakammer May 4, 2015
d1004ed
deleted unused links
anjakammer May 4, 2015
0655d21
deleted contactForm script
anjakammer May 4, 2015
9caa86b
Merge pull request #3 from NewsdiffsDE/Frontend/ContentTranslation
anjakammer May 4, 2015
ba9cc94
Merge branch 'master' of https://github.com/NewsdiffsDE/newsdiffs int…
anjakammer May 4, 2015
8d90efa
Merge pull request #4 from NewsdiffsDE/master
anjakammer May 4, 2015
477147b
modified Bild/Focus Parser
anjakammer May 4, 2015
e5407a2
focus parser workes without errors
anjakammer May 4, 2015
0df0199
added temp db file to gitignore
anjakammer May 4, 2015
e9f784b
kicked panorama from focus and modified bild.py
anjakammer May 4, 2015
a050100
bildParser workes
anjakammer May 4, 2015
c298c41
deleted upvote function use
anjakammer May 4, 2015
d51e8d3
Merge pull request #5 from NewsdiffsDE/bugfix/upvote
anjakammer May 4, 2015
6854217
Merge branch 'dev' of https://github.com/NewsdiffsDE/newsdiffs into B…
anjakammer May 4, 2015
b10eb8c
5 Parsers complete, tested with test parser and Test-URLs. Might stil…
S3TH22 May 5, 2015
8c0a659
modified baseparser, new method-> remove_non_content
anjakammer May 5, 2015
fc43207
stern parser do not get article text
anjakammer May 5, 2015
8093037
stern parser workes
anjakammer May 6, 2015
cce1e22
Added byline-attribute for parsers
S3TH22 May 6, 2015
b6e9d24
completed stern.de parser
anjakammer May 8, 2015
6b521e7
completed spiegel.de parser
anjakammer May 8, 2015
c4f3061
Merge branch 'Backend/Parsers_6-10' of https://github.com/NewsdiffsDE…
S3TH22 May 11, 2015
7532011
Fixed parsers crashing with false articles
S3TH22 May 12, 2015
245baa0
merged
anjakammer May 12, 2015
3ab60c6
Merge branch 'Backend/Parser/Spiegel' of https://github.com/Newsdiffs…
anjakammer May 12, 2015
9ad0dcd
finished CodeReview, zeit.de was testes
anjakammer May 12, 2015
f9b37a8
FAZ completed, baseparser refactoring
anjakammer May 12, 2015
504a063
CodeReview completed, welt and ntv do not run
anjakammer May 12, 2015
8830c2c
modified view for all Parsers
anjakammer May 12, 2015
69a66c2
removed 'ü' in Süddeutsche
anjakammer May 12, 2015
8054c76
refactored all parsers
anjakammer May 12, 2015
d04361d
reduced faz paths
anjakammer May 12, 2015
b145298
refactored faz
anjakammer May 12, 2015
a9f1fc3
modified welt.py, author infos cannot be stripped
anjakammer May 19, 2015
53c0c9d
optimized n-tv parser
anjakammer May 19, 2015
11283be
scraper have all parsers
anjakammer May 19, 2015
5014f71
N-TV-Parsers URL-RegEx-Pattern only recognized articles with the requ…
S3TH22 Jun 8, 2015
8904776
Mediathek recognition improved, date parsing corrected
S3TH22 Jun 8, 2015
7e7e576
URL list is now a set, so redundant links are dropped.
S3TH22 Jun 8, 2015
5b64531
N-TV fixed
S3TH22 Jun 10, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
/.idea
/articles

*~
Expand Down Expand Up @@ -30,4 +31,5 @@ pip-log.txt
.mr.developer.cfg

newsdiffs.db
newsdiffs.db-journal
database_settings.py
40 changes: 40 additions & 0 deletions parsers/RPOnline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from baseparser import BaseParser
from BeautifulSoup import BeautifulSoup, Tag


class RPOParser(BaseParser):
domains = ['www.rp-online.de']

feeder_pat = '1\.\d*$'
feeder_pages = ['http://www.rp-online.de/']

def _parse(self, html):
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES,
fromEncoding='utf-8')
self.meta = soup.findAll('meta')
#article headline
elt = soup.find('meta', {'property': 'og:title'})['content']
if elt is None:
self.real_article = False
return
self.title = elt
# byline / author
author = soup.find('meta', {'itemprop': 'author'})['content']
self.byline = author if author else ''
# article date
created_at = soup.find('meta', {'property': 'vr:published_time'})['content']
self.date = created_at if created_at else ''
#article content
div = soup.find('div', {'class': 'main-text '})
intro = soup.find('div', {'class': 'first intro'})
if intro is None:
intro = ''
else:
intro = intro.find('strong').getText()
if div is None:
self.real_article = False
return
div = self.remove_non_content(div)
self.body = intro
self.body += '\n' + '\n\n'.join([x.getText() for x in div.childGenerator()
if isinstance(x, Tag) and x.name == 'p'])
29 changes: 23 additions & 6 deletions parsers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,32 @@
# - create a parser class in another file, based off (say) bbc.BBCParser
# - add it to parsers (below)
# Test with test_parser.py

# List of parsers to import and use based on parser.domains

"""
sueddeutsche.SDParser
stern.SternParser
bild.BildParser
focus.FocusParser
spiegel.SpiegelParser
zeit.ZeitParser
RPOnline.RPOParser
faz.FAZParser
n-tv.NTVParser
welt.WeltParser
"""

parsers = """
nyt.NYTParser
cnn.CNNParser
politico.PoliticoParser
bbc.BBCParser
washpo.WashPoParser
sueddeutsche.SDParser
stern.SternParser
bild.BildParser
focus.FocusParser
spiegel.SpiegelParser
zeit.ZeitParser
RPOnline.RPOParser
faz.FAZParser
n-tv.NTVParser
welt.WeltParser
""".split()

parser_dict = {}
Expand Down
12 changes: 11 additions & 1 deletion parsers/baseparser.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import sys
import time
import urllib2
from BeautifulSoup import BeautifulSoup, Comment

# Define a logger

Expand Down Expand Up @@ -153,4 +154,13 @@ def feed_urls(cls):

all_urls = all_urls + [url for url in urls if
re.search(cls.feeder_pat, url)]
return all_urls
return set(all_urls)

#removes all non-content
def remove_non_content(self, html):
map(lambda x: x.extract(), html.findAll('script'))
map(lambda x: x.extract(), html.findAll('style'))
map(lambda x: x.extract(), html.findAll('embed'))
comments = html.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
return html
33 changes: 0 additions & 33 deletions parsers/bbc.py

This file was deleted.

46 changes: 46 additions & 0 deletions parsers/bild.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from baseparser import BaseParser
from BeautifulSoup import BeautifulSoup


class BildParser(BaseParser):
SUFFIX = ''
domains = ['www.bild.de']

feeder_pat = '^http://www.bild.de/(politik|regional|geld|digital/[a-z])'
feeder_pages = ['http://www.bild.de/politik/startseite',
'http://www.bild.de/geld/startseite/',
'http://www.bild.de/regional/startseite/',
'http://www.bild.de/digital/startseite/']

def _parse(self, html):
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES,
fromEncoding='utf-8')

self.meta = soup.findAll('meta')
#article headline
try:
elt = soup.find('meta', {'property': 'og:title'})['content']
self.title = elt
except:
self.real_article = False
return

# byline / author
author = soup.find('div', {'itemprop':'author'})
self.byline = author.getText() if author else ''
# article date
created_at = soup.find('div', {'class': 'date'})
self.date = created_at.getText() if created_at else ''
#article content
div = soup.find('div', {'itemprop':'articleBody isFamilyFriendly'})
if div is None:
self.real_article = False
return
div = self.remove_non_content(div)
map(lambda x: x.extract(), div.findAll('div', {'class':'infoEl center edge'})) # commercials
text = ''
p = div.findAll('p')
for txt in p:
text += txt.getText()+'\n'
self.body = text

38 changes: 0 additions & 38 deletions parsers/cnn.py

This file was deleted.

48 changes: 48 additions & 0 deletions parsers/faz.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
from baseparser import BaseParser
from BeautifulSoup import BeautifulSoup, Tag


class FAZParser(BaseParser):
domains = ['www.faz.net']

feeder_pat = 'aktuell/.*\.html$'
feeder_pages = ['http://www.faz.net/aktuell/finanzen',
'http://www.faz.net/aktuell/gesellschaft',
'http://www.faz.net/aktuell/politik',
'http://www.faz.net/aktuell/wirtschaft',
'http://www.faz.net/aktuell/wissen',
'http://www.faz.net/aktuell/feuilleton',
]

def _parse(self, html):
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES,
fromEncoding='utf-8')
self.meta = soup.findAll('meta')
#article headline
elt = soup.find('meta', {'property': 'og:title'})
if elt is None:
self.real_article = False
return
self.title = elt['content']
# byline / author
author = soup.find('meta', {'name': 'author'})
self.byline = author['content'] if author else ''
# article date
created_at = soup.find('meta', {'name': 'DC.date.issued'})
self.date = created_at['content'] if created_at else ''
#article content
div = soup.find('div', 'FAZArtikelContent')
if div is None:
self.real_article = False
return
div = self.remove_non_content(div)
map(lambda x: x.extract(), div.findAll('span', {'class':'autorBox clearfix'})) # Author description
map(lambda x: x.extract(), div.findAll('p', {'class':'WeitereBeitraege'})) # more articles like that one
map(lambda x: x.extract(), div.findAll('ul', {'class':'WBListe'}))# other articles from this author

div = div.find('div', {'class': ''})
if hasattr(div, "childGenerator"):
self.body = '\n' + '\n\n'.join([x.getText() for x in div.childGenerator()
if isinstance(x, Tag) and x.name == 'p'])
else:
self.real_article = False
44 changes: 44 additions & 0 deletions parsers/focus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from baseparser import BaseParser
from BeautifulSoup import BeautifulSoup


class FocusParser(BaseParser):
SUFFIX = '?drucken=1'
domains = ['www.focus.de']

feeder_pat = '^http://www.focus.de/(politik|finanzen|gesundheit|wissen)'
feeder_pages = ['http://www.focus.de/']

def _parse(self, html):
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES,
fromEncoding='utf-8')

self.meta = soup.findAll('meta')
#article headline
elt = soup.find('h1')
if elt is None:
self.real_article = False
return
self.title = elt.getText()
# byline / author
try:
author = soup.find('a', {'rel':'author'}).text
except:
author = ''
self.byline = author
# article date
created_at = soup.find('meta', {'name':'date'})
self.date = created_at['content'] if created_at else ''
#article content
self.body = ''
div = soup.find('div', 'articleContent')
if div is None:
self.real_article = False
return
div = self.remove_non_content(div)
map(lambda x: x.extract(), div.findAll('div', {'class':'adition'})) #focus
text = ''
p = div.findAll('p')
for txt in p:
text += txt.getText()+'\n'
self.body = text
42 changes: 42 additions & 0 deletions parsers/n-tv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from baseparser import BaseParser
from BeautifulSoup import BeautifulSoup, Tag


class NTVParser(BaseParser):
domains = ['www.n-tv.de']

feeder_pat = '^http://www.n-tv.de/(politik|wirtschaft|panorama|technik|wissen)/.*article\d*'
feeder_pages = ['http://www.n-tv.de']

def _parse(self, html):
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES,
fromEncoding='utf-8')
self.meta = soup.findAll('meta')
# Remove any potential "rogue" video articles, that bypass the URL check
try:
if 'Mediathek' in soup.find('title').getText():
self.real_article = False
return
except:
pass
#article headline
elt = soup.find('h1', {'class': 'h1'})
if elt is None:
self.real_article = False
return
self.title = elt.getText()
# byline / author
author = soup.find('p', {'class': 'author'})
self.byline = author.getText() if author else ''
# article date
created_at = soup.find('div', {'itemprop': 'datePublished'})
self.date = created_at['content'] if created_at else ''
#article content
div = soup.find('div', {'class': 'content'})
if div is None:
self.real_article = False
return
div = self.remove_non_content(div)
map(lambda x: x.extract(), div.findAll('p', {'class': 'author'}))
self.body = '\n' + '\n\n'.join([x.getText() for x in div.childGenerator()
if isinstance(x, Tag) and x.name == 'p'])
Loading