Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed. Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.
Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website. Itegrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection. We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.
We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset
- lang: ⚡ Rework of tokenizer. Additionally implemented new (easier) way of adding languages to the packet(
0833859
) (by Andrei Paraschiv) - lang: 🚀 added support for another 13 languages(
fd41af5
) (by Andrei Paraschiv) - lang: 📝 Added stopwords for af, br, ca,eo, eu, ga, gl, gu, ha, hy, ku, ms, so, st, tl, ur, yo, zu from https://github.com/stopwords-iso(
bba7a99
) (by Andrei Paraschiv) - lang: 📝 Added Burmese language(
13670c3
) (by Andrei Paraschiv) - lang: 📝 Added Slovak language support(
4ff82a8
) (by Andrei Paraschiv) - lang: 📝 Added Czech Language support(
afcdc27
) (by Andrei Paraschiv) - lang: 📝 Added Latvian language support(
89f3152
) (by Andrei Paraschiv) - lang: 📝 Added Telugu Language support(
f0f8133
) (by Andrei Paraschiv) - lang: 📝 Added Marathi language support(
ef40042
) (by Andrei Paraschiv) - lang: 📝 Added Georgian language support(
afca45b
) (by Andrei Paraschiv) - lang: 📝 Added Tamil language support(
0bd48ec
) (by Andrei Paraschiv) - lang: 📝 Added Bengali language support(
7a08fc2
) (by Andrei Paraschiv) - parse: ✨ added filter that limits the source.build to a specific category. use source.build(url,only_in_path=True) to scrape only stories that are in the starting url path(
665f6fe
) (by Andrei Paraschiv) - parse: 🔥 Source object is now pickleable(
af3f80f
) (by Andrei Paraschiv) - parse: 🔥 article is now pickleable(
f564524
) (by Andrei Paraschiv) - sources: ✨ New integration of Google news using GNews module. You can now use GoogleNewsSource to search and parse news based on keywords, topic, location or website(
33c3409
) (by Andrei Paraschiv) - sources: ✨ new option when building sources. You can limit the article parsing to the source home page only. Other categories or feeds are then ignored(
6b8c23e
) (by Andrei Paraschiv) - misc: 📈 added cloudscraper as optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection(
720bfe4
) (by Andrei Paraschiv) - misc: better typing support and type hinting Author: Tom Parker-Shemilt <palfrey@***.net>
- misc: Simplify favicon return Author: Tom Parker-Shemilt <palfrey@***.net>
- misc: Basic mypy support Author: Tom Parker-Shemilt <palfrey@***.net>
- core: added language dependencies, cloudscrape and gnews as optional(
cd921a3
) (by Andrei Paraschiv) - doc: 📝 adding evaluation results
- doc: 🚀 Documentation Update. Added Examples, documented new features
- doc: 🔥 Added typing and docstrings to most of the code
- lang: moving all language related files in languages folder
- lang: added valid_languages function that returns available languages
- misc: ⚡ removed ParsingCandidate, RawHelper, URLHelper classes. Removed link_hash from article (was never used)
- parse: article.link_hash is no longer available
- parse: ✨ Tidying up the gravity scoring process. No changes in the final score result
- parse: 🚀 compute word statistics for a node taking children nodes into account
- core: Minimum Python now 3.8; Also test 3.10/11/12 Author: Tom Parker-Shemilt <palfrey@***.net>
- core: run gh actions on PR's. Author: Tom Parker-Shemilt <palfrey@***.net>
- core: Set SETUPTOOLS_USE_DISTUTILS. setuptools as per numpy recommendations. Upgrade numpy and pandas for >= 3.9.Author: Tom Parker-Shemilt <palfrey@***.net>
- core: Upgrade regex, virtualenv to avoid breaking pre-commit, distutils for everyone. Author: Tom Parker-Shemilt <palfrey@***.net>
- parse: 💥 deprecated text_cleaned, clean_doc. Removed clean_top_node, article.clean_top_node is removed. Failures if it was accessed
- lang: ⚡ better is_highlink_density for non-latin languages(
a3b6250
) (by Andrei Paraschiv) - parse: 🐛 fixed an issue with non latin high density detection(
17a2dad
) (by Andrei Paraschiv) - parse: 🐛 better feed discovery in Source objects(
7a3abe9
) (by Andrei Paraschiv) - parse: 🔥 better binary content detection(
7ad77cf
) (by Andrei Paraschiv) - parse: ⚡ Better title parsing. Added language specific regex for article titles(
d5e8b2b
) (by Andrei Paraschiv) - parse: ⚡ get feeds fixed, it was not parsing the main page for possible feeds(
2f7b698
) (by Andrei Paraschiv) - parse: 🔥 better article paragraph detection(
0096999
) (by Andrei Paraschiv) - parse: ⚡ added figure as a tag to be removed before text generation(
5a226e0
) (by Andrei Paraschiv) - parse: ⚡ Bug with autodetecting website language. If no language supplied, the detected language was not used(
07076cb
) (by Andrei Paraschiv) - misc: ✨ tydiing up some code in urls.py(
3bb4ca9
) (by Andrei Paraschiv) - misc: 🚑 python-setup github action version bump(
5bb581e
) (by Andrei Paraschiv) - misc: 🎨 mypy stubs for gnews and cloudscraper + small typing fixes(
2644f7a
) (by Andrei Paraschiv) - cli: json output in stdout missing (by Andrei Paraschiv)
- types: 🎨 added stubs for gnews(
86d7128
) (by Andrei Paraschiv)
Some major changes in document parsing. In previous versions the chance that parts of the article body were missing was high. In addition, in some cases the order of the paragraphs was not correct. This release should fix these issues.
Highlighted features:
- You can now us the module as a command line interface (CLI). Usage:
python -m newspaper --url https://www.test.com
. More information in the documentation. - I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
- Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously
news_pool
was replaced with afetch_news()
function. - Caching is now much more flexible. You can disable it completely or for one request.
- You can now use
newspaper.article()
function for convenience. It will create, download and parse an article in one step. It takes all the parameters of theArticle
class. - protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.
- category: ✨ improved category link parsing / category link detection(
41677b0
) (by Andrei) - category: ⚡ Added option to disable the category_url cache for Source objects. Refactored the cache_disk decorator(
670aad9
) (by Andrei) - cli: ✨ added command line interface (CLI) for the module. Usage:
python -m newspaper --url https://www.test.com
(f46b443
) (by Andrei) - cli: added output format "text"(
31b9079
) (by Andrei) - core Article.download() and Article.parse() now returns self. Calls can be chained(
3be1e47
) (by Andrei) - lang: 🎨 automatically load nltk punkt if not present (
d0fcdd8
) (by Andrei) - nlp added the keyword scores as a dictionary attribute in Articles. Additionally, config.MAX_KEYWORDS is really taken into consideration when computing article keywords(
f51a04f
) (by Andrei) - parse: 🚀 improvements in the article body extraction. some sections that were ignored are now added to the extracted text.(
1af12d2
) (by Andrei) - parse: ✨ better parametrization of top_node detection. magic constants moved out of the score computation(
6485c40
) (by Andrei) - parse: 🚩 added some Author detection tags (Issue #347)(
4aebf29
) (by Andrei) - parse: added fine-grained score for top node article attribute booster(
0d41fc7
) (by Andrei) - parse: Added twitch as a video provider (Issue #349, #348)(
f4d8f0f
) (by Andrei) - parse: minor improvement on top node detection(
95d5cfa
) (by Andrei) - parse: parsing rules improvements suggested by @aleksandar-devedzic in issue #577(
8677dbe
) (by Andrei) - requests: 🔖 Added redirection history from the request calls in Article.download(
8ca3d40
) (by Andrei) - requests: 📈 added a binary file detection. Files that are known binary content-types or have in the first 1000 bytes more than 40% non-ascii characters will raise an exception in article.download.(
e7a60dd
) (by Andrei) - tests: ✨ added evaluation script to test against the dataset from https://github.com/scrapinghub/article-extraction-benchmark/(
737c226
) (by Andrei)
-
bug: 💄 instead of memorize_articles the option / function / parameter was memoize_articles(
aaef712
) (by Andrei) -
bug: MEMO_DIR is now Path object. addition with str forgotten from refactoring(
0b98e71
) (by Andrei) -
depend: removed feedfinder2 as dependency. was not used(
c230aca
) (by Andrei) -
doc: some minor documentation changes(
764742a
) (by Andrei) -
lang added additional stopwords for "fa". Issue #398(
3453538
) (by Andrei) -
lang: 💬 fixed serbian stopwords. added chirilic version (Issue #389)(
dfcb760
) (by Andrei) -
parse itemprop containing but not equal to articleBody(
510be0e
) (by Andrei) -
parse: 🎨 removed some additional advertising snippets(
bd30d48
) (by Andrei Paraschiv) -
parse: 📈 removed possible image caption remains from cleaned article text (Issue #44)(
7298140
) (by Andrei) -
parse: 🌐 image parsing and movie parsing improvements. get links from additional attributes such as "data-src".(
c02bb23
) (by Andrei) -
parse: 📝 exclude some tags from get_text. Tags such as script, option can add garbage to the text output(
f0e1965
) (by Andrei Paraschiv) -
parse: 📝 Improved newline geeneration based on block level tags.
's are better taken into account.(22327d8
) (by Andrei) -
parse: added youtu.be to video sources(
bf516a1
) (by Andrei) -
parse: additional fixes for caption(
3e7fdcc
) (by Andrei) -
refactor: deprecated non pythonic configuration attributes (all caps vs lower caps). for the moment both approaches work(
691e12f
) (by Andrei) -
sec: bump nltk and requests min version(
553ef27
) (by Andrei) -
sources: 🐛 fixed a problem with some type of articlelinks.(
9a5c0e2
) (by Andrei)
- version bump(
f7107be
) (by Andrei) - tests: Add test case for(
592f6f6
) (by Andrei) - parse: added possibility to follow "read more" links in articles(
0720de1
) (by Andrei) - core: Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(
5ff5d27
) (by Andrei) - lang Macedonian file raises an error(
cadea6a
) (by Murat Çorlu) - parse: extended data parsing of json-ld metadata (issue #518)(
fc413af
) (by Andrei) - tests: added script to create test cases(
9df8c16
) (by Andrei) - parse: added tag for date detection issue #835(
41152eb
) (by Andrei) - parse: added og:regDate to known date tags(
dc35e29
) (by Andrei) - tests: convert unittest to pytest(
45c4e8d
) (by Andrei) - doc add autodoc for readthedocs (
22e9dca
) (by Andrei) - doc: Added docstring to Article, Source and Configuration.(
8e54946
) (by Andrei) - doc: some clarifications in the documentation(
e8126d5
) (by Andrei) - doc: some template changes(
0261054
,bfbac2c
) (by Andrei)
- corec: typing annotation for set python 3.8(
895343f
) (by Andrei) - parse: improve meta tag content for articles and pubdate(
37bb0b7
) (by Andrei) - parse: 📝 improved author detection. improved video links detection(
23c547f
) (by Andrei) - parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(
6874d05
) (by Andrei) - core: small changes, replace os.path with pathlib(
5598d95
) (by Andrei) - parse: use one file of stopwords for english, the one in the standard folder #503(
6bdf813
) (by Andrei) - parse: better author parsing based on issue #493(
f93a9c2
) (by Andrei) - parse: make the url date parsing stricter. Issue #514(
0cc1e83
) (by Andrei) - parse: replace \n with space in sentence split (Issue #506)(
3ccb87c
) (by Andrei) - parsing: catch url errors resulting resulting from parsed image links(
9140a04
) (by Andrei) - repo: correct python versions in pipeline(
7e671df
) (by Andrei) - repo: gitignore update(
8855f00
) (by Andrei)
First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.
- tests: starting moving tests to pytest(
f294a01
) (by Andrei) - parser: add yoast schema parse for date extraction(
39a5cff
) (by Andrei)
- docs: update README.md(
d5f9209
) (by Andrei) - parse: feed_url parsing, issue #915(
ec2d474
) (by Andrei) - parse: better content detection. added
<article>
and<div>
tag as candidate for content parent_node(447a429
) (by Andrei) - core: close pickle files - PR #938(
d7608da
) (by Andrei) - parse: improved publication date extraction(
4d137eb
) (by Andrei) - core: some linter errors, whitespaces and spelling(
79553f6
) (by Andrei)
################################### These are the original newspaper3k release notes ################################### ########################################################################################################################
0.1.7 (2016-01-30)
Closed issues:
- ImportError: cannot import name 'Image' #183
- Won't let me import #182
- Install on Mac - El Capitan Failed - "Operation not permitted" #181
- Downgrades to old versions of required packages upon installation #174
- Handling 404, 500, and other non-200 http response codes to prevent scraping error pages #142
- Library downgrading in installation #138
Merged pull requests:
- Don't scrape error pages #190 (yprez)
- Added Hebrew stop words for language support #188 (alon7)
- Fix installation and build #187 (yprez)
- Fix installation docs #184 (yprez)
- Travis CI integration #180 (yprez)
- requirements.txt - Use minimal instead of exact versions #179 (yprez)
- Handle lxml raising ValueError on node.itertext() - Python 3 #178 (yprez)
- Handle lxml raising ValueError on node.itertext() #144 (yprez)
- Parse byline fix #132 (davecrumbacher)
0.1.6 (2016-01-10)
Closed issues:
- Critical leak in newspaper.mthreading.Worker #177
- HTMLParseError #165
- Take local paths to .html files #153
- Wall Street Journal Full Text is not Correctly Scraped #150
- Article HTML Returning Null #131
- No articles #130
- Loading Pages that use heavy javascript #127
- Login handling for premium websites #126
- Installation of nltk is failing #121
Merged pull requests:
- Support urls with dots #176 (alexanderlukanin13)
- upgrade beautifulsoup4 to 4.4.1 for python 3.5 #171 (AlJohri)
- Updated requests version #170 (adrienthiery)
- Turkish Language added #169 (muratcorlu)
- Add macedonian stopwords #166 (dimitrovskif)
- Issue#95 added graceful string concatenation #157 (surajssd)
- fix for "jpeg error with PIL, Can't convert 'NoneType' object to str implicitly" #154 (hnykda)
- bugfix in article.py, is_valid_body #149 (ms8r)
- Fixed typo #139 (Eleonore9)
- Correct link for the Python 3 branch #136 (jtpio)
- Add python3-pip install step for Ubuntu #135 (irnc)
0.1.5 (2015-03-04)
Closed issues:
- is there any kind of documentation on centos 7? #114
- Add extraction publishing date from article. #3
Merged pull requests:
0.1.4 (2015-02-04)
Closed issues:
- Getting rate limiting issue? #116
- newspaper.build( ) error #111
- Allow lists in Parser.clean_article_html() #108
Merged pull requests:
- Fix incorrect log call while generating articles #115 (curita)
- Allow lists in clean_article_html() - fixes #108 #112 (ecesena)
- Fixed nodeToString() to return valid HTML #110 (ecesena)
- Fixed empty return in top_meta_image #109 (ecesena)
0.1.3 (2015-01-15)
Implemented enhancements:
- Fulltext extraction improvement #1 #105
Closed issues:
- Tags h1 in article_html - indented behavior? #107
Merged pull requests:
0.1.2 (2015-01-01)
Closed issues:
- Metatags on Vice.com #103
- Can't extract images from german newspapers #96
- article_html misses many of the images #89
Merged pull requests:
- Integrate UnicodeDammit, deprecate parser_class, deprecate encodeValue, refactor, scaffolding for more unit tests #104 (codelucas)
0.1.1 (2014-12-27)
Closed issues:
- UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc #99
- TypeError: Can't convert 'bytes' object to str implicitly #98
- [Parse lxml ERR] Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. #78
- UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128) #77
- article.text and keywords error #47
Merged pull requests:
- Huge bugfix to aid lxml DOM parsing + remove unhelpful and excess exception messages and added tracebacks to exception logging #102 (codelucas)
- Decode bytestring returned from lxml's
toString
early on before sending it out to outer code #101 (codelucas) - Fixed #78: Remove encoding tag because lxml won't accept it for unicode #97 (mhall1)
0.1.0 (2014-12-17)
0.0.9 (2014-12-17)
Closed issues:
- object has no attribute clean Error when using parse method #90
- Questions #85
- [nltk_data] Error loading brown: <urlopen error [Errno -2] Name or [nltk_data] service not known> #84
- newspaper unable to find embedded youtube video #82
- Bound for memory usage #81
- Hosted demo #80
- Having issues installing due to lxml #79
- Add a BeautifulSoup4 parser. #44
- python 3 support request #36
Merged pull requests:
- update jieba to 0.35 #94 (WingGao)
- Parse was breaking in the method clean_article_html when keep_article_ht... #88 (phoenixwizard)
- split title with _ #87 (deweydu)
- Update to support python3 #86 (log0ymxm)
- Added link to basic demo #83 (iwasrobbed)
- Add splitting of slash-separated titles #75 (igor-shevchenko)
0.0.8 (2014-10-13)
Closed issues:
- Parsing Raw HTML #74
- Can't install newspaper #72
- Refactor codebase so newspaper is actually pythonic #70
- Article.top_node == Article.clean_top_node #65
- article.movies missing 'http:' #64
- KeyError when calling newspaper.languages() #62
- Memoize Articles - Not Printing #61
- Add URL headers while building a "paper" #60
- AttributeError: 'module' object has no attribute 'build' #59
- Typo in newspaper.build argument "memoize_articles" #58
- issue with stopwords-tr.txt #51
- Other language support. #34
- Character encoding detection #2
Merged pull requests:
- Huge refactor: entire codebase in PEP8, imports alphabetized, bugfixes, core changes #71 (codelucas)
- Meta tag extraction fixes #69 (karls)
- Test suite improvements #68 (karls)
- Test suite fixes #67 (karls)
- Revert "Added published date to the extractor+article" #66 (codelucas)
- Added published date to the extractor+article #63 (parhammmm)
0.0.7 (2014-06-17)
Closed issues:
- no document on how to add language #57
- Retain <a> tags in top article node? #56
- DocumentCleaner is missing clean_body_classes #55
- You must download and parse an article before parsing it #52
- Not extracting UL LI text #50
- article does not release_resources() #42
- Doesn't work on http://www.le360.ma/fr #40
- How to assign html content without downloading it? #37
- Python venv only? #32
- .nlp() could not work #27
- Doesn't work with Arabic news sites #23
- SyntaxError: invalid syntax #19
- Retain HTML markup for extracted article #18
- Portuguese is misspelled #14
- Multi-threading article downloads not working #12
- Timegm error? #10
- Problem in Brazilian sites #9
- Brazilian portuguese support #6
Merged pull requests:
- Fix typo in code and documentation #54 (jacquerie)
- removed quotes of 'filename' in utils\__init__.py #53 (jay8688)
- Fixed long-form article issue w/ calculate_best_node #49 (jeffnappi)
- Use first image from article top_node #35 (otemnov)
- Add a section with links to related projects #33 (cantino)
- Original #30 (otemnov)
- Fix reddit top image #29 (otemnov)
- Extract Meta Tags in structured way #28 (voidfiles)
- Replace instances of 'Portugease' with 'Portuguese' #26 (WheresWardy)
- It's The Changelog not The ChangeLog :) #24 (adamstac)
- syntax errors #22 (arjun024)
- Support for more HTML tags in parsers.py #21 (WheresWardy)
- Fixed syntax error #20 (damilare)
- Minor Performance tweaks #17 (techaddict)
- Update README.rst #15 (girasquid)
- Minor Typo candidate_words -> candidate_words #13 (techaddict)
0.0.6 (2014-01-18)
Closed issues:
- Port to Ruby #8
- Huge internationalization / API revamp underway! #7
- Multithread & gevent framework built into newspaper #4
Merged pull requests:
0.0.5 (2014-01-09)
0.0.4 (2013-12-31)
Closed issues:
- Calling nlp() on an article causes 'tokenizers/punkt/english.pickle' Not Found Error #1
Merged pull requests:
- Fix for keyword arg usage in print() on Python 2.7 #5 (michaelhood)
0.0.3 (2013-12-22)
0.0.2 (2013-12-21)
0.0.1 (2013-12-21)
* This Change Log was automatically generated by github_changelog_generator