Memory usage of webkit never stops growing #41

beltran · 2015-11-09T12:52:48Z

First of all thanks for this great package,

The memory usage of the process webkit_server seems to increase with each call to session.visit()
It happens to me using the following script:

import dryscrape
import time


dryscrape.start_xvfb()
session = dryscrape.Session()
session.set_attribute('auto_load_images', False)

while 1:    
    print "Iterating"
    session.visit("https://www.google.es")
    html_source = session.body()
    time.sleep(5)

I see the memory usage with this command:

ps -eo size,pid,user,command --sort -size | grep webkit | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=1 ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }'

Maybe I'm not doing something right?

The text was updated successfully, but these errors were encountered:

niklasb · 2015-11-09T22:57:03Z

Hi @bjmb.

It is actually well possible that there is a memory leak in webkit-server. I don't think many people have used it in a persistent way. If this is possible for your use case, a simple workaround would be to restart webkit-server every once in a while by killing it and creating a new session... Of course a better fix would be to track down the leak, but I'm not sure this is something I will be able to do in the next few weeks or so.

Niklas

beltran · 2015-11-09T23:44:52Z

Thanks for the answer. I will try your advice.

Best

newpyguy · 2015-12-28T03:12:55Z

Hello,

Another thanks for this awesome package, and unfortunately another report of this condition. You mentioned killing webkit_server and re-starting it as a workaround. I'm new, and can't think of a relatively seamless way to do this. Any suggestions?

Thanks again!

13steinj · 2015-12-31T23:46:18Z

This appears to be a QtWebkit bug, see ariya/phantomjs#11390 for more info.

I've been unable to find a solution with dryscrape, and because of instability issues I mentioned in #43, I've switched over to using PhantomJS with selenium, and there I've solved it with:

from selenium.webdriver import PhantomJS
class WebDriver(PhantomJS):
    def __init__(self, *a, **kw):
        super(WebDriver, self).__init__(*a, **kw)
        self._maxRequestCount = 100
        self._RequestCount = 0

    def get(url):
        if self._RequestCount > self._maxRequestCount:
            self.__reset()
        super(WebDriver, self).get(url)
        self._RequestCount += 1

    def __reset(self):
        try:
            self.quit()
        except:
            print("couldn't quit, do so manually")
        self.__dict__ = self.__class__().__dict__

SKART1 · 2016-02-12T11:00:13Z

Memory leaks are giant! For 70 visited pages almost 15 GB of virtual memory...

igittigitthub · 2016-02-26T00:04:16Z

Hi,

how can I restart the webkit-server? Thanks!

JermellB · 2016-02-26T03:48:43Z

I'm with that guy from 4 hours ago. Is there a trick to restarting the server. I could never get my OSError's resolved.

ChrisQB8 · 2016-03-29T09:14:58Z

Hello guys,

I'm experiencing similar issues here. I'm iterating over approx. 300 urls and after 70-80 urls webkit_server takes up about 3GB of memory. However it is not really the memory that is the problem for me, but it seems that dryscrape/webkit_server is getting slower with each iteration. After the said 70-80 iterations dryscrape is so slow that it raises a timeout error (set timeout = 10 sec) and I need to abort. Restarting the webkit_server (e.g. after every 30 iterations) might help and would empty the memory, however I'm unsure if the 'memory leaks' are really responsible that dryscrape is getting slower and slower.

Does anyone know how to restart the webkit_server so I could test that?

I have not found an acceptable workaround for this issue, however I also don't want to switch to another solution (selenium/phantomjs, ghost.py) as I simply love dryscrape for its simplicity. Dryscrape is working great btw. if one is not iterating over too many urls in one session.

niklasb · 2016-03-31T22:10:31Z

It seems like a lot of people could really use the simple workaround of just restarting the server. After my exams for this semester are over I will try to implement this in a way that should be useful to most people with this particular issue.

ChrisQB8 · 2016-04-01T09:29:00Z

Much appreciated, Niklas!

trendsetter37 · 2016-04-01T16:20:29Z

@niklasb

I see this

def reset(self):
    """ Resets the current web session. """
    self.conn.issue_command("Reset")

In the Client class in webserver.py here. Is this the point of focus if one is attempting a workaround fix for the memory leak issue?

If so I guess I could try to play around with implementing it here

class Driver(webkit_server.Client,
             dryscrape.mixins.WaitMixin,
             dryscrape.mixins.HtmlParsingMixin):
  """ Driver implementation wrapping a ``webkit_server`` driver.
  Keyword arguments are passed through to the underlying ``webkit_server.Client``
  constructor. By default, `node_factory_class` is set to use the dryscrape
  node implementation. """
  def __init__(self, **kw):
    kw.setdefault('node_factory_class', NodeFactory)
    super(Driver, self).__init__(**kw)

niklasb · 2016-04-01T16:38:29Z

@trendsetter37 I think that pretty much only triggers https://github.com/niklasb/webkit-server/blob/master/src/NetworkAccessManager.cpp#L50, which clears request headers and username + password for HTTP auth, so no, that doesn't really do anything interesting in that context

niklasb · 2016-04-01T16:40:09Z

Also that command is already exposed here: https://github.com/niklasb/webkit-server/blob/master/webkit_server.py#L254 and can be called as Session#reset()

trendsetter37 · 2016-04-01T17:03:20Z

Ahh ok I see. I will keep poking around. I know that the Server.cpp catches my eye but I am still unsure if that's going in the right direction. I am willing to take a crack at it...would it be better if I sent an email?

pbdeuchler · 2016-05-18T22:34:51Z

If I could just bump this, it takes ~20 minutes to scrape 30 or so links and memory usage climes exponentially, hitting almost 80-90 MB per new link towards the end of those 30. Javascript execution also slows down immensely as the number of links grows.

niklasb · 2016-06-23T08:56:33Z

@siddarth so why not start Xvfb once manually and running your python script inside it? What you describe seems to have nothing to do with the rest of this threads

siddarth3110 · 2016-06-23T09:34:16Z

@niklasb Thanks and agree

ernests · 2017-02-02T11:05:47Z

I ended up using Phantomjs and approach mentioned by 13steinj .

p.s. There is a bug in his wrapper. get function is missing self. The correct version is get(self, url).

13steinj · 2017-02-02T23:50:35Z

Whoops, thanks for the catch. It may be important to note however the only thing that wrapper "solves" is the memory usage (and somethings changed since I wrote that, I believe the driver's .close() should be called after .quit()) on a basic level, as in, only one tab, no persistence of data via a browser extension (though this would only apply to firefox/chrome), and doesn't solve selenium's own instability (though it's far rarer than with dryscrape). However due to seleniums structure I ended up being able to write a more in depth wrapper to "solve" those via a reset as well, which I couldn't with dryscrape (I'd share but I'd have to dig up that old project)

achinmay · 2017-07-31T10:31:26Z

There is a bug in QT till 5.2.1 when memory increase very fast if auto load images is off https://bugreports.qt.io/browse/QTBUG-34494. See if this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage of webkit never stops growing #41

Memory usage of webkit never stops growing #41

beltran commented Nov 9, 2015

niklasb commented Nov 9, 2015

beltran commented Nov 9, 2015

newpyguy commented Dec 28, 2015

13steinj commented Dec 31, 2015

SKART1 commented Feb 12, 2016

igittigitthub commented Feb 26, 2016

JermellB commented Feb 26, 2016

ChrisQB8 commented Mar 29, 2016

niklasb commented Mar 31, 2016

ChrisQB8 commented Apr 1, 2016

trendsetter37 commented Apr 1, 2016

niklasb commented Apr 1, 2016

niklasb commented Apr 1, 2016

trendsetter37 commented Apr 1, 2016

pbdeuchler commented May 18, 2016 •

edited

Loading

niklasb commented Jun 23, 2016

siddarth3110 commented Jun 23, 2016

ernests commented Feb 2, 2017

13steinj commented Feb 2, 2017

achinmay commented Jul 31, 2017

Memory usage of webkit never stops growing #41

Memory usage of webkit never stops growing #41

Comments

beltran commented Nov 9, 2015

niklasb commented Nov 9, 2015

beltran commented Nov 9, 2015

newpyguy commented Dec 28, 2015

13steinj commented Dec 31, 2015

SKART1 commented Feb 12, 2016

igittigitthub commented Feb 26, 2016

JermellB commented Feb 26, 2016

ChrisQB8 commented Mar 29, 2016

niklasb commented Mar 31, 2016

ChrisQB8 commented Apr 1, 2016

trendsetter37 commented Apr 1, 2016

niklasb commented Apr 1, 2016

niklasb commented Apr 1, 2016

trendsetter37 commented Apr 1, 2016

pbdeuchler commented May 18, 2016 • edited Loading

niklasb commented Jun 23, 2016

siddarth3110 commented Jun 23, 2016

ernests commented Feb 2, 2017

13steinj commented Feb 2, 2017

achinmay commented Jul 31, 2017

pbdeuchler commented May 18, 2016 •

edited

Loading