Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client is a potential bottleneck #39

Open
aglyzov opened this issue Jun 21, 2012 · 15 comments
Open

client is a potential bottleneck #39

aglyzov opened this issue Jun 21, 2012 · 15 comments
Labels

Comments

@aglyzov
Copy link
Contributor

aglyzov commented Jun 21, 2012

When testing one has to ensure the client machine is powerful enough to withstand a CPU load
created by the erlang client program.

On several occasions I saw the client to consume more CPU than a server on an identical pair
of machines. I observed the htop output of both the client and server machines at the same time
and it was clear that the erlang client was bounded by CPU while the server had a fair amount of
reserve. This was especially so in the first stage of the test when new connections get created.
Then, after some connections died off due to the client timeout, the client CPU usage lowered
considerably.

So, assuming the testing machines are the same, the client might be a bottleneck in some cases.
This needs to be checked thoroughly.

@jlouis
Copy link
Contributor

jlouis commented Jun 22, 2012

That sounds interesting. It is also interesting because different servers still handle the connections differently. If the client was the sole problem, then all servers "able to keep up" should have roughly the same behaviour. From my initial tests on data for the handshake times only two systems exhibit the same behaviour: erlang-cowboy and go-websocket. The rest of the bunch have considerably different characteristics.

I agree this is worth investigating. I'll consider reading through the client code in order to try to figure out what it does and if I can find something odd in there, among other things.

@aglyzov
Copy link
Contributor Author

aglyzov commented Jun 22, 2012

@jlouis, notice, my systems were both single-core. Considering that a client does much more processing than a simplistic server, that what might caused the oddity. Also, I can confirm that almost all systems behave comparably on my tests. I should try to add another core to the client machine and try again. Thanks for the insight.

@aglyzov
Copy link
Contributor Author

aglyzov commented Jun 22, 2012

So guys, I added the second CPU core to my client machine, ran some tests and I now have interesting results for you.

First of all, I think my theory on the client being a bottleneck in some cases was right. Check out these screenshots to see what I mean:
client (2 CPU cores) on the left, server (1 CPU cores) on the right
java-webbit: https://dl.dropbox.com/u/4663634/websocket-test/java-webbit.png
pypy-twisted: https://dl.dropbox.com/u/4663634/websocket-test/twisted-pypy-1.png
pypy-tornado: https://dl.dropbox.com/u/4663634/websocket-test/tornado-pypy-1.png

Results:
https://dl.dropbox.com/u/4663634/websocket-test/websocket-test-results.txt

On a side note: Haskell and Go were unbelievably awful in terms of memory consumption. While it is a known fact that Go has severe memory problems on 32bit architectures due to the questionable GC design, I am surprised about the haskell-snap behavior.

@jlouis
Copy link
Contributor

jlouis commented Jun 22, 2012

That definitely looks like an overload problem on your hardware. Also note you are not even getting the 10k handshakes which Eric was getting on go and Erlang - so the faster machine Eric has might help in this case at handling all the connectivity. Perhaps you should post specs as well as Eric, so we have an idea of what kind of machine is currently needed to handle the load.

As for the 32bit limit on the Go GC, it is the price they have to pay because their language does not use a precise GC (which is a rather bad decision IMO).

@aglyzov
Copy link
Contributor Author

aglyzov commented Jun 22, 2012

@jlouis, I am not sure about the overload now that I have added the second core to the client. At least it is not because of CPU now. Perhaps some other hidden price of virtualization. Indeed I am eager to find out the results on a real hardware.

Notice, by displaying the screenshots I tried to show that the erlang client was consuming more than 1 CPU to compete with certain fast servers.

@ericmoritz
Copy link
Owner

The server hardware that I have is the following:

AMD Phenom 9600 Quad Core - 2300 mhz
2GB of Memory

The client I will be using is my Macbook Pro bootcamped into Ubuntu 12.04 (at least that is the plan)

MBP's stats:

$ sysctl hw
hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 8589934592
hw.activecpu: 8
hw.physicalcpu: 4
hw.physicalcpu_max: 4
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 7
hw.cpusubtype: 4
hw.cpu64bit_capable: 1
hw.cpufamily: 1418770316
hw.cacheconfig: 8 2 2 8 0 0 0 0 0 0
hw.cachesize: 8589934592 32768 262144 6291456 0 0 0 0 0 0
hw.pagesize: 4096
hw.busfrequency: 100000000
hw.busfrequency_min: 100000000
hw.busfrequency_max: 100000000
hw.cpufrequency: 2200000000
hw.cpufrequency_min: 2200000000
hw.cpufrequency_max: 2200000000
hw.cachelinesize: 64
hw.l1icachesize: 32768
hw.l1dcachesize: 32768
hw.l2cachesize: 262144
hw.l3cachesize: 6291456
hw.tbfrequency: 1000000000
hw.packages: 1
hw.optional.floatingpoint: 1
hw.optional.mmx: 1
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.supplementalsse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
hw.optional.x86_64: 1
hw.optional.aes: 1
hw.optional.avx1_0: 1
hw.cputhreadtype: 1
hw.machine = x86_64
hw.model = MacBookPro8,2
hw.ncpu = 8
hw.byteorder = 1234
hw.physmem = 2147483648
hw.usermem = 943783936
hw.pagesize = 4096
hw.epoch = 0
hw.vectorunit = 1
hw.busfrequency = 100000000
hw.cpufrequency = 2200000000
hw.cachelinesize = 64
hw.l1icachesize = 32768
hw.l1dcachesize = 32768
hw.l2settings = 1
hw.l2cachesize = 262144
hw.l3settings = 1
hw.l3cachesize = 6291456
hw.tbfrequency = 1000000000
hw.memsize = 8589934592
hw.availcpu = 8

@ericmoritz
Copy link
Owner

Sorry, the server only has 2GB of memory. I copy/pasted that from the Craigslist Ad. One of the 2GB modules were bad, so I removed it.

I may have to pick up a 1 or 2GB module if the OS + each server start swapping.

@ericmoritz
Copy link
Owner

Does anyone know if I should add a "cool down" period between stopping on server and starting the other? Could there be any residual effects of one test in the kernel that could affect the result of another test?

@ericmoritz
Copy link
Owner

To save you some googling, the server is 64bit.

@aglyzov
Copy link
Contributor Author

aglyzov commented Jun 22, 2012

Once all the processes have exited/killed it should be fine. A 15 sec pause to be on the safe side.

@aglyzov
Copy link
Contributor Author

aglyzov commented Jun 23, 2012

Update: I've been testing the servers on a pair of linode-512 machines. The outcome is this: a basic linode hardware is capable of handling ~19k of active concurrent connections (pypy,erlang,java).

It's like 1$ a month for a 1,000 of websockets :)

@ericmoritz
Copy link
Owner

I like how this thing is turning into a way to benchmark VPS hosts as well as individual WS implementations.

@perone
Copy link

perone commented Jul 2, 2012

@aglyzov what was the number of active concurrent connections on the linode for the other benchs like gevent for instance ?

@aglyzov
Copy link
Contributor Author

aglyzov commented Jul 3, 2012

@perone gevent-websocket was not doing great unfortunately. There was a cut-off near 11k.

@perone
Copy link

perone commented Jul 3, 2012

@aglyzov thanks for sharing !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants