-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question #140
Comments
Hi @Samistine We are using this project in limited production at this point with between 80 and 100 active clients spread between three instances of this server. I believe that load could be supported by one instance, but I haven't reduced my load balancing lately to test. The capacity limiting factor at this point seems to have more to do with the number of active clients and the amount of churn than the number of events raised. During preliminary testing, I had a small number (~20) devices producing multiple events per second each without any real issues. My (very) simple load balancing solution can be found here: https://github.com/Ario-Inc/spark-routing-db. At present it requires that the "./spark-server/data/" folder be mounted on a file system shared by all instances. Cheers, |
Hi @wdimmit We're getting ready to launch into production in the next week, so your experience is super interesting to us right now :) What kind of symptoms do you see when the spark server is not able to handle load - like, does it get spotty handling requests or does the whole thing just go down? If you don't mind my asking, how 'beefy' is your server(s)? The # of clients issue is definitely a concern. I'll have to take a look at your load balancer and maybe have that ready, just in case. |
@Snazzypants - how many devices? You might need to use more servers until we add in clustering. |
When the server hits its upper limit (probably between 80-120 clients at the moment), it appears to stop registering some keep-alive pings from the clients. This leads to clients disconnecting somewhat randomly at multiples of the timeout interval (15s). Also, if you dump a bunch of clients on a server at once (>20 or so), some percentage of them will disconnect at the first timeout interval. This means that if you add 50 clients to an empty server, perhaps 20 will disconnect, then those 20 will reconnect causing 10 to disconnect then things will stabilize out. The exact numbers here are estimates and from memory. I'm currently running 3 device servers, each on a Azure single core VM. The load levels are essentially 0 - this is not a CPU bound process. I'm running the processes on separate VMs for redundancy, not extra processing power. Finally, check out my code at these two points in the spark-protocol project to see how I'm communicating with my connection router: |
@wdimmit Thanks for the info! @jlkalberer we'll probably start out with ~75 in the first two weeks, then continue to add more as more people come online. The clustering would require multiple CPUs, is that correct? Right now we are also on a single core. Seems like it would still be more cost efficient to set up multiple cheapo servers rather than cluster on a multi core machine? |
Well... there is a bug somewhere in our implementation I'm thinking. Your server should be able to handle way more than 80-120 clients without disconnects occurring. What I think is happening is that somewhere we are blocking the thread which blocks the pings from the devices. With clustering I'm hoping that while the CPU is blocked, it switches to another thread and that will allow more devices to be connected even with a single core. In the short term this is a quick fix but in the long term I'd love to figure out why the thread is blocking. |
@wdimmit So I'm trying to set up your load balancing solution. I cloned over your fork of the spark-protocol, but am getting some an error when starting up spark-server. Any help would be appreciated...will definitely need to get your solution working in the next few days as we are shipping soon. EDIT: sorry, ignore that...the error I was getting was related to some memory issues when installing packages. I'm made a bit more progress, am slowly getting there, I think... (sorry to threadjack but I didn't see a place to open a separate issue in your spark-protocol fork) |
Would you say this would work well for a production product?
What would be a reasonable quantity of interactions from devices that this could handle on a modern computer.
The text was updated successfully, but these errors were encountered: