Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite #55

Open
Raxvis opened this issue Feb 10, 2019 · 14 comments
Open

Rewrite #55

Raxvis opened this issue Feb 10, 2019 · 14 comments

Comments

@Raxvis
Copy link

Raxvis commented Feb 10, 2019

This issue thread will be used to keep everyone apprised of the rewrite taking place.

@Raxvis
Copy link
Author

Raxvis commented Feb 10, 2019

@ash121121 @Milezz

I have ran into trouble with MySQL as the database isn't fast enough. On top of that, there are issues with the torrent scraper (to get meta data) that looks to be broken in that regard with the new rewrite.

I have been working on trying to find a method to overcome both of these and worked through a couple of iterations with no success. I am working on the third iteration that I hope to have more success with.

The new rewrite should fix a lot of the issues you guys are seeing with the tracker and scraper keeping up to speed with things

@ghost
Copy link

ghost commented Feb 10, 2019

I agree mysql isn't that great for what were running here. Although some queries I was able to get down to milliseconds using indexes in mysql. bear in mind that's only with a database of 3 million. That would greatly increase as we reach the 20 + million. Could you tell us what the 2 iterations you have tried and what your 3rd is ? I'm interested to see if we can provide any ideas.

Kind regards

@Raxvis
Copy link
Author

Raxvis commented Feb 10, 2019

Both iterations were based on separating out the tracker, scraper, and torrent lookup (first one was with MySql and second one was with redis). This next iteration is going to isolate the individual actions but run them in a single process that has access to the DHT server and the DHT nodes (for scraping metadata)

@ghost
Copy link

ghost commented Feb 10, 2019

What data store do you plan to use now , redis still ? have you looked at MongoDB ? i seen another dht scraper using it on github

@Raxvis
Copy link
Author

Raxvis commented Feb 11, 2019

Redis and ElasticSearch are the two that I will probably be using.

Redis for the peer / node information and ElasticSearch for the torrent information.

@Raxvis
Copy link
Author

Raxvis commented Feb 11, 2019

Just an update, I have completely rewritten the DHT Server portion and put it into it's own package here: https://github.com/AlphaReign/dht-server

This is a standalone DHT Server that will work as the backbone of our scraper, but will also allow us to query the DHT network for peer information so that we can download. With this being done, I can setup the initial code to just keep looking for peers and getting torrent announcements without having it tied directly into the scraper.

@Raxvis
Copy link
Author

Raxvis commented Feb 11, 2019

You can checkout this branch here: https://github.com/AlphaReign/scraper/tree/split-fix and run:

  • yarn
  • node ./src/index.js

to watch it find torrents.

@ghost
Copy link

ghost commented Feb 11, 2019

Thanks will check this out today :)

@milezzz
Copy link

milezzz commented Feb 11, 2019

Awesome work!

@ghost
Copy link

ghost commented Feb 11, 2019

[ node ./src/index.js
module.js:550
throw err;
^

Error: Cannot find module 'dht-server'
at Function.Module._resolveFilename (module.js:548:15)
at Function.Module._load (module.js:475:25)
at Module.require (module.js:597:17)
at require (internal/module.js:11:18)
at Object. (/root/newscraper/src/index.js:1:75)
at Module._compile (module.js:653:30)
at Object.Module._extensions..js (module.js:664:10)
at Module.load (module.js:566:32)
at tryModuleLoad (module.js:506:12)
at Function.Module._load (module.js:498:3)
](url)

@Prefinem

@ghost
Copy link

ghost commented Feb 11, 2019

Never mind i installed dht-server and bencode

@milezzz
Copy link

milezzz commented Feb 11, 2019

seems to be working:

onGetPeersQuery - new torrent: 8eff86639946d68f2cea7485c59a3790794f78b9
onGetPeersQuery - new torrent: ef719bfbe716bd970afb4e269eab5ccb8fc1b3f2
total nodes 2000
onGetPeersQuery - new torrent: fc9b2d35164542b5704cef777b3b2560fe485cf9
onGetPeersQuery - new torrent: ad4f9ce5aa00943c01da3fd551250bd367729a7a
onGetPeersQuery - new torrent: 1224b03c763dafedae76d1a2dfb16a0396c90e72

@jangrewe
Copy link
Contributor

If one were running the current scraper, is the dht-server a fully working replacement (feature wise, at least), or just a PoC for now?

@Raxvis
Copy link
Author

Raxvis commented Sep 15, 2019

Not currently. The end goal of this project is to have a working dht-server in it's own package. There are a few currently out there on NPM, but I have found most of them aren't suitable for a scraper, so I had planned on to taking the pieces I have right now and finishing up with a full fledged one.

The majority of the DHT server is here: https://github.com/AlphaReign/scraper/blob/master/src/crawler.js

What it mainly lacks is hooks for each method, and public methods for the external hooks. A good data backend is also required for performance. I had tested mongoDB but it couldn't perform under the load. Same with SQLite. My next stop will be Redis, or another in memory cache. This is actually a large reason the other dht-servers don't work. Most of them a) don't maintain enough nodes b) are slow in responses. This scraper works by being on the peer lists of thousands if not tens of thousands of nodes to get announcements from.

Ideally the project would be broken down into
a) dht-server able to support 100K + nodes
b) tracker (such as opentracker) that also helps maintain a list of torrents
c) api for torrent information / searching

That, or another idea I have had in mind is to setup AlphaReign nodes that are dht-servers, but support a second protocol to share torrent information between each of the AlphaReign nodes, so that everyone using AlphaReign scraper can help share the torrent information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants