-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloning website #186
Comments
I think there are plenty of tools online for cloning a website. You can then maybe use https://www.norconex.com/collectors/collector-filesystem/ to index that content with Solr. |
Is the solution proposed by @OkkeKlein working for you? Otherwise, we can transform this ticket into a feature request to allow pluggable implementations of how files are saved when downloads are kept. One such implementation could try to mimic the same directory structure as the website being crawled. Would that be useful to you? |
I still would like to use Norconex entirely in my project :). At the moment I am trying to write something by myself. |
Marking as a feature request to have a way to overwrite/customize the way downloaded files are stored. |
+1. Maybe a exporter rather than how it's stored |
@dgomesbr, can you elaborate what your exporter would look like? There are a few challenges with cloning websites in general. There are dynamic + javascript rendered ones that do not work well as static sites, but also, they sometimes have server-side logic that we do not know about. One example: http://example.com/home can lead to a home page, so we would make "home" a file. What extension? If we do not give it ".html", the cloned static site will not open it properly. If we give it ".html" we will have to update all references to it. Also, what if this also exists: http://example.com/home/about.html. Then "home" needs to be a directory, but we already made it a file. We could add enough configuration options, but it may be hard to have a one-size-fits-all so maybe it is best to have an interface people can extend and custom-code how they want the cloning done? |
The behavior I'm proposing is like what https://www.httrack.com/ does. Copy everything locally to HTML. I don't have a opinion on how SPA (single page applications) and other javascript scenarios should be treated. |
I want to build a system which makes an exact cloned copy of a website and stores it locally. All links in pages have to be modified to point to the local structure, e.g. www.example.com/resource.jpg -> //local/file/system/mirrors/www.example.com/resource.jpg. This allows users to browse the copy of the website locally.
In addition, all content needs to be sent to Solr also.
As I understand, keepDownloads option is not meant for this purpose. Is there any other way to "clone" a website to a local file system, today? If not, should I implement my own committer, using the ICommitter interface, for example?
The text was updated successfully, but these errors were encountered: