Skip to content
This repository has been archived by the owner on Apr 29, 2022. It is now read-only.

Search Enhancement: search bar should search for the preprint if it's not already in the database #89

Open
georgiamoon opened this issue Apr 15, 2020 · 14 comments
Labels
COVID-19 💵 Funded on Issuehunt This issue has been funded on Issuehunt enhancement New feature or request Mozilla 2020 Sprints priority question Further information is requested

Comments

@georgiamoon
Copy link
Member

georgiamoon commented Apr 15, 2020

Issuehunt badges

If a user searches for a preprint that is not already in the database, lookup the preprint on various servers and allow the user to request or add a review


IssueHunt Summary

Backers (Total: $100.00)

Become a backer now!

Or submit a pull request to get the deposits!

Tips

@georgiamoon georgiamoon added enhancement New feature or request COVID-19 labels Apr 15, 2020
@georgiamoon georgiamoon changed the title Search Enhancement: If a user searches for a preprint that is not already in the database, lookup the preprint on various servers and allow the user to request or add a review Search Enhancement: search bar should search for the preprint if it's not already in the database Apr 15, 2020
@sajacy
Copy link

sajacy commented Apr 18, 2020

A couple things to note:

  • For both Preprints.org and Research Square, I was not able to find open APIs nor ToS or T&Cs for directly proxying preprint PDFs. How should we proceed with requests for preprints hosted by these types of sites? It seems inadvisable to simply reverse-engineer / scrape the PDF URLs.
  • The getpreprints repo only actually has EuropePMC implemented, whose catalog is about a day delayed (versus Crossref, for instance).

The sources that have APIs, documentation - which I can get hooked up into both search and resolving when a review is requested:

If there are other places to search, can we add a prioritized list here?

@issuehunt-oss
Copy link

issuehunt-oss bot commented Jul 7, 2020

@prereview has funded $100.00 to this issue.


@wetneb
Copy link

wetneb commented Jul 29, 2020

Because the search field says "Search preprints with PREreviews or requests for review by DOI, arXiv ID or title", I expect that if I paste in a DOI in that field, it will fetch the DOI metadata on the fly to display the paper in Prereview, letting me request a review or add one myself. Currently, it will return no results if the paper has not been added to Prereview before.

Recognizing such ids and fetching the corresponding metadata from the relevant services would perhaps be a good first step towards this issue. It is a lot easier than arbitrary search: fetching metadata with a known id is a lot cheaper than searching by free text.

In my experience, querying multiple third-party search APIs to return search results to the user in real time is a bit brittle. We used to do this in https://dissem.in/ and that was pretty slow (for instance Crossref's API can be less reliable at times). We now ingest the sources proactively in our database (which is a challenge of its own given the size of these sources, of course).

@TheGuardianWolf
Copy link

@harumhelmy this one seems good to start on, I've got a proposal for your search implementation if you've not already considered it.

I'm currently in a company that has implemented search for a product recently by building it from scratch and we found that this was limiting compared to using a third party search API such as one provided by Azure or Google, would you be interested in leveraging these cloud search engines into the application?

How this would work is that you have a document store that is in your database. You submit an index to Azure or Google, and then use their APIs to run your search queries. This provides you with a host of features such as search suggestions and a more powerful search engine.

I would suggest this approach in your project by constructing an index from all the search sources including your data and third party data proactively as @wetneb suggests, submitting it to one of the search services, and querying their API with your search.

You might look into a closer integration into Azure since you are using that platform, the search service can pull and index data from an Azure db without much glue code if you don't mind platform dependence.

@TheGuardianWolf
Copy link

Looking to work on this issue, has anyone read my proposal above?

@harumhelmy
Copy link
Contributor

@TheGuardianWolf sorry for the delay here! I think this might be a good solution! The rub is that we're also separately working into transitioning the site's backend into postgres (it's currently on couchDB). I don't know much about integrating cloud search engines yet, so I'm wondering whether you know how reusable your fix would be reusable with a postgres backend?

@TheGuardianWolf
Copy link

TheGuardianWolf commented Aug 14, 2020

For this situation, let's imagine I've finished implementing the cloud solution, the end products are:

  • The indexer for the internal data stored on couchDB
  • Adapter to reshape 3rd party data from their api into a workable format
  • The indexer for any 3rd party data
  • The adapter for the cloud search engine API
  • The Search UI itself in the frontend

Of these things, the only thing that needs to be rewritten is the indexer for internal data, as you'd need to fetch via sql rather than nosql.

Because you are moving from nosql to sql, I imagine there will be a moderate amount of schema change, I can try to abstract out the data fetching from DB as much as possible in this case to minimse the time spent on rewriting that part.

It would be good to get a bit more information about the current data structure vs the proposed new one along with any existing data indexing processes.

@TheGuardianWolf
Copy link

Would you be able to give me a working invite to your slack team? The one on the readme seems to be dead :(

I think we could talk about this more effectively via chat

@harumhelmy
Copy link
Contributor

@harumhelmy
Copy link
Contributor

I'm logging off for the evening (EDT here), but for a bit more (vagueish) context: the new data structure is still WIP, but we should finalize it on Tuesday, or a little bit after, and in the meantime I'll dig up a spreadsheet that might help with elucidating the current data structure

@TheGuardianWolf
Copy link

TheGuardianWolf commented Aug 15, 2020

Unfortunately I can't accept a shared channel request, I don't have the paid version of slack!

For the repo, could I suggest ttps://gitter.im for developer discussions? Unless there's a better solution I'm not aware of

@harumhelmy
Copy link
Contributor

Sorry for the bad invite 😅 can you try this one: https://join.slack.com/t/prereview/shared_invite/zt-9qpk9pc5-6fsyuI6hwMuenjusPxDTCw

@rudietuesdays rudietuesdays added the question Further information is requested label Oct 2, 2020
@murkatr
Copy link

murkatr commented Nov 19, 2020

@harumhelmy @rudietuesdays Am I correct to say this issue is now also linked to the New Merge Platform project as issue #14?

@rudietuesdays
Copy link
Contributor

@murkatr correct, though some of the implementation of this in the new merged platform is also related to building the API. Either way, it is related to the building taking place in the new merged platform.

cc @harumhelmy

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
COVID-19 💵 Funded on Issuehunt This issue has been funded on Issuehunt enhancement New feature or request Mozilla 2020 Sprints priority question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants