Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No index phone_doc-0520-3 found in Elastic Search #27

Open
mayankagrawal93 opened this issue May 21, 2021 · 6 comments
Open

No index phone_doc-0520-3 found in Elastic Search #27

mayankagrawal93 opened this issue May 21, 2021 · 6 comments

Comments

@mayankagrawal93
Copy link

mayankagrawal93 commented May 21, 2021

I have been trying to run the code locally. All installation steps including Spark are completed successfully. When I run shell_chat, the bot is replying and I am able to chat with it. But there is an error which gets printed as 'No index phone_doc-0520-3 was found in Elastic Search'.
I tried searching in Spark codes I ran where there is no mention of 'phone_doc-0520-3' in 'upload.py' getting created.
What should I do to resolve this? Or do I actually need to resolve it because the bot is already up and running?
Its searching for phone_doc-0520-3 in file chirpy/core/asr/search_phone_to_ent.py .

Separately, (instead of raising another issue, I am asking here itself), the bot replies are okay but not exactly the same as in live demo. The bot does not seem to understand some utterances which it does in the live demo. Am I missing something here?
All docker images have been pulled and containers started. The only thing I have not setup is the twitter opinion database in Postgres (for which it is showing an error in terminal).
Are these two errors (1. No index phone_doc-0520-3 found , 2. No Postgres) responsible for my reduction in accuracy in bot?

Thanks in advance for you reply!

@AshwinParanjape
Copy link
Contributor

AshwinParanjape commented May 22, 2021

It is a phoneme to entity index that's primarily useful for correcting ASR errors. Since you aren't going to be running the bot with voice, you can ignore the error. But if you want the following python script can create the index for you:
https://github.com/stanfordnlp/chirpycardinal/blob/main/chirpy/core/asr/index_phone_to_ent.py

Neither of the two errors sound serious enough for the bot to not be working as expected. What kind of replies are you getting? If you can post a couple of examples, I can probably take a guess.

@mayankagrawal93
Copy link
Author

Thanks for your quick reply!
I see! Then I will just leave out the voice part.

The bot is not able to understand entity names is what I am guessing. These are few examples here from both live demo and my installation

  1. Asking about musicians
    i. Live demo
    image

    ii. My installation
    image

  2. Asking about famous persons
    i. Live demo
    image

    ii. My installation
    image

Bot doesnt seem to understand famous person, musician etc and also sometimes the context like 'Can we talk about music' which it does in live demo.

@AshwinParanjape
Copy link
Contributor

There seems to be something wrong with entity linking.

  • Can you confirm if the elasticsearch instances are running fine?
  • How many docs are in the indices? (This is to confirm that the uploading and indexing has happened correctly)
  • Can you add a logtofile_path here:
    logtofile_level=LOGTOFILE_LEVEL, logtofile_path='',

    Or equivalently, change the logtoscreen_level so that it shows more info? You should be able to see if any entities are detected at all.

Since we already know phone_to_doc is throwing an error, that is a suspect and might be interfering with entity linking, so maybe just try indexing using https://github.com/stanfordnlp/chirpycardinal/blob/main/chirpy/core/asr/index_phone_to_ent.py to see if the error goes away.

@mayankagrawal93
Copy link
Author

Tried this,

  1. Elasticsearch Instance seems to be running fine.
    image

  2. When I query to count the documents,
    image
    Does this count seems okay or is it less?
    Files I took as input to preprocess.py were
    i. In https://dumps.wikimedia.org/wikidatawiki/entities/ , latest-all.json.bz2 => 1 file in total
    ii. In https://dumps.wikimedia.org/enwiki/20210220/, enwiki-20210220-pages-articles-multistream.xml.bz2 18.0 GB and enwiki-20210220-pages-articles-multistream-index.txt.bz2 219.6 MB => 2 files in total
    iii. In https://dumps.wikimedia.org/other/pagecounts-ez/merged/, pagecounts-2020-08-views-ge-5-totals.bz2 => 1 file in total

  3. This is the output of log file. I have pasted logs for only that particular utterances where issue is coming. I had to manaully add a line to print output results of elastic search (results['hits']['hits']) in code. To directly go to that part in logs, search for results of elastic search are
    logs.txt

The output seems to be coming but maybe it is missing some entities. Please let me know was the input of preprocess.py correct?

Meanwhile I will index phone_to_doc and let you know if that improves the accuracy

@mayankagrawal93
Copy link
Author

To create indexing for phone_to_doc, its trying to find WIKI_ENTITIES = "/u/scr/nlp/data/Wikipedia/enwiki-20200520-pages-articles-multistream-spans.json.bz2" , but I dont see any spans file in spark dump of wikidata.
These are the files in the output :
image

Please let me know which one to give as input in WIKI_ENTITIES

@AshwinParanjape
Copy link
Contributor

Sorry, for the super late reply. I was not available in the meanwhile.

Here it seems that you only have 700k articles in the elasticsearch index. Which means not everything got uploaded. I think that's the problem. There is no "Linkin Park" wikipedia page to link to.

@anumpamme helped fix an error with the indexing and I just merged it in this commit - ccff9b9

Can you do this?
1 - Pull the latest version
2 - Rerun wiki-es-dump/upload.py with the appropriate args
3 - Get the counts (in particular articles)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants