Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWA periodically return invalid JSON for same query #214

Open
arlake228 opened this issue Jun 1, 2021 · 8 comments
Open

PWA periodically return invalid JSON for same query #214

arlake228 opened this issue Jun 1, 2021 · 8 comments
Assignees

Comments

@arlake228
Copy link
Collaborator

If you run the psconfig validate URL command against a PWA URL, about 1/10 times it seems you will get a validation error. PWA is occasionally for some reason returning different JSON for the same call. We have seen this in multiple different contexts:

  1. WLCG originally saw it as it was causing their hosts to recreate tests. A workaround on the client reading the URL has been added, but the server side problem should be fixed.
  2. Independently RNP saw this same issue doing testing on their own.

Identifying what is causing the JSON to change to be invalid will need to be done and then determining the best way to fix it.

@arlake228 arlake228 assigned arlake228 and grigutis and unassigned arlake228 Jun 1, 2021
@grigutis
Copy link

So far, I haven't been successful in recreating this behavior on my development host. Is there any more information about the servers that are exhibiting this issue [e.g., host OS & version, version of PWA installed, installation method (docker or RPM)]?

@DanielNeto
Copy link

Hi @grigutis, I've seen this problem when I updated our PWA to the latest version here at RNP. We have a machine with CentOS 7.9 where we run the docker containers using this docker-compose file.
I thought it was a bug related to the JSON file size, so I removed most tests and reduced the number of hosts in the mesh, but I still had the problem from time to time. I ended up rolling back to the previous version that was working.

@grigutis
Copy link

@DanielNeto I'm still not able to reproduce this. Would you be able to get some logs for me? This should do it:

docker logs -f --since 0m docker_pwa-pub1_1 > ~/pub.log & \
docker logs -f --since 0m docker_pwa-admin1_1 > ~/admin.log & \
docker logs -f --since 0m docker_mongo_1 > ~/mongo.log & \
docker logs -f --since 0m docker_nginx_1 > ~/nginx.log &

Run that, reproduce the problem, then you can kill those jobs and attach the logs to this issue.

@grigutis
Copy link

@DanielNeto Actually, maybe logs won't be necessary after all. I finally was able to reproduce this. It is only appearing when configs have tests that use a disjoint topology.

@grigutis
Copy link

grigutis commented Jul 7, 2021

Thanks to a user in Slack, I now know how to reliability reproduce this problem. It apparently only occurs when the app is under load. For example:

$ ab -n 100 -c 2 https://psconfig.opensciencegrid.org/pub/config/opn-all

and while that is going on, do

$ for i in `seq 1 10` ; do curl -s https://psconfig.opensciencegrid.org/pub/config/opn-all | wc -c ; done

If it's working correctly, you should see the same byte count for all 10 iterations. If it's not, you won't.

I've also been reading a book about Node.js Design Patterns and came across something that sounds like it might be what is causing this issue.

One of the most dangerous situations is to have an API that behaves synchronously under certain conditions and asynchronously under others.

The bug that you've just seen can be extremely complicated to identify and reproduce in a real application. Imagine using a similar function in a web server, where there can be multiple concurrent requests. Imagine seeing some of those requests hanging, without any apparent reason and without any error being logged. This can definitely be considered a nasty defect.

I think the problem lies somewhere here, but that's just a hunch. I see that promise is being overridden in Mongoose, but not sure if that has anything to do with it yet.

@grigutis
Copy link

grigutis commented Aug 25, 2021

Just to give some more details about this …

A colleague and I took a deeper look at this and when the issue appears, the host_groups_details variable is not being fully populated before the psconfig JSON object is returned.

We're not sure where exactly the error is happening due to the nested async functions and anonymous call backs which make it very confusing to follow, but in general, the flow goes like this (all in meshconfig.js):

exports.generate
exports._process_published_config
async.eachSeries
async.parallel
generate_group_members
resolve_hostgroup

We made several attempts to fix the problem, but nothing was successful and came to the conclusion that rewriting the "/config/:url" route from scratch was probably the best way forward.

@ShawnMcKee
Copy link

Just wondering about the status on this. For OSG/WLCG we are worried that "variable" configs coming from PWA could be part of the problems we are seeing. We can track how often this is occurring using our CheckMK monitoring. For psconfig-itb see https://psetf.aglt2.org/etf/check_mk/index.py?start_url=%2Fetf%2Fpnp4nagios%2Findex.php%2Fgraph%3Fhost%3Dpsconfig-itb%26srv%3Dpsconfig-itb_stats%26theme%3Dmultisite%26baseurl%3D%2Fetf%2Fcheck_mk%2F%26view%3D4 and for psconfig see https://psetf.aglt2.org/etf/check_mk/index.py?start_url=%2Fetf%2Fpnp4nagios%2Findex.php%2Fgraph%3Fhost%3Dpsconfig%26srv%3Dpsconfig_stats%26source%3D0%26theme%3Dmultisite%26baseurl%3D%2Fetf%2Fcheck_mk%2F%26view%3D4

@grigutis
Copy link

I'm still working on it, but I would appreciate any help.

I'm working in the issue-214 branch, and the problem seems to be in meshconfig.js. I suspect either in the exports._process_published_config or generate_group_members functions. The problem might be caused by how the generate_group_members function is being called asynchronously.

I'm trying to rewrite the callbacks into promises (async/await) to make the code flow clearer, but this is proving to be a real pain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready
Development

No branches or pull requests

4 participants