Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Routing table error after long uptime or 3+ nodes #38

Closed
2E0PGS opened this issue Jan 12, 2020 · 19 comments
Closed

Routing table error after long uptime or 3+ nodes #38

2E0PGS opened this issue Jan 12, 2020 · 19 comments
Labels

Comments

@2E0PGS
Copy link

2E0PGS commented Jan 12, 2020

Great firmware, I just successfully range tested two modules with impressive results on the 868MHz band.

A few improvement ideas

  • Can we get some kind of status page. I wasn't sure if the other node was connected or in range. The only was I knew was getting a friend to ping me messages back and forth on the web chat while I walked around outside.
  • Can we drive OLED display that's present on most premade compatible boards from China. Even if it's some simple status and IP address to show it's working.

Cheers

@samuk
Copy link
Collaborator

samuk commented Jan 12, 2020

Yes both good ideas, the status has been discussed below

#35

Writing scrolling last-sent messages to the OLED has been mentioned on the mailing list.

@paidforby
Copy link
Contributor

@samuk is correct, both of these issues have been on my mind.

For what it is worth, the latest firmware (which I will release as 0.1.1 soon) includes a console interface accessed through serial that allows you to print the routing table by typing lr -r, see the testing firmware section of the readme, to see the current abilities of the console interface. The routing table will show "connected" nodes along with the "quality" of the connection in the form the metric.

I'm planning on figuring out a way to display the routing table in the web app.

@paidforby
Copy link
Contributor

@2E0PGS checkout new "Active Nodes" list feature mentioned in the related issue, #35 (comment)

Should be working if you build (both the firmware and the web app) from latest, or I will be compiling a pre-built binary for 0.1.1 soon.

No progress on utilizing the OLED screen yet.

@2E0PGS
Copy link
Author

2E0PGS commented Jan 19, 2020

ok cool thanks!

@2E0PGS
Copy link
Author

2E0PGS commented Jan 25, 2020

Sorry it took me a while to try v0.1.1 just flashed it on my two boards. Working great!
I can now see the hops and metric.

I presume node 000000000000 is the local node's address? hops 00 and metric 00

@2E0PGS
Copy link
Author

2E0PGS commented Jan 26, 2020

It looks like there maybe a bug with the beacon length when the device is left on for a long time.

2020-01-25 17:20:31

image

2020-01-26 00:46:40

image

I didn't change any settings in GQRX.

The only changes I can think of maybe room temperature, laptop warming up, LoRa warming up, and HackRF warming up.

Version 0.1.1 from the binary release.

I had two nodes running there. Oddly enough unplugging and replugging didn't reset it.

However back this morning with a cold room and cold devices (switch off over night) they're back to how it was in the first screenshot.

@2E0PGS
Copy link
Author

2E0PGS commented Jan 26, 2020

I will try and replicate it by artificially heating up my board. Or leaving one on and one off and compare after hours.

@2E0PGS
Copy link
Author

2E0PGS commented Jan 26, 2020

No sudden changes from artificially heating my SDR or my LoRa board. I am using TTGO.

I will try leaving one running for now. Then I can turn the other on later and compare.

@2E0PGS
Copy link
Author

2E0PGS commented Jan 26, 2020

The only code references I see are these two: https://github.com/search?q=org%3Asudomesh+beaconInterval&type=Code

@2E0PGS
Copy link
Author

2E0PGS commented Jan 26, 2020

Or the glitch relates to the route message getting longer. Android phone on the WiFi slowing it down?

@2E0PGS
Copy link
Author

2E0PGS commented Jan 26, 2020

I did some testing today. Here are the results.

During all of this testing GQRX was not modified settings wise. I am running two v0.1.1 firmware on TTGO boards from prebuilt binaries.

Ignore the extra harmonics this is due to one board being powered via a grounded mains to 5v PSU and the second via battery power bank. The second lagging signal is the power bank TTGO we shall call this node 2.

2020-01-26 17:28:28 "Receiver Options"

This is the beginning of the test and I show a few setting windows.
image

2020-01-26 17:28:38 "FFT Settings"

image

2020-01-26 17:28:42 "Input controls"

image

2020-01-26 22:03:16

Several hours into testing I notice a increase in the signal TX length.
image

2020-01-26 22:33:48

I take power cycle one of the boards to see if this changes it's TX length. It makes no change.
image

2020-01-26 22:35:50

I decide to try power cycle both boards to see if the issue is related to packets exchanged between the two, maybe routing information. This resolves the problem.
image

@2E0PGS
Copy link
Author

2E0PGS commented Jan 27, 2020

Running one node on it's own for hours with no neighbours didn't have this behavior. This makes me think it's route message related.

@samuk
Copy link
Collaborator

samuk commented Jan 27, 2020

Interesting stuff, wonder if it's worth testing with the latest code? Realise not that much has changed, but might be worth verifying it's still an issue?

@paidforby
Copy link
Contributor

Highly likely that there may be an unknown error with the routing message logic that only appears after a long uptime. My guess is that a byte gets shifted somewhere and starts filling the routing table with false routes. This would explain why it doesn't go away after only one node is restarted, because the node that was kept on immediately shares those false routes with the rebooted node. However, when both are rebooted, their routing tables are reset and their little network "forgets" about the false routes.

Note: this is just my theory, I would need to do some actual testing and write some debugging code to demonstrate that this is happening.

@tlrobinson
Copy link
Contributor

tlrobinson commented Jan 28, 2020 via email

@samuk samuk added the bug label Feb 13, 2020
@2E0PGS
Copy link
Author

2E0PGS commented Feb 13, 2020

Ref the hypothesis, this sounds about right to me. I suspect it's filling up and this causes a knock on effect of a longer TX length as the message is longer.

@paidforby paidforby changed the title Improvement ideas Routing table error after long uptime or 3+ nodes Apr 9, 2020
@samuk
Copy link
Collaborator

samuk commented May 13, 2020

Would you be up for trying to replicate your error with the latest routing? Hoping this bug has just gone away: #57 (comment)

@paidforby
Copy link
Contributor

Yes, it would be good to test if this bug is resolved on the 1.0.0-rc.2 branch, which is using the latest updates to LoRaLayer2, which has switched to a more dynamic source routing (DSR) style and no longer requires that sharing of routing tables via routing table packets.

@paidforby
Copy link
Contributor

Closing this issue and merging it with #81 since there is more activity on that thread and these seem closely (if not directly) related issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants