Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bandwidth usage #2746

Open
oddjobz opened this issue Oct 25, 2024 · 15 comments
Open

Bandwidth usage #2746

oddjobz opened this issue Oct 25, 2024 · 15 comments

Comments

@oddjobz
Copy link

oddjobz commented Oct 25, 2024

So, I've just noticed that the websockets update seem to be a complete data refresh every couple of seconds .. this is generating ~ 120k per refresh, so maybe 2-3Mbtytes per minute .. which is something like 2G per hour, or 50G per day. Which is kinda huge in the context of scaling to many users (which is what I'm looking at)

My current websocket client/server code just transfers deltas, I was wondering if there was any scope in the code for outputting to a local file / key-value store rather than a websocket, in order to hook in a more efficient ws mechanism?

(or alternatively, a way of cutting down the data packets ... other than disabling a lot of charts? or maybe compress the data?)

@allinurl
Copy link
Owner

Good point! With a small tweak, you could probably read the named pipe that GoAccess uses to get the data directly. The --stdout option was added to gwsocket, but it hasn’t made it to GoAccess yet. Merging that change should be pretty straightforward, though. I recall there’s a request for real-time JSON format too, but for now, grabbing the output from the pipe seems like the easiest option.

@allinurl
Copy link
Owner

Also, take a look at mod_deflate, I think it can handle application/json content types.

@oddjobz
Copy link
Author

oddjobz commented Oct 25, 2024

Ok, the bandwidth usage is a little crazy .. is there any way to limit the frequency of response, so no more than once every 5s? I know the whole idea is "live", but this will chew up my monthly bandwidth allowance in a matter of days ..?

@allinurl
Copy link
Owner

--html-refresh=<seconds>
Refresh the HTML report every X seconds. The value has to be between 1 and 60 seconds. The default is set to refresh the HTML report every 1 second.

@oddjobz
Copy link
Author

oddjobz commented Oct 26, 2024

It would appear html-refresh sends the first WS packet then stops .. and for some reason my persist and restore seem not to be working, logrotate just rand and after restarting goaccess instances I'm seeing blank charts. Very odd .. cache folders are not populated .. although I have an old cache folder that is. Might have to call it a night, will look again tomorrow.

@oddjobz
Copy link
Author

oddjobz commented Oct 26, 2024

Mmm.. works when I launch it from the command line, but when I launch goaccess from a python script, persist doesn't write the database file in the cache folder .. no error. Will investigate tomorrow, I guess maybe it's a pty issue.

@allinurl
Copy link
Owner

Just a quick heads up: if you're piping data or running it from a script, be sure to include -. e.g.,

# cat access.log | goaccess - --log-format=COMBINED

@oddjobz
Copy link
Author

oddjobz commented Oct 26, 2024

Ok, so I seem to have resolved a number of issues. I'd not appreciated that the cache is only written on a clean exit and my sub-process shutdown was obviously a little too severe. I'm now doing a SIGINT and that seems to be writing cache files on exit.

However; what happens if the application (or server) has a hard crash? This seems to imply that logging information might be lost if there was a logrotate since it was last restarted? Should I be restarting all goaccess instances following a logrotate (to ensure the cache is updated?)

http-refresh now seems to work for me, I was obviously doing something wrong here.
bandwidth, I've reduced rows to 24, removed a couple of less important tables, and turned compression up to 9 .. which has left me with a throughput of 13k bytes per second .. which is a lot better, but possibly still 10x what it could probably be. I need to generate you a html file for the mouse-over issue, then I'll take a look at the alternative WS transport.

@0bi-w6n-K3nobi
Copy link
Contributor

Hi @oddjobz
It's great to talk to you here.
So, I think things are a bit mixed here.

What you do means with " cache is only written ... " ?
You are running in real-time mode. Right ?

Humm... I see. You want running in real-time mode and generate persistent data storage.
Yes; GOAccess can be do. But... You will be a price to pay.
So, in real-time mode GOAccess need (or just worry) only generate JSON data for transmission
under WebSocket to client.
I believe that ( and correct me if I'm wrong @allinurl ) persistent data need finished LOGs requests
processing for save to disk, like in normal mode (non real-time).

Well... I use day-by-day GOAccess in real-time mode.
And I believe some tricks that I learned can be useful for you too.

So, I use too scripts for start and stop... In truly, I created SystemD service script for that.
Well... I used timeout tool for stop/end service, at ending of day. I prefer that, instead of running indefinitely.
So, You need only use timeout --foreground TIME-TO-END goaccess ... .
You can know more details using man timeout .
This tool work perfectly, send TERM signal to GOAccess and made finished itself.

But, the price is... Service will be stopped. Off course; do You need restart it.
And You need be refresh page, at browser, for running again, because WebSocket went away.
For me; the advantage is that -- I had the frozen report ( state of data ) at moment in GOAccesswas stopped.
For You; the persistent data will be saved to disk !

Yes; is short answer about warning lost your persistent data.
Do you need made a copy, every time, before of starting GOAccess.
And if something to happen, You may to reprocessing LOGs at normal mode at ending of day.

I hope to be clean, and that helped you.
Feel free to talk about.

@oddjobz
Copy link
Author

oddjobz commented Nov 5, 2024

" cache is only written ... "

Ok, so this would be a feature request :-)

Please can we have an option for goaccess to flush it's persistent storage cache to disk periodically .. say every minute .. so the most data that could be lost would be 1m?

In the meantime I'm going to set an hourly restart :) Just as a matter of interest, this is how I'm using it, embedded within a Vue Application ... so it automatically creates and maintains a live stats instance for every site it tracks. Another useful feature would be the ability to manipulate the side-bar a little more easily .. ;-)

GOAccessInMMS

@0bi-w6n-K3nobi
Copy link
Contributor

0bi-w6n-K3nobi commented Nov 8, 2024

Hi @oddjobz
Again, I feel good to talk to you here.

Well ... ... this is impossible, in practice.
I will explain for why...
So; I believe the best way for realize this, will be run 2 instances:
1 for real-time and other for offline LOGs processing. (at end-of-day for example)

But; Why is this impractical ?

  • The nature of GOAccess, first at all: (@allinurl, please, correct me if I is wrong)
    This great tool was planned for 2 modes: real-time and offline LOGs processing.
    First for real-time monitoring your sites, and second for report and persistent
    data storage (off course).
    Second mode will save data at ending, no more and no less.
  • Big Overhead: if you claim for large data transport for JSON via WebSocket, so
    what will be if save all data each minute.
    And here need make a note: the WebSocket just only send data needs for graphics
    and tables, but this is not all data that GOAccess retain.
    Here, I know that possible to use fast storage like M2 NVME.
    But, GOAccess need save all data and not only one delta or snapshot.
    Remember that it can retain millions of requests and several days, and this operation
    will do slow downing its operation.
  • Operation System and File System constraints: Neither Windows or Linux good enough
    trust for your data file. So, here We can prolong talk about safely.
    Only real-time OS and very security File Systems can be guarantee your data.
    Even ZFS/OpenZFS can not do it. All File Systems has interval between flush of data.
    For example:
    If at exactly moment of saving data file, your PC turn off, your file can be lost or corrupt.
    If not then your data will be old data, and not from last 1 minute.
    Off course, I can suppose that you are using one no-break. Well; it can fail too.
    Almost security way is a Raid Controller at Server Machine, with a battery (off course) !
  • Right Way (mechanism) for resume from stopped point: Well, how resume from stopped point?
    I.E. Which mechanism/strategy will You adopt for resume?
    I suppose here that LOG file processing was interrupt before to be finished.
    If You already had a response, so why not using this for offline processing between regular intervals.
    For example: hour in hour, half hour, etc.

Humm ... What I propose here to You do use GoAccess for what it does well.
Therefore, without creating any expectation where it may be fail.
I use it at day-by-day, processing roughly 20 sites and 6 millions of hits.
Processing offline/batch LOGs and persist data is way like GOAccess work well.
In this way, you can split your process in regular intervals, and backup Your storage data files
before continue next set of LOGs. In this way, You maybe guarantee or prevent any data lost.
If some error to happen, so you can continue at point-of-error reprocessing LOGs again.

Well. I hope than be clean. Again; feel free to show your point.

@oddjobz
Copy link
Author

oddjobz commented Nov 8, 2024

Ok, so the statistics need to be relatively accurate, but losing a small percentage of the information is not an issue. So when you talk about data safely, I think you are missing the point.

Here is the operational scenario;

  1. (n) GoAccess processes run on (n) log files (say for example n=300)
  2. These logs are real-time and start from the beginning of the current log file
  3. Every night "logrotate" moves each log to "log.1" and starts a new log file
  4. After 3 days, the GoAccess processes (all 300) need to restart

Problems;

  1. Each GoAccess process will restart with the current .log file, losing 2 days of history
  2. If there is no persistence, this is a lot of concentrated processing in one hit, not great for the server
  • If this repeats, there will never be any realistic history available for any of the virtual servers.
  • If this is not "live", then most of the point of GoAccess is lost.

What I do to try to mitigate this;

0 0 * * * /usr/local/bin/mms_weblogs --restart

Which restarts all my GoAccess instances and forces them onto new logs. This works, and in the event of a crash the current logs would still be available so little or no information would be lost.

So, don't saying the issue can't be solved, I've already solved it, what I'm suggesting is that having to do this with "cron" is a bit of a messy / poor solution and it would be a lot "cleaner" if GoAccess had this ability itself.

@0bi-w6n-K3nobi
Copy link
Contributor

0bi-w6n-K3nobi commented Nov 8, 2024

@oddjobz
Ok, continuing...

Well; like I already said above, if You separate into 2 process, all problems that you did quote will be solved !
Therefore follow, I will describe (one suggestion) like can You solve them:

  • Real-time processing not need be stopped. You can use --keep-last option; see more detail in manual at
    parse options section here.
    In this way, you can for keep last 3 days (for example). If coming new day, the data of first one will be cleaned!
    And, if You use for offline processing, and have a backup of persistent data files, then can use different values.
    I.E. generate reports with lasted 3 days, another with 7 days, and other 30 days and so far.
  • LOGROTATE can use date suffix instead of numbers 1, 2, 3, etc;
    Again, You can guarantee processing correct LOGs from date and not for number.
    So, keep more that 3 days of LOGs in case of some error processing to happen.
    If storage space are problem, so You can compress old of them.
    Well. Your script will need smart enougth for detect compressed and uncompressed LOGs .
  • In Real-Time if error to happen, You can reprocess old LOGs until current LOG (today).
    Off course requires more elaborate solution, and for that, You can not use current LOG but just a
    clone in real-time.
    I use this solution for that ! Even because I have a lot of servers and not just one.
    Any way, if some error to happen, this file can retain (at least) last 3 days and so running again just
    from zero (and --kept-last option active), and waiting until arrives to now.
  • Oh, in Real-Time I can not stop... several people depend of them.
    So, Why not have 2 instances?
    If one error to happen, the 2nd instance restart an reprocess from old LOGs until today.
    Some HTTP fronts, like NGinX, can transparent redirect between them.
    Of course, the WebSocket connect will be lost, and You need be refresh the page !
  • Server is corrupt, is lost, is damage !
    Well; second server is for that !
    And this suggestion describe here, can do just "Rewind and Forward" like not did happened !

Well; I think the opposite of you. I believe more like Unix Philosophy: make one thing, make it right.
GOAccess work well; with 2 modes and each with its distinct corresponding objective.
Again,  I  You can ask me for: Why should I separate the processes?
GOAccess can process in multi-thread, in lasted versions. So, a big data volume is not problem anymore.
And You can apply different filters, different interval for  statics  statistics, inhibit some panels, etc.

Well. Again, I hope than be clean.

@oddjobz
Copy link
Author

oddjobz commented Nov 8, 2024

  • I fail to see the point of having a second instance. If one instance is live and one is not, the one that is not live won't get used.

  • I already use "keep-last", but I don't see how that's relevant either way.

  • Logrotate, compression is an issue. If GoAccess could process all log files for a virtual host regardless of the extension and whether it was compressed, that would help. So if it started without persistence and would automatically process 30 days worth of logs, that would be a start, but 30 days * 300 virtual hosts is a lot.

  • From my timings, saving a snapshot of the database takes almost no time, doing this every hour should not be a significant overhead.

If you're happy with it the way it is, that's great. For me, although it's all there and looks great, it's operationally problematic. Whereas I accept the bandwidth issue is complex and not easy to solve (and something I can probably do myself), simply storing the data in a way that doesn't involves excessive reprocessing or data loss seems to be a fundamental issue.

@0bi-w6n-K3nobi
Copy link
Contributor

0bi-w6n-K3nobi commented Nov 8, 2024

Hi @oddjobz .

Well... For me, backup exist for never be used... but if something to happened, that is amazing that had one !
Just for that, exist RAID 0, RAID 10 and so for... I really hope no need them, but...

GOAccess can process from input; the called STDIN at UNIX systems. You can use:

(bunzip2 -c LOG1.bz2; gunzip -c LOG2.gz; cat LOG3) | goaccess-f -SOME-MORE-OPTIONS-HERE

Not. I not said that. You not need reprocess 30 days again.
If you have a backup of persistent data storage files (PDSF), you need reprocess from point of fail.
For example:

  • At each day You process only LOGs of day....
  • So, You found that error happened 3 days back;
  • Well, just get backup from 3 days back PSDF, and reprocess LOGs from last 3 days only and so for;

What do I propose is offline LOGs processing at end-of-day, or at short period. It is not excessive.
And GOAccess have multi-thread processing, so it will be fast enough for that.
You may save copies from PSDF before each LOGs processing, then you never lost any data.
And since, You can have different report  statics  statistics: 3 days, 7 days, 30 days and so for. Or different filters.
( With diffent PSDF for each case, of course. And need to reprocessing LOGs for each one. )

Well, It is correct: Database Server can do snap at few seconds. But it have a mechanism for that.
It is intrinsic nature from Database: high availability and fault tolerance.

GOAccess do not have a snap, or transaction, or atomizity for that. All things happen in memory as storage.
And for saving data, it need do "walk" into (all) entire hash tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants