-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TCP support for writing logs #9
Conversation
That's awesome! Will review it in the next days! |
gelf/tcpwriter.go
Outdated
func (w *TCPWriter) writeToSocketWithReconnectAttempts(zBytes []byte) (n int, err error) { | ||
var errConn error | ||
w.mu.Lock() | ||
for i := 0; n <= 3; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this loop always run three times? What happens when the first write is successful? The No. of re-attempts should be configurable. And a reconnect should be logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, I'll update that portion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the logic and made the maximum number of reconnection attempts configurable. I am uncertain how the logging should work however. Simply returning the number of bytes and an error would probably be considered by the client as a failure. Changing the method signature would mean not implementing the io.Writer interface. Do you have an idea I might have missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a problem. Let's skip that one for now. Maybe we see a solution in a later stage.
I think the reconnection logic is not that easy. First I think we should wait for some time till the next attempt is triggered (also configurable?). Another issue happens when you close the server side while the writer tries to send:
|
gelf/udpwriter.go
Outdated
@@ -57,9 +61,9 @@ func numChunks(b []byte) int { | |||
// New returns a new GELF Writer. This writer can be used to send the | |||
// output of the standard Go log functions to a central GELF server by | |||
// passing it to log.SetOutput() | |||
func NewWriter(addr string) (*Writer, error) { | |||
func NewWriter(addr string) (*UDPWriter, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency I would call this NewUDPWriter
. That's a breaking change but it's a major update anyway.
I tried to reproduce the bug you found an was unable to. Did you have code that you can share to reproduce or just a specific procedure? |
Sure, I basically took the example from the Readme and added a end-less loop:
Start a Graylog GELF input, let the program run and while it's sending logs stop the input. |
Any luck in reproducing this? |
Not yet, I have not had the time to work on this this week, but I think I have enough to go on for this weekend. |
I was able to reproduce on commit 204855d but not on commits since the reconnection delay has been added. I also found a small error where I was applying the reconnect delay before all writes instead of before the reconnects. |
Actually, adding TCP and TLS support has been sort of bouncing on and off my to-do list for a while. I did fix up the docs to make it clear that GELF isn't restricted to UDP and clarified a few other things (PR #9); it may be worth rebasing this PR on to that but I also have no problem with re-doing my commit on that PR off of this, if this wants to go in first. I've had only a quick look through these commits and I'm certainly not an expert Go programmer, but one comment I have is that I'd find this a lot easier to review if it were rebased to a) separate formatting changes and b) remove the things that were done and then undone later. Commit e98b8fb provides several examples of things that could be done to make the PR easier to read and review.
My suggestion would actually be to take all the refactorings that touch the original code and make them the first commits on the PR; then the actual functionality changes will fall into the place prepared for them in a way that everything is exactly what's needed to add the new functionality without extraneous other stuff. Oh, one other thing: I notice that the tests have gotten really slow (less than 1/10 second to over 8 seconds on my i7 system). It would be really nice to fix that; I'll have a look as I do further review and see if I see anything that might help there. BTW, feel free to e-mail me to work out where to do text chat (I prefer Gitter) if you want to talk back-and-forth. |
Agree with @C-J-S, I would love to merge this but the final review is a little bit hard in the state the PR is right now. |
Makes sense, I will work on rebasing the current work. Should I force push to this branch or open a new PR once done? For the test being slow, it is because the tests for TCP sleep a while. The tcpreader code that is used for the test has some synchronization issues that I was not able to fix yet. All that code is not exported however. I also joined the Graylog2 room on Gitter for further chats. |
@ghislainbourgeois you can force push to this PR. But like @C-J-S said, having one PR for refactoring work and one for the new feature would be great. In that way we can see more quickly if we have to think about some edge cases. I can also need some help on testing this, especially when this should go into Docker it needs a certain quality. |
Oh, I wasn't suggesting a separate PR for refactoring work! When refactoring in preparation for adding new features, refactorings are often one that you otherwise wouldn't do or even might actively object to. A bunch of holes in your lawn is only desirable if you know that they'll be filled with fence posts in a few days. :-) I think it's fine to force push to this PR. If you're worried about being able to reference back to this work, you can always create another branch in your own repo to maintain a pointer to these commits until you feel you don't need it any more. I will try to have another look at this tomorrow. One thing I noticed is that the code currently doesn't allow multiline messages (the complete message passed in is sent as FWIW, my use case for this is with SystemdJournal2Gelf to export log messages from various Unix systems into Graylog while maintaining more metadata than rsyslogd can manage. I expect to have messages sometimes hundreds of KB long. |
@C-J-S oh sorry, misread that. Could you do some testing with SystemdJournal2Gelf and the latest commits? That would be great help. |
@mariussturm: Sorry for taking so long to get back to you. Yes! Your new commits are much better. The refactoring looks nice and the interface change and the TCP code are nicely split. My previous concerns about the interface were due to a misreading of the However, the tests failed when I ran them. Are they failing for you? I will try, in the next day or two, to look at the current testing framework and see if it can be made to work faster and better, and also convert SystemdJournal2Gelf to use the new library and see how it works out. |
@C-J-S My bad for the tests, I will fix it and redo the rebase the same way, I forgot a small change. |
I was able to figure out how to speed up the tests and make them a bit more robust and I refactored them a little at the same time. The tests now take about a second on my machine instead of 8. |
That's great. The test cleanup is nice. Looks like all those changes can be squashed into the |
This just assumes that the `github.com/c-j-s/go-gelf/gelf` working copy is set to the branch above. I'm not clear on how one might ask for a specific version of a package in Go. Perhaps make it part of the working copy name? For details see: Graylog2/go-gelf#9
Yes, I did not want to rebase before you had a chance to be made aware in case you started work on the previous commit. It is now rebased. |
Oh, you should always feel free to rebase away; I'll still have all your old commits in my repo and I can simply rebase any work I've done based on your old commits onto the new version of your branch at whatever point it's convenient to me. I should be able to get some work done on SystemdJournal2Gelf tomorrow. Looking back at the API stuff, I realize now the bit that concerned me slightly actually is in |
gelf/message.go
Outdated
|
||
// Syslog severity levels | ||
const ( | ||
LOG_EMERG = int32(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could declare const like this.
const (
LOG_EMERG int32 = iota
LOG_ALERT
LOG_CRIT
LOG_ERR
LOG_WARNING
LOG_NOTICE
LOG_INFO
LOG_DEBUG
)
Maybe you unnecessarily to fix type of const because Go const can untype value.
So you can declare only value like this.
const (
LOG_EMERG = iota
LOG_ALERT
LOG_CRIT
LOG_ERR
LOG_WARNING
LOG_NOTICE
LOG_INFO
LOG_DEBUG
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this looks more idiomatic Go, and it still passes all the tests, however, I did not write that code, I simply moved it from the UDP code, so I cannot be sure if there was a reason for declaring the const like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with a Graylog instance and this did not not break behaviour, so I applied the change and force pushed.
@C-J-S, regarding the Write(p []byte) interface, I am not sure either, we could keep it as is, put the full message in 'short_message', or change 'full_message' to contain everything but the first line. According to the GELF documentation (http://docs.graylog.org/en/2.2/pages/gelf.html#gelf-payload-specification), 'full_message' can contain a backtrace, and 'short_message' is required and should be short. I understand from this that 'full_message' supports multiline but not 'short_message', but it is not completely clear. Only referring to that documentation and the name of the fields, I would keep things as is. If I am missing information, I would be willing to change the implementation, but would also suggest updating to documentation to clarify. |
gelf/message.go
Outdated
return messageBytes, nil | ||
} | ||
|
||
func constructMessage(p []byte, hostname string, facility string, file string, line int) (m Message) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think return type of constructMessage
could be pointer of Message m *Message
. Then line 132 could be m = &Message{
and caller side can use without &m
again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I will make that change and rebase it in the two separate commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can certainly leave out the int32
declarations in the serverity levels because they're not int32s; the allowed range of the priority is 0 through 23. That said, I'd probably stick with assigning explicit numbers, as in
LOG_EMERG = 0
LOG_ALERT = 1
LOG_CRIT = 2
LOG_ERR = 3
...
because these numbers for individual facilities are specifically (if informally) assigned and cannot be changed; I think doing it as above makes this more clear.
I would do this as a separate refactoring commit before your other changes to keep it small and clear about what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@C-J-S I just forced push that commit rebased before the rest.
I've just added issue #11 with my take on the |
@@ -52,7 +52,10 @@ giving us both centralized and local logs. (Redundancy is nice). | |||
flag.Parse() | |||
|
|||
if graylogAddr != "" { | |||
gelfWriter, err := gelf.NewWriter(graylogAddr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keeping the NewWrite function which would return the gelf.Writer
interface?
Depending on the string given to the NewWriter
function, an UDPWriter
or TCPWriter
would be returned.
It would be easier to add new writers in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the parameters are different, or ought to be, at any rate. A UDP connection has optional compression; TCP connections do not. TLS connections will need more parameters for the key and certificate information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be achieved using functional options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I do like those functional options. That's basically a monoid with the operation being function application and the identity element being the identity function (that returns its input). (Well, but appropriately distorted to take into account that mutable variables are being passed around.) That kind of thing is a trick we use in Haskell.
That said, while I can see how this could be a handy trick for wrapping the TCP connection in TLS, since they're both stream connections:
- I think we'd may still have the library supply a constructor for the TLS connectivity since applications other than
SystemdJournal2Gelf
may want to use TLS; and - I can't really see using a common constructor working for stream (TCP/TLS) and datagram (UDP) connections because the options are just so completely different.
The only thing that the stream and datagram connections share in common are the host and port. But datagram has chunking (implying a size limit users may want to set) and (optional) compression, neither of which stream connections allow. Stream connections potentially have all the TLS key and cert info, which datagram connections do not have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change function name to Dial
as net
package https://golang.org/pkg/net/#Dial with functional options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just thinking the other day, after reading about conventions for package names, that NewWriter
was the wrong name anyway. Dial
for consistency with other stuff sounded good until I did more research, and now I'm realizing that we seem to be going about this all wrong; GELF and its framing is all application-level protocol stuff and should just run over any stream or datagram connection, all of which we could have the client supply to us as something that implements Write([]byte) (len, error)
. Of course, we'd need some internal configuration to deal with setting appropriate framing based on whether we're told it's a stream or datagram (or "packet", as they say in net
package world) connection. Things like reconnecting should be handled by a different package whose Writer
the go-gelf
package would use.
LOG_DEBUG = 7 | ||
) | ||
|
||
func (m *Message) MarshalJSONBuf(buf *bytes.Buffer) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have a custom MarshalJSONBuf
function instead of implementing the standard Marshaler
interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unchanged code from where the PR started; I think that it's probably better to move refactorings like that (not directly necessary to this PR) to a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep it, @dqminh did a lot of optimization there to get it as fast as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MiLk usage of MarshalJSONBuf
is mostly because we want to accept an existing buffer from a pool or otherwise to save on an extra allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, got it. Thank you.
An FYI: Yesterday @MiLk and I did a bunch of trial refactorings against this code; you can see them on my dev/cjs/170721/refactoring branch. These do some more API cleanup, among other things. I think it's best to sort them out a bit and do them against Discussion of this in the go-gelf Gitter channel is welcome. |
I mentioned above that I have some new thoughts on what the architecture of this library should look like. I've written them up in this document. I'm not sure if here is the best place to discuss it; possibly we want to open a new issue or PR about this? (As usual, I welcome Gitter chat, too.) (Side note: Yes, I did just change my GitHub account name from |
@0cjs thanks for writing this up. As it looks like the refactoring is way bigger then the TCP support. So I think it's just fair to merge that first and than work on the architecture. @ghislainbourgeois is waiting for this a long time now. |
I'd like to see a merge as soon as possible, too, but I worry about the effect of API changes on clients. Perhaps we should start by declaring the current head of master version 1.0.0, starting a And I'm really not clear on how the Go world deals with stuff like this. Maybe separate checkouts of different branches under, e.g., However you want to do it, it doesn't really much affect implementing my ideas; all the code is there to be pulled over and munged regardless of what branch it's on. |
@0cjs @mariussturm, I also think the architecture refactoring is a nice endeavor, but it will take a while to get it done properly, forcing a good quantity of code in a diverging state for that period. However, Go does not support package versions at all, we need to have a different import path to release a new version. I just found the GoPkg.in service that does url rewriting to branches or tags. With that service, we could merge the TCP code and create version tags and unique import paths while keeping a single repository. I tested it on my repository and it works fine. You simply have to name branches or tags with their naming convention and it is completely transparent. We could also keep the default branch to the earlier UDP-only version so existing client do not break. This would let us do the big refactoring in small incremental steps that we could continually merge, rather than going for the perfect architecture in one step. What do you guys think? |
I also would like to see this PR merged before any refactor, even if it requires more work during the refactor. I think the following plan is good:
GoPkg.in should support that correctly. |
Unfortunately, I tested this morning and we can only target major version numbers, and the TCP feature has some breaking changes. Even if we keep NewWriter, Writer has changed from a struct to an interface. My suggestion is thus to use v1 for the UDP-only version, v2 for this PR and v3 for the refactoring of the API. PR #13 contains suggested changes to the documentation that should be reviewed and merged before this PR. |
Let's do that then. I didn't see that we could use only major version numbers. |
Targeting only major version numbers for branches is just fine; if we use semantic versioning properly anybody using a particular version can safely upgrade to a later version with the same major version number (e g., 1.0.0 to 1.9.73) without changing their code in any way. As for the return values, if my understanding of Go is correct there should be no problem maintaining interface compatibility.
So there is no problem with that interface that I can see, so long as It doesn't really affect me personally if we have one version or two versions of the API now, and two versions or three versions of the API later, but I think it would be nicer to clients of this library that aren't following along with all of this not to offer them two incompatible API changes in a relatively short period of time. |
gelf/tcpreader.go
Outdated
} | ||
|
||
func (r *TCPReader) readMessage() (*Message, error) { | ||
r.okToRead.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a dead lock if you write 2 messages on the same connection.
gelf/tcpwriter_test.go
Outdated
func TestNewTCPWriter(t *testing.T) { | ||
w, err := NewTCPWriter("") | ||
if err == nil && w != nil { | ||
t.Errorf("New didn't fail") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Error("New didn't fail")
gelf/tcpwriter_test.go
Outdated
func TestNewTCPWriterConfig(t *testing.T) { | ||
r, _, err := newTCPReader("127.0.0.1:0") | ||
if err != nil { | ||
t.Errorf("Could not open TCPReader") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Error("Could not open TCPReader")
gelf/tcpwriter_test.go
Outdated
signal <- "stop" | ||
done := <-signal | ||
if done == "done" { | ||
return nil, fmt.Errorf("Wrong signal received") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors.New("Wrong signal received")
gelf/tcpwriter_test.go
Outdated
signal <- "stop" | ||
done := <-signal | ||
if done == "done" { | ||
return nil, fmt.Errorf("Wrong signal received") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors.New("Wrong signal received")
gelf/tcpwriter_test.go
Outdated
signal <- "stop" | ||
done := <-signal | ||
if done == "done" { | ||
return nil, nil, fmt.Errorf("Wrong signal received") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors.New("Wrong signal received")
@ghislainbourgeois What about having
Is there any other breaking change? |
It will not really help because some clients like Docker specify |
@ghislainbourgeois could you please rebase? |
The severity levels are not int32, they are limited from 0-23.
This is a breaking change. Clients will need to use gelf.NewUDPWriter instead of gelf.NewWriter. The type returned will also change from *gelf.Writer to *gelf.UDPWriter, or gelf.Writer for the interface.
We expose a NewTCPWriter function that returns a TCPWriter type that implements the gelf.Writer interface.
By using a channel instead of a pair of mutexes and a slice of bytes attached to the reader. This also allows to write more than one message over the same connection.
@mariussturm I just rebased it |
🎉 thanks a lot four your awesome work @ghislainbourgeois!!! |
Thanks to you and all the reviewers! |
@mariussturm: It looks to me as if you did a squash commit when you brought this on to master, thus mushing five commits into one and making it significantly more difficult to review or audit the code in the future. (This is especially problematic for anybody who wants to go back and see what was done to add TCP, since the hundreds of lines of diff for that are now intermixed with many hundreds more lines of diff for the refactoring: in total there's about 2000 lines of diff in which to look to find any particular part of the change. I feel pretty strongly that such destruction of history is a bad idea. Can I suggest we drop this change and instead preserve the original commits by simply moving the master branch pointer up to the last commit on this PR? This is simply done with:
As a side note, the reason you need to fetch from Ghislain's repo above is because the commits on this PR are not actually in the Graylog2 copy of this repo. Presumably if Ghislain deleted his repo the commits on this PR would simply be lost forever, except perhaps via this GitHub web interface. |
BTW, I also have some README fixes ready to PR, but I'd like to get this sorted out first. |
@0cjs you can go back to this PR at any time to review the work. In the git history you want to have one commit per added feature. So squashing is a common practice. Otherwise you would have tons of commits in the history without a significant commit message. That's the reason why we want to have clear feature PRs without refactoring ten thinks on the side. |
The problem is, you can only review here in GitHub and you must know that a PR exists. When you work normally in your own clone of the Git repo none of this (the PR or its associated commits) is there. And also, this PR includes at least one commit that is nothing to do with the TCP addition (0f932fd). At any rate, if you feel strongly that future people looking back through the code outside GitHub should not be able to see these changes except as one huge blob I guess that's something I'll just have to live with. |
This change adds a way to send GELF logs over TCP. It also includes unit tests targeting a new tcpreader. The tcpreader's API is limited to the package because it will not handle multiple concurrent connections. I additionally tested it against a Graylog 2.2.1-1 instance running in docker.