Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interval log support #70

Merged
merged 18 commits into from
Nov 6, 2017
Merged

Interval log support #70

merged 18 commits into from
Nov 6, 2017

Conversation

marshallpierce
Copy link
Collaborator

@marshallpierce marshallpierce commented Oct 27, 2017

Mostly done; figured I'd get feedback on what I had so far.

The log format has some issues:

  • No way to describe what max value scaling factor was used, so you pretty much just have to know what it was if you care about the max value (which you probably shouldn't -- IMO, we should build better tools for displaying that sort of stuff rather than jamming in a semi-human-readable thing, like perhaps one or more metadata modes for our CLI tool to display min, max, etc)
  • Some comments are magical metadata comments. It would be better if metadata had a non-comment syntax.
  • Metadata can technically be interleaved with interval log lines, but it's unclear if this has any actual meaning
  • Tag support is pretty clearly bolted-on, but it's simple enough to parse fortunately.

Reading logs

This exposes a somewhat low level interface: it doesn't keep track of an observed BaseTime and add it to interval histograms' timestamps, for instance. Instead, it exposes an Iterator of the useful log lines, from which it would be simple to apply your own preference for dealing with timestamps, skipping logs you don't care about, etc. I decided to try using nom for parsing, and I think it came out pretty well.

Writing logs

Pretty straightforward. I chose to make a Tag abstraction so that we could guarantee the lack of tag-forbidden characters in a str with the type system. I also split writing duties into two separate types to prevent users from writing lines I think should only exist in the preamble (StartTime and BaseTime) once they've started writing interval histograms. (Helper methods for those things aren't there yet but the structure is in place.)

To do

  • Figure out what to do StartTime and BaseTime lines. There's no support for writing them yet mainly because I can't figure out what their semantics are in a clear enough way to document it in a way that isn't terribly confusing. They seem to sort of clash with each other.
  • Benchmarks.
  • Builder
  • Duration for interval duration
  • Duration for interval timestamp
  • SystemTime for writing StartTime/BaseTime
  • Better type for reading StartTime/BaseTime
  • Comment filtering to avoid breaking log format

@jonhoo
Copy link
Collaborator

jonhoo commented Oct 27, 2017

I'm confused. Why do we now have our own Serialize trait?

Will try to look over this at some point in not too long, but flying to a conference in Shanghai tomorrow morning, so may take a little while :)

@marshallpierce
Copy link
Collaborator Author

The Serializer trait is to abstract over the different serializers (V2, V2 + Deflate) so that they can both be dropped into the log writer as the user prefers.

No big rush on reviewing it; I need some time to ponder the above timestamp issues anyway.

I want to use that in interval log benchmarks, hence tackling that TODO.

For serialization benchmarks, this is just right: we want to exercise
all different varint lengths (modulo bucketing, which throws it off a
little bit). For non-serialization benches (e.g. add and subtract) this
is still a lot better than just regular random numbers, which will be
heavily weighted towards the last few buckets.

This drops measurements quite a bit, but that's expected because now
histograms aren't mostly using values way in the high end that all get
bucketed together, so now there's a lot more work to do when traversing
the histogram.
It's fast enough that I'm not stressing about optimizing it further.
@marshallpierce
Copy link
Collaborator Author

I wrote up my interpretation of how the Java impl treats StartTime and BaseTime.

I now have this wishlist for interval log v2:

  • Legend line is just a plain comment
  • Headers with a defined semantic like StartTime should not be comments, but should have their own syntax
  • Users should be able to add their own metadata to the log in a sufficiently structured way that a parser could sanely spit it back to them. Key=value?
  • Users should be able to also add metadata to individual intervals in the log.
  • Don't allow headers intermingled with interval histograms
  • Don't allow duplicate headers

Spur of the moment strawman syntax:

!Version 2
!StartTime 123.456
#Comment
?Key=value
... normal interval histogram

Not sure what to do about metadata for an interval. Maybe re-use url query param encoding? But almost everyone gets that wrong, so perhaps it's best to steer clear of that mess.

@marshallpierce
Copy link
Collaborator Author

Oh, and also: Fixes #59.

/// will be represented by the iterator providing the corresponding variants here. The presence
/// of those timestamps will affect how you should interpret the timestamps for individual
/// intervals.
#[allow(variant_size_differences)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be worried about a 56-byte enum variant?


/// Write a StartTime log line. See the module-level documentation for more info.
///
/// This should only be called once to avoid creating a confusing interval log.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be enforced by panicking if it's called twice. I couldn't think of a way to enforce it with the compiler without it being rather clumsy.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this some more, IntervalLogHeaderWriter really just seems like a builder to me. Maybe that's the way we should present it. That would allow us to only write things out when the user has committed to them, and then only write them once. Something like:

let mut ilb = IntervalLogBuilder::default().with_start_time(42);
ilb.add_comment("I love comments");
let mut logger = ilb.build(&mut writer, &mut serializer);

This has some other neat secondary effects such as headers being easy to re-use, fields that can only be set once only being written once, and more efficient writes (because they can be batch-written on .build(). We can bikeshed the exact name for .build(), but overall following the builder pattern as outlined in the Rust API guidelines would be a good route to take.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the builder approach.

/// Add a comment line.
///
/// If you do silly things like write a comment with a newline character, you'll end up with
/// an un-parseable log file. Don't do that.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe panic? Or filter silently?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or .replace('\n', "\n#")?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe that would be better to protect users from accidentally horking their logs. I suppose it should probably look for all of https://en.wikipedia.org/wiki/Newline#Unicode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want char::is_control, which includes all the Unicode control characters (including newlines).

/// The timestamp may be absolute vs the epoch, or there may be a `StartTime` or `BaseTime` for
/// the log, in which case you may wish to consider this number as a delta vs those timestamps.
/// See the module-level documentation about timestamps.
pub fn start_timestamp(&self) -> f64 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add a method to return a chrono NaiveDateTime, but it didn't seem worth the dependency to do something that the consumer can just as easily do

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, given that it might be entirely user-defined, I'm not sure we should.

///
/// `tag` is an optional tag for this histogram.
///
/// `max_value_divisor` is used to scale down the max value to something that may be more human
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be fixed for the entire interval log? It seems like it can't be a good idea to allow different divisors for different intervals.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that sounds like a good idea. And it fits nicely with the proposed builder pattern!

//! log.extend_from_slice(b"3.789,1.007,2.769,base64EncodedHisto\n");
//! log.extend_from_slice(b"Tag=b,4.123,1.007,2.769,base64EncodedHisto\n");
//! log.extend_from_slice(b"5.456,1.007,2.769,base64EncodedHisto\n");
//! log.extend_from_slice(b"#Another comment\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more readable with multi-line strings:

let mut log = "\
               #I'm a comment\n\
               Tag=a,0.123,1.007,2.769,base64EncodedHisto\n\
               1.456,1.007,2.769,base64EncodedHisto\n\
               3.789,1.007,2.769,base64EncodedHisto\n\
               Tag=b,4.123,1.007,2.769,base64EncodedHisto\n\
               5.456,1.007,2.769,base64EncodedHisto\n\
               #Another comment\n"
    .to_vec();

//! let count = iter.map(|r| r.unwrap())
//! // only look at intervals (which are the only non-comment lines in this log)
//! .filter_map(|e| match e {
//! interval_log::LogEntry::Interval(ilh) => Some(ilh),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would match on Ok(interval_log::LogEntry::Interval(ilh)) here to avoid the unwrap just above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having read further down, I also don't see why the IntervalLogIterator yields a Result. What is the Err case? And wouldn't it be easier to use with the errors being another LogEntry type? Or does the iterator behave weirdly once an Err has been yielded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only error (at the moment) is a parsing error, beyond which we can't really proceed. We could have errors represented in LogEntry, though that feels somewhat odd to me for reasons I can't really pin down. Not in love with the current way, though, that's for sure... an iterator of Results is clumsy. One good thing about Result is that it enables easy use of ?, which I don't think we would get from having LogEntry have an error variant.

Speaking of which, the iterator should probably only return None once it's returned a Some(Err(...))...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all the above.

//! header_writer.write_comment("Comments are great").unwrap();
//! header_writer.write_start_time(123.456789).unwrap();
//!
//! let mut log_writer = header_writer.into_log_writer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't read further down yet, but is an IntervalLogHeaderWriter the only way to get an IntervalLogWriter? If so, I think the former should be given a different name.

internal_writer: InternalLogWriter<'a, 'b, W, S>,
}

impl<'a, 'b, W: 'a + io::Write, S: 'b + Serializer> IntervalLogHeaderWriter<'a, 'b, W, S> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the IntervalLogHeaderWriter should be parameterized on S. Instead, the call to into_log_writer should be. The header writes are all independent of the Serializer as far as I can tell?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it doesn't use Serializer, so it can definitely be deferred. I was torn about whether to take the Serializer up front or wait until it was needed, and if it feels weird to you to do it eagerly, that's enough of a reason to switch for me.


/// Once you're finished with headers, convert this into a log writer so you can write interval
/// histograms.
pub fn into_log_writer(self) -> IntervalLogWriter<'a, 'b, W, S> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See notes above about making this more builder-like, and then only committing to a writer + serializer at this point (rather than when creating the header writer).

///
/// This isn't created directly; start with an `IntervalLogHeaderWriter`. Once you've written the
/// headers and ended up with an `IntervalLogWriter`, typical usage would be to write a histogram
/// at regular intervals (e.g. once a second).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an example would be more useful here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll add one once we get the builder API sorted out.

/// instead use a delta since that time. See the discussion about timestamps in the module-level
/// documentation.
///
/// `duration` is the duration of the interval in seconds.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you didn't make this a time::Duration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I didn't think of it once I dismissed the idea of bringing in chrono ;)

self.text_buf.clear();
let _len = self.serializer
.serialize(h, &mut self.serialize_buf)
.map_err(|e| IntervalLogWriterError::SerializeError(e))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add

impl<E> From<E> for IntervalLogWriterError<E> {}

so that you can remove this map_err.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... without a way to bound E, that would map every possible error type to SerializeError, which seems undesirable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isn't that what the IntervalLogWriterError type already suggests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is the IoError variant, so it's not just a wrapper for serialization errors. Having everything else map to SerializeError(E) feels risky; wouldn't that make it possible to have a typo where some other Result-producing call gets a ? appended to it, and presto now we're returning SerializeError(SomeOtherUnrelatedError).

io::Error seems like a reasonable thing for From because it's called many times and is unambiguous. I've never used an open-ended From like that so maybe my unease is unfounded, but it reminds me unpleasantly of Scala's implicits which made code hard to follow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that IntervalLogWriterError's one type argument is only used in the SerializeError variant. If we were to add some other wrapped error, that would be of a different type, which would change IntervalLogWriterError to take two type arguments <E1, E2>. In that case, an impl of From wouldn't make much sense without further bounds. But in the current case, if you have an error and want to produce some IntervalLogWriterError<E>, then that works exactly when that error is of type E. Which is what the above From would work for.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tightened up the bounds a bit:

pub enum IntervalLogWriterError<S: Serializer> {
    /// Histogram serialization failed.
    SerializeError(S::SerializeError),
    /// An i/o error occurred.
    IoError(io::ErrorKind),
}

Nonetheless, the compiler gets grumpy:

error[E0119]: conflicting implementations of trait std::convert::From<serialization::interval_log::IntervalLogWriterError<_>> for type `serialization::interval_log::IntervalLogWriterError<_>

Copy link
Collaborator

@jonhoo jonhoo Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What code is it it's complaining on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted it into this playground: https://play.rust-lang.org/?gist=d6826495d99f9718c0980197a0fe14d3&version=stable

I think what the compiler is getting at is that you could have a Serializer whose SerializeError was io::Error, and then there would be no way to know which impl of From to use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that makes sense. Oh well, leave it as-is I guess.

}
}

impl<'a> ops::Deref for Tag<'a> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to impl AsRef instead here. Though not clear that it matters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not real clear on when I should use one vs the other, but String impl's Deref<Target=str> so I was cargo-culting off of that since this seems like a similar sort of relationship. Since this is just a convenience for hypothetical usage, it's hard for me to say what would be best, but I'm happy to switch if AsRef is the Right Answer(TM).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is really poorly documented. Reading a bit more though, I think implementing Deref might actually be the right thing after all, so leave it as-is.

pub struct IntervalLogHistogram<'a> {
tag: Option<Tag<'a>>,
start_timestamp: f64,
duration: f64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this should be exposed as time::Duration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make write_histogram use time::Duration too.


/// Errors that occur when parsing an interval log.
#[derive(Debug, PartialEq)]
pub enum LogIteratorError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unfortunate, but makes total sense. There's some related discussion here: https://users.rust-lang.org/t/handling-errors-from-iterators/2551/20. I guess it's fine.

///
/// This parses from a slice representing the complete file because it made implementation easier
/// (and also supports mmap'd files for maximum parsing speed). If parsing from a `Read` is
/// important for your use case, open an issue about it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. I suspect parsing from a Read will be a very useful addition, and one that will be tricky to combine with the way the code is currently written, but I agree that we can deal with that on-demand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will be too bad; we'll just have to read into a buffer, parse what we can, and grow the buffer if we get Incomplete. There's some streaming support in nom, but I didn't investigate too closely. More than I felt like tackling on top of learning nom. :)

}

impl<'a> Iterator for IntervalLogIterator<'a> {
type Item = Result<LogEntry<'a>, LogIteratorError>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some issues with this.
If I've understood the log format correctly, this iterator will always first yield at most one each of StartTime and BaseTime, and from that point forwards, only yield LogEntry::Interval. If that interpretation is correct, then it seems like this should really be an iterator that yields IntervalLogHistogram<'a>, with some kind of initial step that gives you the start/base time. That'd be much more convenient to work with, as it would obviate the need for that pesky match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an ideal world, that would be true: well defined headers, which we read once, and then only expose body elements (ala HTTP). However, I'm not sure that in the wild these logs actually are structured that way. The Java impl lets you write those header-ish comments anywhere you want as often as you want, and supports reading them interleaved with interval logs as well, so maybe that's a thing people actually do? I asked Gil but I haven't heard from him in a while on Gitter; he must be busy with other stuff.

I figure that for users who (1) have well-structured logs and (2) are bothered by this API it would be quick work to wrap this Iterator of Results into a much tidier thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suppose... It seems unfortunate, because it makes the API a fair amount less ergonomic. I guess we could provide a refined iterator on top of this, whereas the reverse is not true, so leave it like this for now.

@@ -137,7 +144,7 @@
//! // Make some histograms
//! for _ in 0..num_histograms {
//! let mut h = Histogram::<u64>::new_with_bounds(1, u64::max_value(), 3).unwrap();
//! h.record_n(42, 7);
//! h.record_n(42, 7).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want rust-lang/rfcs#1937 so we can make this use ?.

@@ -0,0 +1,126 @@
extern crate rand;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module is only used in tests, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. I'll give it a comment to make it less confusing.

@jonhoo
Copy link
Collaborator

jonhoo commented Nov 2, 2017

I like the change overall. See some suggestions for structural/interface changes inline above.

- Doc test tidying
- Use `time::Duration` for interval duration
@@ -250,10 +268,11 @@ mod tests {
}
}

println!("{}", ::std::str::from_utf8(&log_buf).unwrap());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is just leftover from debugging?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh!

/// will be represented by the iterator providing the corresponding variants here. The presence
/// of those timestamps will affect how you should interpret the timestamps for individual
/// intervals. See the module-level documentation.
#[allow(variant_size_differences)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we care about this? Not really sure I see a way around it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought variant_size_differences was allowed by default? I think it's fine for this enum to have a somewhat large size variant, though it is also another reason to make the iterator only yield IntervalLogHistogram (as the enum would no longer be necessary).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are, but they're turned off in the crate-wide deny statement. The variant is only large because IntervalLogHistogram is large, though, so removing the enum isn't really going to help the "copying fat structs around" issue (though I think it's unlikely to be a performance issue).

))
/// This can be called multiple times, but only the value for the most recent invocation will
/// be written.
pub fn with_start_time(&mut self, seconds_since_epoch: f64) -> &mut Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if these should take time::SystemTime?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... perhaps. That would mean we'd have to do something about handling the error case for duration_since(), though. We could just take the Duration instead of the SystemTime to push that decision off to the caller, but still have something more "time-flavored" than a lowly f64.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let since_epoch = match time.duration_since(UNIX_EPOCH) {
    Ok(d) => dur_to_f64(d),
    Err(ste) => -1 * dur_to_f64(ste.duration())
};

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that should work, and we're very unlikely to hit the Err case, though I wonder if maybe we should panic because your timekeeping is clearly hosed if your system thinks it's before 1970?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh -- if we accept SystemTime, we can't have repeatable tests because you can't construct an arbitrary SystemTime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you can, by adding to UNIX_EPOCH ;)
As for panicking, I'd rather avoid it given that we don't have to.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. In that case I'm 👍 because I can get my testing done and we can always provide another way to set the times if SystemTime turns out to be awkward for users.

}

/// Build a LogWriter and apply any configured headers.
pub fn build_with<'a, 'b, W: 'a + io::Write, S: 'b + Serializer>(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much the opposite of what https://aturon.github.io/ownership/builders.html suggests: we're taking the required things at the end, not the beginning, but being able to use the same setup on multiple writers or serializers seems hypothetically useful, and this API does not offend me otherwise.

Let the name bikeshedding begin! :P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I'm okay with this approach.

I actually don't mind build_with. I think I'd be partial to a name that fits better in context though. How about begin_log_with?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, WFM.


// The Java impl doesn't write a comment for this but it's confusing to silently modify the
// max value without leaving a trace
if self.max_value_divisor != 1.0_f64 {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straying off the reservation here but it is so easy to misinterpret a log (as a human) without this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it.

max_value_divisor: self.max_value_divisor,
};

for c in self.comments.iter() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, you can use for c in &self.comments here.

/// Add a comment line to be written when the writer is built.
///
/// If you do silly things like write a comment with a newline character, you'll end up with
/// an un-parseable log file. Don't do that.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure that #70 (comment) doesn't get lost in these changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, got it tracked in the opening comment on the PR. This is one of the things I prefer about BitBucket: PR tasks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tightened up parsing to only accept newlines as a line ending (because \r should be shunned), and parsing, which was already quite fast, sped up over 3x!

Anyway, it seems to work fine if we only transform \n into newline-plus-# and leave everything else as-is. Is there a particular reason to turn any control character into a subsequent line of comments?

pub fn write_histogram<T: Counter>(
&mut self,
h: &Histogram<T>,
start_timestamp: f64,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should maybe be a Duration as well? 🤔 Conceivably, that's how it would be used: a Duration since either UNIX_EPOCH or whatever the chosen StartTime/BaseTime was.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, that's a good idea!

#[allow(variant_size_differences)]
pub enum LogEntry<'a> {
/// Logs may include a StartTime. If present, it represents seconds since the epoch.
StartTime(f64),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I suppose this should be Duration then too. More expensive to parse into Duration but these should be rare relative to histograms, so it's probably not an issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it back, that should be SystemTime I suppose.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, now I'm not so sure. There's not really much useful you can do with a SystemTime except turn it back into a Duration based on UNIX_EPOCH, which would force the user to do another round of error checking on the Result. So, it seems probably more desirable to give them the Duration directly. The path from Duration to chrono::naive::NaiveDateTime and friends is then pretty smooth.

@@ -365,7 +366,11 @@ impl<'a, 'b, W: 'a + io::Write, S: 'b + Serializer> InternalLogWriter<'a, 'b, W,
}

fn write_comment(&mut self, s: &str) -> io::Result<()> {
write!(self.writer, "#{}\n", s)
for l in s.split('\n') {
write!(self.writer, "#{}\n", l)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you not still have to escape control characters? \r for example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unless we start treating \r as a line separator. I added a test that has every control character in a comment. It gets a new line with a leading # for the \n but all others are left alone, and it parses fine.

Parsing into an f64 then separately truncating to an int and multiplying
the fractional part by 1_000_000_000 introduced errors in the fractional
part since frequently abc.xyz cannot produce 0.xyz as an f64, but has
to have 0.xyz0000001 or something like that.
IResult::Done(rest, data) => {
let (secs, nanos_str) = data;

// only read up to 9 digits since we can only support nanos, not smaller precision
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"only" heh

@jonhoo
Copy link
Collaborator

jonhoo commented Nov 6, 2017

cf9ec24 fixes the last task "Better type for reading StartTime/BaseTime", right?
I think I'm happy to merge.

@marshallpierce
Copy link
Collaborator Author

Yep I think this is good to go!

@jonhoo jonhoo merged commit 0b88329 into master Nov 6, 2017
@jonhoo
Copy link
Collaborator

jonhoo commented Nov 6, 2017

Nice! Good work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants