Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Up-to-date trace inspector criteria? #2

Open
schulmar opened this issue Oct 10, 2014 · 34 comments
Open

Up-to-date trace inspector criteria? #2

schulmar opened this issue Oct 10, 2014 · 34 comments

Comments

@schulmar
Copy link

The readme says

Currently, there is no up-to-date or actively maintained application to view or inspect the profiles.

and links to Templar.

Which criteria need to be fulfilled for an application to be seen as up-to-date in your opinion? (I am planning to work on this when I have time to spare)

@mikael-s-persson
Copy link
Owner

Well, I just said that because, at the time of writing, Templar had not been updated since about 9 months ago, is still kind of in a beta stage (at least, from what I remember the last time I tried it), and still referring to the original templight patch, which is no longer compatible with the Clang trunk, and also, I remember having some issues making it work with latest version of graphviz too. That, for me, qualifies as not up-to-date or actively maintained. I did not mean to belittle or offend, and I understand full-well how difficult it is to find time to work on a side-project like this, I also left my version of templight sit on a shelf for a long time, not finding the time or motivation to get back on track with it.

I could rephrase it. There's nothing formal about what qualifies as up-to-date and maintained. But a good rule of thumb is if there are lingering bugs or if a user who gets the code needs to do some additional work to get it to compile or work properly on his machine (with his newer versions of the dependent libraries), then that means it isn't being maintained.

But regarding your work on Templar. I was hoping to change the output formats (and some changes have already been committed by Wojciech Mamrak). The first changes that were made were mostly to reduce the redundant or empty fields in the begin / end entities, this reduces the size of the files significantly. But I'm also considering changing the structure to have more of a recursive / tree-like structure for the yaml and xml outputs, i.e., instead of a flat list of entities, having entities with sub-entities and so on. Would you have any comments, interests, or problems with that?

I'm also planning to produce graphviz and GraphML outputs too, which could also make things easier on your side.

And if I'm gonna be changing the output formats, I'd like to make most of the changes quickly and then never change them again, to avoid future compatibility problems. So, I'm all ears for ideas about what would make the "perfect" format.

@schulmar
Copy link
Author

I am sorry if you felt I was offended, I was just genuinely interested.

If you have any problems with the current version of Templar then please open a new issue for each one and I will try to find out what is the problem.

Nesting the structure seems like a good idea to me. It is much more natural than having a flat list. When I first saw the XML structure I wondered why the values were not stored as attributes but instead as sub elements, do you have any ideas on that?

Regarding the outputs: You mean you want to produce graphpviz GraphML outputs directly from templight?
If so then I am skeptical. If they can be created from the trace I would avoid adding code for this to templight for fear of feature creep.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 19, 2014

Hello,

there is one additional inconsistency regarding the produced trace files in various formats I have somehow missed, i.e. yaml trace file describes location of an instantiation of a template using three keys: FileName, Line, Column, whereas text and xml formats use single PointOfInstantiation element. File name, line and column are its value and are separated using "|" character.
There are few reasons I would switch to the same formatting for all the formats (and preferably the yaml one):

  1. It is inconsistent.
  2. Using "|" saves some space when no compression is used, but since none of the formats was designed with space-efficiency in mind, this is rather a flaw. And arguably parsing strings to find "|" takes more time than simply jumping to the next line or element.
  3. Why would anyone use additional formatting rules (i.e. usage of some random sign), instead of using a well-understood rules of each format (elements/properties in xml, new lines in txt).

The patch would be very simple, but I can save you these few key strokes and prepare one.

regards

PS. another thing: for text format, I would remove all these assignments ("=") in favor of new lines.

@schulmar
Copy link
Author

@wmamrak I second all of that.

After having spent some time on the loading code in Templar I think we should give the files some version information. This would make it easier to select the correct loader without having to dig through the files to find version specific constructs.

@mikael-s-persson
Copy link
Owner

It's odd that you guys bring this up today, because I was just doing some work on the output formats, I had a bit of time to spare. Because of the comments you expressed, I've put my commit into a new "develop" branch, to avoid polluting the master with controversial commits. That's because I kind of went the opposite direction of what you just said, before I read the posts.

Nesting the structure seems like a good idea to me. [...] why the values were not stored as attributes?

I implemented both of these changes into an experimental format that you can produce with -format=nestedxml. So, now all the fields are attributes of the XML entries. And entries of nested instantiations are nested XML elements. Something like this:

...
<Entry Kind="TemplateInstantiation" Name="foo<int>" Location="foo.cpp|20|42" Time="0.0" Memory="0">
<Entry Kind="TemplateInstantiation" Name="foo<int>::foo" Location="foo.cpp|25|21" Time="0.0" Memory="0">
</Entry>
</Entry>
...

You mean you want to produce graphpviz GraphML outputs directly from templight?

Yes, and I did. Both GraphML and graphviz are pretty trivial formats to produce, and the hard work is to transform the cached templight entries into a tree, which is needed to write the nested-XML format anyway. With most of the heavy-lifting done, it was trivial to produce those formats too.

If they can be created from the trace I would avoid adding code for this to templight for fear of feature creep.

I understand that point, and I could re-consider. There are a few advantages to producing the graphml and graphviz formats directly from templight. For one, they are rather compact formats, which is nice. It should make the tooling code easier to write, i.e., you currently load up the Templight traces, and then generate an in-memory graphviz graph to display in Templar, others (developing other tools) might not be willing to do this. For example, I could imagine a tool that generates the template equivalent of doxygen-style call-graphs to display them in documentation or something, which would be more convenient if it can get a dot (gv) file directly out of templight, to load up directly in graphviz. I don't plan to make those formats customizable in any way, so that if someone wants more than a vanilla graphviz graph, he'll have to post-process it (to apply transforms and styles), which can all be done with dot / graphviz, which is the whole point (he doesn't have to go through any other templight-specific tool). And also, in terms of feature creep / bloat, we are only really talking about at most a couple of hundred lines of code, i.e., a fart in the wind.

  1. It is inconsistent.

I agree that it should be consistent. I don't really know why it was inconsistently done in the first place. And I might add that I don't really like the terminology used either, like "PointOfInstantiation", which is kind of redundant given that the entries are all about instantiations, and I think that a simple "Location" would do just fine.

  1. Using "|" saves some space when no compression is used, but since none of the formats was designed with space-efficiency in mind, this is rather a flaw.

Uhmm.. I'm not sure that the fact that the formats are not particularly terse to begin with make it a flaw to have a more compact notation in one field. And an XML or yaml file that is 20% smaller (which is about what you can get with this, in the best cases) is still an improvement.

And arguably parsing strings to find "|" takes more time than simply jumping to the next line or element.

Well, no. I mean, tokenizing a string by the "|" character is about the easiest task you can do. And unless you are using an off-the-sheld yaml / xml reader, there is no way that this is harder than the multiple field / attribute version of it. This particular format with "filename.cpp|42|69" is a fairly common format for expression file-name, line and column. I've seen this in debuggers, diagnostics, editors, etc.. and in part because it is very easy to parse.

Why would anyone use additional formatting rules (i.e. usage of some random sign), instead of using well-understood rules of each format (elements/properties in xml, new lines in txt).

That's a good point. But like I said the "|" separation is not arbitrary, it's quite common (the other common format is with "filename.cpp:42" (like GDB), but that can be a bit tricky to parse in some esoteric operating systems like Windows that marks hard-drives like "C:", and so, the vertical bar is nicer because it's not ambiguous with anything that is likely to be found in a file-path string). And the other reason why I like the compact version is that "Location" makes more sense as a whole, i.e., you are never going to be looking up the line or column where the instantiation occurred, without first looking up the file-name, in other words, they are, together, one single piece of information, not three separate "attributes".

But if there is any risk of things behaving badly with having this compact version for the locations, then I'll retract that.

another thing: for text format, I would remove all these assignments ("=") in favor of new lines.

No. That's a definite no. The vanilla text file format is very typically using the format "var = value" on each line. That's a very common (and again, easy to parse) format. And, with the equal sign, it's more human-readable, which is the main point of the text format, because if you want a format to the parsed by a program, you'd choose the more computer-friendly formats, like yaml / xml / graphml / dot / ... (maybe proto-buf or JSON?).

After having spent some time on the loading code in Templar I think we should give the files some version information.

That's probably a good idea. For any XML format, this is pretty straight-forward, and in any case, most xml parsers don't need to really know what's in the file to read it, but there could be a version number, and possibly a DTD as well. I guess that for the yaml and text formats, it would have to be some ad hoc thing at the start of the file. (And for graphML and graphviz, that's not really necessary since those formats are fixed by their own standards)

I think it's still a bit early for this. I'd like to first get a nice frozen first proper version of the output formats (which is just a matter of having a concensus on issues like those above). Include a version number into those output formats. And that'll be the starting point for any kind of future backward- / forward-compatibility maintenance, which should be possible on most formats quite easily.

This would make it easier to select the correct loader without having to dig through the files to find version specific constructs.

That sounds a bit odd. You should write one loader that works for all versions of one format, and keep it backward and forward compatible. It's very simple, if you encounter fields you don't recognize, just ignore them (forward compatibility), and if you don't encounter the newer fields, you default them and / or fall-back to the older behaviour. You do this mostly on a field-by-field basis, but having a version number at the start certainly helps along, but you should have to "select" the correct loader.

But anyways, hopefully, we should settle on one set of stable output formats pretty soon. After that, I don't think there will be much room for expansion, except maybe adding more "Kind" possibilities.

@schulmar
Copy link
Author

[...] there could be a version number, and possibly a DTD as well.
Everything that gives a better understanding of the format would be good.
All the work I have done on the loader so far is based on what I see in the files which is not a good basis for a correct implementation.

I would be especially interested in differences to the original format.

You should write one loader that works for all versions of one format

Of course, it is all a matter of perspective: At the toplevel we only want to call load('trace.xml') (or whatever the file is called) but internally the loader has to distinguish.

It's very simple, if you encounter fields you don't recognize, just ignore them (forward compatibility), and if you don't encounter the newer fields, you default them and / or fall-back to the older behaviour.

That works if the format does not change too much but there might be cases where there is no sane default value.

Take your new format for an example: If I were to implement it in the same loader as the old one, then the nesting would be totally wrong because the old format nests via <TemplateBegin></TemplateBegin> and <TemplateEnd></TemplateEnd> while the new one nests via <Entry> and </Entry>. I am sure I would be able to merge these two loaders (because they do the same with slightly different constructs), but at what price for code clarity/correctness? So there should be a separate loader for each of the two formats.

I think it's still a bit early for this. I'd like to first get a nice frozen first proper version of the output formats

The mentioned differences are where my wish for format versioning comes from, sure it helps future revisions but we already have at least two XML formats now. For now the first tag of them might suffice to distinguish them but I don't think that this is the right approach.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 20, 2014

Hello,

I see there are few misunderstandings which emerged from various point of views, and some by my lack of precision in formulating my thoughts.

Please answer yourself the following questions.

  1. Is it reasonable to provide a few formats (currently 3 or 4 - depends on how to treat 2 xml formats) which are equal in functionality - i.e. all the formats provide the same data and take more or less the same disk space?
  2. Why would anyone use an expensive and slow XML parser instead of a simple, e.g. raw text file format, except "because he can"? And since "he can", why templight developers should care?
  3. Are trace files meant to be processed by humans or computer programs?

In my opinion there should be one format supported. More precisely, the one which, using the simplest syntax and parsing rules, can unambiguously describe all the data produced by Templight.
The trace files (in any format in the world) are human-readable, but they should be processed by computers. Hence, in my opinion any human-legibility improvements should be removed due to additional processing and size costs they carry.
I am thinking e.g. about two spaces in front of each key-value pair in text format.
This is also why I opted for replacing the "=" sign with new line (or space). Not only it makes the format simpler, it makes parsers simpler and trace files smaller.

Uhmm.. I'm not sure that the fact that the formats are not particularly terse to begin with make it a flaw to have a more compact notation in one field.

Consistency? You can save more space by giving the elements/attributes shorter names; you can make the format fixed, separate all the data using "|" and introduce versioning to the format file. Each one would improve the size without loss of consistency. My point is, if any format gives you tools to describe what you want, do not invent new ones or circumvent the exisiting ones, even in good faith.

another thing: for text format, I would remove all these assignments ("=") in favor of new lines.

No. That's a definite no. The vanilla text file format is very typically using the format "var = value" on each line. That's a very common (and again, easy to parse) format.

I don't especially like the argument of commonness, particularly because it is very subjective. What I prefer is rationality and usefulness.

Well, no. I mean, tokenizing a string by the "|" character is about the easiest task you can do. And unless you are using an off-the-sheld yaml / xml reader, there is no way that this is harder than the multiple field / attribute version of it.

I have been imprecise - I was thinking of a text file format only and about processing time and complexity of the two operations.
I have made a test. I have a trace file of 1.4M of lines. Processing a raw text file in pure Python took about 4 seconds (correction: about 3.5s if stripping whitespaces was omitted). Then I have installed PyYaml with LibYaml (C) bindings for maximum speed, and used, I believe, the fastest method possible. Processing yaml file took about 18 seconds. I have not tried XML parser as it would probably take similar, or even more time.
As a sidenote, it indeed looks like parsing text file with new lines as a separator, for my trivial case, is a bit slower.

And, with the equal sign, it's more human-readable, which is the main point of the text format, because if you want a format to the parsed by a program, you'd choose the more computer-friendly formats, like yaml / xml / graphml / dot / ... (maybe proto-buf or JSON?).

All above mentioned formats are more computer-friendly because they can describe complex data structures that simple text format can not without introducing additional rules. I don't think their computer-friendliness goes any step further, though :)

To sum it up, I would keep the tracer simple (not that it is overcomplicated), i.e. I would provide one, simple, easy to parse format without any human-legibility improvements, as they are overrated :) If someone is using notepad or vi for viewing trace files, most likely she is doing something wrong.

@schulmar
Copy link
Author

I must agree with @wmamrak:

I would keep the tracer simple (not that it is overcomplicated), i.e. I would provide one, simple, easy to parse format without any human-legibility improvements

It would make life easier for both the producing and consuming end if there was only one output format to maintain.

I see no real usecase for opening the file and viewing it in an editor so we don't need human readability.

If we don't need the human readability we could start using a binary format which would save even more space and could be tailor made for the problem domain. (In fact Thomas Benard has already created such a binary format and incorporated it into a fork of templar).

The only advantage that I see with the XML/YAML outputs is that they can be parsed by tools that don't have to know about templight (e.g. for statistics plotting). IMHO that alone does not justify producing these files in templight. If need for them arises it should be possible to generate them from a binary trace file with an external application.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 20, 2014

Martin described my thoughts more accurately than I would by myself.

@mikael-s-persson
Copy link
Owner

Is it reasonable to provide a few formats (currently 3 or 4 - depends on how to treat 2 xml formats) which are equal in functionality - i.e. all the formats provide the same data and take more or less the same disk space?

I agree that it's not a good idea to be supporting several redundant formats. I'm sort of working off of what I inherited from the original templight version.

The original XML format and the YAML format are completely redundant. I'm guessing that they only provided the two formats because YAML is more native to LLVM, while XML is more wide-spread and well-known. Is that really a sufficient reason to provide both? I don't think so.

And then, for the text format. There is no reason to provide a text format other than human readability. And I agree with you guys that reading a templight trace in a text editor has no real purpose / use-cases. I would vote to eliminate the text format altogether, and I never understood why it was part of the original templight. And to that point:

Why would anyone use an expensive and slow XML parser instead of a simple, e.g. raw text file format, except "because he can"? And since "he can", why templight developers should care?

Well, whether they use an off-the-shelf XML "slow" parser or not is beside the point. Relying on a raw text file format has the very big problem that there is no standard definition behind it, and making our own would be a waste of time. The advantage of using an established base format like XML or YAML is that there are clear syntax rules already established, and widely known and followed by people. I've written XML parsers more often than I can count, like most seasoned developers, I presume (and btw, I've had far more trouble with basic text file formats in the past). The well-defined syntax rules for the XML format is what makes it easy and reliable to parse. Simple text formats are, by definition, without any clear rules (e.g., quotes?, multi-line?, what chars must be escaped?, spaces significant or not?, comments?, etc..), which means those rules need to be established and communicated to anyone wanting to parse those files. I've seen many projects fail just on the basis that they came up with their own format, that no-body's interested to learn or follow. The convention is that a raw text format (if any is produced), is for human-consumption only, because a program needs precise rules, for reliable parsing. Of course, for a quick-and-dirty parsing script, a simple text format is easier, but not for a reliable one. This is the whole reason why all these standard extensible formats exist (XML, YAML, protobuf, JSON, etc..).

And so, if there is no use-case for a human-readable format, I would get rid of the text format.

I am thinking e.g. about two spaces in front of each key-value pair in text format.
This is also why I opted for replacing the "=" sign with new line (or space). Not only it makes the format simpler, it makes parsers simpler and trace files smaller.

This is exactly the type of discussion I want to avoid. You are talking about establishing some sort of clearer, less ambiguous, set of syntax rules for this new format that you are proposing.... This is a waste of time. Going down that road is going to turn into far more trouble than you can imagine. And the reason why formats like XML have so many quirky rules and "complex" syntax is because when you start going down that road of establishing reliable syntax rules, you find so many issues and corner cases (which we are subject to as well, don't think that we don't), that you end up with something like that, that's what makes a format mature and seasoned. We would be fools not to take advantage of all that stabilization work that has already been done.

To me, the primary / preferred output format for templight should be one that relies on a standard extensible format (XML, YAML, protobuf, JSON, whatever, as long as it's mature and well-established).

I don't especially like the argument of commonness, particularly because it is very subjective. What I prefer is rationality and usefulness.

I'm all for rationality, but the thing you have to understand about file formats is that commonality is the most rational choice. Formats are used to communicate information. We use the English language to communicate because it's the most common one, not because of its inherent rationality or effectiveness.

About the "Location" / "Filename / Line / Column" problem, I understand your point that the separated version is more in-line with the format used. But my concern is that the triad of "Filename / Line / Column" is one piece of information, not three. It could also cause issues. For example, the version of templight that Martin has referred to (see here) has started to output not just the single location of the instantiation, but the range (start to end) of locations. In the separated version, this would lead to fields of "StartFilename / StartLine / StartColumn / EndFilename / EndLine / EndColumn", and things like that. If we were to also output the location of the template declaration (not just the instantiation), that would further compound the problem. With the |-separated version, you would have "Start / End" as locations defined as something like "some_source.cpp|42|10" and "some_source.cpp|48|8". I find this solution much nicer, and for being in-line with the format (XML / YAML), I think that the idea is that "one piece of information == one field".

Another option could be to use attributes:

<Location Filename="some_source.cpp" Line=42 Column=10/>

I would provide one, simple, easy to parse format without any human-legibility improvements, as they are overrated :) If someone is using notepad or vi for viewing trace files, most likely she is doing something wrong.

I agree, sort of (except about what constitutes an "easy to parse format"). The way I see things is that there should one and only one primary / preferred format, that is used by default (currently yaml), and it should be the one format used in accessory tools (like Templar and soon, Metashell too).

But, there can still be other formats provided for specific purposes:

  • The text file format fulfils the role of "use this if you only want to read it in a text editor", which, I agree, is not going to be very popular (but I can still imagine a few cases, like small toy examples and stuff). But then again, it's a pretty useless format.
  • The graphML / graphviz formats fulfil the role of being able to directly feed into graph visualization and analysis tools. It's true that this could be done with a separate trace-to-graph program, but I don't see the point of doing that for some trivial bits of code (and also, with the associated trouble of having to parse the trace to produce the graph, instead of producing the graph directly from the in-memory cache of templight trace entries). The idea with that is you can do things like "sources -> templight -> dot -> images -> doxygen / html documentation", which I find very appealing. Being able to load the graphml file, with libraries like BGL, is also very appealing from the point of view of both doing high-level statistics on the instantiation graphs, and streamlining the development of tools.

It would make life easier for both the producing and consuming end if there was only one output format to maintain.

Yes. There should be one primary format, that tools should keep in sync with. Any other format that templight can produce, if any, should be aimed at general third-party applications that are not related to templight. Which is sort of the way it has always been: yaml is primary for templight tools (e.g., Templar); xml is for easy off-the-shelf loading in third-party applications that don't have to be in-sync with templight; and text is for the rare cases where one would want to see the trace in a text editor.

but we already have at least two XML formats now.

I consider the second (nested) XML format as a candidate to replace the existing one. There should not be two XML formats at the end. You said that you liked the idea of having a nested structure and that you would prefer the fields to be attributes of the XML entries. That's what I created with that format. If you like it, or want to make some minor changes to it (like renaming fields or whatever), then we could make that the official XML format.

The main draw-back with the nested XML format is that it requires a conversion of the cached entries into a tree, such that the nesting can be resolved. See the EntryTraverseTask class' depth-first traversal in TemplightTracer.cpp for how that code looks. The main issues with that is that it will add run-time overhead (but it won't corrupt the profiles, because it's done only between instantiations, never during one), and it disables the "safe-mode" option because there is no way to immediately print entries for a valid nested XML. But it does have the advantage of being much smaller, and also producing time and memory intervals, instead of absolute time and memory usage (or is that something you want to keep?).

(In fact Thomas Benard has already created such a binary format and incorporated it into a fork of templar).

Well, that binary format will definitely need re-working (it doesn't even handle endianness (!?!), and I've spotted a few dubious things in it too). Binary formats can be tricky to get done right. If we are going to go that route, I would prefer protobuf, because it's the best of both worlds, i.e., it's a simple binary format that performs very well and is easy to read / write, and it's extensible and has a schema language to allow it to be readable from any application. I've written some protobuf serializers before, and have code that I can easily grab for doing that. I'll mock-up some code to see how that could turn out.

The only advantage that I see with the XML/YAML outputs is that they can be parsed by tools that don't have to know about templight (e.g. for statistics plotting). IMHO that alone does not justify producing these files in templight.

The thing is that it costs virtually nothing to produce them from templight, so, they don't need much justification. I see all formats besides the primary one (currently yaml) as "advanced options" for third-party applications.

If need for them arises it should be possible to generate them from a binary trace file with an external application.

That's a possibility. I might consider that for the tree / graph outputs (nested XML, graphML, graphviz) because they have run-time overhead and don't support the "safe-mode".

So.... in light of all this, one thing is clear, adopting a good set of or single output format is not a trivial thing. That's why I knew it had to be "first order of business" before going forward. My main wishes, for the primary (or sole) output format, are as follows:

  • Use an existing extensible format so that we can skip all the "reinvent-the-wheel" work of fixing the syntax rules. Frankly, I don't care which format it is, as long as it's a standard one.
  • Provide a version number.
  • Be extensible in the possible "Kind" fields and extensible for adding fields, in a backward/forward compatible manner.
  • Maintain a reference implementation for parsing it. This could be done as part of a small tool to convert it to other formats.
  • Be reasonable terse, but I am worried that traces can be large, especially in the target use-cases, which are code-bases that suffer from excessive template instantiations.

I'll start mocking-up some code for a protobuf output format, because I think it is optimal in all those aspects.

YAML sort of fits the bill, but I find it a bit awkward.
XML fits the bill well, but is a bit too large.
Text format doesn't fit the bill because it's not standard or well-defined enough.

@mikael-s-persson
Copy link
Owner

I also forgot to mention that I would wish the output to be easily composable, i.e., by dumping several trace files into one single file, which would solve some problems related to invoking templight on several source files while producing a single output file. A protobuf format would naturally work for this, because it's a header-free format, so they can easily be dumped end-to-end into a file.

I also uploaded a tentative protobuf .proto file for defining the messages.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 20, 2014

And then, for the text format. There is no reason to provide a text format other than human readability.

It depends what you mean by text format. XML is text as well. JSON too. INI, dot, etc. And even though they are human-readable, they are not intended to be human-processable (is anybody browsing the webpages this way?). Protocol buffers are different beast, and if you decide to support only one format, which I am encouraging you to, I would opt for protocol buffers.

Well, whether they use an off-the-shelf XML "slow" parser or not is beside the point.

Well, I gave you an example of how fast is a production YAML parser in comparison to non-optimized raw text parser. XML is much more complex so I assumed, unless using some non-standard-compliant or even worse, hand-crafted XML-subset parser, that its speed will be similar.
I don't really care about which format will be used, I will parse anything, but I don't like overcomplicating the things. Data produced by templight has a very simple structure, I will stress that a few times in the following lines, hence there is no need to produce output in some awkward (XML), popular (XML), overcomplicated (XML), space-inefficient (guess by yourself) format :)

Relying on a raw text file format has the very big problem that there is no standard definition behind it, and making our own would be a waste of time.

Writing which would take literally 2 minutes of time.

The advantage of using an established base format like XML or YAML is that there are clear syntax rules already established, and widely known and followed by people.

But still, reading documentation is needed to know which elements and attributes are there, which values they can take, and what these attributes and values mean. Reading additional information about trace file structure would impose a very little overhead, not to mention writing code, not to mention testing.

Simple text formats are, by definition, without any clear rules (e.g., quotes?, multi-line?, what chars must be escaped?, spaces significant or not?, comments?, etc..),

In general I agree, but not with respect to templight.

You are talking about establishing some sort of clearer, less ambiguous, set of syntax rules for this new format that you are proposing.... This is a waste of time.

Waste of time is waiting for parsing results more than you need to, and not a substantive discussion about reducing this wasted time.
Also, to be precise, I was trying to fix a set of parsing rules of an exisiting, not new, format. Small, but significant difference.
The use of XML for templight trace file is like taking a sledgehammer to crack a nut. I know XML is mature, I know its complexity is for a reason. But it is still a sledgehammer and a nut.

I'm all for rationality, but the thing you have to understand about file formats is that commonality is the most rational choice.

I don't have to understand that because I did not say I am against it. My point is, that I won't use a sledgehammer to crack a nut even if all other people act in this way.

But my concern is that the triad of "Filename / Line / Column" is one piece of information, not three.

I would argue - template instantiation is one piece of information as it gives full information about what and where happened :)

In my opinion Protocol Buffers are a way to go. Simple, extensible, fast, mature, widely-supported. In other words one format to rule them all.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 20, 2014

Two issues:

The smallest tag number you can specify is 1.

There is 0 used a few times.

optional uint64 memory_usage = 5;

Any reason for a 64bit integer?

Except that, the protocol file looks fine, but I would consider one change. Now, such message is well defined:
is_begin=true
which makes not much sense. I would consider making the protocol more strict by adding Begin and End structures. Each would define minimum required values that would make the encoded message complete and meaningful - Something like this

@mikael-s-persson
Copy link
Owner

There is 0 used a few times.

Oh, sorry. Forgot about that rule. I normally generate proto files from my code, so I don't write them by hand all that often.

Any reason for a 64bit integer?

Well, that is technically the type of the memory usage variable. And theoretically, there are translation units whose memory consumption during compilation can rise above 4GB... I have a few of those (a problem I hope to solve once I iron out a mature version of Templight). I think its a good idea to use 64bits. And in any case, with protobuf's VarInt encoding, it doesn't make much of a difference.

I would consider making the protocol more strict by adding Begin and End structures.

Alright. I can live with that. Can you pull request this thing so that I don't have to manually enter the changes. Thanks.

And about the whole "templight output is simple, so don't overcomplicate the format" argument. I agree that content-wise, it's very simple (at least for now, not that I have any crazy plans ;) , but you never know what to come up down the line). But what I mostly want to avoid is coming up with some simple syntax rules for a text format (as in, similar to the current -format=text output) and then, later find that some additions or changes cannot work because they break the parsing according to those simple rules. What I mean is, for example, if we have a simple rule about splitting fields with a new-line, and then, somewhere down the line Clang people decide the change diagnostic prints to be multi-lined (e.g., with some pretty indenting scheme), then we'd have to retro-fit some way to deal with field values having new-lines and/or updating the parsers, etc... That's just a hypothetical example, but the point is that established formats like XML, however overcomplicated, have the benefit of experience with all sorts of issues like that, and have many standard ways to work around special issues without having to retro-fit or step out of the rules. I know that it's taking a sledgehammer to crack a nut, but it does have the benefit of leave no nuts uncracked!

That said, with protobuf, there's a natural immunity to so many issues, while remaining very simple. I like that a lot. Being binary, i.e., no string-parsing, avoids so many parsing quirks, like escaping characters and quotes, and character encodings.

Oh and to this:

unless using some non-standard-compliant or even worse, hand-crafted XML-subset parser

I have rarely seen people use a full-blown standard-compliant XML parser. Even the heaviest XML parsers in practical use are not fully standard-compliant, AFAIK. And many many projects use hand-crafted XML parsers too. But that is part of the intent of XML, it is not meant to always be used generally (with full DTD and all that). The vast majority of XML producers and consumers use only a tiny subset, often using hand-crafted code. I mean, that's what templight does, after all.

I would argue - template instantiation is one piece of information as it gives full information about what and where happened :)

Well.. sure :) I was more thinking along the lines of if someone asks you what time it is, he expects something like the hour and the minutes, maybe even the seconds, but people don't go around and ask each other the time in a series of separate questions like "hey, what's the current hour? what's the current minute? what's the current second?". Another way to put it is, if the templight trace was a data-base, and you wanted to get the location of an instantiation, would you expect to have to make three separate SQL queries to get the file-name, line number and column? I don't think so, you'd expect one query would do the trick. But anyway, I think that the way it is structured in the proto message is the right way to go.

@mikael-s-persson
Copy link
Owner

Nevermind the pull request. I did it myself (didn't know I could do that ;) ).

@mikael-s-persson
Copy link
Owner

Ok. So I put up an implementation with the protobuf output format. I had to change a couple of things in the message definition file:

  • Removed "oneof" because it's too recent (last august) and not supported much yet. I replaced it with two optional fields (begin / end) for now (which is binary compatible with "oneof" anyway).
  • Added a TemplightTraceCollection as the top-level thing. This is just so that dumping (raw) multiple trace files into a single file will still be a valid protobuf file (yay!).

I tested my protobuf file output against google's reference implementation, and it works fine, no apparent problem or corrupt values (I didn't really expect any issues, given how simple it is). And for a human-readable version, you can always use a python script to convert it to a text file, or the C++ equivalent. Loading the traces up into C++ code should be straight-forward if you use libprotobuf, but in any case, it'll be good to make standalone loading code (like my standalone saving code), to avoid the dependency with google's library.

In terms of file-sizes, it really can't do any miracles since most of the output data are strings (name / file-name fields occupy most of the output files). Still, the protobuf outputs are significantly smaller (about half the size as yaml, in my simple tests).

@schulmar
Copy link
Author

I also think protobuf is the way to go.

Regarding the file size: I think one of the biggest cost savers would be to only reference the file and not to write the file name everytime. In the binary format that I linked to there was a second file that only contained a list of the paths to the used source files. The trace then only contained the index of the file in that list instead of the full name.

However, I would not use a second file but inline the filepaths into the trace. On first use the full name is used and later on the index.

message SourceLocation {
optional string file_name = 1;
required uint32 file_id = 2;
required uint32 line = 3;
optional uint32 column = 4;
}

Another possible space optimization: Since we don't need human readability we could use the mangled symbol names which should be shorter.

Regarding the format-version: In the current .proto file the version is in the header. This would mean that each trace in a collection can have its own format version. Is this wanted?

@wmamrak
Copy link
Contributor

wmamrak commented Oct 21, 2014

And many many projects use hand-crafted XML parsers too.
The vast majority of XML producers and consumers use only a tiny subset, often using hand-crafted code. I mean, that's what templight does, after all.

Templight does not use a parser, it uses an emitter. And each hand-crafted XML parser is vulnerable to bugs and errors. Furthermore, it might not support some of many standard ways to work around special issues. Hence, eventually, most developers end up with a sledgehammer wondering why parsing of so trivial data has to be so involving.
I never claimed text format is a remedy to all the problems, but in this particular case it has many significant advantages over other formats. That is what I have tried to point out here.
Good we have reached a consensus on protocol buffers, yet I don't know where your objections to more strict protocol, expressed with "I can live with that", come from. I was politely suggesting some improvements I believed were rational.

Another way to put it is, if the templight trace was a data-base, and you wanted to get the location of an instantiation, would you expect to have to make three separate SQL queries to get the file-name, line number and column? I don't think so, you'd expect one query would do the trick.

If the templight trace was a database table, then each template instantiation would take one row in this table. Single query would indeed do the trick:
SELECT filename, line, column FROM TemplateInstatiations [...].

Noone ever would favor concatenation of three values in a single field over three, separate, well-typed fields, for a few reasons: no type guessing, space efficiency (no separators, numbers are not stored as text), easier handling of incorrect data (line and column emptiness verification), no extra processing (splitting) after querying the rows. Your solution would be a perfect anti-pattern realization :)

Removed "oneof" because it's too recent (last august) and not supported much yet.

Yeah, it was not in the docs when I last browsed them.

@mikael-s-persson
Copy link
Owner

However, I would not use a second file but inline the filepaths into the trace. On first use the full name is used and later on the index.

That sounds good. I'll do that.

Regarding the format-version: In the current .proto file the version is in the header. This would mean that each trace in a collection can have its own format version. Is this wanted?

That's not really wanted per se. But it keeps things simpler. Otherwise the collection would need to have its own header, and then we would lose the composability via a straight file dump. A per trace version just keeps things simpler, and it's not really any harder to support multiple versions on a per trace basis as it is on a per collection basis.

Since we don't need human readability we could use the mangled symbol names which should be shorter.

I don't think that's a good idea. If consumers or translators of the trace files need to perform symbol de-mangling, it will most likely add an additional dependency. I don't want consumers to depend on any LLVM or Clang code at all. Also, AFAIK, Clang uses different mangling depending on the driver mode (compatibility with GCC or MSVC), and that adds an extra layer. And finally, I'm not sure that mangled names are necessarily that much shorter, in fact, sometimes they're bigger (decorated).

Yeah, it was not in the docs when I last browsed them.

Yeah, they were quicker at putting up that feature in the docs than they were at releasing it. I looked in the change logs and found that the added that feature in late august of this year. So, unless you are compiling from source (or on some up-stream ppa), you don't have support for that feature. But this doesn't matter too much since having two optional fields or having them in a "oneof" block ends up looking exactly the same in the encoding (binary compatible), it's only a rule enforcement during reading. So, it'll be easy to add later on, without breaking anything.

@schulmar
Copy link
Author

Since we don't need human readability we could use the mangled symbol names which should be shorter.

I don't think that's a good idea.

You are right, and that does not solve the problem I wanted to attack: There are huge amounts of redundancy in the name field which makes it a perfect fit for compression. Now we could just compress the file after having saved it (gzip/bzip/...) but then the surrounding binary structure would increase the entropy and the stream compression would prevent random access in the file (if that is wanted/possible). It would be nice if all strings were stored in a compressed part and only referenced later on. However, this makes safe mode harder to implement and complicates the storage.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 21, 2014

oneof is supported in the latest stable version, the docs are correct and up-to-date. The problem is whether force the users to switch to the latest version.

@mikael-s-persson
Copy link
Owner

Ok, so the file-name indexing has been added to the protobuf writer (and example reader). It reduces the file sizes quite a bit even on my small tests.

Now we could just compress the file after having saved it (gzip/bzip/...)

I thought about that. And LLVM does have a zlib binding to compress data. But I'm not sure it's really worth the trouble. If trace files really need to be compressed, they can easily be compressed externally.

I find that the main benefit of reducing file size with the file format is that it makes the saving and loading faster (less parsing or processing of the data). Compression reduces the file size but at the expense of increased processing time on both ends (save / load). It's not a favorable trade-off, and if people want to apply compression to the traces externally, they are welcome to do so.

@mikael-s-persson
Copy link
Owner

stream compression would prevent random access in the file (if that is wanted/possible).

Random access in the file is not supported. Protobuf is a very "sequential" format. And that is in-line with the overall philosophy (and practical reality) that you should never really do random-access into files. Also, I have never encountered a file-format that was friendly to random-access operations. Have you?

It would be nice if all strings were stored in a compressed part and only referenced later on.

Yeah, it would be nice to have a good way to deal with the template names. I don't remember, but, do you do any pretty-printing work on those names in Templar? For example, templates that are really long like foo<lots, and, lots, of, stuff>, and turn them into foo<...> where "..." can be expanded if the user wants to. My point with this is that if the basic structure of ns::foo<args>::bar (and a few others) could be encoded in the protobuf file, then each identifier (incl. class names, namespaces, etc..) could be saved once and only id-referenced afterwards (in the usual way, same as for file-names). This is a bit crazier as an idea, but it would be very neat (save space, and save parsing time later for pretty-displaying of template names).

@schulmar
Copy link
Author

Also, I have never encountered a file-format that was friendly to random-access operations. Have you?

I have but they were designed specifically for that purpose, after all they were out-of-core data structures. The hierarchical nature of our data makes it not a good fit for such things, i was just curious.

do you do any pretty-printing work on those names in Templar?
I have not yet touched much of Templar (besides making it fit for CGraph and the trace loader) but I don't think that there is a contraction/expansion mechanism, just simple string rendering.

My point with this is that if the basic structure of ns::foo::bar (and a few others) could be encoded in the protobuf file, then each identifier (incl. class names, namespaces, etc..) could be saved once and only id-referenced afterwards

I also thought about doing something of this kind but it could become really cumbersome. Just take multiple nested templates into account A<B<C>, D>::template E<F>::type. There could be some really nasty intricacies in the format required to store all that information. On the other hand that is a far more powerful format with regard to semantic information and would relieve consumers from parsing the strings to do more advanced things than simple displaying.

@mikael-s-persson
Copy link
Owner

I also thought about doing something of this kind but it could become really cumbersome.

Yeah, I figured that much. The thing is that that structure is already available within Clang. Templight just happens to do a pretty-print of the whole thing and use it as the "name" field for the template instantiation. But Clang's AST has that hierarchical information to begin with (and I use some of it in the "debugger" part of templight). But, of course, if we go down the road of trying to preserve this semantic information, then it could turn into an AST dump, which is a whole different beast (at which point, you're gonna need to use libclang to make sense of it).

And at the end of the day, for programs like Templar, it might be sufficient to do simple < > matching and collapsing, which is easy enough to do without having to know anything about the semantics. But then again, maybe such simple collapsing rules could be encoded in the file format too. If you had a structure like this:

message TemplateName {
  required uint32 id = 1;
  optional string base_name = 2;
  repeated uint32 collapsed_id = 3;
}

And, have it such that a name like A<B<C>, D>::template E<F>::type would turn into something like these messages:

names {
  id: 0;
  base_name: "C";
};
names {
  id: 1;
  base_name: "B<#>";
  collapsed_id: 0;  // refers to "C"
};
names {
  id: 2;
  base_name: "D";
};
names {
  id: 3;
  base_name: "F";
};
names {
  id: 4;
  base_name: "A<#,#>::template E<#>::type";
  collapsed_id: 1;  // refers to "B<C>"
  collapsed_id: 2;  // refers to "D"
  collapsed_id: 3;  // refers to "F"
};

But I agree that this can get hairy very quickly. I guess it all depends on where to strike the balance between a full string name (as it is now) and a full AST dump (retaining all of Clang's "knowledge").

The thing is that if someone is writing a tool (e.g., Templar) and wants to use libclang to do syntax analysis of the code to make his tool "smarter", then he can just as easily parse the template instantiation string (as it is now) with libclang and recreate most of the AST information. And if you need full AST introspection within the template instantiations, then the tool to use is the templight debugger (which is the whole reason I wrote it).

That's why I would tend towards the "dumb" string solution (as it is now), but it would be nice to put some smart structure to avoid repetition of the same strings again and again. That's why I'm starting to consider something like the above structure. What's your take on this?

@schulmar
Copy link
Author

As we both said there is not much use in dumping the AST. The only goal is to reduce the file size so I would use an off the shelf compression algorithm. Then again these algorithms have their own formats so it would probably be some kind of reimplementation of an existing algorithm to fit it into the format.

Maybe we should postpone this step until it becomes a real issue. (Although that time period might be rather short)

@mikael-s-persson
Copy link
Owner

In the mean time, it could be nice to add some extensibility on the template name. Something like this:

message TemplateName {
  oneof name_representation {  // or optional fields, until "oneof" is sufficiently well supported.
    string name = 1;
    bytes compressed_name = 2;
    // ... other options could be added.
  };
};

which would allow for any scheme we could come up with to compress or compact the template names.

Then again these algorithms have their own formats so it would probably be some kind of reimplementation of an existing algorithm to fit it into the format.

I believe that the zlib-based compression algorithm in LLVM is essentially head-less (takes a buffer and generates a compressed buffer, i.e., without any "container" header, that you would normally find in a gzip / zip file). So, it could be used to simply compress the names. However, I doubt that it will perform very good because what makes compression work well is repetition, and compressing individual names might not provide much size reduction. It would be much more effective if all the names are compressed together, at which point, we're basically compressing the whole file.

And as far as coming up with some sort of compression algorithm that is best tailored to this application, it would seem to me that the kind of structure I proposed in my last post is one such algorithm that uses domain-knowledge (about the usual structure of the template name strings) to efficiently reduce the space used.

@schulmar
Copy link
Author

The problem with the LLVM "binding" to zlib is that it is "stateless" from the caller's view. What we really want is to retain the information of previous compressions to compress later names using the redundancy to earlier ones. I don't know about the internals of zlib but I assume that they have to do something like this anyways (to support big files). However, this rules out the nicer interface of llvm::zlib::compress.

The result should basically be like this: All names are concatenated in one big file and compressed therein. Then we store each name's compressed information in the corresponding entry's name field. Later on we can reconstruct the names by virtually concatenating the compressed strings into a big file and doing decompression on it.

An alternative method would be to store all names in an extra file that is compressed and store names in the trace by referencing offsets in the name file. Yet, I dislike having multiple files because that allows to corrupt the trace by accidentally deleting the other file.

@mikael-s-persson
Copy link
Owner

As you might have seen, I first implemented this extensible version of TemplateName message such that the names could be replaced by something more clever than just the raw string. Then, I also tried to use the llvm::zlib::compress on the individual names. As expected, the file sizes are mostly increased when doing this, due to some overhead that zlib puts on the buffers and with no real opportunity for compression when the strings are too short and not repetitive enough.

I agree that it would be nice to use zlib in a kind of stop-and-continue manner, to share the names across all name fields. But I'm pretty sure that zlib includes some form of post-processing (last pass), which would make this difficult to do in a sequential manner.

I haven't dealt too much with compression algorithms before. But looking at the high-level description of the DEFLATE algorithm, it seems like a pretty simple scheme of building a dictionary (in-line or separately) of repeated strings (similar to the way we compress the file-names currently) and then applying some Huffman encoding. Besides the Huffman encoding, it's very much the same as the approach used for the file-names, and what I suggested for the template names as well. But I think that using some knowledge of the structure of the names (there are natural delimiters in C++ names, like spaces, brakets, ::, parenthesis, etc.) that should provide the kind of domain-specific knowledge that could actually make the compression even more effective than a vanilla compression algorithm (that has to do unstructured matching of strings). I think it's worth at try, at least, and I'd be interested to see how it compares to a compressed original protobuf file.

I also don't like the idea of having multiple files. Protobuf can handle dumps of raw bytes, which makes it somewhat easy to embed a file into another. So, one option could be to dump the compressed list of all names at the start of the protobuf trace file, as a kind of compressed dictionary of all the names. But I first want to see if the simple compression of the names that I suggested earlier would work.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 25, 2014

The new compression method works well for me (no errors) and reduces trace file from 38MB to 6.3MB. For completeness: YAML file is 73MB in size and text file is 62MB.

@mikael-s-persson
Copy link
Owner

That's great! In my own, smaller, tests, the compression was less impressive (from 120KB to 77KB), but I suspected that a much larger trace file would be compressed much more.

I'm a bit skeptical of the correctness of the compression, and especially the reconstruction (reading) of the data from the file. I'll start working on a small separate application that can use the protobuf-reader class that I have written (but not tested) to convert to the other output formats (xml, yaml, text, nested-xml, graphml, and graphviz).

Your code seems really good for testing templight (not too big, not too small). Is there any way you could send that code my way? If it's sufficiently standalone (boost is fine), and you could also remove whatever is sensitive or remove the "real" code, just keeping the meta-programs. I'm just saying that because I'm starting to realize that writing template-heavy test code is either impractical or highly contrived (e.g., making really stupidly deep recursive expansions with tuples and variadic templates), so I'm starting to look around for real-life code that I could use to be able to test templight more thoroughly.

@wmamrak
Copy link
Contributor

wmamrak commented Oct 26, 2014

I'm a bit skeptical of the correctness of the compression, and especially the reconstruction (reading) of the data from the file.

Over 29k of template names, of which the longest (single name) was 49kB, have been correctly reconstructed by my very simple python code, so I believe your Reader is correct as well.

Is there any way you could send that code my way? If it's sufficiently standalone (boost is fine), and you could also remove whatever is sensitive or remove the "real" code, just keeping the meta-programs.

You would require three public, quite common libraries except of boost, and, more importantly, you would be the first to test it outside of Windows. The code is standard C++, but, even though I am using CMake build system, I am sure it won't work under Linux. Well, it does not work correctly even under Windows, but it is a different story. Proficient user however, should make it running in a few minutes. So it is up to you.

@mikael-s-persson
Copy link
Owner

I have a first draft of a conversion tool. Currently, I kind of duplicated all the output code from the main templight program into that conversion utility, but eventually, that code would be removed from the main templight program and exist only in the conversion. The protobuf decompression seems to work correctly, because the converted traces (templight -> protobuf -> templight-convert -> yaml / xml) are identical to those produced directly form the main templight program (except for fluctuating time-stamps).

At this point, I mostly want to test it out a bit more, and make small revisions here and there to finalize the code. Then, I'll make the switch in the master branch (and add some docs on templight-convert).

You would require three public, quite common libraries except of boost, and, more importantly, you would be the first to test it outside of Windows. The code is standard C++, but, even though I am using CMake build system, I am sure it won't work under Linux.

If you are OK with just sending that code my way, I'll do whatever I can with it (and unless you are using some overtly platform-specific code, it should compile in Linux too, especially if you can compile it with Clang and GCC under Windows, the main porting issues are related to differences between MSVC and Clang/GCC). And, I don't really need anything that works correctly, it only needs to compile (not even link). As for using CMake, that's one of my next orders of business... making sure templight works well as a compiler substitute in CMake.. Have you tried that? I have lots of code (in https://github.com/mikael-s-persson/ReaK) that would be very good for testing templight, but I need templight to work well in CMake (hopefully just with CC / CXX environment variables).

@wmamrak
Copy link
Contributor

wmamrak commented Oct 26, 2014

If you are OK with just sending that code my way

Of course I am, and, also, what I am sure won't work is not compilation, but my Cmake. :)

I have specified "native compilers" for visual studio CMake project generator which resulted in defining CMAKE_C[XX]_COMPILER entries (pointing to templight-cl.exe) but it has no effect on a generated project, which still uses native msvc compiler.
Hovewer I have been using something similar with MinGW project generator succesfully. I am not sure if you meant that, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants