-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Up-to-date trace inspector criteria? #2
Comments
Well, I just said that because, at the time of writing, Templar had not been updated since about 9 months ago, is still kind of in a beta stage (at least, from what I remember the last time I tried it), and still referring to the original templight patch, which is no longer compatible with the Clang trunk, and also, I remember having some issues making it work with latest version of graphviz too. That, for me, qualifies as not up-to-date or actively maintained. I did not mean to belittle or offend, and I understand full-well how difficult it is to find time to work on a side-project like this, I also left my version of templight sit on a shelf for a long time, not finding the time or motivation to get back on track with it. I could rephrase it. There's nothing formal about what qualifies as up-to-date and maintained. But a good rule of thumb is if there are lingering bugs or if a user who gets the code needs to do some additional work to get it to compile or work properly on his machine (with his newer versions of the dependent libraries), then that means it isn't being maintained. But regarding your work on Templar. I was hoping to change the output formats (and some changes have already been committed by Wojciech Mamrak). The first changes that were made were mostly to reduce the redundant or empty fields in the begin / end entities, this reduces the size of the files significantly. But I'm also considering changing the structure to have more of a recursive / tree-like structure for the yaml and xml outputs, i.e., instead of a flat list of entities, having entities with sub-entities and so on. Would you have any comments, interests, or problems with that? I'm also planning to produce graphviz and GraphML outputs too, which could also make things easier on your side. And if I'm gonna be changing the output formats, I'd like to make most of the changes quickly and then never change them again, to avoid future compatibility problems. So, I'm all ears for ideas about what would make the "perfect" format. |
I am sorry if you felt I was offended, I was just genuinely interested. If you have any problems with the current version of Templar then please open a new issue for each one and I will try to find out what is the problem. Nesting the structure seems like a good idea to me. It is much more natural than having a flat list. When I first saw the XML structure I wondered why the values were not stored as attributes but instead as sub elements, do you have any ideas on that? Regarding the outputs: You mean you want to produce graphpviz GraphML outputs directly from templight? |
Hello, there is one additional inconsistency regarding the produced trace files in various formats I have somehow missed, i.e. yaml trace file describes location of an instantiation of a template using three keys: FileName, Line, Column, whereas text and xml formats use single PointOfInstantiation element. File name, line and column are its value and are separated using "|" character.
The patch would be very simple, but I can save you these few key strokes and prepare one. regards PS. another thing: for text format, I would remove all these assignments ("=") in favor of new lines. |
@wmamrak I second all of that. After having spent some time on the loading code in Templar I think we should give the files some version information. This would make it easier to select the correct loader without having to dig through the files to find version specific constructs. |
It's odd that you guys bring this up today, because I was just doing some work on the output formats, I had a bit of time to spare. Because of the comments you expressed, I've put my commit into a new "develop" branch, to avoid polluting the master with controversial commits. That's because I kind of went the opposite direction of what you just said, before I read the posts.
I implemented both of these changes into an experimental format that you can produce with
Yes, and I did. Both GraphML and graphviz are pretty trivial formats to produce, and the hard work is to transform the cached templight entries into a tree, which is needed to write the nested-XML format anyway. With most of the heavy-lifting done, it was trivial to produce those formats too.
I understand that point, and I could re-consider. There are a few advantages to producing the graphml and graphviz formats directly from templight. For one, they are rather compact formats, which is nice. It should make the tooling code easier to write, i.e., you currently load up the Templight traces, and then generate an in-memory graphviz graph to display in Templar, others (developing other tools) might not be willing to do this. For example, I could imagine a tool that generates the template equivalent of doxygen-style call-graphs to display them in documentation or something, which would be more convenient if it can get a dot (gv) file directly out of templight, to load up directly in graphviz. I don't plan to make those formats customizable in any way, so that if someone wants more than a vanilla graphviz graph, he'll have to post-process it (to apply transforms and styles), which can all be done with dot / graphviz, which is the whole point (he doesn't have to go through any other templight-specific tool). And also, in terms of feature creep / bloat, we are only really talking about at most a couple of hundred lines of code, i.e., a fart in the wind.
I agree that it should be consistent. I don't really know why it was inconsistently done in the first place. And I might add that I don't really like the terminology used either, like "PointOfInstantiation", which is kind of redundant given that the entries are all about instantiations, and I think that a simple "Location" would do just fine.
Uhmm.. I'm not sure that the fact that the formats are not particularly terse to begin with make it a flaw to have a more compact notation in one field. And an XML or yaml file that is 20% smaller (which is about what you can get with this, in the best cases) is still an improvement.
Well, no. I mean, tokenizing a string by the "|" character is about the easiest task you can do. And unless you are using an off-the-sheld yaml / xml reader, there is no way that this is harder than the multiple field / attribute version of it. This particular format with "filename.cpp|42|69" is a fairly common format for expression file-name, line and column. I've seen this in debuggers, diagnostics, editors, etc.. and in part because it is very easy to parse.
That's a good point. But like I said the "|" separation is not arbitrary, it's quite common (the other common format is with "filename.cpp:42" (like GDB), but that can be a bit tricky to parse in some esoteric operating systems like Windows that marks hard-drives like "C:", and so, the vertical bar is nicer because it's not ambiguous with anything that is likely to be found in a file-path string). And the other reason why I like the compact version is that "Location" makes more sense as a whole, i.e., you are never going to be looking up the line or column where the instantiation occurred, without first looking up the file-name, in other words, they are, together, one single piece of information, not three separate "attributes". But if there is any risk of things behaving badly with having this compact version for the locations, then I'll retract that.
No. That's a definite no. The vanilla text file format is very typically using the format "var = value" on each line. That's a very common (and again, easy to parse) format. And, with the equal sign, it's more human-readable, which is the main point of the text format, because if you want a format to the parsed by a program, you'd choose the more computer-friendly formats, like yaml / xml / graphml / dot / ... (maybe proto-buf or JSON?).
That's probably a good idea. For any XML format, this is pretty straight-forward, and in any case, most xml parsers don't need to really know what's in the file to read it, but there could be a version number, and possibly a DTD as well. I guess that for the yaml and text formats, it would have to be some ad hoc thing at the start of the file. (And for graphML and graphviz, that's not really necessary since those formats are fixed by their own standards) I think it's still a bit early for this. I'd like to first get a nice frozen first proper version of the output formats (which is just a matter of having a concensus on issues like those above). Include a version number into those output formats. And that'll be the starting point for any kind of future backward- / forward-compatibility maintenance, which should be possible on most formats quite easily.
That sounds a bit odd. You should write one loader that works for all versions of one format, and keep it backward and forward compatible. It's very simple, if you encounter fields you don't recognize, just ignore them (forward compatibility), and if you don't encounter the newer fields, you default them and / or fall-back to the older behaviour. You do this mostly on a field-by-field basis, but having a version number at the start certainly helps along, but you should have to "select" the correct loader. But anyways, hopefully, we should settle on one set of stable output formats pretty soon. After that, I don't think there will be much room for expansion, except maybe adding more "Kind" possibilities. |
I would be especially interested in differences to the original format.
Of course, it is all a matter of perspective: At the toplevel we only want to call
That works if the format does not change too much but there might be cases where there is no sane default value. Take your new format for an example: If I were to implement it in the same loader as the old one, then the nesting would be totally wrong because the old format nests via
The mentioned differences are where my wish for format versioning comes from, sure it helps future revisions but we already have at least two XML formats now. For now the first tag of them might suffice to distinguish them but I don't think that this is the right approach. |
Hello, I see there are few misunderstandings which emerged from various point of views, and some by my lack of precision in formulating my thoughts. Please answer yourself the following questions.
In my opinion there should be one format supported. More precisely, the one which, using the simplest syntax and parsing rules, can unambiguously describe all the data produced by Templight.
Consistency? You can save more space by giving the elements/attributes shorter names; you can make the format fixed, separate all the data using "|" and introduce versioning to the format file. Each one would improve the size without loss of consistency. My point is, if any format gives you tools to describe what you want, do not invent new ones or circumvent the exisiting ones, even in good faith.
I don't especially like the argument of commonness, particularly because it is very subjective. What I prefer is rationality and usefulness.
I have been imprecise - I was thinking of a text file format only and about processing time and complexity of the two operations.
All above mentioned formats are more computer-friendly because they can describe complex data structures that simple text format can not without introducing additional rules. I don't think their computer-friendliness goes any step further, though :) To sum it up, I would keep the tracer simple (not that it is overcomplicated), i.e. I would provide one, simple, easy to parse format without any human-legibility improvements, as they are overrated :) If someone is using notepad or vi for viewing trace files, most likely she is doing something wrong. |
I must agree with @wmamrak:
It would make life easier for both the producing and consuming end if there was only one output format to maintain. I see no real usecase for opening the file and viewing it in an editor so we don't need human readability. If we don't need the human readability we could start using a binary format which would save even more space and could be tailor made for the problem domain. (In fact Thomas Benard has already created such a binary format and incorporated it into a fork of templar). The only advantage that I see with the XML/YAML outputs is that they can be parsed by tools that don't have to know about templight (e.g. for statistics plotting). IMHO that alone does not justify producing these files in templight. If need for them arises it should be possible to generate them from a binary trace file with an external application. |
Martin described my thoughts more accurately than I would by myself. |
I agree that it's not a good idea to be supporting several redundant formats. I'm sort of working off of what I inherited from the original templight version. The original XML format and the YAML format are completely redundant. I'm guessing that they only provided the two formats because YAML is more native to LLVM, while XML is more wide-spread and well-known. Is that really a sufficient reason to provide both? I don't think so. And then, for the text format. There is no reason to provide a text format other than human readability. And I agree with you guys that reading a templight trace in a text editor has no real purpose / use-cases. I would vote to eliminate the text format altogether, and I never understood why it was part of the original templight. And to that point:
Well, whether they use an off-the-shelf XML "slow" parser or not is beside the point. Relying on a raw text file format has the very big problem that there is no standard definition behind it, and making our own would be a waste of time. The advantage of using an established base format like XML or YAML is that there are clear syntax rules already established, and widely known and followed by people. I've written XML parsers more often than I can count, like most seasoned developers, I presume (and btw, I've had far more trouble with basic text file formats in the past). The well-defined syntax rules for the XML format is what makes it easy and reliable to parse. Simple text formats are, by definition, without any clear rules (e.g., quotes?, multi-line?, what chars must be escaped?, spaces significant or not?, comments?, etc..), which means those rules need to be established and communicated to anyone wanting to parse those files. I've seen many projects fail just on the basis that they came up with their own format, that no-body's interested to learn or follow. The convention is that a raw text format (if any is produced), is for human-consumption only, because a program needs precise rules, for reliable parsing. Of course, for a quick-and-dirty parsing script, a simple text format is easier, but not for a reliable one. This is the whole reason why all these standard extensible formats exist (XML, YAML, protobuf, JSON, etc..). And so, if there is no use-case for a human-readable format, I would get rid of the text format.
This is exactly the type of discussion I want to avoid. You are talking about establishing some sort of clearer, less ambiguous, set of syntax rules for this new format that you are proposing.... This is a waste of time. Going down that road is going to turn into far more trouble than you can imagine. And the reason why formats like XML have so many quirky rules and "complex" syntax is because when you start going down that road of establishing reliable syntax rules, you find so many issues and corner cases (which we are subject to as well, don't think that we don't), that you end up with something like that, that's what makes a format mature and seasoned. We would be fools not to take advantage of all that stabilization work that has already been done. To me, the primary / preferred output format for templight should be one that relies on a standard extensible format (XML, YAML, protobuf, JSON, whatever, as long as it's mature and well-established).
I'm all for rationality, but the thing you have to understand about file formats is that commonality is the most rational choice. Formats are used to communicate information. We use the English language to communicate because it's the most common one, not because of its inherent rationality or effectiveness. About the "Location" / "Filename / Line / Column" problem, I understand your point that the separated version is more in-line with the format used. But my concern is that the triad of "Filename / Line / Column" is one piece of information, not three. It could also cause issues. For example, the version of templight that Martin has referred to (see here) has started to output not just the single location of the instantiation, but the range (start to end) of locations. In the separated version, this would lead to fields of "StartFilename / StartLine / StartColumn / EndFilename / EndLine / EndColumn", and things like that. If we were to also output the location of the template declaration (not just the instantiation), that would further compound the problem. With the |-separated version, you would have "Start / End" as locations defined as something like "some_source.cpp|42|10" and "some_source.cpp|48|8". I find this solution much nicer, and for being in-line with the format (XML / YAML), I think that the idea is that "one piece of information == one field". Another option could be to use attributes:
I agree, sort of (except about what constitutes an "easy to parse format"). The way I see things is that there should one and only one primary / preferred format, that is used by default (currently yaml), and it should be the one format used in accessory tools (like Templar and soon, Metashell too). But, there can still be other formats provided for specific purposes:
Yes. There should be one primary format, that tools should keep in sync with. Any other format that templight can produce, if any, should be aimed at general third-party applications that are not related to templight. Which is sort of the way it has always been: yaml is primary for templight tools (e.g., Templar); xml is for easy off-the-shelf loading in third-party applications that don't have to be in-sync with templight; and text is for the rare cases where one would want to see the trace in a text editor.
I consider the second (nested) XML format as a candidate to replace the existing one. There should not be two XML formats at the end. You said that you liked the idea of having a nested structure and that you would prefer the fields to be attributes of the XML entries. That's what I created with that format. If you like it, or want to make some minor changes to it (like renaming fields or whatever), then we could make that the official XML format. The main draw-back with the nested XML format is that it requires a conversion of the cached entries into a tree, such that the nesting can be resolved. See the EntryTraverseTask class' depth-first traversal in TemplightTracer.cpp for how that code looks. The main issues with that is that it will add run-time overhead (but it won't corrupt the profiles, because it's done only between instantiations, never during one), and it disables the "safe-mode" option because there is no way to immediately print entries for a valid nested XML. But it does have the advantage of being much smaller, and also producing time and memory intervals, instead of absolute time and memory usage (or is that something you want to keep?).
Well, that binary format will definitely need re-working (it doesn't even handle endianness (!?!), and I've spotted a few dubious things in it too). Binary formats can be tricky to get done right. If we are going to go that route, I would prefer protobuf, because it's the best of both worlds, i.e., it's a simple binary format that performs very well and is easy to read / write, and it's extensible and has a schema language to allow it to be readable from any application. I've written some protobuf serializers before, and have code that I can easily grab for doing that. I'll mock-up some code to see how that could turn out.
The thing is that it costs virtually nothing to produce them from templight, so, they don't need much justification. I see all formats besides the primary one (currently yaml) as "advanced options" for third-party applications.
That's a possibility. I might consider that for the tree / graph outputs (nested XML, graphML, graphviz) because they have run-time overhead and don't support the "safe-mode". So.... in light of all this, one thing is clear, adopting a good set of or single output format is not a trivial thing. That's why I knew it had to be "first order of business" before going forward. My main wishes, for the primary (or sole) output format, are as follows:
I'll start mocking-up some code for a protobuf output format, because I think it is optimal in all those aspects. YAML sort of fits the bill, but I find it a bit awkward. |
I also forgot to mention that I would wish the output to be easily composable, i.e., by dumping several trace files into one single file, which would solve some problems related to invoking templight on several source files while producing a single output file. A protobuf format would naturally work for this, because it's a header-free format, so they can easily be dumped end-to-end into a file. I also uploaded a tentative protobuf .proto file for defining the messages. |
It depends what you mean by text format. XML is text as well. JSON too. INI, dot, etc. And even though they are human-readable, they are not intended to be human-processable (is anybody browsing the webpages this way?). Protocol buffers are different beast, and if you decide to support only one format, which I am encouraging you to, I would opt for protocol buffers.
Well, I gave you an example of how fast is a production YAML parser in comparison to non-optimized raw text parser. XML is much more complex so I assumed, unless using some non-standard-compliant or even worse, hand-crafted XML-subset parser, that its speed will be similar.
Writing which would take literally 2 minutes of time.
But still, reading documentation is needed to know which elements and attributes are there, which values they can take, and what these attributes and values mean. Reading additional information about trace file structure would impose a very little overhead, not to mention writing code, not to mention testing.
In general I agree, but not with respect to templight.
Waste of time is waiting for parsing results more than you need to, and not a substantive discussion about reducing this wasted time.
I don't have to understand that because I did not say I am against it. My point is, that I won't use a sledgehammer to crack a nut even if all other people act in this way.
I would argue - template instantiation is one piece of information as it gives full information about what and where happened :) In my opinion Protocol Buffers are a way to go. Simple, extensible, fast, mature, widely-supported. In other words one format to rule them all. |
Two issues:
There is 0 used a few times.
Any reason for a 64bit integer? Except that, the protocol file looks fine, but I would consider one change. Now, such message is well defined: |
Oh, sorry. Forgot about that rule. I normally generate proto files from my code, so I don't write them by hand all that often.
Well, that is technically the type of the memory usage variable. And theoretically, there are translation units whose memory consumption during compilation can rise above 4GB... I have a few of those (a problem I hope to solve once I iron out a mature version of Templight). I think its a good idea to use 64bits. And in any case, with protobuf's VarInt encoding, it doesn't make much of a difference.
Alright. I can live with that. Can you pull request this thing so that I don't have to manually enter the changes. Thanks. And about the whole "templight output is simple, so don't overcomplicate the format" argument. I agree that content-wise, it's very simple (at least for now, not that I have any crazy plans ;) , but you never know what to come up down the line). But what I mostly want to avoid is coming up with some simple syntax rules for a text format (as in, similar to the current That said, with protobuf, there's a natural immunity to so many issues, while remaining very simple. I like that a lot. Being binary, i.e., no string-parsing, avoids so many parsing quirks, like escaping characters and quotes, and character encodings. Oh and to this:
I have rarely seen people use a full-blown standard-compliant XML parser. Even the heaviest XML parsers in practical use are not fully standard-compliant, AFAIK. And many many projects use hand-crafted XML parsers too. But that is part of the intent of XML, it is not meant to always be used generally (with full DTD and all that). The vast majority of XML producers and consumers use only a tiny subset, often using hand-crafted code. I mean, that's what templight does, after all.
Well.. sure :) I was more thinking along the lines of if someone asks you what time it is, he expects something like the hour and the minutes, maybe even the seconds, but people don't go around and ask each other the time in a series of separate questions like "hey, what's the current hour? what's the current minute? what's the current second?". Another way to put it is, if the templight trace was a data-base, and you wanted to get the location of an instantiation, would you expect to have to make three separate SQL queries to get the file-name, line number and column? I don't think so, you'd expect one query would do the trick. But anyway, I think that the way it is structured in the proto message is the right way to go. |
Nevermind the pull request. I did it myself (didn't know I could do that ;) ). |
Ok. So I put up an implementation with the protobuf output format. I had to change a couple of things in the message definition file:
I tested my protobuf file output against google's reference implementation, and it works fine, no apparent problem or corrupt values (I didn't really expect any issues, given how simple it is). And for a human-readable version, you can always use a python script to convert it to a text file, or the C++ equivalent. Loading the traces up into C++ code should be straight-forward if you use libprotobuf, but in any case, it'll be good to make standalone loading code (like my standalone saving code), to avoid the dependency with google's library. In terms of file-sizes, it really can't do any miracles since most of the output data are strings (name / file-name fields occupy most of the output files). Still, the protobuf outputs are significantly smaller (about half the size as yaml, in my simple tests). |
I also think protobuf is the way to go. Regarding the file size: I think one of the biggest cost savers would be to only reference the file and not to write the file name everytime. In the binary format that I linked to there was a second file that only contained a list of the paths to the used source files. The trace then only contained the index of the file in that list instead of the full name. However, I would not use a second file but inline the filepaths into the trace. On first use the full name is used and later on the index.
Another possible space optimization: Since we don't need human readability we could use the mangled symbol names which should be shorter. Regarding the format-version: In the current .proto file the version is in the header. This would mean that each trace in a collection can have its own format version. Is this wanted? |
Templight does not use a parser, it uses an emitter. And each hand-crafted XML parser is vulnerable to bugs and errors. Furthermore, it might not support some of many standard ways to work around special issues. Hence, eventually, most developers end up with a sledgehammer wondering why parsing of so trivial data has to be so involving.
If the templight trace was a database table, then each template instantiation would take one row in this table. Single query would indeed do the trick: Noone ever would favor concatenation of three values in a single field over three, separate, well-typed fields, for a few reasons: no type guessing, space efficiency (no separators, numbers are not stored as text), easier handling of incorrect data (line and column emptiness verification), no extra processing (splitting) after querying the rows. Your solution would be a perfect anti-pattern realization :)
Yeah, it was not in the docs when I last browsed them. |
That sounds good. I'll do that.
That's not really wanted per se. But it keeps things simpler. Otherwise the collection would need to have its own header, and then we would lose the composability via a straight file dump. A per trace version just keeps things simpler, and it's not really any harder to support multiple versions on a per trace basis as it is on a per collection basis.
I don't think that's a good idea. If consumers or translators of the trace files need to perform symbol de-mangling, it will most likely add an additional dependency. I don't want consumers to depend on any LLVM or Clang code at all. Also, AFAIK, Clang uses different mangling depending on the driver mode (compatibility with GCC or MSVC), and that adds an extra layer. And finally, I'm not sure that mangled names are necessarily that much shorter, in fact, sometimes they're bigger (decorated).
Yeah, they were quicker at putting up that feature in the docs than they were at releasing it. I looked in the change logs and found that the added that feature in late august of this year. So, unless you are compiling from source (or on some up-stream ppa), you don't have support for that feature. But this doesn't matter too much since having two optional fields or having them in a "oneof" block ends up looking exactly the same in the encoding (binary compatible), it's only a rule enforcement during reading. So, it'll be easy to add later on, without breaking anything. |
You are right, and that does not solve the problem I wanted to attack: There are huge amounts of redundancy in the name field which makes it a perfect fit for compression. Now we could just compress the file after having saved it (gzip/bzip/...) but then the surrounding binary structure would increase the entropy and the stream compression would prevent random access in the file (if that is wanted/possible). It would be nice if all strings were stored in a compressed part and only referenced later on. However, this makes safe mode harder to implement and complicates the storage. |
oneof is supported in the latest stable version, the docs are correct and up-to-date. The problem is whether force the users to switch to the latest version. |
Ok, so the file-name indexing has been added to the protobuf writer (and example reader). It reduces the file sizes quite a bit even on my small tests.
I thought about that. And LLVM does have a zlib binding to compress data. But I'm not sure it's really worth the trouble. If trace files really need to be compressed, they can easily be compressed externally. I find that the main benefit of reducing file size with the file format is that it makes the saving and loading faster (less parsing or processing of the data). Compression reduces the file size but at the expense of increased processing time on both ends (save / load). It's not a favorable trade-off, and if people want to apply compression to the traces externally, they are welcome to do so. |
Random access in the file is not supported. Protobuf is a very "sequential" format. And that is in-line with the overall philosophy (and practical reality) that you should never really do random-access into files. Also, I have never encountered a file-format that was friendly to random-access operations. Have you?
Yeah, it would be nice to have a good way to deal with the template names. I don't remember, but, do you do any pretty-printing work on those names in Templar? For example, templates that are really long like |
I have but they were designed specifically for that purpose, after all they were out-of-core data structures. The hierarchical nature of our data makes it not a good fit for such things, i was just curious.
I also thought about doing something of this kind but it could become really cumbersome. Just take multiple nested templates into account |
Yeah, I figured that much. The thing is that that structure is already available within Clang. Templight just happens to do a pretty-print of the whole thing and use it as the "name" field for the template instantiation. But Clang's AST has that hierarchical information to begin with (and I use some of it in the "debugger" part of templight). But, of course, if we go down the road of trying to preserve this semantic information, then it could turn into an AST dump, which is a whole different beast (at which point, you're gonna need to use libclang to make sense of it). And at the end of the day, for programs like Templar, it might be sufficient to do simple
And, have it such that a name like
But I agree that this can get hairy very quickly. I guess it all depends on where to strike the balance between a full string name (as it is now) and a full AST dump (retaining all of Clang's "knowledge"). The thing is that if someone is writing a tool (e.g., Templar) and wants to use libclang to do syntax analysis of the code to make his tool "smarter", then he can just as easily parse the template instantiation string (as it is now) with libclang and recreate most of the AST information. And if you need full AST introspection within the template instantiations, then the tool to use is the templight debugger (which is the whole reason I wrote it). That's why I would tend towards the "dumb" string solution (as it is now), but it would be nice to put some smart structure to avoid repetition of the same strings again and again. That's why I'm starting to consider something like the above structure. What's your take on this? |
As we both said there is not much use in dumping the AST. The only goal is to reduce the file size so I would use an off the shelf compression algorithm. Then again these algorithms have their own formats so it would probably be some kind of reimplementation of an existing algorithm to fit it into the format. Maybe we should postpone this step until it becomes a real issue. (Although that time period might be rather short) |
In the mean time, it could be nice to add some extensibility on the template name. Something like this:
which would allow for any scheme we could come up with to compress or compact the template names.
I believe that the zlib-based compression algorithm in LLVM is essentially head-less (takes a buffer and generates a compressed buffer, i.e., without any "container" header, that you would normally find in a gzip / zip file). So, it could be used to simply compress the names. However, I doubt that it will perform very good because what makes compression work well is repetition, and compressing individual names might not provide much size reduction. It would be much more effective if all the names are compressed together, at which point, we're basically compressing the whole file. And as far as coming up with some sort of compression algorithm that is best tailored to this application, it would seem to me that the kind of structure I proposed in my last post is one such algorithm that uses domain-knowledge (about the usual structure of the template name strings) to efficiently reduce the space used. |
The problem with the LLVM "binding" to zlib is that it is "stateless" from the caller's view. What we really want is to retain the information of previous compressions to compress later names using the redundancy to earlier ones. I don't know about the internals of zlib but I assume that they have to do something like this anyways (to support big files). However, this rules out the nicer interface of The result should basically be like this: All names are concatenated in one big file and compressed therein. Then we store each name's compressed information in the corresponding entry's name field. Later on we can reconstruct the names by virtually concatenating the compressed strings into a big file and doing decompression on it. An alternative method would be to store all names in an extra file that is compressed and store names in the trace by referencing offsets in the name file. Yet, I dislike having multiple files because that allows to corrupt the trace by accidentally deleting the other file. |
As you might have seen, I first implemented this extensible version of TemplateName message such that the names could be replaced by something more clever than just the raw string. Then, I also tried to use the llvm::zlib::compress on the individual names. As expected, the file sizes are mostly increased when doing this, due to some overhead that zlib puts on the buffers and with no real opportunity for compression when the strings are too short and not repetitive enough. I agree that it would be nice to use zlib in a kind of stop-and-continue manner, to share the names across all name fields. But I'm pretty sure that zlib includes some form of post-processing (last pass), which would make this difficult to do in a sequential manner. I haven't dealt too much with compression algorithms before. But looking at the high-level description of the DEFLATE algorithm, it seems like a pretty simple scheme of building a dictionary (in-line or separately) of repeated strings (similar to the way we compress the file-names currently) and then applying some Huffman encoding. Besides the Huffman encoding, it's very much the same as the approach used for the file-names, and what I suggested for the template names as well. But I think that using some knowledge of the structure of the names (there are natural delimiters in C++ names, like spaces, brakets, ::, parenthesis, etc.) that should provide the kind of domain-specific knowledge that could actually make the compression even more effective than a vanilla compression algorithm (that has to do unstructured matching of strings). I think it's worth at try, at least, and I'd be interested to see how it compares to a compressed original protobuf file. I also don't like the idea of having multiple files. Protobuf can handle dumps of raw bytes, which makes it somewhat easy to embed a file into another. So, one option could be to dump the compressed list of all names at the start of the protobuf trace file, as a kind of compressed dictionary of all the names. But I first want to see if the simple compression of the names that I suggested earlier would work. |
The new compression method works well for me (no errors) and reduces trace file from 38MB to 6.3MB. For completeness: YAML file is 73MB in size and text file is 62MB. |
That's great! In my own, smaller, tests, the compression was less impressive (from 120KB to 77KB), but I suspected that a much larger trace file would be compressed much more. I'm a bit skeptical of the correctness of the compression, and especially the reconstruction (reading) of the data from the file. I'll start working on a small separate application that can use the protobuf-reader class that I have written (but not tested) to convert to the other output formats (xml, yaml, text, nested-xml, graphml, and graphviz). Your code seems really good for testing templight (not too big, not too small). Is there any way you could send that code my way? If it's sufficiently standalone (boost is fine), and you could also remove whatever is sensitive or remove the "real" code, just keeping the meta-programs. I'm just saying that because I'm starting to realize that writing template-heavy test code is either impractical or highly contrived (e.g., making really stupidly deep recursive expansions with tuples and variadic templates), so I'm starting to look around for real-life code that I could use to be able to test templight more thoroughly. |
Over 29k of template names, of which the longest (single name) was 49kB, have been correctly reconstructed by my very simple python code, so I believe your Reader is correct as well.
You would require three public, quite common libraries except of boost, and, more importantly, you would be the first to test it outside of Windows. The code is standard C++, but, even though I am using CMake build system, I am sure it won't work under Linux. Well, it does not work correctly even under Windows, but it is a different story. Proficient user however, should make it running in a few minutes. So it is up to you. |
I have a first draft of a conversion tool. Currently, I kind of duplicated all the output code from the main templight program into that conversion utility, but eventually, that code would be removed from the main templight program and exist only in the conversion. The protobuf decompression seems to work correctly, because the converted traces (templight -> protobuf -> templight-convert -> yaml / xml) are identical to those produced directly form the main templight program (except for fluctuating time-stamps). At this point, I mostly want to test it out a bit more, and make small revisions here and there to finalize the code. Then, I'll make the switch in the master branch (and add some docs on templight-convert).
If you are OK with just sending that code my way, I'll do whatever I can with it (and unless you are using some overtly platform-specific code, it should compile in Linux too, especially if you can compile it with Clang and GCC under Windows, the main porting issues are related to differences between MSVC and Clang/GCC). And, I don't really need anything that works correctly, it only needs to compile (not even link). As for using CMake, that's one of my next orders of business... making sure templight works well as a compiler substitute in CMake.. Have you tried that? I have lots of code (in https://github.com/mikael-s-persson/ReaK) that would be very good for testing templight, but I need templight to work well in CMake (hopefully just with CC / CXX environment variables). |
Of course I am, and, also, what I am sure won't work is not compilation, but my Cmake. :) I have specified "native compilers" for visual studio CMake project generator which resulted in defining CMAKE_C[XX]_COMPILER entries (pointing to templight-cl.exe) but it has no effect on a generated project, which still uses native msvc compiler. |
The readme says
and links to Templar.
Which criteria need to be fulfilled for an application to be seen as up-to-date in your opinion? (I am planning to work on this when I have time to spare)
The text was updated successfully, but these errors were encountered: