-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: type derivation and function docs overhaul #352
Conversation
If we want to go forward with this change, I'd like to see patches for substrait-java, substrait-validator and any other mature libraries that work with this stuff to be available before we merge this. We can't have our sub-repos substantially out-of-date with the spec. |
I'll try to ignore the unreasonableness of that request and instead focus on trying to work out what prompted it. This proposal already very intentionally avoids making any changes to the specification, let alone substantial ones. It's a lot of words, but it's ultimately merely a rewrite of what's already there, to actually describe how the system should (or could) work, rather than just throwing a few examples out there and hoping users get it right. I classified this as a docs overhaul PR for a reason, but maybe the "overhaul" part threw you off? It does add some new syntax, but does so primarily to even be able to define what's going on in a coherent way. If this new syntax is what's bothering you, I can just change the proposal to indicate that the corresponding patterns are not yet supported and exist only as a means to communicate behavior in the specification. This is, in fact, how the current parser in the validator internals already treats these things; it will emit an error if you're using syntax not currently supported by Substrait, even though it understands the syntax perfectly well, because it has to for the internals to be sane.
The PRs that introduce this functionality internally are substrait-io/substrait-validator#32, substrait-io/substrait-validator#43, substrait-io/substrait-validator#48, and substrait-io/substrait-validator#49. But, again, the validator does not support YAML validation at all yet, so there was no preexisting behavior to patch. substrait-io/substrait-validator#47 is the WIP/tracking PR for supporting validation of type extensions; variations and functions don't have tracking PRs yet. substrait-io/substrait-validator#50 defined the protos for describing extensions that I was talking about, but I'll likely be changing those, still; I wrote them as something to work toward more than anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several fundamental problems with how you've approached this patch. The behaviors are not unique here and need to be addressed as they are part of a larger pattern that I see as counter to the kind of community that I think we are trying to build. I have no doubt that you approach all your work with good intentions. You're obviously brilliant and passionate and have made a huge set of contributions to the project. Unfortunately, community health is about more than good intentions and good code.
The key challenging behaviors I've seen include:
- Generally negative, denigrating & dismissive tone.
- Failure to collaborate
- Lack of appreciation/respect for domain (related: lack of effort on learning domain).
I don't have the time to go and find the all of the examples of these behaviors. It's been pretty consistent. From you rage-quitting on PRs I spent days reviewing to just being super rude, the pattern is clear to me. As a personal impact, I've often felt like a punching bag for the last several months as you've railed against the problems in the spec, failed to do the legwork to learn some relational algebra and instead chosen to often preface your comments with "I don't know the domain" and then remark how foolish or wrong something is followed by suggesting major refactoring.
If the behavior concepts I listed above are unclear, let me point out some problematic concrete things in this particular PR:
- Your comment: "I'll try to ignore the unreasonableness of that request"
- Your link to documentation of "conflicting" with no justification of what is conflicting or trying to solve that conflict in as narrow and collaborative way as possible.
- Your comment: "This is the culmination of multiple months of work." You shouldn't be working on rewriting the spec for functions unilaterally, especially months at a time.
- Your comment: "No idea if this list is exhaustive, I've not really been keeping track." You need to keep track of what you're proposing/changing. (And this really goes back to the more basic premise that none of us should be working on something for months without collaborating.)
- The fact that you forked code in the Java repo, modified it and are now proposing it for the main repo while never contributing back changes to the Java repo. This isn't okay. It's awesome that you want to improve the grammar and the specification. How about we start by making the single piece of code that is already trying to work with the function definition works as opposed to fracturing the community into two competing grammars.
I know you're trying to make Substrait better. It's awesome that you're volunteering so much time to the project. I also am the first to agree there are problems with the spec and things need to be improved. What I need is your help on is improving positivity, collaboration and domain appreciation. As it stands, these fundamental challenges make patches like this very hard to review.
|
Well, okay, this explains a lot. I appreciate your honesty. I suppose it's not worth much at this point, but I agree that I was out of line in #333 and apologize again. In general, many of the things you mention about my personality are fair enough. I have a hard time looking past a problem once I've seen it, and get upset if I can't solve it, not to mention when I see people apparently in the process of making things measurably worse. I find it especially aggravating when something is almost consistent but not quite, which is how I see Substrait right now. I also don't believe my opinion (or anyone else's) should matter when making decisions, so I try to argue with facts rather than sugarcoating things. Facts can be cold and can be taken personally, despite me not meaning them that way. Simultaneously, when other people start arguing based on feelings rather than facts, like what happened in #333, I get especially upset, because you can never reason someone out of a position they didn't reason themselves into, and disagreements that cannot be resolved leads to deadlock. There are already a while number of examples of that in the form of open PRs. But I guess I'm making excuses now. I don't know. I hope not, but maybe I'm just an asshole. I don't intend to be disrespectful of anyone else's field, and I try stay out of discussions when I don't know what I'm talking about. However, you're right, I don't really care for relational algebra personally. I do however care for and have a background in engineering specifications in general, defining domain-specific languages, and compiler construction. I would even argue that the fact that I don't have prior experience with any particular DBMS or whatever makes me uniquely qualified to look at things objectively without biases to what I'm used to, and to point out when things are unclear. I'm sorry you seem to disagree. I would also like to point out that you don't pay me to work on this; I'm not going to study a field I'm not all that interested in just because you want me to. That goes for both relational algebra in general and the Java programming ecosystem. I spent a day trying to get substrait-java to compile in an IDE so I could maybe look at proposing something, but failed completely, and despite you calling it mature, there doesn't seem to be any documentation or (binary) release + dependencies anywhere. Anyway, while we're making observations about things in general, here's one of mine: we seem to fundamentally disagree on what a specification is. To me, a specification is a document that unambiguously defines an interface. It always precedes any implementation, except perhaps a reference implementation like a validator. Especially for a vendor-agnostic thing like Substrait, a specification should not favor any particular possible implementation; it should strike a balance between letting different implementations communicate with one another by means of the specified interface, while allowing users the freedom to innovate. Something being unclear or open to interpretation is not acceptable, unless it is explicitly unspecified behavior; when something like that is found, the number one priority should be to fix it. Unless the fix is to make it explicitly unspecified, this is necessarily a breaking change, because something that was previously legal due to not being explicitly illegal now actually is illegal. To you, I think, a specification is more of a working document. The specification is secondary to the implementations; if two implementations can communicate with each other, it doesn't matter if they actually follow the specification to the letter, because the specification has already done its job at that point. Perhaps the specification should then be reworded to align with the implementations, which I guess would not be a breaking change if you look at it from this perspective, but just a bugfix. It should be brief and easy to read, making use of the background that any user should logically have already in order to do something with the specification; being overly precise is a waste of everyone's time. I push back against that, perhaps too passionately, because I don't think it's going to work. I think everyone will just end up implementing their own dialect of Substrait, sending a lot of the benefits of having a common representation down the drain. Then, when nothing works together out of the box anyway, people will realize it's a whole lot easier to just make your own IR, and Substrait will either die completely or be just another pseudo-standard. But maybe I'm wrong. On behalf of all the people and companies investing time into making consumers and producers, I certainly hope so. To stress just how big of a communication issue this is (and why I'm bringing it up), it took me a good ten minutes to even understand what you were talking about with this:
The way I see it, I am merely proposing an update to the specification here. If the specification would currently have actually specified a grammar at all rather than just saying impl pending, I would have taken that as a baseline instead. Nothing I proposed here had anything to do with substrait-java or any other implementation, because I consider a specification to be independent and distinct from any implementation. In general I reject the notion that an EBNF grammar has anything to do with implementation; EBNF is a means to specify a language, not to implement one, it just happens to also be useful for parser generators like ANTLR. And the grammar isn't even modified from the one in substrait-java, it's a complete rewrite, except I guess the filename because that made my test scripts easier at the time and I saw no reason to name it differently. The only reason I linked to substrait-java and investigated it at all is because I wanted to avoid breaking it as much as I could, for example by continuing to allow It took me this long not because I changed so much, but because it actually was that hard for me to make heads or tails of it and change as little as possible while still having everything be consistent. I want to make sure I actually understand a problem and make sure a proposed solution works before bothering someone else with my walls of text and to avoid breaking things more inadvertently, and I can't do that incrementally when the baseline doesn't make sense to me. Believe me, it would have been a hell of a lot easier to just start over if I truly didn't care about collaboration. The reason I went through the trouble of formulating a complete system, making sure that it actually works in an implementation, and writing extensive documentation for it on my own is so we now actually have a working baseline to discuss, rather than what amounts to little more than a few examples and a Java implementation that throws "not yet implemented" for evaluating all but the most trivial case. I don't know how to point out what's wrong and inconsistent with the various bits and pieces any more specifically than what I already pointed out in the bullet points in the OP. If none of those strike you as "oh, right, that's indeed pretty broken" then I guess you should just close this PR. If it does but you just don't like my all-at-once method, then that's also fine, but I'm sorry to say that I just don't know how to fix those things incrementally. You'll have to find someone else. FWIW, I'll be moving on from VD at the end of the year, so I'll in practice be out of your hair in ~11 weeks anyway (I made that decision a while back already for unrelated reasons, and it's no secret, so I might as well throw it out here). I'll still try to finish the validator, but someone else will need to take over maintenance from me after that. As for this repository, I don't think I'll be making any more contributions unless I really have to for work. It's pretty clear that my methods or background are not desirable, and I'm already pushing myself to be more positive than I naturally am, make smaller PRs than I naturally would, and don't intend to study relational algebra and read database docs in any more detail than I already am. P.S. If this post seems especially rambly or incoherent, I saw your post just before heading home from volunteer work at around 2 AM. I've been writing and editing this since because there was not a snowball's chance in hell I'd fall asleep before answering. It's somehow 9:45 AM now and I'm really starting to fall asleep. |
I'm very saddened by this response. My goal in providing feedback was to improve our collaboration, not end it. You've provided massive positive benefit to the project and it is a bad outcome for you to depart.
I don't think that is true. I do think that we disagree on what the Substrait project is and where it is in it's lifecycle. To me, the specification is an important part of what we're trying to create. I think we're striving to recreate the same balance that Arrow has achieved: a specification + one or more reference client implementations.
I think that it worked well in Arrow and I believe we can do it again here. On this point, we need to be looking for ways to start applying the "reference implementations must exist for proposed specification changes" paradigm.
I think this makes early iteration of the project similar to what we did in Arrow especially challenging. We need people focused on filling on the gaps before auditing is going to be productive (as opposed to frustrating). Compounding the challenge of abstract "specification work" is the goal for Substrait to sit at a new domain-specific abstraction level. Without a point of reference, simplification can easily result in difficulty of consumption. Substrait must get to the point where it is usable and consumable by people without domain expertise--but it isn't there yet.
Your expertise and passion have been greatly appreciated and will be sorely missed. I hope you reconsider. |
I don't see a CoC violation from @jvanstraten here. It looks as if #333 was a much-needed discussion and got out of hand because of discussion threads. Someone should have been a grown-up and called a zoom meeting before it got so heated. I've seen a lot of good work from @jvanstraten, acting in a strict-but-fair "editor" role that was much for the benefit of the project. Frankly I assumed that he was a paid employee of Jacques' because no one would have enough spare time to process each request so diligently. I don't know what is Substrait's governance structure. I hope that he is speaking on behalf of (something like) a PMC. But I fear that he is speaking as a BDFL and if so there is a bigger problem than Jan's tone and positions. |
If you interested, there is pending proposed governance model that was shared several weeks ago on the mailing list and we've incorporated the feedback of many members of the community. Happy to get your feedback as well. WRT to my comments here. The members of the SMC have discussed this topic multiple times over the last several months amid multiple instances. There was consensus amongst us that while @jvanstraten has huge positive impact to the project there were several things that need to be improved. A culture of respect and collaboration is especially important early on in the establishment of any community. Because we had not formally adopted the governance document, the discussion was held in private as there was concern that it would be a jarring discussion to have in public without formally adopting the transparent pattern we outline in our proposed governance.
I have no interest in a BDFL model. I have a track record of the opposite. Please assume good intentions. |
This was definitely not just Jacques here. I am also not interested in BDFL governance models. |
I'm sorry, but I honestly find this downright shocking. Several months? Without radical transparency you could also have told me in private, you know. Instead, everyone in VD anyway has been telling me how good of a job I've been doing. And then you put me personally on the spot in the sync meeting for bouncing ideas off of colleagues before spamming the whole community with questionable ideas? I believe kids practice that governance model by punching someone while yelling "no punch back!" and then acting like the victim when they are indeed punched back. Such a healthy dynamic. But clearly I'm the only problem here. I know I'm not good at collaboration; I find it very difficult not to get carried away with something and when I started this project I was downright phobic of voicing my opinion in public. I voiced these concerns at VD but was placed on the project anyway for various management reasons, and I've been trying my damdnest to improve and act accordingly. Recently this has involved literally spending a workday on each comment with multiple rewrites and edits to tone it down, then asking a colleague to proofread to make sure I'm not out of line, and then editing to tone it down some more. My idle train of thought is nothing but forming arguments for Substrait discussions now and it's really starting to cause problems for my personal life. Apparently this is not and has not been good enough, so I apologize again for my personality hampering your ability to make good use of my "brilliance", and that it's getting in the way of you reviewing my proposals objectively. I will probably end up regretting this comment as much as I regret #333 but I am beyond my breaking point right now and yet again wasting the nice weather on a day off arguing on my damn phone. If only for my personal mental health, I need to take a good number of steps back here. If that goes as far as ending up out of a job for a few months then so be it. I again apologize for stating facts and my opinion. |
@jacques-n is not wrong in suggesting that we want this community to be a place where people can feel comfortable contributing. That doesn't mean they aren't challenged and I agree that cold hard facts are the best way to do this kind of challenging. I think there are times @jvanstraten 's opinions were interpreted (by more than just @jacques-n) as being more critical than needed. That being said, I am certainly guilty of this myself at times. I was quite hostile I felt during an early discussion on type derivation and more recently got frustrated and gave up during a discussion on random seeds. The entire discussion in #333 was not helped by the fact that I did not do a good job discouraging an internal discussion in VD and it led to a sort of brigading. Certainly there is frustration to go around. We wouldn't need consensus if we didn't have plenty of disagreement and compromise. I would hope that we, as a community, can get to a place where we feel comfortable pointing out these offenses as they happen, and not letting things bottle up to a point of explosion. Again, I will share the blame of not doing more of this myself, and will aim to step in earlier in the future. A lengthy post calling out patterns of behavior is both hard to receive and hard to recover from. It shouldn't be the first indication that someone is upset and it is unfair to the person that receives it. Yes, we spoke about this in private. Yes, I agreed that some of @jvanstraten 's comments were over the line but his intentions were good. I should have done a better job bringing up these points I describe above at that time. I do not think this is how we want to handle these things in the future. |
FWIW, I am extremely grateful for all of the hard work that you've put in and I don't think anyone is saying otherwise. The spec is, undoubtedly, in a better place due to your critiques and contribution. |
Jacques, I used the term BDFL because there was a bossiness to your tone in your first and second comments. The power imbalance between you and Jeroen is palpable. In the Apache governance model (and AFAICT in the Substrait governance model also) all committers are peers when it comes to technical contributions. There were times (in other threads) when, it seemed to me, you were justifying technical decisions because you had written the spec, and the spec was final. The spec is always subject to revision, if there is consensus among committers. If the committers (or perhaps SMC) have agreed to freeze the spec, then it is respectful to cite that decision, not to imply that your opinion as a senior committer trumps the opinion of a more junior committer. There is no such thing as committer seniority. That said, in my work in Calcite I've been in your shoes many times, looking at a complex patch that has not been done quite the way I'd have done it, which took a long time to write, and a considerable time even to review. That process is incredibly stressful to both parties. The best solution I have found is to use inclusive language: "This is important work, and I'd really like to get it merged. The main problems I see are as follows... What do you think? I hear you saying you're uncomfortable working in Java. If we can find someone else to handle that, do you think you could handle the other languages?" For what it's worth, I read
as a constructive comment, though it was a little poorly phrased. I hear Jeroen expressing his emotional reaction to your comment (which is healthy) and pledging to work on the underlying issues and find compromise. I looked briefly at the proposed governance model. It looks OK. I have some minor comments, and I'll make them in the doc. Lastly, long threads with many-paragraph replies are a red flag. It's probable that one or more parties in those threads are under stress for days at a time while the threads play out. These threads place an unacceptable burden on people's mental health, not to mention the huge amount of time it takes to write them. |
It seems like this thread has run its course, and since I no longer intend to work on this, I'll close this in the interest of cleaning up after myself. I won't delete the branch on my fork in case someone else wants to pick things up from where I left off. |
I need to just bite the bullet already and submit something for this.
This is the culmination of multiple man-months of work trying to reconcile the conflicting and outdated specifications and requirements surrounding type derivations, function prototype matching, etc.
Breaking change?
There are two "pseudo-breaking changes" that I'm aware of ("pseudo" because the specification is currently self-contradictory), as well as one pretty pedantic breaking change:
?
for nullable compound types is placed before the parameter pack instead of after. This complies with the website, but is inconsistent with the grammar hidden in substrait-java. The core extensions use a mixture of both. I've never seen anyone write the?
after the parameter pack except for a few cases inarithmetic_decimal.yaml
, and the grammar file was never advertised on the website, so?
before pack seems to be the clear winner here.if x then y else z
for the ternary expression, rather than the C-stylex ? y : z
. As above, this complies with the website and the hidden grammar, but is inconsistent witharithmetic_decimal.yaml
. The C-style ternary would also be HIGHLY ambiguous when combined with some of the language features I added, and even without those would probably not result in a grammar that a tool like Bison (LALR1) can parse. (I have to admit though, that I haven't actually checked if the proposed grammar is LALR1, but from experience dealing with Bison I feel like it should be, for what that's worth)true
orassert
can now no longer be used as an alternate name forT
or something. None of the words are obvious names for bindings aka type parameters though.I'm fine with calling this a breaking change if it makes someone happy, but considering how 100% of the users I know about have only been using YAMLs as structured documentation it seems a bit overkill.
Features
T?
. The current ANTLR grammar does not have a rule for this and the documentation makes no mention of this corner case.coalesce(T?, T?n) -> T?n
, i.e. the output is nullable iff the second argument is nullable, and the first argument must always be nullable. The added syntax is also expressive enough to specifyMIRROR
andDECLARED_OUTPUT
nullability explicitly, allowing the nullability behavior enum attached to a function to become a desugaring step rather than a bunch of special cases in the derivation interpreter.something(T, STRUCT<T, S>...)
is currently at the very least weird if the function is marked inconsistent (S
is not bound yet, which seems to mean that eachS
can be different. But what aboutT
? Shouldn't that refer back to the previous use ofT
?). The proposed syntax would besomething(T, STRUCT<T, ?S>...)
ifT
is consistent andS
can match anything, or you can just use?
instead of?S
if you don't useS
elsewhere. Same as for nullability behavior, this makes handling consistency a desugaring step instead of a special case for the derivation interpreter.VARCHAR<N>
but not the allowed values of ani64
). Basically for free, too; the constraint system relies on the same logic that's already necessary to match function argument type patterns.<=
,>=
, and!=
.T
can be bound to any metatype. This entirely avoids having to do the type analysis suggested by the protos here. It makes them way easier to interpret and no more difficult to write. It is still strongly-typed though (no coercions whatsoever).No idea if this list is exhaustive, I've not really been keeping track.
Anti-features
I unfortunately wasn't able to make everything nicely generalized and intuitive without (serious) breaking changes.
T
should capture data types including their nullability, or if it should capture only the class/variation/parameter pack part of the quadruple and refer specifically to the non-nullable variant. I somewhat reluctantly ended up going for the latter in order to be intuitively consistent with something likei32
referring to specifically a non-nullablei32
rather than an non-concretei32
with unspecified nullability, but it would have been So. Much. Cleaner. if non-nullable types had just required a!
suffix or something, rather than non-nullable being the default. Now you need arcane syntax likeT?n
to capture the class/variation/parameter portion inT
and the nullability inn
.MIRROR
/DECLARED_OUTPUT
andINCONSISTENT
variadic consistency desugaring rules though, so unless a user is specifying weird variadic functions or functions with weird nullability semantics, they are unlikely to ever see them.T
is an integer type:true = T == i8 || T == i16 || T == i32 || T == i64
. There are better ways to write it, for exampleassert covers(T, i8) || covers(T, i16) || covers(T, i32) || covers(T, i64)
, and it's grammatically and semantically possible to add syntax for pattern unions later, to make something likeassert T matches i8 | i16 | i32 | i64
work in addition, or to allow such patterns to just be specified in the argument patterns directly. To work around the weirdness somewhat, I split the documentation into a user-focused section that just describes the intuition of how they work, and the exact specification of how the interpreter deals with them. Patterns are, however, very powerful once you do get used to them, and (aside from the other anti-features) pretty trivial to implement. For reference, the validator implementation of patterns (including lots of validator-specific logic and boilerplate code for error messages) is fully contained in this file.Future proofing
trim(string, N) -> varchar<N>
). The only thing missing is the ability to specify any metatype in the protos used to bind arguments.is_integer -> true
andsize -> 8|16|32|64
trait. A function like integer addition could then be written for example asadd(T, T) -> assert is_integer(T); T
instead of duplicating the function for all integers. The practical upshot of this (besides deduplication) is that if a user defines their own integer types and sets the same trait, the coreadd
function would automatically be extended and usable for those types.DECIMAL
cannot currently be expressed with YAML due to theS <= P
constraint, while this is trivial to specify for the derivation interpreter (you'd literally just sayassert S <= P
). In fact, the entire function argument matching semantics could be copypasted to compound type parameter pack matching; the only difference is that they only have the metavalue "arguments" I mentioned before (no value arguments or optional enumeration arguments).NSTRUCT
pseudotype and certainly the protobufNamedStruct
abstraction thereof has never made sense to me; it makes way more sense to me to just use a single struct type with optional names. For the grammar I generalized this even further and just allow every parameter to be given a name, expressed either as an identifier or as a string. This contrasts with the current grammar, which hasNSTRUCT
as a grammatical special case (which also requiredNSTRUCT
to be a keyword). Thus, if we ever do decide to generalize, the grammar is ready for it..
-based namespace separators that would be part of feat: support for simple extensions dependencies #265.Future work