Add related source code locations to errors #13664

eliaperantoni · 2024-12-05T14:52:01Z

After the PoC that I did in https://github.com/eliaperantoni/datafusion/tree/eper/inline-diagnostics, in this PR I'm attempting to build a more robust solution, with the goal of eventually merging it.

Still, so far I've only added diagnostics to the Cannot coerce arithmetic expression .. to valid types error to get feedback and approval on the general implementation. That way I can make tweaks quickly, before I go extend this to the rest of the repo.

Which issue does this PR close?

Closes #13662.

What changes are included in this PR?

I introduce the Diagnostic, DiagnosticEntry, and DiagnosticEntryKind types to enrich errors with better user-facing information, mainly by referring to the portions of the SQL code that caused it.

A DiagnosticEntry is used to refer to one particular portion of SQL code and describe its involvement in the error with a string message and a kind. A Diagnostic is simply a collection of DiagnosticEntry.

For example, in SELECT a_string + an_integer we'd have:

Entry kind=Error message=Incompatible types in binary expression Wrapping the whole a_string + an_integer expression.
Entry kind=Note message=Is of type Utf8 Wrapping just the left side.
Entry kind=Note message=Is of type Int64 Wrapping just the right side.

A new DataFusionError::Diagnostic variant is used to attach Diagnostic to errors. You can call with_diagnostic on any DataFusionError to do so. They can later be retrieved by calling DataFusionError::get_diagnostics.

It is possible to attach multiple Diagnostic while an error is being returned along the function call stack. The rationale here is that different functions might have different levels of detail, but they might all have useful diagnostics to add. e.g. the function that computes the output type of an expression has no idea about why it's being called, but calling functions have no idea about the intricacies of the error. They all know different things.

I also added a span: Span to Column. This is the first step of a (probably) long process of enriching the types used during logical planning with the Spans from the parsed AST nodes.

Added WithSpan<T>. The idea is that we want to get Span information down to functions called deep in the call stack, but without breaking existing code too much so that we can ensure a smooth and incremental transition to Diagnostic-ed errors. WithSpan<T> was my attempt to do so because any T implements Into<WithSpan<T>> by simply attaching Span::empty (but of course, if a Span is available it should be used instead). This means that any time a function wants to start spawning Diagnostic-ed errors, it can change some of its arguments by wrapping their types in WithSpan<T> and existing calls will keep on working.

WithSpan<T> should be used when the argument type is something that can loosely be traced back to a location in the source code, but is not really part of any AST node, or anything that is isomorphic to it. A good example is DataType which is taken by signature(DataType, Operator, DataType) -> DataType to get the output type of a binary expression. That is the function that spawns the error, and the caller doesn't have enough information to provide a good Diagnostic. But at the same time, it would be (in my opinion) cumbersome to add Span arguments to the function. So instead, the function can wrap the DataTypes for the two sides of the expression in WithSpan<T>. Some call points can be updated to pass a Span-enriched WithSpan<DataType>, and all the others will keep on working without changes required.

Are these changes tested?

Not yet, while this a draft to gather feedback.

Are there any user-facing changes?

Yes. The user can call DataFusionError::get_diagnostics to get an iterator over all the Diagnostic that have been attached to the error chain. Those contain source code locations that relate to the error, and each location comes with a message.

alamb

Thank you @eliaperantoni -- this is very cool. I left some comments, let me know what you think

alamb · 2024-12-05T18:27:57Z

datafusion/common/src/error.rs

@@ -131,6 +132,7 @@ pub enum DataFusionError {
    /// Errors from either mapping LogicalPlans to/from Substrait plans
    /// or serializing/deserializing protobytes to Substrait plans
    Substrait(String),
+    Diagnostic(Diagnostic, Box<DataFusionError>),


This is an interesting idea -- one way to think about this is that it adds additional structured information to DataFusionError::Context

If we went with the DataFusion::Diagnostic approach, do you think we would be able to deprecate / remove DataFusionError::Context in a future release?

Thank you! Yes, I think DataFusion::Diagnostic can convey a superset of the information that DataFusion::Context can. Any wrapping such as:

let error = DataFusionError::Context(message, Box::new(error));

can be converted to:

let error = DataFusionError::Diagnostic( Diagnostic { entries: vec![ DiagnosticEntry { span: Span::empty(), kind: DiagnosticEntryKind::Error, message, } ] }, Box::new(error) );

And of course, we can provide a DataFusionError::with_simple_diagnostic function to avoid the boilerplate. At that point, DataFusion::Context could be removed.

This also enables progressively adding Span information to what was previously simply a string message.

alamb · 2024-12-05T18:31:11Z

datafusion/common/src/column.rs

 pub struct Column {
    /// relation/table reference.
    pub relation: Option<TableReference>,
    /// field/column name.
    pub name: String,
+    #[derivative(


this is an interesting pattern -- it is different than the way sqlparser did it (which is to effectively wrap the span with a struct that is ignored when comparing eq, hash, etc

Did you consider a similar approach?

pub struct Column { ... // Wrapped span that does not contribute to PartialEq, Eq, Hash, etc pub span: DiagnosticOnly<Span> }

I think this approach is less verbose and might be easier to understand for the casual reader

I also see WithSpan has a similar approach (but that is for wrapping things with Spans 🤔 )

It is indeed interesting. 😅 I was going to use derivative for this problem in sqlparser initially actually, but I wasn't sure if you would want to add a new dependency. If that is not a concern, I think using derivative for cases like this is pretty nice as it is also clear at the usage site that it is being ignored, unlike AttachedToken which kind of hides the behavior. If a wrapper type is preferred, I now think actually maybe the struct in sqlparser should have been called DiagnosticOnly as well 😂

I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.

What I don't 100% love about it is that the PartialEq implementation on AttachedToken is a bit "impure". In that, it serves the one specific purpose of making it ignored when used in a struct field, but prevents you from actually comparing two instances of it because they would always are equal (the implementation returns always true), which is something you might want to do at some point. Especially because the name AttachedToken doesn't clearly convey that intent, and does more than just being a passthrough for PartialEq since its main purpose is tying together a Span and a Token.

When looking at a struct that uses it:

#[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)] pub struct Select { /// Token for the `SELECT` keyword pub select_token: AttachedToken, // ... }

it's not clear to me that AttachedToken is being ignored. I only realise that when I go look at the implementation of PartialEq.

Overall, the derivative approach to me more clearly lets me keep the sensible implementation of PartialEq for Span, while also conveying that "when used as a field in this particular struct, I want it to not influence the struct's comparison". But in other structs, I might want it to influence the comparison.

I personally find this approach more readable and flexible, but I'm open to converting to a wrapper type :)

I also see WithSpan has a similar approach (but that is for wrapping things with Spans 🤔 )

Ah yes good catch! I could definitely have used derivative for WithSpan too. And about WithSpan: do you like the general idea of it?

It's meant to be a "sidecar" to get a Span all the way to where it's needed to create a meaningful diagnostic, by attaching it to values that are sort-of-related, but wouldn't make a lot of sense to add as a struct field.

For example, the two DataTypes in a binary expression come from the two expressions in the SQL query, so they (in some sense) relate to a portion of the code. But putting span as a field of DataType sounds wrong.

Maybe a better name could be SpanRelated or something like that?.

From my perspective some design goals should be:

Make it easy for someone reading the DataFusion source code to understand what is going on

If people are not interested in the span / diagnostic information they can ignore it

From my perspective, using new crates does reduce the amount of code in the DataFusion crate, but can often increase cognitive load (as now to understand the DataFusion code you need to understand the crates)

I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.

Yeah, that is why I think AttachedToken might be better in this case

alamb · 2024-12-05T18:36:23Z

datafusion/common/src/with_span.rs

+
+use sqlparser::tokenizer::Span;
+
+/// A wrapper type for entitites that somehow refer to a location in the source


Did you consider a trait like was done in sqlparser that retrieves spans? For example

pub trait Spanned { /// return the span for this object, if known fn span(&self) -> Option<&Span> }

And then we would implement that trait for various DataFusion / Arrow tyoes (like DataType, etc)?

That would likely require implementing specific wrapper types like DataTypeWithSpan or something to attach Span information to DataTypes

I also see the need to somehow plumb through the span information into things like coerce-types 🤔

Though maybe now would be a good time to make a proper structure for coercing types, (struct TypeCoercer or something) where we could attach the spans 🤔

Did you consider a trait like was done in sqlparser that retrieves spans?

Yes, but to implement one such trait I'd have to put the data in DataType. And I think that's a bit pushing the limits of what DataType is meant to do. I quite like the idea of the WithSpan (can rename) functioning like a sidecar, to limit big changes in the code.

Maybe we could have a Spanned trait in datafusion as well, that is automatically implemented for all WithSpan<T>, and also by types that natively carry a Span?

Though maybe now would be a good time to make a proper structure for coercing types, (struct TypeCoercer or something) where we could attach the spans 🤔

Not too familiar about the history of that component. But I think that's just one example, and there's probably many more cases where changes like that would be needed to accommodate the presence of Spans.

eliaperantoni · 2024-12-09T14:26:02Z

@alamb here's a few examples of what the new diagnostics that I implemented so far can do:

What are you thoughts so far?

Note that the rendering here is not part of my PR, but it consumes the Diagnostic data that is now produced by datafusion.

We think that this could be quite nicely integrated with datafusion-dft and datafusion-clito provide a richer experience for end users.

Perhaps this kind of quality of life features can also be related to #13525, as it would make it easier for consumers of Datafusion to provide a nicer experience to their end users.

alamb · 2024-12-12T16:07:09Z

Sorry for my radio silence here -- I am reviewing this now. Thank you for your patience

alamb

Sorry again for the delay

I spent some more time reviewing this PR, but did not get through it entirely

I think one of the largest design decisions is "where to store the spans"?

There is a tension between trying to keep full source provenance information and overburdening the existing code. I am musing here, but I am hoping to find some balance between

Getting something useful in to DataFusion that we can incrementally update
Ensuring we have a good long term roadmap.

The basic idea of attaching the spans to existing LogicalPlan and Exprs (which is what this PR seems to do) makes sense to me in theory, but in practice it feels like it may be hugely disruptive (many changes to existing structs) and the information propagation will likely be to hard to maintain and test -- all code that transforms/rewrites would have to propagate the spans

I also added a span: Span to Column. This is the first step of a (probably) long process of enriching the types used during logical planning with the Spans from the parsed AST nodes.

What do you think about initially focusing on locations from the errors out of the sql planner (SqlToRel) and postpone adding Spans to the plan nodes?

This would allow us to set up a good pattern for attaching Diagnostics to errors and reportin them. Once that infrastructure is solidified we can start figuring out a pattern to plumb the source information into other areas (like type coercions errors)?

alamb · 2024-12-12T16:14:44Z

datafusion/common/src/column.rs

 pub struct Column {
    /// relation/table reference.
    pub relation: Option<TableReference>,
    /// field/column name.
    pub name: String,
+    #[derivative(


From my perspective some design goals should be:

Make it easy for someone reading the DataFusion source code to understand what is going on

If people are not interested in the span / diagnostic information they can ignore it

From my perspective, using new crates does reduce the amount of code in the DataFusion crate, but can often increase cognitive load (as now to understand the DataFusion code you need to understand the crates)

I think the AttachedToken approach that we took in sqlparser could be a functionally equivalent alternative. And I understand that not everybody might be familiar with derivative.

Yeah, that is why I think AttachedToken might be better in this case

alamb · 2024-12-12T16:15:47Z

datafusion/common/src/column.rs

+    /// Attaches a [`Span`] to the [`Column`], i.e. its location in the source
+    /// SQL query.
+    pub fn with_span(mut self, span: Span) -> Self {
+        self.spans = vec![span];


should this perhaps append a span to the Column rather than overwrite it?

alamb · 2024-12-12T16:19:46Z

datafusion/common/src/dfschema/mod.rs

@@ -113,6 +117,10 @@ pub struct DFSchema {
    field_qualifiers: Vec<Option<TableReference>>,
    /// Stores functional dependencies in the schema.
    functional_dependencies: FunctionalDependencies,
+    /// The location in the source code where the fields are defined (e.g. in


This seems somewhat strange to me as the schema may not be defined in the query

Maybe we can avoid adding this as part of V1

alamb and others added 3 commits November 30, 2024 08:22

Temp patch to new sqlparser

e4bbffa

Update for new sqlparser API

3545647

feat: init diagnostics

34c388a

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Dec 5, 2024

alamb reviewed Dec 5, 2024

View reviewed changes

feat: diagnostics for unknown columns and missing in GROUP BY

dee7959

This was referenced Dec 6, 2024

Add related source code locations to errors #13662

Open

Report multiple errors, not just the first one #13676

Open

eliaperantoni added 2 commits December 9, 2024 15:07

feat: FieldSpans and ambiguous column reference diagnostics

515bb5e

Merge branch 'main' into eper/diagnostics

cdee494

eliaperantoni added 3 commits December 10, 2024 10:04

feat: UNION wrong number of fields diagnostics

03afbf6

feat: collect more errors

2d3e2a1

feat: ignorable errors

0506a80

github-actions bot added the proto Related to proto crate label Dec 11, 2024

alamb reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add related source code locations to errors #13664

Add related source code locations to errors #13664

eliaperantoni commented Dec 5, 2024

alamb left a comment

alamb Dec 5, 2024

eliaperantoni Dec 6, 2024

alamb Dec 5, 2024

alamb Dec 5, 2024

Nyrox Dec 6, 2024

eliaperantoni Dec 6, 2024

eliaperantoni Dec 6, 2024

alamb Dec 12, 2024

alamb Dec 5, 2024

alamb Dec 5, 2024

eliaperantoni Dec 6, 2024

eliaperantoni commented Dec 9, 2024 •

edited

Loading

alamb commented Dec 12, 2024

alamb left a comment

alamb Dec 12, 2024

alamb Dec 12, 2024

alamb Dec 12, 2024


		use sqlparser::tokenizer::Span;

		/// A wrapper type for entitites that somehow refer to a location in the source

Add related source code locations to errors #13664

Are you sure you want to change the base?

Add related source code locations to errors #13664

Conversation

eliaperantoni commented Dec 5, 2024

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eliaperantoni commented Dec 9, 2024 • edited Loading

alamb commented Dec 12, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eliaperantoni commented Dec 9, 2024 •

edited

Loading