- Flink: create Openlineage configuration based on Flink configuration.
#2033
@pawel-big-lebowski
Flink configuration entries starting withopenlineage.*
are passed to Openlineage client. - Spark: Append output dataset name to a job name.
#2036
@pawel-big-lebowski
Solves problem of multiple jobs, writing to different datasets while having the same job name. Feature is enabled by default and results in different job names and can be disabled withspark.openlineage.jobName.appendDatasetName
set tofalse
. Unifies job names generated on Databricks platform (uses dot job part separator instead of underscore). Default behaviour can be altered withspark.openlineage.jobName.replaceDotWithUnderscore
. - Spark: Support Spark 3.4.1.
#2057
@pawel-big-lebowski
Bump latest Spark version to be covered in integration tests.
- Spark: filter
CreateView
events#1968
#1987
@pawel-big-lebowski
Clears events generated by logical plans havingCreateView
nodes as root. - Spark: fix
MERGE INTO
for delta tables identified by physical locations#2026
Delta tables identified by physical locations were not properly recognized.
1.0.0 - 2023-08-01
- Airflow: convert lineage from legacy
File
definition#2006
@mobuchowski
Adds coverage forFile
entity definition to enhance backwards compatibility.
- Spec: remove facet ref from core
#1997
@JDarDagran
Removes references to facets from the core spec that broke compatibility with JSON schema specification.
- Airflow: change log level to
DEBUG
when extractor isn't found#2012
@kaxil
Changes log level fromWARNING
toDEBUG
when an extractor is not available. - Airflow: make sure we cannot fail in thread despite direct execution
#2010
@mobuchowski
Ensures the listener is not failing tasks, even in unlikely scenarios.
- Airflow: stop using reusable session by default, do not send full event on Snowflake complete
#2025
@mobuchowski
Fixes the issue of the Snowflake connector clashing withHttpTransport
by disabling automaticrequests
session reuse and not runningSnowflakeExtractor
again on job completion. - Client: fix error message to avoid confusion
#2001
@mars-lan
Fixes the error message inHttpTransport
in the case of a null URL.
0.30.1 - 2023-07-25
- Flink: support Iceberg sinks
#1960
@pawel-big-lebowski
Detects output datasets when using an Iceberg table as a sink. - Spark: column-level lineage for
merge into
on delta tables#1958
@pawel-big-lebowski
Makes column-level lineage supportmerge into
on Delta tables. Also refactors column-level lineage to deal with multiple Spark versions. - Spark: column-level lineage for
merge into
on Iceberg tables#1971
@pawel-big-lebowski
Makes column-level lineage supportmerge into
on Iceberg tables. - Spark: add support for Iceberg REST catalog
#1963
@juancappi
Addsrest
to the existing options ofhive
andhadoop
inIcebergHandler.getDatasetIdentifier()
to add support for Iceberg'sRestCatalog
. - Airflow: add possibility to force direct-execution based on environment variable
#1934
@mobuchowski
Adds the option to use the direct-execution method on the Airflow listener when the existence of a non-SQLAlchemy-based Airflow event mechanism is confirmed. This happens when using Airflow 2.6 or when theOPENLINEAGE_AIRFLOW_ENABLE_DIRECT_EXECUTION
environment variable exists. - SQL: add support for Apple Silicon to
openlineage-sql-java
#1981
@davidjgoss
Expands the OS/architecture checks when compiling to produce a specific file for Apple Silicon. Also expands the corresponding OS/architecture checks when loading the binary at runtime from Java code. - Spec: add facet deletion
#1975
@julienledem
In order to add a mechanism for deleting job and dataset facets, adds a{ _deleted: true }
object that can take the place of any job or dataset facet (but not run or input/output facets, which are valid only for a specific run). - Client: add a file transport
#1891
@Alexkuva
Creates aFileTransport
and its configuration classes supporting append mode or write-new-file mode, which is especially useful when an object store does not support append mode, e.g. in the case of Databricks DBFS FUSE.
- Airflow: do not run plugin if OpenLineage provider is installed
#1999
@JDarDagran
SetsOPENLINEAGE_DISABLED
totrue
if the provider is installed. - Python: rename
config
toconfig_class
#1998
@mobuchowski
Renames theconfig
class variable toconfig_class
to avoid potential conflict with the config instance.
- Airflow: add workaround for airflow-sqlalchemy event mechanism bug
#1959
@mobuchowski
Due to known issues with the fork and thread model in the Airflow-SQLAlchemy-based event-delivery mechanism, a Kafka producer left alone does not emit a `COMPLETE`` event. This creates a producer for each event when we detect that we're under Airflow 2.3 - 2.5. - Spark: fix custom environment variables facet
#1973
@pawel-big-lebowski
Enables sending the Spark environment variables facet in a non-deterministic way. - Spark: filter unwanted Delta events
#1968
@pawel-big-lebowski
Clears events generated by logical plans havingProject
node as root. - Python: allow modification of
openlineage.*
logging levels via environment variables#1974
@JDarDagran
AddsOPENLINEAGE_{CLIENT/AIRFLOW/DBT}_LOGGING
environment variables that can be set according to module logging levels and cleans up some logging calls inopenlineage-airflow
.
0.29.2 - 2023-06-30
- Flink: support Flink version 1.17.1
#1947
@pawel-big-lebowski
Adds support for Flink versions 1.15.4, 1.16.2 and 1.17.1. - Spark: support Spark 3.4
#1790
@pawel-big-lebowski
Introduces support for latest Spark version 3.4.0, along with 3.2.4 and 3.3.2. - Spark: add Databricks platform integration test
#1928
@pawel-big-lebowski
Adds a Spark integration test to verify behavior on Databricks to be run manually in CircleCI when needed. - Spec: add static lineage event types
#1880
@pawel-big-lebowski
As a first step in implementing static lineage, this adds newDatasetEvent
andJobEvent
types to the spec, along with support for the new types in the Python client.
- Proxy: remove unused Golang client approach
#1926
@mobuchowski
Removes the unused Golang proxy, rendered redundant by the fluentd proxy. - Req: bump minimum supported Python version to 3.8
#1950
@mobuchowski
Python 3.7 is at EOL. This bumps the minimum supported version to 3.8 to keep the project aligned with the Python EOL schedule.
- Flink: fix
KafkaSource
withGenericRecord
#1944
@pawel-big-lebowski
Extract dataset schema fromKafkaSource
whenGenericRecord
deserialized is used. - dbt: fix security vulnerabilities
#1945
@JDarDagran
Fixes vulnerabilities in the dbt integration and integration tests.
0.28.0 - 2023-06-12
- dbt: add Databricks compatibility
#1829
Ines70
Enables launching OpenLineage with a Databricks profile.
- Fix type-checked marker and packaging
#1913
gaborbernat
The client was not marking itself as type-annotated. - Python client: add
schemaURL
to run event#1917
gaborbernat Adds the missingschemaURL
to the client'sRunState
class.
0.27.2 - 2023-06-06
- Python client: deprecate
client.from_environment
, do not skip loading config#1908
@mobuchowski
Deprecates theOpenLineage.from_environment
method and recommends using the constructor instead.
0.27.1 - 2023-06-05
- Python client: add emission filtering mechanism and exact, regex filters
#1878
@mobuchowski
Adds configurable job-name filtering to the Python client. Filters can be exact-match- or regex-based. Events will not be sent in the case of matches.
- Spark: fix column lineage for aggregate queries on databricks
#1867
@pawel-big-lebowski
Aggregate queries on databricks did not return column lineage. - Airflow: fix unquoted
[
and]
in Snowflake URIs#1883
@JDarDagran
Snowflake connections containing one of[
or]
were causingurllib.parse.urlparse
to fail.
0.26.0 - 2023-05-18
- Proxy: Fluentd proxy support (experimental)
#1757
@pawel-big-lebowski
Adds a Fluentd data collector as a proxy to buffer Openlineage events and send them to multiple backends (among many other purposes). Also implements a Fluentd Openlineage parser to validate incoming HTTP events at the beginning of the pipeline. See the readme file for more details.
- Python client: use Hatchling over setuptools to orchestrate Python env setup
#1856
@gaborbernat
Replaces setuptools with Hatchling for building the backend. Also includes a number of fixes, including to type definitions intransport
and elsewhere.
- Spark: support single file datasets
#1855
@pawel-big-lebowski
Fixes the naming of single file datasets so they are no longer named using the parent directory's path:spark.read.csv('file.csv')
. - Spark: fix
logicalPlan
serialization issue on Databricks#1858
@pawel-big-lebowski
Disables thespark_unknown
facet by default to turn off serialization oflogicalPlan
.
0.25.0 - 2023-05-15
- Spark: add Spark/Delta
merge into
support#1823
@pawel-big-lebowski
Adds support formerge into
queries.
- Spark: fix JDBC query handling
#1808
@nataliezeller1
Makes query handling more tolerant of variations in syntax and formatting. - Spark: filter Delta adaptive plan events
#1830
@pawel-big-lebowski
Extends theDeltaEventFilter
class to filter events in cases where rewritten queries in adaptive Spark plans generate extra events. - Spark: fix Java class cast exception
#1844
@Anirudh181001
Fixes the error caused by theOpenLineageRunEventBuilder
when it cast the Spark scheduler'sShuffleMapStage
to boolean. - Flink: include missing fields of Openlineage events
#1840
@pawel-big-lebowski Enriches Flink events so that missingeventTime
,runId
andjob
elements no longer produce errors.
0.24.0 - 2023-05-02
- Support custom transport types
#1795
@nataliezeller1
Adds a new interface,TransportBuilder
, for creating custom transport types without having to modify core components of OpenLineage. - Airflow: dbt Cloud integration
#1418
@howardyoo
Adds a new OpenLineage extractor for dbt Cloud that uses the dbt Cloud hook provided by Airflow to communicate with dbt Cloud via its API. - Spark: support dataset name modification using regex
#1796
@pawel-big-lebowski It is a common scenario to write Spark output datasets with a location path ending with/year=2023/month=04
. The Spark parameterspark.openlineage.dataset.removePath.pattern
introduced here allows for removing certain elements from a path with a regex pattern. - Spark: filter adaptive plan events
#1830
@pawel-big-lebowski When spark plan is optimized, it is rewritten into adaptive plan which lead to duplicate Openlineage events: per normal and per adaptive plan. This changes filters the latter one.
- Spark: catch exception when trying to obtain details of non-existing table.
#1798
@pawel-big-lebowski This mostly happens when getting table details on START event while the table is still not created. - Spark: LogicalPlanSerializer
#1792
@pawel-big-lebowski
ChangesLogicalPlanSerializer
to make use of non-shaded Jackson classes in order to serializeLogicalPlans
. Note: class names are no longer serialized. - Flink: fix Flink CI
#1801
@pawel-big-lebowski
Specifies an older image version that succeeds on CI in order to fix the Flink integration.
0.23.0 - 2023-04-20
- SQL: parser improvements to support:
copy into
,create stage
,pivot
#1742
@pawel-big-lebowski
Adds support for additional syntax available in sqlparser-rs. - dbt: add support for snapshots
#1787
@JDarDagran
Adds support for this special kind of table representing type-2 Slowly Changing Dimensions.
- Spark: change custom column lineage visitors
#1788
@pawel-big-lebowski
Makes theCustomColumnLineageVisitor
interface public to support custom column lineage.
- Spark: fix null pointer in
JobMetricsHolder
#1786
@pawel-big-lebowski
Adds a null check before runningput
to fix a NPE occurring inJobMetricsHolder
- SQL: fix query with table generator
#1783
@pawel-big-lebowski
AllowsTableFactor::TableFunction
to support queries containing table functions. - SQL: fix rust code style bug
#1785
@pawel-big-lebowski
Fixes a minor style issue invisitor.rs
.
- Airflow: Remove explicit
pass
from severalextract_on_complete
methods#1771
JDarDagran
Removes the code from three extractors.
0.22.0 - 2023-04-03
- Spark: properties facet
#1717
@tnazarew
Adds a new facet to capture specified Spark properties. - SQL: SQLParser supports
alter
,truncate
anddrop
statements#1695
@pawel-big-lebowski
Adds support for the statements to the parser. - Common/SQL: provide public interface for openlineage_sql package
#1727
@JDarDagran
Provides a.pyi
public interface file for providing typing hints. - Java client: add configurable headers to HTTP transport
#1718
@tnazarew
Adds custom header handling toHttpTransport
and the Spark integration. - Python client: create client from dictionary
#1745
@JDarDagran
Adds a newfrom_dict
method to the Python client to support creating it from a dictionary.
- Spark: remove URL parameters for JDBC namespaces
#1708
@tnazarew
Makes the namespace value from an event conform to the naming convention specified in Naming.md. - Airflow: make
OPENLINEAGE_DISABLED
case-insensitive#1705
@jedcunningham
Makes the environment variable for disabling OpenLineage in the Python client and Airflow integration case-insensitive.
- Spark: fix missing BigQuery class in column lineage
#1698
@pawel-big-lebowski
The Spark integration now checks if the BigQuery classes are available on the classpath before attempting to use them. - DBT: throw
UnsupportedDbtCommand
when finding unsupported entry inargs.which
#1724
@JDarDagran
Adjusts thedbt-ol
script to detect DBT commands inrun_results.json
only.
- Spark: remove unnecessary warnings for column lineage
#1700
@pawel-big-lebowski
Removes the warnings aboutOneRowRelation
andLocalRelation
nodes. - Spark: remove deprecated configs
#1711
@tnazarew
Removes support for deprecated configs.
0.21.1 - 2023-03-02
- Clients: add
DEBUG
logging of events to transports#1633
@mobuchowski
Ensures that theDEBUG
loglevel on properly configured loggers will always log events, regardless of the chosen transport. - Spark: add
CustomEnvironmentFacetBuilder
class#1545
New contributor @Anirudh181001
Enables the capture of custom environment variables from Spark. - Spark: introduce the new output visitors
AlterTableAddPartitionCommandVisitor
andAlterTableSetLocationCommandVisitor
#1629
New contributor @nataliezeller1
Adds visitors for extracting table names from the Spark commandsAlterTableAddPartitionCommand
andAlterTableSetLocationCommand
. The intended use case is a custom transport for the OpenMetadata lineage API. - Spark: add column lineage for JDBC relations
#1636
@tnazarew
Adds column lineage information to JDBC events with data extracted from query by the SQL parser. - SQL: add linux-aarch64 native library to Java SQL parser
#1664
@mobuchowski
Adds a Linux-ARM version of the native library. The Java SQL parser interface had only Linux-x64 and MacOS universal binary variants previously.
- Airflow: get table database in Athena extractor
#1631
New contributor @rinzool
Changes the extractor to get a table's database from thetable.schema
field or the operator default if the field isNone
.
- dbt: add dbt
seed
to the list of dbt-ol events#1649
New contributor @pohek321
Ensures thatdbt-ol test
no longer fails when run against an event seed. - Spark: make column lineage extraction in Spark support caching
#1634
@pawel-big-lebowski
Collect column lineage from Spark logical plans that contain cached datasets. - Spark: add support for a deprecated config
#1586
@tnazarew
Maps the deprecatedspark.openlineage.url
tospark.openlineage.transport.url
. - Spark: add error message in case of null in url
#1590
@tnazarew
Improves error logging in the case of undefined URLs. - Spark: collect complete event for really quick Spark jobs
#1650
@pawel-big-lebowski
Improves the collecting of OpenLineage events on SQL complete in the case of quick operations. - Spark: fix input/outputs for one node
LogicalRelation
plans#1668
@pawel-big-lebowski
For simple queries likeselect col1, col2 from my_db.my_table
that do not write output, the Spark plan contained just a single node, which was wrongly treated as both an input and output dataset. - SQL: fix file existence check in build script for openlineage-sql-java
#1613
@sekikn
Ensures that the build script works if the library is compiled solely for Linux.
- Airflow: remove
JobIdMapping
and update macros to better support Airflow version 2+#1645
@JDarDagran
Updates macros to useOpenLineageAdapter
's method to generate deterministic run UUIDs because using theJobIdMapping
utility is incompatible with Airflow 2+.
- Spark: column lineage for JDBC relations
#1636
@tnazarew- Adds column lineage info to JDBC events with data extracted form query by OL SQL parser
0.20.6 - 2023-02-10
- Airflow: add new extractor for
FTPFileTransmitOperator
#1603
@sekikn
Adds a new extractor for this Airflow operator serving legacy systems.
- Airflow: make extractors for async operators work
#1601
@JDarDagran
Sends a deterministic Run UUID for Airflow runs.
- dbt: render actual profile only in profiles.yml
#1599
@mobuchowski
Adds aninclude_section
argument for the Jinja render method to include only one profile if needed. - dbt: make
compiled_code
optional#1595
@JDarDagran
Makescompiled_code
optional for manifest > v7.
0.20.4 - 2023-02-07
- Airflow: add new extractor for
GCSToGCSOperator
#1495
@sekikn
Adds a new extractor for this operator. - Flink: resolve topic names from regex, support 1.16.0
#1522
@pawel-big-lebowski
Adds support for Flink 1.16.0 and makes the integration resolve topic names from Kafka topic patterns. - Proxy: implement lineage event validator for client proxy
#1469
@fm100
Implements logic in the proxy (which is still in development) for validating and handling lineage events.
- CI: use
ruff
instead of flake8, isort, etc., for linting and formatting#1526
@mobuchowski
Adopts theruff
package, which combines several linters and formatters into one fast binary.
- Airflow: make the Trino catalog non-mandatory
#1572
@JDarDagran
Makes the Trino catalog optional in the Trino extractor. - Common: add explicit SQL dependency
#1532
@mobuchowski
Addresses a 0.19.2 breaking change to the GE integration by including the SQL dependency explicitly. - DBT: adjust
tqdm
logging indbt-ol
#1549
@JdarDagran
Adjuststqdm
to show the correct number of iterations and adds START events for parent runs. - DBT: fix typo in log output
#1493
@denimalpaca
Fixes 'emittled' typo in log output. - Great Expectations/Airflow: follow Snowflake dataset naming rules
#1527
@mobuchowski
Normalizes Snowflake dataset and datasource naming rules among DBT/Airflow/GE; canonizes old Snowflake account paths around making them all full-size with account, region and cloud names. - Java and Python Clients: Kafka does not initialize properties if they are empty; check and notify about Confluent-Kafka requirement
#1556
@mobuchowski
Fixes the failure to initializeKafkaTransport
in the Java client and adds an exception if the requiredconfluent-kafka
module is missing from the Python client. - Spark: add square brackets for list-based Spark configs
#1507
@Varunvaruns9
Adds a condition to treat configs with[]
as lists. Note:[]
will be required for list-based configs starting with 0.21.0. - Spark: fix several Spark/BigQuery-related issues
#1557
@mobuchowski
Fixes the assumption that a version is always a number; adds support forHadoopMapReduceWriteConfigUtil
; makes the integration accessBigQueryUtil
andgetTableId
using reflection, which supports all BigQuery versions; makes logs provide the full serialized LogicalPlan ondebug
. - SQL: only report partial failures `#1479 @mobuchowski
Changes the parser so it reports partial failures instead of failing the whole extraction.
0.19.2 - 2023-01-04
- Airflow: add Trino extractor
#1288
@sekikn
Adds a Trino extractor to the Airflow integration. - Airflow: add
S3FileTransformOperator
extractor#1450
@sekikn
Adds anS3FileTransformOperator
extractor to the Airflow integration. - Airflow: add standardized run facet
#1413
@JDarDagran
Creates one standardized run facet for the Airflow integration. - Airflow: add
NominalTimeRunFacet
andOwnershipJobFacet
#1410
@JDarDagran
AddsnominalEndTime
andOwnershipJobFacet
fields to the Airflow integration. - dbt: add support for postgres datasources
#1417
@julienledem
Adds the previously unsupported postgres datasource type. - Proxy: add client-side proxy (skeletal version)
#1439
#1420
@fm100
Implements a skeletal version of a client-side proxy. - Proxy: add CI job to publish Docker image
#1086
@wslulciuc
Includes a script to build and tag the image plus jobs to verify the build on every CI run and publish to Docker Hub. - SQL: add
ExtractionErrorRunFacet
#1442
@mobuchowski
Adds a facet to the spec to reflect internal processing errors, especially failed or incomplete parsing of SQL jobs. - SQL: add column-level lineage to SQL parser
#1432
#1461
@mobuchowski @StarostaGit
Adds support for extracting column-level lineage from SQL statements in the parser, including adjustments to Rust-Python and Rust-Java interfaces and the Airflow integration's SQL extractor to make use of the feature. Also includes more tests, removal of the old parser, and removal of the common-build cache in CI (which was breaking the parser). - Spark: pass config parameters to the OL client
#1383
@tnazarew
Adds a mechanism for making new lineage consumers transparent to the integration, easing the process of setting up new types of consumers.
- Airflow: fix
collect_ignore
, add flags to Pytest for cleaner output#1437
@JDarDagran
Removes theextractors
directory from the ignored list, improving unit testing. - Spark & Java client: fix README typos @versaurabh
Fixes typos in the SPDX license headers.
0.18.0 - 2022-12-08
- Airflow: support
SQLExecuteQueryOperator
#1379
@JDarDagran
Changes theSQLExtractor
and adds support for the dynamic assignment of extractors based onconn_type
. - Airflow: introduce a new extractor for
SFTPOperator
#1263
@sekikn
Adds an extractor for tracing file transfers between local file systems. - Airflow: add Sagemaker extractors
#1136
@fhoda
Creates extractors forSagemakerProcessingOperator
andSagemakerTransformOperator
. - Airflow: add S3 extractor for Airflow operators
#1166
@fhoda
Creates an extractor for theS3CopyObject
in the Airflow integration. - Airflow: implement DagRun listener
#1286
@mobuchowski
OpenLineage integration will now explicitly emit DagRun start and DagRun complete or DagRun failed events, which allows precise tracking of single dags. - Spec: add spec file for
ExternalQueryRunFacet
#1262
@howardyoo
Adds a spec file to make this facet available for the Java client. Includes a README - Docs: add a TSC doc
#1303
@merobi-hub
Adds a document listing the members of the Technical Steering Committee.
- Spark: enable usage of other Transports via Spark configuration
#1383
@tnazarew- OL client argument parsing moved from Spark Integration to java client
- Spark: improve Databricks to send better events
#1330
@pawel-big-lebowski
Filters unwanted events and provides a meaningful job name. - Spark-Bigquery: fix a few of the common errors
#1377
@mobuchowski
Fixes a few of the common issues with the Spark-Bigquery integration and adds an integration test and configures CI. - Python: validate
eventTime
field in Python client#1355
@pawel-big-lebowski
Validates theeventTime
of aRunEvent
within the client library. - Databricks: Handle Databricks Runtime 11.3 changes to
DbFsUtils
constructor#1351
@wjohnson
Recaptures lost mount point information from theDatabricksEnvironmentFacetBuilder
and environment-properties facet by looking at the number of parameters in theDbFsUtils
constructor to determine the runtime version.
0.17.0 - 2022-11-16
- Spark: support latest Spark 3.3.1
#1183
@pawel-big-lebowski
Adds support for the latest Spark 3.3.1 version. - Spark: add Kinesis Transport and support config Kinesis in Spark integration
#1200
@yogayang
Adds support for sending to Kinesis from the Spark integration. - Spark: Disable specified facets
#1271
@pawel-big-lebowski
Adds the ability to disable specified facets from generated OpenLineage events. - Python: add facets implementation to Python client
#1233
@pawel-big-lebowski
Adds missing facets to the Python client. - SQL: add Rust parser interface
#1172
@StarostaGit @mobuchowski
Implements a Java interface in the Rust SQL parser, including a build script, native library loading mechanism, CI support and build fixes. - Proxy: add helm chart for the proxy backed
#1068
@wslulciuc
Adds a helm chart for deploying the proxy backend on Kubernetes. - Spec: include possible facets usage in spec
#1249
@pawel-big-lebowski
Extends thefacets
definition with a list of available facets. - Website: publish YML version of spec to website
#1300
@rossturk
Adds configuration necessary to make the OpenLineage website auto-generate openAPI docs when the spec is published there. - Docs: update language on nominating new committers
#1270
@rossturk
Updates the governance language to reflect the new policy on nominating committers.
- Website: publish spec into new website repo location
#1295
@rossturk
Creates a new deploy key, adds it to CircleCI & GitHub, and makes the necessary changes to therelease.sh
script. - Airflow: change how pip installs packages in tox environments
#1302
@JDarDagran
Use deprecated resolver and constraints files provided by Airflow to avoid potential issues caused by pip's new resolver.
- Airflow: fix README for running integration test
#1238
@sekikn
Updates the README for consistency with supported Airflow versions. - Airflow: add
task_instance
argument toget_openlineage_facets_on_complete
#1269
@JDarDagran
Adds thetask_instance
argument toDefaultExtractor
. - Java client: fix up all artifactory paths
#1290
@harels
Not all artifactory paths were changed in the build CI script in a previous PR. - Python client: fix Mypy errors and adjust to PEP 484
#1264
@JDarDagran
Adds a--no-namespace-packages
argument to the Mypy command and adjusts code to PEP 484. - Website: release all specs since
last_spec_commit_id
, not just HEAD~1#1298
@rossturk
The script now ships all specs that have changed since.last_spec_commit_id
.
- Deprecate HttpTransport.Builder in favor of HttpConfig
#1287
@collado-mike
Deprecates the Builder in favor of HttpConfig only and replaces the existing Builder implementation by delegating to the HttpConfig.
0.16.1 - 2022-11-03
- Airflow: add
dag_run
information to Airflow version run facet#1133
@fm100
Adds the Airflow DAG run ID to thetaskInfo
facet, making this additional information available to the integration. - Airflow: add
LoggingMixin
to extractors#1149
@JDarDagran
Adds aLoggingMixin
class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings. - Airflow: add default extractor
#1162
@mobuchowski
Adds aDefaultExtractor
to support the default implementation of OpenLineage for external operators without the need for custom extractors. - Airflow: add
on_complete
argument inDefaultExtractor
#1188
@JDarDagran
Adds support for running another method onextract_on_complete
. - SQL: reorganize the library into multiple packages
#1167
@StarostaGit @mobuchowski
Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains CI fix.
- Airflow: move
get_connection_uri
as extractor's classmethod#1169
@JDarDagran
Theget_connection_uri
method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor. - Airflow: change
get_openlineage_facets_on_start/complete
behavior#1201
@JDarDagran
Splits up the method for greater legibility and easier maintenance.
- Airflow: always send SQL in
SqlJobFacet
as a string#1143
@mobuchowski
Changes the data type ofquery
from array to string to an fix error in theRedshiftSQLOperator
. - Airflow: include
__extra__
case when filtering URI query params#1144
@JDarDagran
Includes theconn.EXTRA_KEY
in theget_connection_uri
method to avoid exposing secrets in URIs via the__extra__
key. - Airflow: enforce column casing in
SQLCheckExtractor
s#1159
@denimalpaca
Uses the parent extractor's_is_uppercase_names
property to determine if the column should be upper cased in theSQLColumnCheckExtractor
's_get_input_facets()
method. - Spark: prevent exception when no schema provided
#1180
@pawel-big-lebowski
Prevents evaluation of column lineage when theschemFacet
isnull
. - Great Expectations: add V3 API compatibility
#1194
@denimalpaca
Fixes the Pandas datasource to make it V3 API-compatible.
- Airflow: remove support for Airflow 1.10
#1128
@mobuchowski
Removes the code structures and tests enabling support for Airflow 1.10.
0.15.1 - 2022-10-05
- Airflow: improve development experience
#1101
@JDarDagran
Adds an interactive development environment to the Airflow integration and improves integration testing. - Spark: add description for URL parameters in readme, change
overwriteName
toappName
#1130
@tnazarew
Adds more information about passing arguments withspark.openlineage.url
and changesoverwriteName
toappName
for clarity. - Documentation: update issue templates for proposal & add new integration template
#1116
@rossturk
Adds a YAML issue template for new integrations and fixes a bug in the proposal template.
- Airflow: lazy load BigQuery client
#1119
@mobuchowski
Moves import of the BigQuery client from top level to local level to decrease DAG import time.
- Airflow: fix UUID generation conflict for Airflow DAGs with same name
#1056
@collado-mike
Adds a namespace to the UUID calculation to avoid conflicts caused by DAGs having the same name in different namespaces in Airflow deployments. - Spark/BigQuery: fix issue with spark-bigquery-connector >=0.25.0
#1111
@pawel-big-lebowski
Makes the Spark integration compatible with the latest connector. - Spark: fix column lineage
#1069
@pawel-big-lebowski
Fixes a null pointer exception error and an error whenopenlineage.timeout
is not provided. - Spark: set log level of
Init OpenLineageContext
to DEBUG#1064
@varuntestaz
Prevents sensitive information from being logged unless debug mode is used. - Java client: update version of SnakeYAML
#1090
@TheSpeedding
Bumps the SnakeYAML library version to include a key bug fix. - dbt: remove requirement for
OPENLINEAGE_URL
to be set#1107
@mobuchowski
Removes erroneous check forOPENLINEAGE_URL
in the dbt integration. - Python client: remove potentially cyclic import
#1126
@mobuchowski
Hides imports to remove potentially cyclic import. - CI: build macos release package on medium resource class
#1131
@mobuchowski
Fixes failing build due to resource class being too large.
0.14.1 - 2022-09-07
- Fix Spark integration issues including error when no
openlineage.timeout
#1069
@pawel-big-lebowski
OpenlineageSparkListener
was failing when noopenlineage.timeout
was provided.
0.14.0 - 2022-09-06
- Support ABFSS and Hadoop Logical Relation in Column-level lineage
#1008
@wjohnson
Introduces anextractDatasetIdentifier
that uses similar logic toInsertIntoHadoopFsRelationVisitor
to pull out the path on the HDFS compliant file system; tested on ABFSS and DBFS (Databricks FileSystem) to prove that lineage could be extracted using non-SQL commands. - Add Kusto relation visitor
#939
@hmoazam
Implements aKustoRelationVisitor
to support lineage for Azure Kusto's Spark connector. - Add ColumnLevelLineage facet doc
#1020
@julienledem
Adds documentation for the Column-level lineage facet. - Include symlinks dataset facet
#935
@pawel-big-lebowski
Includes the recently introducedSymlinkDatasetFacet
in generated OpenLineage events. - Add support for dbt 1.3 beta's metadata changes
#1051
@mobuchowski
Makes projects that are composed of only SQL models work on 1.3 beta (dbt 1.3 renamed thecompiled_sql
field tocompiled_code
to support Python models). Does not provide support for dbt's Python models. - Support Flink 1.15
#1009
@mzareba382
Adds support for Flink 1.15. - Add Redshift dialect to the SQL integration
#1066
@mobuchowski
Adds support for Redshift's SQL dialect in OpenLineage's SQL parser, including quirks such as the use of square brackets in JSON paths. (Note, this does not add support for all of Redshift's custom syntax.)
- Make the timeout configurable in the Spark integration
#1050
@tnazarew
Makes timeout configurable by the user. (In some cases, the time needed to send events was longer than 5 seconds, which exceeded the timeout value.)
- Add a dialect parameter to Great Expectations SQL parser calls
#1049
@collado-mike
Specifies the dialect name from the SQL engine. - Fix Delta 2.1.0 with Spark 3.3.0
#1065
@pawel-big-lebowski
Allows delta support for Spark 3.3 and fixes potential issues. (The Openlineage integration for Spark 3.3 was turned on without delta support, as delta did not support Spark 3.3 at that time.)
0.13.1 - 2022-08-25
- Rename all
parentRun
occurrences toparent
in Airflow integration1037
@fm100
Changes theparentRun
property name toparent
in the Airflow integration to match the spec. - Do not change task instance during
on_running
event1028
@JDarDagran
Fixes an issue in the Airflow integration with theon_running
hook, which was changing theTaskInstance
object along with thetask
attribute.
0.13.0 - 2022-08-22
- Add BigQuery check support
#960
@denimalpaca
Adds logic and support for proper dynamic class inheritance for BigQuery-style operators. (BigQuery's extractor needed additional logic to support the forthcomingBigQueryColumnCheckOperator
andBigQueryTableCheckOperator
.) - Add
RUNNING
EventType
in spec and Python client#972
@mzareba382
Introduces aRUNNING
event state in the OpenLineage spec to indicate a running task and adds aRUNNING
event type in the Python API. - Use databases & schemas in SQL Extractors
#974
@JDarDagran
Allows the Airflow integration to differentiate between databases and schemas. (There was no notion of databases and schemas when querying and parsing results frominformation_schema
tables.) - Implement Event forwarding feature via HTTP protocol
#995
@howardyoo
AddsHttpLineageStream
to forward a given OpenLineage event to any HTTP endpoint. - Introduce
SymlinksDatasetFacet
to spec#936
@pawel-big-lebowski
Creates a new facet, theSymlinksDatasetFacet
, to support the storing of alternative dataset names. - Add Azure Cosmos Handler to Spark integration
#983
@hmoazam
Defines a new interface, theRelationHandler
, to support Spark data sources that do not haveTableCatalog
,Identifier
, orTableProperties
set, as is the case with the Azure Cosmos DB Spark connector. - Support OL Datasets in manual lineage inputs/outputs
#1015
@conorbev
Allows Airflow users to create OpenLineage Dataset classes directly in DAGs with no conversion necessary. (Manual lineage definition required users to create anairflow.lineage.entities.Table
, which was then converted to an OpenLineage Dataset.) - Create ownership facets
#996
@julienledem
Adds an ownership facet to both Dataset and Job in the OpenLineage spec to capture ownership of jobs and datasets.
- Use
RUNNING
EventType in Flink integration for currently running jobs#985
@mzareba382
Makes use of the newRUNNING
event type in the Flink integration, changing events sent by Flink jobs fromOTHER
to this new type. - Convert task objects to JSON-encodable objects when creating custom Airflow version facets
#1018
@fm100
Implements ato_json_encodable
function in the Airflow integration to make task objects JSON-encodable.
- Add support for custom SQL queries in v3 Great Expectations API
#1025
@collado-mike
Fixes support for custom SQL statements in the Great Expectations provider. (The Great Expectations custom SQL datasource was not applied to the support for the V3 checkpoints API.)
0.12.0 - 2022-08-01
- Add Spark 3.3.0 support
#950
@pawel-big-lebowski - Add Apache Flink integration
#951
@mobuchowski - Add ability to extend column level lineage mechanism
#922
@pawel-big-lebowski - Add ErrorMessageRunFacet
#897
@mobuchowski - Add SQLCheckExtractors
#717
@denimalpaca - Add RedshiftSQLExtractor & RedshiftDataExtractor
#930
@JDarDagran - Add dataset builder for AlterTableCommand
#927
@tnazarew
- Limit Delta events
#905
@pawel-big-lebowski - Airflow integration: allow lineage metadata to flow through inlets and outlets
#914
@fenil25
- Limit size of serialized plan
#917
@pawel-big-lebowski - Fix noclassdef error
#942
@pawel-big-lebowski
0.11.0 - 2022-07-07
- HTTP option to override timeout and properly close connections in
openlineage-java
lib.#909
@mobuchowski - Dynamic mapped tasks support to Airflow integration
#906
@JDarDagran SqlExtractor
to Airflow integration#907
@JDarDagran- PMD to Java and Spark builds in CI
#898
@merobi-hub
- When testing extractors in the Airflow integration, set the extractor length assertion dynamic
#882
@denimalpaca - Render templates as start of integration tests for
TaskListener
in the Airflow integration#870
@mobuchowski
- Dependencies bundled with
openlineage-java
lib.#855
@collado-mike - PMD reported issues
#891
@pawel-big-lebowski - Spark casting error and session catalog support for
iceberg
in Spark integration#856
@wslulciuc
0.10.0 - 2022-06-24
- Add static code anlalysis tool mypy to run in CI for against all python modules (
#802
) @howardyoo - Extend
SaveIntoDataSourceCommandVisitor
to extract schema fromLocalRelaiton
andLogicalRdd
in spark integration (#794
) @pawel-big-lebowski - Add
InMemoryRelationInputDatasetBuilder
forInMemory
datasets to Spark integration (#818
) @pawel-big-lebowski - Add copyright to source files
#755
@merobi-hub - Add
SnowflakeOperatorAsync
extractor support to Airflow integration#869
@merobi-hub - Add PMD analysis to proxy project (
#889
) @howardyoo
- Skip
FunctionRegistry.class
serialization in Spark integration (#828
) @mobuchowski - Install new
rust
-based SQL parser by default in Airflow integration (#835
) @mobuchowski - Improve overall
pytest
and integration tests for Airflow integration (#851
,#858
) @denimalpaca - Reduce OL event payload size by excluding local data and including output node in start events (
#881
) @collado-mike - Split spark integration into submodules (
#834
,#890
) @tnazarew @mobuchowski
- Conditionally import
sqlalchemy
lib for Great Expectations integration (#826
) @pawel-big-lebowski - Add check for missing class
org.apache.spark.sql.catalyst.plans.logical.CreateV2Table
in Spark integration (#866
) @pawel-big-lebowski - Fix static code analysis issues (
#867
,#874
) @pawel-big-lebowski
0.9.0 - 2022-06-03
- Spark: Column-level lineage introduced for Spark integration (#698, #645) @pawel-big-lebowski
- Java: Spark to use Java client directly (#774) @mobuchowski
- Clients: Add OPENLINEAGE_DISABLED environment variable which overrides config to NoopTransport (#780) @mobuchowski
- Set log to debug on unknown facet entry (#766) @wslulciuc
- Dagster: pin protobuf version to 3.20 as suggested by tests (#787) @mobuchowski
- Add SafeStrDict to skip failing attributes (#798) @JDarDagran
0.8.2 - 2022-05-19
openlineage-airflow
now supports getting credentials from Airflows secrets backend (#723) @mobuchowskiopenlineage-spark
now supports Azure Databricks Credential Passthrough (#595) @wjohnsonopenlineage-spark
detects datasets wrapped byExternalRDD
s (#746) @collado-mike
PostgresOperator
fails to retrieve host and conn during extraction (#705) @sekikn- SQL parser accepts lists of sql statements (#734) @mobuchowski
- Missing schema when writing to Delta tables in Databricks (#748) @collado-mike
0.8.1 - 2022-04-29
- Airflow integration uses new TaskInstance listener API for Airflow 2.3+ (#508) @mobuchowski
- Support for HiveTableRelation as input source in Spark integration (#683) @collado-mike
- Add HTTP and Kafka Client to
openlineage-java
lib (#480) @wslulciuc, @mobuchowski - New SQL parser, used by Postgres, Snowflake, Great Expectations integrations (#644) @mobuchowski
- GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API (#683) @collado-mike
0.7.1 - 2022-04-19
- Python implements Transport interface - HTTP and Kafka transports are available (#530) @mobuchowski
- Add UnknownOperatorAttributeRunFacet and support in lineage backend (#547) @collado-mike
- Support Spark 3.2.1 (#607) @pawel-big-lebowski
- Add StorageDatasetFacet to spec (#620) @pawel-big-lebowski
- Airflow: custom extractors lookup uses only get_operator_classnames method (#656) @mobuchowski
- README.md created at OpenLineage/integrations for compatibility matrix (#663) @howardyoo
- Dagster: handle updated PipelineRun in OpenLineage sensor unit test (#624) @dominiquetipton
- Delta improvements (#626) @collado-mike
- Fix SqlDwDatabricksVisitor for Spark2 (#630) @wjohnson
- Airflow: remove redundant logging from GE import (#657) @mobuchowski
- Fix Shebang issue in Spark's wait-for-it.sh (#658) @mobuchowski
- Update parent_run_id to be a uuid from the dag name and run_id (#664) @collado-mike
- Spark: fix time zone inconsistency in testSerializeRunEvent (#681) @sekikn
0.6.2 - 2022-03-16
- CI: add integration tests for Airflow's SnowflakeOperator and dbt-snowflake @mobuchowski
- Introduce DatasetVersion facet in spec @pawel-big-lebowski
- Airflow: add external query id facet @mobuchowski
- Complete Fix of Snowflake Extractor get_hook() Bug @denimalpaca
- Update artwork @rossturk
- Airflow tasks in a DAG now report a common ParentRunFacet @collado-mike
0.6.1 - 2022-03-07
- Catch possible failures when emitting events and log them @mobuchowski
- dbt: jinja2 code using do extensions does not crash @mobuchowski
0.6.0 - 2022-03-04
- Extract source code of PythonOperator code similar to SQL facet @mobuchowski
- Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski
- Airflow: extract source code from BashOperator @mobuchowski
- Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune
- OpenLineage sensor for OpenLineage-Dagster integration @dalinkim
- Java-client: make generator generate enums as well @pawel-big-lebowski
- Added
UnknownOperatorAttributeRunFacet
to Airflow integration to record operators that don't produce lineage @collado-mike
- Airflow: increase import timeout in tests, fix exit from integration @mobuchowski
- Reduce logging level for import errors to info @rossturk
- Remove AWS secret keys and extraneous Snowflake parameters from connection uri @collado-mike
- Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski
0.5.2 - 2022-02-10
- Proxy backend example using
Kafka
@wslulciuc - Support Databricks Delta Catalog naming convention with DatabricksDeltaHandler @wjohnson
- Add javadoc as part of build task @mobuchowski
- Include TableStateChangeFacet in non V2 commands for Spark @mr-yusupov
- Support for SqlDWRelation on Databricks' Azure Synapse/SQL DW Connector @wjohnson
- Implement input visitors for v2 commands @pawel-big-lebowski
- Enabled SparkListenerJobStart events to trigger open lineage events @collado-mike
- dbt: job namespaces for given dbt run match each other @mobuchowski
- Fix Breaking SnowflakeOperator Changes from OSS Airflow @denimalpaca
- Made corrections to account for DeltaDataSource handling @collado-mike
0.5.1 - 2022-01-18
- Support for dbt-spark adapter @mobuchowski
- New
backend
to proxy OpenLineage events to one or more event streams 🎉 @mandy-chessell @wslulciuc - Add Spark extensibility API with support for custom Dataset and custom facet builders @collado-mike
- airflow: fix import failures when dependencies for bigquery, dbt, great_expectations extractors are missing @lukaszlaszko
- Fixed openlineage-spark jar to correctly rename bundled dependencies @collado-mike
0.4.0 - 2021-12-13
- Spark output metrics @OleksandrDvornik
- Separated tests between Spark 2 & 3 @pawel-big-lebowski
- Databricks install README and init scripts @wjohnson
- Iceberg integration with unit tests @pawel-big-lebowski
- Kafka read and write support @OleksandrDvornik / @collado-mike
- Arbitrary parameters supported in HTTP URL construction @wjohnson
- Increased visitor coverage for Spark commands @mobuchowski / @pawel-big-lebowski
- dbt: column descriptions are properly filled from metadata.json @mobuchowski
- dbt: allow parsing artifacts with version higher than officially supported @mobuchowski
- dbt: dbt build command is supported @mobuchowski
- dbt: fix crash when build command is used with seeds in dbt 1.0.0rc3 @mobuchowski
- spark: increase logical plan visitor coverage @mobuchowski
- spark: fix logical serialization recursion issue @OleksandrDvornik
- Use URL#getFile to fix build on Windows @mobuchowski
0.3.1 - 2021-10-21
- fix import in spark3 visitor @mobuchowski
0.3.0 - 2021-10-21
- Spark3 support @OleksandrDvornik / @collado-mike
- LineageBackend for Airflow 2 @mobuchowski
- Adding custom spark version facet to spark integration @OleksandrDvornik
- Adding dbt version facet @mobuchowski
- Added support for Redshift profile @AlessandroLollo
- Sanitize JDBC URLs @OleksandrDvornik
- strip openlineage url in python client @OleksandrDvornik
- deploy spec if spec file changes @mobuchowski
0.2.3 - 2021-10-07
- Add dbt
v3
manifest support @mobuchowski
0.2.2 - 2021-09-08
- Implement OpenLineageValidationAction for Great Expectations @collado-mike
- facet: add expectations assertions facet @mobuchowski
- airflow: pendulum formatting fix, add tests @mobuchowski
- dbt: do not emit events if run_result file was not updated @mobuchowski
0.2.1 - 2021-08-27
- Default
--project-dir
argument to current directory indbt-ol
script @mobuchowski
0.2.0 - 2021-08-23
-
Parse dbt command line arguments when invoking
dbt-ol
@mobuchowski. For example:$ dbt-ol run --project-dir path/to/dir
-
Set
UnknownFacet
for spark (captures metadata about unvisited nodes from spark plan not yet supported) @OleksandrDvornik
- Remove
model
from dbt job name @mobuchowski - Default dbt job namespace to output dataset namespace @mobuchowski
- Rename
openlineage.spark.*
toio.openlineage.spark.*
@OleksandrDvornik
- Remove instance references to extractors from DAG and avoid copying log property for serializability @collado-mike
0.1.0 - 2021-08-12
OpenLineage is an Open Standard for lineage metadata collection designed to record metadata for a job in execution. The initial public release includes:
- An initial specification. The the initial version
1-0-0
of the OpenLineage specification defines the core model and facets. - Integrations that collect lineage metadata as OpenLineage events:
Apache Airflow
with support for BigQuery, Great Expectations, Postgres, Redshift, SnowflakeApache Spark
dbt
- Clients that send OpenLineage events to an HTTP backend. Both
java
andpython
are initially supported.