From 545021b502551c4855c2c91ef75263f85500c2ea Mon Sep 17 00:00:00 2001 From: Richard Chien Date: Tue, 10 Dec 2024 14:06:28 +0800 Subject: [PATCH 1/4] complete source format table Signed-off-by: Richard Chien --- ingestion/supported-sources-and-formats.mdx | 50 +++++++++++++++------ 1 file changed, 36 insertions(+), 14 deletions(-) diff --git a/ingestion/supported-sources-and-formats.mdx b/ingestion/supported-sources-and-formats.mdx index 7079f957..5d563c72 100644 --- a/ingestion/supported-sources-and-formats.mdx +++ b/ingestion/supported-sources-and-formats.mdx @@ -12,17 +12,25 @@ To ingest data in formats marked with "T", you need to create tables (with conne | Connector | Version | Format | | :------------ | :------------ | :------------------- | -| [Kafka](/integrations/sources/kafka) | 3.1.0 or later versions | [Avro](#avro), [JSON](#json), [protobuf](#protobuf), [Debezium JSON](#debezium-json) (T), [Debezium AVRO](#debezium-avro) (T), [DEBEZIUM\_MONGO\_JSON](#debezium-mongo-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T), [Upsert JSON](#upsert-json) (T), [Upsert AVRO](#upsert-avro) (T), [Bytes](#bytes) | -| [Redpanda](/integrations/sources/redpanda) | Latest | [Avro](#avro), [JSON](#json), [protobuf](#protobuf) | -| [Pulsar](/integrations/sources/pulsar) | 2.8.0 or later versions | [Avro](#avro), [JSON](#json), [protobuf](#protobuf), [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | -| [Kinesis](/integrations/sources/kinesis) | Latest | [Avro](#avro), [JSON](#json), [protobuf](#protobuf), [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | -| [PostgreSQL CDC](/integrations/sources/postgresql-cdc) | 10, 11, 12, 13, 14 | [Debezium JSON](#debezium-json) (T) | -| [MySQL CDC](/integrations/sources/mysql-cdc) | 5.7, 8.0 | [Debezium JSON](#debezium-json) (T) | -| [CDC via Kafka](/ingestion/change-data-capture-with-risingwave) | | [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | -| [Amazon S3](/integrations/sources/s3) | Latest | [JSON](#json), CSV | -| [Load generator](/ingestion/generate-test-data) | Built-in | [JSON](#json) | -| [Google Pub/Sub](/integrations/sources/google-pub-sub) | | [Avro](#avro), [JSON](#json), [protobuf](#protobuf), [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | -| [Google Cloud Storage](/integrations/sources/google-cloud-storage) | | [JSON](#json) | +| [Kafka](/integrations/sources/kafka) | 3.1.0 or later versions | [JSON](#json), [Protobuf](#protobuf), [Avro](#avro), [Bytes](#bytes), [CSV](#csv), [Upsert JSON](#upsert-json) (T), [Upsert Avro](#upsert-avro) (T), Upsert Protobuf (T), [Debezium JSON](#debezium-json) (T), [Debezium Avro](#debezium-avro) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T), [Debezium Mongo JSON](#debezium-mongo-json) (T) | +| [Redpanda](/integrations/sources/redpanda) | Latest | [JSON](#json), [Protobuf](#protobuf), [Avro](#avro) | +| [Pulsar](/integrations/sources/pulsar) | 2.8.0 or later versions | [JSON](#json), [Protobuf](#protobuf), [Avro](#avro), [Bytes](#bytes), [Upsert JSON](#upsert-json) (T), [Upsert Avro](#upsert-avro) (T), [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | +| [Kinesis](/integrations/sources/kinesis) | Latest | [JSON](#json), [Protobuf](#protobuf), [Avro](#avro), [Bytes](#bytes), [Upsert JSON](#upsert-json) (T), [Upsert Avro](#upsert-avro) (T), [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | +| [PostgreSQL CDC](/integrations/sources/postgresql-cdc) | 10, 11, 12, 13, 14 | [Debezium JSON](#debezium-json) (T) | +| [MySQL CDC](/integrations/sources/mysql-cdc) | 5.7, 8.0 | [Debezium JSON](#debezium-json) (T) | +| [SQL Server CDC](/integrations/sources/sql-server-cdc) | 2019, 2022 | [Debezium JSON](#debezium-json) (T) | +| [MongoDB CDC](/integrations/sources/mongodb-cdc) | | [Debezium Mongo JSON](#debezium-mongo-json) (T) | +| [Citus CDC](/integrations/sources/citus-cdc) | 10.2 | [Debezium JSON](#debezium-json) (T) | +| [CDC via Kafka](/ingestion/change-data-capture-with-risingwave) | | [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | +| [Google Pub/Sub](/integrations/sources/google-pub-sub) | | [JSON](#json), [Protobuf](#protobuf), [Avro](#avro), [Bytes](#bytes), [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | +| [Amazon S3](/integrations/sources/s3) | Latest | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | +| [Google Cloud Storage](/integrations/sources/google-cloud-storage) | | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | +| [Azure Blob](/integrations/sources/azure-blob) | | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | +{/* | [POSIX File System]() | | [CSV](#csv) | */} +| [NATS JetStream](/integrations/sources/nats-jetstream) | | [JSON](#json), [Protobuf](#protobuf), [Bytes](#bytes) | +| [MQTT](/integrations/sources/mqtt) | | [JSON](#json), [Bytes](#bytes) | +| [Apache Iceberg](/integrations/sources/apache-iceberg) | | No need to specify `FORMAT` | +| [Load generator](/ingestion/generate-test-data) | Built-in | [JSON](#json) | When a source is created, RisingWave does not ingest data immediately. RisingWave starts to process data when a materialized view is created based on the source. @@ -72,7 +80,7 @@ FORMAT PLAIN ENCODE BYTES ``` -### Debezium AVRO +### Debezium Avro When creating a source from streams in with Debezium AVRO, the schema of the source does not need to be defined in the `CREATE TABLE` statement as it can be inferred from the `SCHEMA REGISTRY`. This means that the schema file location must be specified. The schema file location can be an actual Web location, which is in `http://...`, `https://...`, or `S3://...` format, or a Confluent Schema Registry. For more details about using Schema Registry for Kafka data, see [Read schema from Schema Registry](/integrations/sources/kafka#read-schemas-from-confluent-schema-registry). @@ -190,11 +198,26 @@ ENCODE JSON [ ( ) ] ``` +### CSV + +To consume data in CSV format, you can use `ENCODE PLAIN FORMAT CSV` with options. Configurable options include `delimiter` and `without_header`. + +Syntax: + +```sql +FORMAT PLAIN +ENCODE CSV ( + delimiter = 'delimiter', + without_header = 'false' | 'true' +) +``` + +The `delimiter` option is required, while the `without_header` option is optional, with a default value of `false`. + ### Parquet Parquet format allows you to efficiently store and retrieve large datasets by utilizing a columnar storage architecture. RisingWave supports reading Parquet files from object storage systems including Amazon S3, Google Cloud Storage (GCS), and Azure Blob Storage. - Syntax: ```sql @@ -230,7 +253,6 @@ ENCODE PROTOBUF ( For more information on supported protobuf types, refer to [Supported protobuf types](/sql/data-types/supported-protobuf-types). - ## General parameters for supported formats Here are some notes regarding parameters that can be applied to multiple formats supported by our systems. From 8db8e602468effd9dda8bdca2bcbc064fe15c7c5 Mon Sep 17 00:00:00 2001 From: Richard Chien Date: Tue, 10 Dec 2024 14:08:41 +0800 Subject: [PATCH 2/4] fix Signed-off-by: Richard Chien --- ingestion/supported-sources-and-formats.mdx | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/ingestion/supported-sources-and-formats.mdx b/ingestion/supported-sources-and-formats.mdx index 5d563c72..7009ff67 100644 --- a/ingestion/supported-sources-and-formats.mdx +++ b/ingestion/supported-sources-and-formats.mdx @@ -26,10 +26,9 @@ To ingest data in formats marked with "T", you need to create tables (with conne | [Amazon S3](/integrations/sources/s3) | Latest | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | | [Google Cloud Storage](/integrations/sources/google-cloud-storage) | | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | | [Azure Blob](/integrations/sources/azure-blob) | | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | -{/* | [POSIX File System]() | | [CSV](#csv) | */} | [NATS JetStream](/integrations/sources/nats-jetstream) | | [JSON](#json), [Protobuf](#protobuf), [Bytes](#bytes) | | [MQTT](/integrations/sources/mqtt) | | [JSON](#json), [Bytes](#bytes) | -| [Apache Iceberg](/integrations/sources/apache-iceberg) | | No need to specify `FORMAT` | +| [Apache Iceberg](/integrations/sources/apache-iceberg) | | No need to specify `FORMA | | [Load generator](/ingestion/generate-test-data) | Built-in | [JSON](#json) | From feab61fa75b55d260645c4ee34b1d9227ee25a96 Mon Sep 17 00:00:00 2001 From: Richard Chien Date: Tue, 10 Dec 2024 14:08:56 +0800 Subject: [PATCH 3/4] fix Signed-off-by: Richard Chien --- ingestion/supported-sources-and-formats.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ingestion/supported-sources-and-formats.mdx b/ingestion/supported-sources-and-formats.mdx index 7009ff67..ba98f083 100644 --- a/ingestion/supported-sources-and-formats.mdx +++ b/ingestion/supported-sources-and-formats.mdx @@ -28,7 +28,7 @@ To ingest data in formats marked with "T", you need to create tables (with conne | [Azure Blob](/integrations/sources/azure-blob) | | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | | [NATS JetStream](/integrations/sources/nats-jetstream) | | [JSON](#json), [Protobuf](#protobuf), [Bytes](#bytes) | | [MQTT](/integrations/sources/mqtt) | | [JSON](#json), [Bytes](#bytes) | -| [Apache Iceberg](/integrations/sources/apache-iceberg) | | No need to specify `FORMA | +| [Apache Iceberg](/integrations/sources/apache-iceberg) | | No need to specify `FORMAT` | | [Load generator](/ingestion/generate-test-data) | Built-in | [JSON](#json) | From a9a1c483ab7313b59db8f90ad98dcc4ab52a47d5 Mon Sep 17 00:00:00 2001 From: Richard Chien Date: Tue, 10 Dec 2024 14:10:37 +0800 Subject: [PATCH 4/4] add version for azblob and gcs Signed-off-by: Richard Chien --- ingestion/supported-sources-and-formats.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ingestion/supported-sources-and-formats.mdx b/ingestion/supported-sources-and-formats.mdx index ba98f083..a565f405 100644 --- a/ingestion/supported-sources-and-formats.mdx +++ b/ingestion/supported-sources-and-formats.mdx @@ -24,8 +24,8 @@ To ingest data in formats marked with "T", you need to create tables (with conne | [CDC via Kafka](/ingestion/change-data-capture-with-risingwave) | | [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | | [Google Pub/Sub](/integrations/sources/google-pub-sub) | | [JSON](#json), [Protobuf](#protobuf), [Avro](#avro), [Bytes](#bytes), [Debezium JSON](#debezium-json) (T), [Maxwell JSON](#maxwell-json) (T), [Canal JSON](#canal-json) (T) | | [Amazon S3](/integrations/sources/s3) | Latest | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | -| [Google Cloud Storage](/integrations/sources/google-cloud-storage) | | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | -| [Azure Blob](/integrations/sources/azure-blob) | | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | +| [Google Cloud Storage](/integrations/sources/google-cloud-storage) | Latest | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | +| [Azure Blob](/integrations/sources/azure-blob) | Latest | [JSON](#json), [CSV](#csv), [Parquet](#parquet) | | [NATS JetStream](/integrations/sources/nats-jetstream) | | [JSON](#json), [Protobuf](#protobuf), [Bytes](#bytes) | | [MQTT](/integrations/sources/mqtt) | | [JSON](#json), [Bytes](#bytes) | | [Apache Iceberg](/integrations/sources/apache-iceberg) | | No need to specify `FORMAT` |