Skip to content

Commit

Permalink
Merge pull request #6 from jcustenborder/jenkinsfile
Browse files Browse the repository at this point in the history
Jenkinsfile
  • Loading branch information
jcustenborder authored Sep 7, 2016
2 parents eb377bb + 45a3eac commit efa7c51
Show file tree
Hide file tree
Showing 4 changed files with 69 additions and 339 deletions.
65 changes: 65 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Overview

This Kafka Connect connector provides the capability to watch a directory for files and read the data as new files are
written to the input directory. The RecordProcessor implementation can be overridden so any file type can be supported.
Currently there is support for delimited files and reading a file line by line.

The CSVRecordProcessor supports reading CSV or TSV files. It can convert a CSV on the fly to the strongly typed Kafka
Connect data types. It currently has support for all of the schema types and logical types that are supported in Kafka 0.10.x.
If you couple this with the Avro converter and Schema Registry by Confluent, you will be able to process csv files to
strongly typed Avro data in real time.

The LineRecordProcessor supports reading a file line by line and emitting the line.

# Building on you workstation

```
[email protected]:jcustenborder/kafka-connect-spooldir.git
cd kafka-connect-spooldir
mvn clean package
```

# Running on your workstation


# Schema Configuration

This connector allows you to either infer a schema with nullable strings from the header row, or you can specify the schema in json format.
To use the automatic schema generation set ``csv.first.row.as.header=true``, ``csv.schema.from.header=true``, ``csv.schema.from.header.keys=key1,key2``.
To manually define the schema set ``csv.schema`` to a json representation of the schema. The example below works is for the mock data in the test class.

# Configuration options

| Name | Description | Type | Default | Valid Values | Importance |
|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|-------------------------|--------------------------------------------------------------------------------------------------------------|------------|
| error.path | The directory to place files in which have error(s). This directory must exist and be writable by the user running Kafka Connect. | string | | | high |
| finished.path | The directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect. | string | | | high |
| input.file.pattern | Regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches(). | string | | | high |
| input.path | The directory to read files that will be processed. This directory must exist and be writable by the user running Kafka Connect. | string | | | high |
| record.processor.class | Class that implements RecordProcessor. This class is used to process data as it arrives. | class | | | high |
| topic | The Kafka topic to write the data to. | string | | | high |
| halt.on.error | Should the task halt when it encounters an error or continue to the next file. | boolean | true | | high |
| csv.first.row.as.header | Flag to indicate if the fist row of data contains the header of the file. | boolean | false | | medium |
| csv.schema | Schema representation in json. | string | "" | | medium |
| batch.size | The number of records that should be returned with each batch. | int | 1000 | | low |
| csv.case.sensitive.field.names | Flag to determine if the field names in the header row should be treated as case sensitive. | boolean | false | | low |
| csv.escape.char | Escape character. | int | 92 | | low |
| csv.file.charset | Character set to read wth file with. | string | UTF-8 | | low |
| csv.ignore.leading.whitespace | Sets the ignore leading whitespace setting - if true, white space in front of a quote in a field is ignored. | boolean | true | | low |
| csv.ignore.quotations | Sets the ignore quotations mode - if true, quotations are ignored. | boolean | false | | low |
| csv.keep.carriage.return | Flag to determine if the carriage return at the end of the line should be maintained. | boolean | false | | low |
| csv.null.field.indicator | Indicator to determine how the CSV Reader can determine if a field is null. Valid values are EMPTY_SEPARATORS, EMPTY_QUOTES, BOTH, NEITHER. For more information see http://opencsv.sourceforge.net/apidocs/com/opencsv/enums/CSVReaderNullFieldIndicator.html. | string | NEITHER | ValidEnum{enumClass=CSVReaderNullFieldIndicator, validEnums=[NEITHER, EMPTY_SEPARATORS, EMPTY_QUOTES, BOTH]} | low |
| csv.parser.timestamp.date.formats | The date formats that are expected in the file. This is a list of strings that will be used to parse the date fields in order. The most accurate date format should be the first in the list. Take a look at the Java documentation for more info. https://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html | list | [yyyy-MM-dd' 'HH:mm:ss] | | low |
| csv.parser.timestamp.timezone | The timezone that all of the dates will be parsed with. | string | UTC | | low |
| csv.quote.char | The character that is used to quote a field. This typically happens when the csv.separator.char character is within the data. | int | 34 | | low |
| csv.schema.from.header | Flag to determine if the schema should be generated based on the header row. | boolean | false | | low |
| csv.schema.from.header.keys | csv.schema.from.header.keys | list | [] | | low |
| csv.schema.name | Fully qualified name for the schema. This setting is ignored if csv.schema is set. | string | "" | | low |
| csv.separator.char | The character that seperates each field. Typically in a CSV this is a , character. A TSV would use \t. | int | 44 | | low |
| csv.skip.lines | Number of lines to skip in the beginning of the file. | int | 0 | | low |
| csv.strict.quotes | Sets the strict quotes setting - if true, characters outside the quotes are ignored. | boolean | false | | low |
| csv.verify.reader | Flag to determine if the reader should be verified. | boolean | true | | low |
| file.minimum.age.ms | The amount of time in milliseconds after the file was last written to before the file can be processed. | long | 0 | [0,...,9223372036854775807] | low |
| include.file.metadata | Flag to determine if the metadata about the file should be included. | boolean | false | | low |
| processing.file.extension | Before a file is processed, it is renamed to indicate that it is currently being processed. This setting is appended to the end of the file. | string | .PROCESSING | ValidPattern{pattern=^.*\..+$} | low |

Loading

0 comments on commit efa7c51

Please sign in to comment.