Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LC-501: standardize stage properties to camelCase #203

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Jozurf
Copy link
Contributor

@Jozurf Jozurf commented Oct 21, 2024

No description provided.

@Jozurf Jozurf marked this pull request as ready for review October 28, 2024 13:57
@Jozurf Jozurf requested a review from rseitz October 28, 2024 13:57
validation-example.conf Outdated Show resolved Hide resolved
import org.slf4j.LoggerFactory;
import org.apache.commons.text.CaseUtils;

public class CamelCaseConfigConverter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need a unit test for this. Can be a simple test that runs the conversion on one input conf file and confirms that the output matches what we expect. The conf file should have some examples of edge cases like multiple instances of the same stage property name on one line (used as a property and as a value) and some other snake case properties in connectors or elsewhere that shouldn't be converted because they're not stage properties

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created test for this in 1134f0a

@@ -53,25 +53,25 @@
* - "offset" : number of character offset from start of document
* - "length" : number of characters in this chunk
* - "chunk_number" : chunk number
* - "total_chunk_number" : total chunk number produced from parent document
* - "total_chunks" : total chunk number produced from parent document
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Should these document field names be camel case too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiratraynor to answer both of your questions. I did have the same thoughts, and short answer is, that we can, but this would make it a nightmare to create a script for conversion. Some external libraries (like Tika for TikaExtractor) produces snake_case for lucille documents. And so any other stages further down the pipeline that takes in the name of the field (like source: "field_produced_by_tika") cannot be changed in the script convertor. Also, there is the case where CSVConnector, SolrConnector, VFSConnector, DatabaseConnector could produce fields with snake_case, meaning that the script would have to know which fields not to convert in cases like this, especially for CSVConnector and DatabaseConnector, which is dependent on the input.

@kiratraynor
Copy link
Contributor

I'm not sure if we'd want to change these or not, but there are a few examples of stages that have values that are in snake case like NormalizeText (with title_case, sentence_case), ChunkText (with the doc fields its adding), Condition (with 'must_not'), or just run_id in general. It seems like the changes that were made were ensuring that the stage properties themselves were in camel case, but just wondering if we would we want the values like these to be as well so that it's more consistent?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants