LC-501: standardize stage properties to camelCase #203

Jozurf · 2024-10-21T21:12:32Z

No description provided.

lucille-core/src/main/java/com/kmwllc/lucille/util/CamelCaseConfigConverter.java

validation-example.conf

rseitz · 2024-11-05T17:44:33Z

lucille-core/src/main/java/com/kmwllc/lucille/util/CamelCaseConfigConverter.java

+import org.slf4j.LoggerFactory;
+import org.apache.commons.text.CaseUtils;
+
+public class CamelCaseConfigConverter {


We do need a unit test for this. Can be a simple test that runs the conversion on one input conf file and confirms that the output matches what we expect. The conf file should have some examples of edge cases like multiple instances of the same stage property name on one line (used as a property and as a value) and some other snake case properties in connectors or elsewhere that shouldn't be converted because they're not stage properties

created test for this in 1134f0a

spencersolomon6 · 2024-11-05T17:55:05Z

lucille-core/src/main/java/com/kmwllc/lucille/stage/ChunkText.java

@@ -53,25 +53,25 @@
 *       - "offset" : number of character offset from start of document
 *       - "length" : number of characters in this chunk
 *       - "chunk_number" : chunk number
- *       - "total_chunk_number" : total chunk number produced from parent document
+ *       - "total_chunks" : total chunk number produced from parent document


[question] Should these document field names be camel case too?

@kiratraynor to answer both of your questions. I did have the same thoughts, and short answer is, that we can, but this would make it a nightmare to create a script for conversion. Some external libraries (like Tika for TikaExtractor) produces snake_case for lucille documents. And so any other stages further down the pipeline that takes in the name of the field (like source: "field_produced_by_tika") cannot be changed in the script convertor. Also, there is the case where CSVConnector, SolrConnector, VFSConnector, DatabaseConnector could produce fields with snake_case, meaning that the script would have to know which fields not to convert in cases like this, especially for CSVConnector and DatabaseConnector, which is dependent on the input.

kiratraynor · 2024-11-12T21:16:41Z

I'm not sure if we'd want to change these or not, but there are a few examples of stages that have values that are in snake case like NormalizeText (with title_case, sentence_case), ChunkText (with the doc fields its adding), Condition (with 'must_not'), or just run_id in general. It seems like the changes that were made were ensuring that the stage properties themselves were in camel case, but just wondering if we would we want the values like these to be as well so that it's more consistent?

Jozurf added 3 commits October 21, 2024 17:12

LC-501: standardize stage properties to camelCase

b06fcfd

LC-501: added script to convert conf files to camelCase

2f03165

LC-501: update camelcase script

c2cece5

Jozurf commented Oct 23, 2024

View reviewed changes