From f22e2432b8ebbd449144a1546064e872f0888356 Mon Sep 17 00:00:00 2001 From: Will Johnson Date: Thu, 23 Jan 2025 14:39:32 -0500 Subject: [PATCH] docs: add note to note that file extension is required in training data path (#447) * docs: add note to note that file extension is required in training data path Signed-off-by: Will Johnson * docs: clarify what is being checked Signed-off-by: Will Johnson * docs: make note, clarification Signed-off-by: Will Johnson * docs: clarify specifically what code does Signed-off-by: Will Johnson --------- Signed-off-by: Will Johnson --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index ff257b17c..2add2b404 100644 --- a/README.md +++ b/README.md @@ -80,6 +80,11 @@ ARROW | ✅ As iterated above, we also support passing a HF dataset ID directly via `--training_data_path` argument. +**NOTE**: Due to the variety of supported data formats and file types, `--training_data_path` is handled as follows: +- If `--training_data_path` ends in a valid file extension (e.g., .json, .csv), it is treated as a file. +- If `--training_data_path` points to a valid folder, it is treated as a folder. +- If neither of these are true, the data preprocessor tries to load `--training_data_path` as a Hugging Face (HF) dataset ID. + ## Use cases supported with `training_data_path` argument ### 1. Data formats with a single sequence and a specified response_template to use for masking on completion.