Finetuning Chronos on ETTh datasets yields poor performance #209

aleksmaksimovic · 2024-11-22T09:49:39Z

Describe the bug

When finetuning chronos-t5-small on the ETTh1 dataset and ETTh2 dataset respectively, the performance drops compared to the zeroshot performance. Could that be the case because the prediction_length is recommended to be <=64?

Expected behavior

If the model chronos-t5-small is finetuned on let's say the dataset ETTh1 only, the finetuned model should yield superior MAE and MSE performance compared to the zeroshot model.

How to reproduce

This example is focused on the ETTh1 dataset. For the ETTh2 dataset, the procedure is identical. Please note that the finetuning evaluation for my experiments is done individually for both datasets, so the model is not finetuned and evaluated on both datasets at once.

Standardize ETTh dataset and convert it into arrow format

def convert_to_arrow(
    path: Union[str, Path],
    time_series: Union[List[np.ndarray], np.ndarray],
    compression: str = "lz4",
):
    """
    Store a given set of series into Arrow format at the specified path.

    Input data can be either a list of 1D numpy arrays, or a single 2D
    numpy array of shape (num_series, time_length).
    """
    assert isinstance(time_series, list) or (
        isinstance(time_series, np.ndarray) and
        time_series.ndim == 2
    )

    # Set an arbitrary start time
    start = np.datetime64("2016-07-01 00:00:00", "s")

    dataset = [
        {"start": start, "target": ts} for ts in time_series
    ]




    ArrowWriter(compression=compression).write_to_file(
        dataset,
        path=path,
    )

if name == "main":
# Load and preprocess the dataset
df = pd.read_csv('/path/to/dataset')

df=df[0:12194]

# Ensure time column is in datetime format
time_column = 'date'
df[time_column] = pd.to_datetime(df[time_column])

# Define feature columns
feature_columns = ['HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT']

# Standardize the feature columns
scaler = StandardScaler()
df[feature_columns] = scaler.fit_transform(df[feature_columns])
df['id'] = 0

# Create the structured DataFrame
structured_df = df[['id', time_column] + feature_columns].rename(columns={time_column: 'timestamp'})

# Extract the time series and start times
time_series = [structured_df[col].to_numpy() for col in feature_columns]
start_times = [np.datetime64(structured_df['timestamp'].iloc[0], 's')] * len(feature_columns)

Finetune the model

Use training pipeline implemented in file chronos-forecasting/scripts/training/train.py shown in the tutorial with the following config chronos-t5-small.yaml:


training_data_paths:
- "path/to/ETTh1_train.arrow"
probability:
- 1.0
context_length: 512
prediction_length: 96
min_past: 60
max_steps: 200_000
save_steps: 20_000
log_steps: 500
per_device_train_batch_size: 16
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 1
shuffle_buffer_length: 50_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/etth1/
tf32: true
torch_compile: false
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
  low_limit: -15.0
  high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true

Evaluate finetuned model on the final checkpoint with the evaluation pipeline implemented in chronos-forecasting/scripts/evaluation/evaluate.py with the following modifications in the load_and_split_dataset() function:

def load_and_split_dataset(backtest_config: dict):
    hf_repo = backtest_config["hf_repo"]
    dataset_name = backtest_config["name"]
    offset = backtest_config["offset"]
    prediction_length = backtest_config["prediction_length"]
    num_rolls = backtest_config["num_rolls"]


    ds=Dataset.from_file("/path/to/ETTh1.arrow")

    ds.set_format("numpy")

    

    gts_dataset = to_gluonts_univariate(ds)



    # Split dataset for evaluation
    _, test_template = split(gts_dataset, offset=offset)
    test_data = test_template.generate_instances(prediction_length, windows=num_rolls, distance=1)


    return test_data

and following metrics:

metrics = (
            evaluate_forecasts(
                sample_forecasts,
                test_data=test_data,
                metrics=[
                    MAE(),
                    MSE(),
                ],
                batch_size=5000,
            )
            .reset_index(drop=True)
            .to_dict(orient="records")
        )

The evaluation is performed on the test section of the standardized ETTh1 dataset, hence the offset. For the evaluation pipeline, use the following config:

- name: ETTh
  hf_repo: autogluon/chronos_datasets_extra
  offset: -1742
  prediction_length: 96
  num_rolls: 1135

Compare the evaluation of the finetuned model with the default model chronos-t5-small (which has not been trained on the ETTh1 dataset).

I get the following results:

Zeroshot ETTh1
dataset,model,MAE[0.5],MSE[mean]
ETTh,amazon/chronos-t5-small_,0.5081954018184918,0.560689815315581_

Zeroshot ETTh2
dataset,model,MAE[0.5],MSE[mean]
ETTh,amazon/chronos-t5-small,0.2625630043626757,0.1391419442914831

Finetuned ETTh1
dataset,model,MAE[0.5],MSE[mean]
ETTh,/path/to/checkpoint-final,0.7746078180721628,1.1865953634689008

Finetuned ETTh2
dataset,model,MAE[0.5],MSE[mean]
ETTh,/path/to/checkpoint-final,0.35415831866543424,0.2516080298962922

As you can see, MAE and MSE are worse for the finetuned checkpoint than for the default model. That shouldn't be the case.

Environment description
Operating system: Ubuntu 22.04.4 LTS
Python version: 3.10.14
CUDA version: 12.4
PyTorch version: 2.4.0
HuggingFace transformers version: 4.44.2
HuggingFace accelerate version: 0.33.0

The text was updated successfully, but these errors were encountered:

abdulfatir · 2024-11-22T16:47:10Z

Thank you opening this, although this is not exactly a bug. When it comes to fine-tuning there is no one-fits-all set of hyperparameters that would work for all datasets. Therefore, it is completely possible that due to specific settings such as a large learning_rate or max_steps, the model's performance worsens upon fine-tuning (e.g., due to over-fitting). I would encourage you to try different fine-tuning hyperparameters.

aleksmaksimovic · 2024-11-28T14:01:55Z

Thanks, drastically reducing the initial learning rate solves the problem (the finetune performance is better than the zeroshot performance, for both datasets).

aleksmaksimovic added the bug Something isn't working label Nov 22, 2024

lostella added question Further information is requested and removed bug Something isn't working labels Nov 25, 2024

aleksmaksimovic closed this as completed Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Chronos on ETTh datasets yields poor performance #209

Finetuning Chronos on ETTh datasets yields poor performance #209

aleksmaksimovic commented Nov 22, 2024

abdulfatir commented Nov 22, 2024

aleksmaksimovic commented Nov 28, 2024 •

edited

Loading

Finetuning Chronos on ETTh datasets yields poor performance #209

Finetuning Chronos on ETTh datasets yields poor performance #209

Comments

aleksmaksimovic commented Nov 22, 2024

abdulfatir commented Nov 22, 2024

aleksmaksimovic commented Nov 28, 2024 • edited Loading

aleksmaksimovic commented Nov 28, 2024 •

edited

Loading