Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning Chronos on ETTh datasets yields poor performance #209

Closed
aleksmaksimovic opened this issue Nov 22, 2024 · 2 comments
Closed

Finetuning Chronos on ETTh datasets yields poor performance #209

aleksmaksimovic opened this issue Nov 22, 2024 · 2 comments
Labels
question Further information is requested

Comments

@aleksmaksimovic
Copy link

Describe the bug

When finetuning chronos-t5-small on the ETTh1 dataset and ETTh2 dataset respectively, the performance drops compared to the zeroshot performance. Could that be the case because the prediction_length is recommended to be <=64?

Expected behavior

If the model chronos-t5-small is finetuned on let's say the dataset ETTh1 only, the finetuned model should yield superior MAE and MSE performance compared to the zeroshot model.

How to reproduce

This example is focused on the ETTh1 dataset. For the ETTh2 dataset, the procedure is identical. Please note that the finetuning evaluation for my experiments is done individually for both datasets, so the model is not finetuned and evaluated on both datasets at once.

  1. Standardize ETTh dataset and convert it into arrow format
def convert_to_arrow(
    path: Union[str, Path],
    time_series: Union[List[np.ndarray], np.ndarray],
    compression: str = "lz4",
):
    """
    Store a given set of series into Arrow format at the specified path.

    Input data can be either a list of 1D numpy arrays, or a single 2D
    numpy array of shape (num_series, time_length).
    """
    assert isinstance(time_series, list) or (
        isinstance(time_series, np.ndarray) and
        time_series.ndim == 2
    )

    # Set an arbitrary start time
    start = np.datetime64("2016-07-01 00:00:00", "s")

    dataset = [
        {"start": start, "target": ts} for ts in time_series
    ]




    ArrowWriter(compression=compression).write_to_file(
        dataset,
        path=path,
    )

if name == "main":
# Load and preprocess the dataset
df = pd.read_csv('/path/to/dataset')

df=df[0:12194]

# Ensure time column is in datetime format
time_column = 'date'
df[time_column] = pd.to_datetime(df[time_column])

# Define feature columns
feature_columns = ['HUFL', 'HULL', 'MUFL', 'MULL', 'LUFL', 'LULL', 'OT']

# Standardize the feature columns
scaler = StandardScaler()
df[feature_columns] = scaler.fit_transform(df[feature_columns])
df['id'] = 0

# Create the structured DataFrame
structured_df = df[['id', time_column] + feature_columns].rename(columns={time_column: 'timestamp'})

# Extract the time series and start times
time_series = [structured_df[col].to_numpy() for col in feature_columns]
start_times = [np.datetime64(structured_df['timestamp'].iloc[0], 's')] * len(feature_columns)
  1. Finetune the model

Use training pipeline implemented in file chronos-forecasting/scripts/training/train.py shown in the tutorial with the following config chronos-t5-small.yaml:


training_data_paths:
- "path/to/ETTh1_train.arrow"
probability:
- 1.0
context_length: 512
prediction_length: 96
min_past: 60
max_steps: 200_000
save_steps: 20_000
log_steps: 500
per_device_train_batch_size: 16
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 1
shuffle_buffer_length: 50_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/etth1/
tf32: true
torch_compile: false
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
  low_limit: -15.0
  high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true

  1. Evaluate finetuned model on the final checkpoint with the evaluation pipeline implemented in chronos-forecasting/scripts/evaluation/evaluate.py with the following modifications in the load_and_split_dataset() function:
def load_and_split_dataset(backtest_config: dict):
    hf_repo = backtest_config["hf_repo"]
    dataset_name = backtest_config["name"]
    offset = backtest_config["offset"]
    prediction_length = backtest_config["prediction_length"]
    num_rolls = backtest_config["num_rolls"]


    ds=Dataset.from_file("/path/to/ETTh1.arrow")

    ds.set_format("numpy")

    

    gts_dataset = to_gluonts_univariate(ds)



    # Split dataset for evaluation
    _, test_template = split(gts_dataset, offset=offset)
    test_data = test_template.generate_instances(prediction_length, windows=num_rolls, distance=1)


    return test_data


and following metrics:

metrics = (
            evaluate_forecasts(
                sample_forecasts,
                test_data=test_data,
                metrics=[
                    MAE(),
                    MSE(),
                ],
                batch_size=5000,
            )
            .reset_index(drop=True)
            .to_dict(orient="records")
        )

The evaluation is performed on the test section of the standardized ETTh1 dataset, hence the offset. For the evaluation pipeline, use the following config:

- name: ETTh
  hf_repo: autogluon/chronos_datasets_extra
  offset: -1742
  prediction_length: 96
  num_rolls: 1135


  1. Compare the evaluation of the finetuned model with the default model chronos-t5-small (which has not been trained on the ETTh1 dataset).

I get the following results:

Zeroshot ETTh1
dataset,model,MAE[0.5],MSE[mean]
ETTh,amazon/chronos-t5-small_,0.5081954018184918,0.560689815315581_

Zeroshot ETTh2
dataset,model,MAE[0.5],MSE[mean]
ETTh,amazon/chronos-t5-small,0.2625630043626757,0.1391419442914831

Finetuned ETTh1
dataset,model,MAE[0.5],MSE[mean]
ETTh,/path/to/checkpoint-final,0.7746078180721628,1.1865953634689008

Finetuned ETTh2
dataset,model,MAE[0.5],MSE[mean]
ETTh,/path/to/checkpoint-final,0.35415831866543424,0.2516080298962922

As you can see, MAE and MSE are worse for the finetuned checkpoint than for the default model. That shouldn't be the case.

Environment description
Operating system: Ubuntu 22.04.4 LTS
Python version: 3.10.14
CUDA version: 12.4
PyTorch version: 2.4.0
HuggingFace transformers version: 4.44.2
HuggingFace accelerate version: 0.33.0

@aleksmaksimovic aleksmaksimovic added the bug Something isn't working label Nov 22, 2024
@abdulfatir
Copy link
Contributor

Thank you opening this, although this is not exactly a bug. When it comes to fine-tuning there is no one-fits-all set of hyperparameters that would work for all datasets. Therefore, it is completely possible that due to specific settings such as a large learning_rate or max_steps, the model's performance worsens upon fine-tuning (e.g., due to over-fitting). I would encourage you to try different fine-tuning hyperparameters.

@lostella lostella added question Further information is requested and removed bug Something isn't working labels Nov 25, 2024
@aleksmaksimovic
Copy link
Author

aleksmaksimovic commented Nov 28, 2024

Thanks, drastically reducing the initial learning rate solves the problem (the finetune performance is better than the zeroshot performance, for both datasets).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants