How to evaluate the models' performance through metrics such as MASE? #75

nate-gillman · 2024-05-18T00:18:21Z

Hi Chronos team--

Howdy!! I'm a PhD student in the States and I'm using this as a baseline for my research... thanks for building this model :)

I'm currently implementing evaluation metrics like in the paper to work for the Chronos model, and I'm starting with MASE. One thing that's unclear to me at the moment: in Appendix D in the arXiv preprint, the authors say that the MASE computation involves some seasonality parameter $S$ from in the seasonal naive model.

What seasonality parameter should I use to obtain metrics similar to how the authors did it in the paper? In other scenarios, I've seen that some people try to automatically compute a seasonality S for each dataset; I've also seen people use information about the original dataset to select $S$ (e.g. if it's a taxi dataset with hourly counts, then choosing $S=7*24$ would be a reasonable heuristic); and I've seen other people just use $S=1$, but that to me seems like a "seasonal very naive model".

Thanks in advance for your help!!

Cheers
Nate

abdulfatir · 2024-05-20T12:48:52Z

Thanks for your interest @nate-gillman!

We used gluonts for computing metrics. Here's an example for the m4_hourly dataset. You can follow this example for your own custom dataset.

Important

While many datasets in GluonTS have the same name as the ones used in the paper, they may be different from the evaluation in the paper in crucial aspects such as prediction length and number of rolls.

You will need to install: pip install gluonts pandas~=2.1.4 (please ensure that the pandas version is 2.1.x)

import numpy as np
import torch
from gluonts.dataset.repository import get_dataset
from gluonts.dataset.split import split
from gluonts.ev.metrics import MASE, MeanWeightedSumQuantileLoss
from gluonts.itertools import batcher
from gluonts.model.evaluation import evaluate_forecasts
from gluonts.model.forecast import SampleForecast
from tqdm.auto import tqdm

from chronos import ChronosPipeline

# Load dataset
batch_size = 32
num_samples = 20
dataset = get_dataset("m4_hourly")
prediction_length = dataset.metadata.prediction_length

# Load Chronos
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-small",
    device_map="cuda:0",
    torch_dtype=torch.bfloat16,
)

# Split dataset for evaluation
_, test_template = split(dataset.test, offset=-prediction_length)
test_data = test_template.generate_instances(prediction_length)

# Generate forecast samples
forecast_samples = []
for batch in tqdm(batcher(test_data.input, batch_size=32)):
    context = [torch.tensor(entry["target"]) for entry in batch]
    forecast_samples.append(
        pipeline.predict(
            context,
            prediction_length=prediction_length,
            num_samples=num_samples,
        ).numpy()
    )
forecast_samples = np.concatenate(forecast_samples)

# Convert forecast samples into gluonts SampleForecast objects
sample_forecasts = []
for item, ts in zip(forecast_samples, test_data.input):
    forecast_start_date = ts["start"] + len(ts["target"])
    sample_forecasts.append(
        SampleForecast(samples=item, start_date=forecast_start_date)
    )

# Evaluate
metrics_df = evaluate_forecasts(
    sample_forecasts,
    test_data=test_data,
    metrics=[
        MASE(),
        MeanWeightedSumQuantileLoss(np.arange(0.1, 1.0, 0.1)),
    ],
)
metrics_df

	MASE[0.5]	mean_weighted_sum_quantile_loss
None	0.857724	0.026105

nate-gillman · 2024-05-21T00:00:17Z

Thanks a ton for your quick reply!! That answers my question.

abdulfatir · 2024-05-21T08:12:39Z

Keeping this open as a FAQ.

yeongnamtan · 2024-06-04T06:36:24Z

I have a dataframe that I tried to convert to a Gluonts dataset so I can use the above code but I keep getting errors. Appreciate any advice....

import pandas as pd
from gluonts.dataset.common import ListDataset
from gluonts.dataset.field_names import FieldName

# Convert DataFrame into GluonTS ListDataset
freq = '10Min'  # Frequency of the data (daily in this case)
start_date = df.index[0]  # Start date of the series

# Convert start date to pandas Timestamp
start_date = pd.Timestamp(df.index[0])

# Convert target values to one-dimensional numpy array
target_values = df['DO'].values.astype(np.float32)


# Print debugging information
print("start_date:", start_date)
print("freq:", freq)
print("df['DO'].values:", df['DO'].values)
print("df.head():\n", df.head())


# Create ListDataset
dataset = ListDataset(
    [{"start": df.index[0], "target": df['DO'].values}],
    freq=freq
)


# Check the type of dataset
if type(dataset) == ListDataset:
    print("dataset is a GluonTS ListDataset object")
else:
    print("dataset is not a GluonTS ListDataset object")

OUTPUT:

start_date: 2023-11-23 10:10:00
freq: 10Min
df['DO'].values: [6.47 6.47 6.49 ... 6.5  6.5  5.72]
df.head():
                        DO
Time                     
2023-11-23 10:10:00  6.47
2023-11-23 10:20:00  6.47
2023-11-23 10:30:00  6.49
2023-11-23 10:40:00  6.29
2023-11-23 10:50:00  6.29
dataset is not a GluonTS ListDataset object

print("Type of start_date:", type(start_date))
print("Type of target_values:", type(target_values))
print("Shape of target_values:", target_values.shape)

OUTPUT:

Type of start_date: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Type of target_values: <class 'numpy.ndarray'>
Shape of target_values: (22800,)

lostella · 2024-06-04T08:07:31Z

@yeongnamtan there's no need to use ListDataset, you can just use a list:

dataset = [
    {
        "start": pd.Period(df.index[0], freq=freq),
        "target": df["DO"].values
    }
]

then you can apply split(dataset, offset=...) as in the example above.

yeongnamtan · 2024-06-04T09:08:19Z

@lostella I tried what you suggested and get this error....

AttributeError Traceback (most recent call last)
in <cell line: 26>()
24
25 # Split dataset for evaluation
---> 26 _, test_template = split(dataset.test, offset=-prediction_length)
27 test_data = test_template.generate_instances(prediction_length)
28

AttributeError: 'list' object has no attribute 'test'

My code below...

Load dataset

batch_size = 32
num_samples = 20
freq = '10Min'

dataset = [
{
"start": pd.Period(df.index[0], freq=freq),
"target": df["DO"].values
}
]

Split dataset for evaluation

_, test_template = split(dataset.test, offset=-prediction_length)
test_data = test_template.generate_instances(prediction_length)

Generate forecast samples

forecast_samples = []
for batch in tqdm(batcher(test_data.input, batch_size=32)):
context = [torch.tensor(entry["target"]) for entry in batch]
forecast_samples.append(
pipeline.predict(
context,
prediction_length=prediction_length,
num_samples=num_samples,
).numpy()
)
forecast_samples = np.concatenate(forecast_samples)

Convert forecast samples into gluonts SampleForecast objects

sample_forecasts = []
for item, ts in zip(forecast_samples, test_data.input):
forecast_start_date = ts["start"] + len(ts["target"])
sample_forecasts.append(
SampleForecast(samples=item, start_date=forecast_start_date)
)

Evaluate

metrics_df = evaluate_forecasts(
sample_forecasts,
test_data=test_data,
metrics=[
MASE(),
MeanWeightedSumQuantileLoss(np.arange(0.1, 1.0, 0.1)),
],
)
metrics_df

lostella · 2024-06-04T11:25:39Z

@yeongnamtan you should do

split(dataset, offset=...)

and not

split(dataset.test, offset=...)

Also, if you wrap your code snippets between triple-backticks, the code will display with a better format, see https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks

yeongnamtan · 2024-06-05T03:35:15Z

thank you very much for your help. It is working now.

kushalkodn-db · 2024-06-17T16:53:52Z

Hi, I ran the exact same code twice that @abdulfatir provided above for m4_hourly and my results differed quite a bit from the paper's results and what was given above. Is there a reason for this difference? The paper reported MASE=0.881 and WQL=0.026 for m4_hourly Benchmark I.

MASE[0.5] = 0.738313
mean_weighted_sum_quantile_loss = 0.024236

MASE[0.5] = 0.734823
mean_weighted_sum_quantile_loss = 0.023885

abdulfatir · 2024-06-17T18:09:40Z

Which model are you using and what's the pandas and gluonts version?

kushalkodn-db · 2024-06-17T19:29:39Z

Which model are you using and what's the pandas and gluonts version?

model: amazon/chronos-t5-small
gluonts: 0.15.1
pandas: 2.1.4

abdulfatir · 2024-06-18T07:30:43Z

Ah, okay. This discrepancy may be due to an off-by-one issue we recently fixed in the inference code (See #73). Fixing this improved our results. We have not updated the paper with the new results yet.

kushalkodn-db · 2024-06-18T16:24:04Z

Got it, totally valid! Appreciate it :)
Any idea when the new results might be released?

abdulfatir · 2024-06-27T19:09:41Z

Update: We have just open-sourced the datasets used in the paper (thanks @shchur!). Please check the updated README. We have also released an evaluation script and backtest configs to compute the WQL and MASE numbers as reported in the paper. Please follow the instructions in this README to evaluate on the in-domain and zero-shot benchmarks.

nate-gillman changed the title ~~Evaluating time series using MASE?~~ Evaluating model performance using MASE? May 18, 2024

lostella added the question Further information is requested label May 18, 2024

nate-gillman closed this as completed May 21, 2024

abdulfatir added the FAQ Frequently asked question label May 21, 2024

abdulfatir changed the title ~~Evaluating model performance using MASE?~~ How to evaluate the models' performance through metrics such as MASE? May 21, 2024

abdulfatir reopened this May 21, 2024

lostella removed the question Further information is requested label May 21, 2024

lostella mentioned this issue Jun 3, 2024

Performance Metric for Chronos #96

Closed

liu-jc mentioned this issue Jun 9, 2024

How to reproduce the results as shown in the paper? #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to evaluate the models' performance through metrics such as MASE? #75

How to evaluate the models' performance through metrics such as MASE? #75

nate-gillman commented May 18, 2024 •

edited

Loading

abdulfatir commented May 20, 2024 •

edited

Loading

nate-gillman commented May 21, 2024

abdulfatir commented May 21, 2024

yeongnamtan commented Jun 4, 2024 •

edited by lostella

Loading

lostella commented Jun 4, 2024

yeongnamtan commented Jun 4, 2024

lostella commented Jun 4, 2024

yeongnamtan commented Jun 5, 2024

kushalkodn-db commented Jun 17, 2024 •

edited

Loading

abdulfatir commented Jun 17, 2024

kushalkodn-db commented Jun 17, 2024 •

edited

Loading

abdulfatir commented Jun 18, 2024

kushalkodn-db commented Jun 18, 2024

abdulfatir commented Jun 27, 2024

How to evaluate the models' performance through metrics such as MASE? #75

How to evaluate the models' performance through metrics such as MASE? #75

Comments

nate-gillman commented May 18, 2024 • edited Loading

abdulfatir commented May 20, 2024 • edited Loading

nate-gillman commented May 21, 2024

abdulfatir commented May 21, 2024

yeongnamtan commented Jun 4, 2024 • edited by lostella Loading

lostella commented Jun 4, 2024

yeongnamtan commented Jun 4, 2024

Load dataset

Split dataset for evaluation

Generate forecast samples

Convert forecast samples into gluonts SampleForecast objects

Evaluate

lostella commented Jun 4, 2024

yeongnamtan commented Jun 5, 2024

kushalkodn-db commented Jun 17, 2024 • edited Loading

abdulfatir commented Jun 17, 2024

kushalkodn-db commented Jun 17, 2024 • edited Loading

abdulfatir commented Jun 18, 2024

kushalkodn-db commented Jun 18, 2024

abdulfatir commented Jun 27, 2024

nate-gillman commented May 18, 2024 •

edited

Loading

abdulfatir commented May 20, 2024 •

edited

Loading

yeongnamtan commented Jun 4, 2024 •

edited by lostella

Loading

kushalkodn-db commented Jun 17, 2024 •

edited

Loading

kushalkodn-db commented Jun 17, 2024 •

edited

Loading