Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistency in hourly vs daily aggregation #99

Open
priamai opened this issue Sep 22, 2021 · 7 comments
Open

inconsistency in hourly vs daily aggregation #99

priamai opened this issue Sep 22, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@priamai
Copy link

priamai commented Sep 22, 2021

Hi there,
sorry to bother you but I noticed something weird with the aggregation.

from pandagg.discovery import discover
from pandagg import Search

indices = discover(es_client, "filebeat-*")
filebeat = indices['filebeat-7.14.0-2021.08.24-000001']
search = filebeat.search().filter().size(1).groupby('alerts_day', 'date_histogram', fixed_interval='1d',field='@timestamp',format="yyyy-MM-dd")

response = search.execute()
day_df = response.aggregations.to_dataframe()
day_df.reset_index(inplace=True)
day_df.index = pd.to_datetime(day_df['alerts_day'])
day_df.drop(['alerts_day'], axis=1,inplace=True)

search = filebeat.search().size(1).groupby('alerts_day', 'date_histogram', fixed_interval='1h', field='@timestamp',format="yyyy-MM-dd hh:mm:ss")
response = search.execute()
hour_df = response.aggregations.to_dataframe()
hour_df.reset_index(inplace=True)
hour_df.index = pd.to_datetime(hour_df['alerts_day'])
hour_df.drop(['alerts_day'], axis=1,inplace=True)

That is all fine then I do:

day_df.doc_count.sum()
hour_df.doc_count.sum()

This produces 7775 vs 1722 documents, the first number is correct the latter is not.
I am sure there is some detail I am missing but can't figure it out!

@alk-lbinet
Copy link
Contributor

Hi @priamai ,
In order to detect if issues is from pandagg I would need you to provide:

  • the search query search.to_dict(): to check if there is an error while building the search query
  • the response aggregation data response.aggregations.data: to check whether there is an error while parsing aggregation result in tabular format

thx

@alk-lbinet alk-lbinet added the bug Something isn't working label Sep 23, 2021
@robomotic
Copy link

Yes one sec checking!

@robomotic
Copy link

The first query search to dict:

{'query': {'bool': {'filter': [{'term': {'log.file.path': {'value': 'first-org-conf-2015-eve.json'}}}]}}, 'aggs': {'alerts_day': {'date_histogram': {'field': '@timestamp', 'format': 'yyyy-MM-dd', 'fixed_interval': '1d'}}}, 'size': 1}

The second query search to dict:

{'query': {'bool': {'filter': [{'term': {'log.file.path': {'value': 'first-org-conf-2015-eve.json'}}}]}}, 'aggs': {'alerts_day': {'date_histogram': {'field': '@timestamp', 'format': 'yyyy-MM-dd hh:mm:ss', 'fixed_interval': '1h'}}}, 'size': 1}

@robomotic
Copy link

The second query data frame:

image

Raw shape:

raw.shape
(476, 1)

@robomotic
Copy link

One important detail is that those timestamps are in UTC, I am wondering if that is somehow interfering with the conversion?

@cwhaley8288
Copy link

I realize this is older, but had a similar issue, I think its syntax of "hour" - this appears to fix it

.groupby('stat_hour', 'date_histogram', fixed_interval='1H',field='@timestamp',format="yyyy-MM-dd HH:mm:ss")\

@priamai
Copy link
Author

priamai commented Mar 1, 2022

You are bang on, I am pretty sure is the format indeed. I will test it and report it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

4 participants