Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One logstash input for Azure blob is slower than other #214

Open
arunp-motorq opened this issue Jan 29, 2020 · 1 comment
Open

One logstash input for Azure blob is slower than other #214

arunp-motorq opened this issue Jan 29, 2020 · 1 comment

Comments

@arunp-motorq
Copy link

arunp-motorq commented Jan 29, 2020

Issue with Logstash input for Azure blob

I have one instance of logstash for reading data from blob storage. Although logs are in the same container I have 2 major folder structure for logs from two different processes. Blob structure is something like this
Blob

  • Container
    • Process1/Year/Month/Day/Hour/LogFile
    • Process2/Year/Month/Day/Hour/LogFile

My logstash blob config looks like this

`azureblob
{
storage_account_name => 'folder1'
storage_access_key => ''
container => 'logs'
id => 'jobs1'
blob_list_page_size => 150
file_chunk_size_bytes => 8088608
registry_create_policy => 'resume'
path_filters => 'folder1/2020 /**/*.csv'
}

azureblob
{
storage_account_name => 'folder2'
storage_access_key => ''
container => 'logs'
id => 'jobs1'
blob_list_page_size => 150
file_chunk_size_bytes => 8088608
registry_create_policy => 'resume'
path_filters => 'folder2/2020 /**/*.csv'
}`

Heap is around 3G and cpu usage is at 70-80%.

I run only one instance of logstash. Issue is logs from folder2 are processed much faster than logs from folder1. Folder2 is days ahead of folder1. ( This is catch up scenario. Am reading logs from start of this month) How do I debug this ?

@pinochioze
Copy link

Hi Arun, I think your concern is due to the number of blob in each folder (you can get this number by using CLI or Ms Azure Storage Explorer), the procedure of this plugin is:

  1. get the list of all the blobs in the container
  2. Compare the list with the files in "path_filter" then get the list which matched
  3. Get 1 blob in the list of matched blobs base on Generation algorthm and offset of the blob
    So there are many blobs in the list of matched blobs have to wait to the next loop of the process

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants