Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training not able to start #150

Open
langerma opened this issue Oct 5, 2019 · 6 comments
Open

training not able to start #150

langerma opened this issue Oct 5, 2019 · 6 comments

Comments

@langerma
Copy link

langerma commented Oct 5, 2019

error:

Oct  5 18:51:11 monitor loudmld: INFO:root:job[a6eba68c-30a3-41ec-a404-e4f9b94372da] starting, nice=5

Oct  5 18:51:11 monitor loudmld: Process Process-7:
Oct  5 18:51:11 monitor loudmld: Traceback (most recent call last):
Oct  5 18:51:11 monitor loudmld: File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
Oct  5 18:51:11 monitor loudmld: self.run()
Oct  5 18:51:11 monitor loudmld: File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
Oct  5 18:51:11 monitor loudmld: self._target(*self._args, **self._kwargs)
Oct  5 18:51:11 monitor loudmld: File "/usr/local/lib64/python3.6/site-packages/pebble/pool/process.py", line 389, in worker_process
Oct  5 18:51:11 monitor loudmld: send_result(channel, Result(task.id, result))
Oct  5 18:51:11 monitor loudmld: File "/usr/local/lib64/python3.6/site-packages/pebble/common.py", line 183, in send_result
Oct  5 18:51:11 monitor loudmld: pipe.send(data)
Oct  5 18:51:11 monitor loudmld: File "/usr/local/lib64/python3.6/site-packages/pebble/pool/channel.py", line 98, in unix_send
Oct  5 18:51:11 monitor loudmld: return self.writer.send(obj)
Oct  5 18:51:11 monitor loudmld: File "/usr/lib64/python3.6/multiprocessing/connection.py", line 206, in send
Oct  5 18:51:11 monitor loudmld: self._send_bytes(_ForkingPickler.dumps(obj))
Oct  5 18:51:11 monitor loudmld: File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
Oct  5 18:51:11 monitor loudmld: cls(buf, protocol).dump(obj)
Oct  5 18:51:11 monitor loudmld: AttributeError: Can't pickle local object '_compile_scalar.<locals>.validate_value'

model:

{
    "bucket_interval": "1m",
    "default_bucket": "influx",
    "features": [
        {
            "anomaly_type": "low_high",
            "default": "previous",
            "field": "load1",
            "io": "io",
            "match_all": [
                {
                    "tag": "host",
                    "value": "monitor"
                }
            ],
            "measurement": "system",
            "metric": "mean",
            "name": "mean_load1"
        }
    ],
    "forecast": 5,
    "grace_period": 0,
    "interval": "60s",
    "max_evals": 10,
    "max_threshold": 0,
    "min_threshold": 0,
    "name": "telegraf_system_mean_load1_host_monitor_1m",
    "offset": "10s",
    "run": {
        "detect_anomalies": true,
        "save_prediction": true
    },
    "seasonality": {
        "daytime": true,
        "weekday": true
    },
    "span": 100,
    "type": "donut"
}

command:

train-model --from now-14d --to now telegraf_system_mean_load1_host_monitor_1m

running master

@regel
Copy link
Owner

regel commented Oct 6, 2019

I could not reproduce. Can you run pip freeze and provide the output? Do you see any major differences compared to the file base/vendor/requirements.txt?

The problem could be the networkx, h5py, or numpy version. If you are in a virtualenv I recommend running pip install -r base/vendor/requirements.txtto setup the dependencies.

You can also try the following:

  1. generate data (warning: this command will also clear the database)

loudml-wave --from now-3d --to now --shape sin --amplitude 10 --base 5 --clear --tags host:monitor --field load1 influx

  1. train the model:

train-model -f now-3d -t now -m1 telegraf_system_mean_load1_host_monitor_1m

@regel
Copy link
Owner

regel commented Oct 7, 2019

I hope this helps. Let me know if this is still a blocking point.

@langerma
Copy link
Author

langerma commented Oct 7, 2019

pip freeze output as follows

absl-py==0.8.0
aniso8601==8.0.0
astor==0.8.0
boto3==1.9.238
botocore==1.12.238
cached-property==1.5.1
certifi==2019.9.11
chardet==3.0.4
Click==7.0
cycler==0.10.0
daiquiri==1.6.0
dateutils==0.6.6
decorator==4.4.0
dictdiffer==0.8.0
dill==0.3.1.1
docker==2.6.1
docker-compose==1.18.0
docker-pycreds==0.2.1
dockerpty==0.4.1
docopt==0.6.2
docutils==0.15.2
elasticsearch==6.3.1
Flask==1.1.1
Flask-RESTful==0.3.7
future==0.17.1
gast==0.3.2
gevent==1.4.0
greenlet==0.4.15
grpcio==1.24.0
h5py==2.9.0
hyperopt==0.1.2
idna==2.8
influxdb==5.2.3
itsdangerous==1.1.0
Jinja2==2.10.1
jmespath==0.9.4
jsonschema==2.5.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
loudml===1.6.0-446132b4
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.0.3
mock==3.0.5
networkx==1.11
numpy==1.16.4
pbr==5.4.3
Pebble==4.4.0
protobuf==3.9.2
pycrypto==2.6.1
pymongo==3.9.0
pyparsing==2.4.2
PySocks==1.6.8
python-dateutil==2.8.0
pytz==2019.2
PyYAML==5.1.2
requests==2.22.0
requests-aws4auth==0.9
s3transfer==0.2.1
schedule==0.6.0
scipy==1.3.1
six==1.12.0
stevedore==1.31.0
tensorboard==1.13.1
tensorflow==1.13.2
tensorflow-estimator==1.13.0
termcolor==1.1.0
texttable==1.6.2
tqdm==4.36.1
urllib3==1.25.6
virtualenv==15.1.0
virtualenv-clone==0.5.3
virtualenvwrapper==4.8.4
voluptuous==0.10.5
warp10client==1.0.1
websocket-client==0.47.0
Werkzeug==0.16.0

@langerma
Copy link
Author

langerma commented Oct 7, 2019

i have running an older version on jetson nano with tensorflow-gpu which works fine (also with tf 1.13.1 which is the last 1.13 version for the jetson nano)

but the newer version (1.6) has the same error on jetson

@toni-moreno
Copy link

Hello @regel @langerma , I've reproduced this error when the input bucket leaks / measurement and/or annotation_db

With this configuration is working ok

  - name: influxdb-linux #input
    type: influxdb
    addr: influxdb:8086
    database: telegraf
    retention_policy: autogen
    measurement: loudml #review
    dbuser: admin #https://github.com/regel/loudml/issues/377 
    dbuser_password: admin1234
    annotation_db: loudml
    create_database: false

With the the below config this error 'AttributeError: Can't pickle local object '_compile_scalar.<locals>.validate_value'' happened

  - name: influxdb-linux #input
    type: influxdb
    addr: influxdb:8086
    database: telegraf
    retention_policy: autogen
    #measurement: loudml <---------- commented 
    dbuser: admin #
    dbuser_password: admin1234
    #annotation_db: loudml <-----------commented
    create_database: false

@langerma I don't know if you have the same config.

Anyway in this context if have serious doubts about how the data flow is working

[input-bucket] --> [LOUDML] ---> [output-bucket] 

with this flow measurement and annotation_db (and also create_database) has no sense in input-bucket if only querying. ( related to
#377)

@regel could be great if we can discuss about this data flow .

Thank you very much

@langerma
Copy link
Author

langerma commented Aug 3, 2020

@toni-moreno i am not using influxdb anymore :-( so i cannot contribute any more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants