Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CWLEXEC fails with Hibernate exception for workflows with more than 20 steps #53

Open
mscheremetjew opened this issue Sep 10, 2019 · 3 comments

Comments

@mscheremetjew
Copy link

mscheremetjew commented Sep 10, 2019

We do bump into this issue whenever we try to execute workflows in CWLEXEC with more than 20 steps, so 21 steps for instance. Also sometimes CWLEXEC hangs after it has reported the error. Could it be that sessions are not closed properly after each database transaction? We created a simple test workflow so it becomes easy to reproduce for you guys.

Here is the workflow:
test-workflow.zip

This is the command we are running:

cwlexec -debug -L -p -w <work-dir> -o <output-dir> test-workflow.cwl

CWLEXEC reports the following and exits or sometimes just hangs:

17:14:04.579 default [pool-3-thread-20] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_20) was submitted. Job <6076314> is submitted to default queue <research-rh74>.
17:14:04.579 default [pool-3-thread-16] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_16) was submitted. Job <6076305> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-12] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_12) was submitted. Job <6076312> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-14] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_14) was submitted. Job <6076313> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-21] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_21) was submitted. Job <6076319> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-18] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_18) was submitted. Job <6076322> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-17] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_17) was submitted. Job <6076317> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-13] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_13) was submitted. Job <6076321> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-2] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_2) was submitted. Job <6076316> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-5] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_5) was submitted. Job <6076310> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-1] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_1) was submitted. Job <6076324> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-15] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_15) was submitted. Job <6076318> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-4] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_4) was submitted. Job <6076320> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-19] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_19) was submitted. Job <6076307> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-3] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_3) was submitted. Job <6076325> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-7] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_7) was submitted. Job <6076306> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-6] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_6) was submitted. Job <6076315> is submitted to default queue <research-rh74>.
17:14:04.580 default [pool-3-thread-11] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_11) was submitted. Job <6076309> is submitted to default queue <research-rh74>.
17:14:04.581 default [pool-3-thread-10] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_10) was submitted. Job <6076311> is submitted to default queue <research-rh74>.
17:14:04.581 default [pool-3-thread-9] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_9) was submitted. Job <6076323> is submitted to default queue <research-rh74>.
17:14:04.584 default [pool-3-thread-8] INFO  c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Job (touch_8) was submitted. Job <6076308> is submitted to default queue <research-rh74>.
17:14:04.737 default [pool-3-thread-6] ERROR c.i.s.c.e.e.lsf.LSFBsubExecutorTask - Failed to submit the step touch_6, The internal connection pool has reached its maximum size and no connection is currently available!
17:14:04.743 default [pool-3-thread-6] ERROR c.i.s.c.e.e.lsf.LSFBsubExecutorTask - The exception stacks:
org.hibernate.HibernateException: The internal connection pool has reached its maximum size and no connection is currently available!
	at org.hibernate.engine.jdbc.connections.internal.PooledConnections.poll(PooledConnections.java:82)
	at org.hibernate.engine.jdbc.connections.internal.DriverManagerConnectionProviderImpl.getConnection(DriverManagerConnectionProviderImpl.java:186)
	at org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:35)
	at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:106)
	at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getPhysicalConnection(LogicalConnectionManagedImpl.java:136)
	at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.getConnectionForTransactionManagement(LogicalConnectionManagedImpl.java:254)
	at org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.begin(LogicalConnectionManagedImpl.java:262)
	at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.begin(JdbcResourceLocalTransactionCoordinatorImpl.java:214)
	at org.hibernate.engine.transaction.internal.TransactionImpl.begin(TransactionImpl.java:56)
	at org.hibernate.internal.AbstractSharedSessionContract.beginTransaction(AbstractSharedSessionContract.java:409)
	at com.ibm.spectrumcomputing.cwl.exec.service.CWLInstanceService.updateCWLProcessInstance(CWLInstanceService.java:83)
	at com.ibm.spectrumcomputing.cwl.exec.executor.lsf.LSFBsubExecutorTask.runStep(LSFBsubExecutorTask.java:108)
	at com.ibm.spectrumcomputing.cwl.exec.executor.lsf.LSFBsubExecutorTask.run(LSFBsubExecutorTask.java:56)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
17:14:04.744 default [pool-3-thread-6] DEBUG c.i.s.c.e.e.lsf.LSFWorkflowRunner - broadcast event EXIT, touch_6
17:14:04.746 default [pool-3-thread-6] ERROR c.i.s.c.e.e.lsf.LSFWorkflowRunner - The workflow (test-wf) exited with <255>.
@skeeey
Copy link
Collaborator

skeeey commented Sep 20, 2019

@mscheremetjew , I think this because the concurrent database connections exceed the Hibernate connection pool size (default 20), as a temporary way, you may increase the pool size at https://github.com/IBMSpectrumComputing/cwlexec/blob/master/src/main/java/com/ibm/spectrumcomputing/cwl/exec/util/DatabaseManager.java#L84, e.g.

properties.put("hibernate.connection.pool_size", "50");

We may need to limit (e.g. use Semaphore) the database concurrent request in future at https://github.com/IBMSpectrumComputing/cwlexec/blob/master/src/main/java/com/ibm/spectrumcomputing/cwl/exec/service/CWLInstanceService.java#L80

@adamsla
Copy link

adamsla commented Sep 20, 2019

Does it make sense to make this a setting in the users flow somewhere? Is that supported in the Specification?

@mr-c
Copy link

mr-c commented Sep 20, 2019

Does it make sense to make this a setting in the users flow somewhere? Is that supported in the Specification?

There is not a specific configuration option for this in the CWL standards. CWLEXEC could detect the number of steps in class: Workflow and configure itself accordingly. Ideally site specific configuration is not in a CWL description. However, if it is needed, then a cwlexec:hibernate.connection_pool_size CWL directive could be implemented by CWLEXEC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants