Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8_processing notebook process_patient takes extremely long #10

Open
jareducherek opened this issue Feb 28, 2020 · 4 comments
Open

8_processing notebook process_patient takes extremely long #10

jareducherek opened this issue Feb 28, 2020 · 4 comments

Comments

@jareducherek
Copy link

def process_patient(aid) takes extremely long to run in the notebook. This function takes about 30 seconds to run per patient input. Even running the subsequent part with 20 workers would still take several days to complete.

Is there anything I can change to get this to complete in a reasonable amount of time? I'm running this on a 20 Core Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz.

def process_patient(aid):
    with open('admdata/log/adm-{0}.log'.format(str('%.6d' % aid)), 'w') as f:
        try:
            proc = processing(aid, f)
            if len(proc) == 0:
                return
            res = {
                'timeseries': sparsify(proc),
                'general': ageLosMortality(aid, f),
                'icd9': ICD9(aid, f)
            }
            np.save('admdata/adm-' + str('%.6d' % aid), res)
            print('finished {0}!'.format(aid))
        except Exception as e:
            with open('admdata/log/admerror-{0}.log'.format(str('%.6d' % aid)), 'w') as ferr:
                traceback.print_exc(file=ferr)
            traceback.print_exc(sys.stdout)
            print('failed at {0}!'.format(aid))
    
process_patient(136796)

num_workers = cpu_count()
p = Pool(num_workers)
for aid in admission_ids:
    p.apply_async(process_patient, args=(aid,))
p.close()
p.join()
@jareducherek
Copy link
Author

The SQL queries for chartevents are taking a very long time. My chartevents.csv is around 3.3GB, but I am not sure how to optimize these queries. Please let me know if there is any better way:

cur.execute('SELECT charttime,itemid,valuenum,valueuom FROM mimiciii.chartevents WHERE hadm_id = '+str(aid)+' and itemid in (select * from mengcztemp_itemids_valid_chart)')

@JJnotJimmyJohn
Copy link

Hey Jared,

Same issue here. I don't have a powerful machine like yours, it took me about 5 days to finish that cell.

Also, the last cell of that notebook was giving me error - "name 'admission_first_ids_set' is not defined". Are you in the same situation?

@NchemIcaLS
Copy link

I had the same issue and figured I would share my solution for anyone else in the future. There are two things that greatly improve the performance:

  1. The queries for chartevents and labevents are duplicated 4 times. Deduplicating them by moving the query outside the function saves on database queries. Be careful that one of the queries selects valuenum while the others use value.

  2. When querying the database by hadam_id an index is crucial. The authors already mention this as the first step in their script but it implemented it incorrectly (maybe it worked for them and the database has changed since or postgres used to work differently?). Since chartevents is just a collection of child tables the index must be applied to the children and not the parent table. You can add it to the parent table as well but as far as I can tell it does nothing.

conn = getConnection()
cur = conn.cursor()
for i in range(17):
	print(i+1)
	query = f'''DROP INDEX IF EXISTS chartevents_{i+1}_idx_hadm;
	CREATE INDEX chartevents_{i+1}_idx_hadm ON mimiciii.chartevents_{i+1} (hadm_id) INCLUDE (charttime,itemid,value,valuenum,valueuom)'''
	cur.execute(query)
	conn.commit()

# query = '''DROP INDEX IF EXISTS chartevents_idx_hadm;
# CREATE INDEX chartevents_idx02 ON mimiciii.chartevents (hadm_id) INCLUDE (charttime,itemid,value,valuenum,valueuom);'''
# cur.execute(query)
# conn.commit()
conn.close()

@mengcz13
Copy link
Collaborator

mengcz13 commented Jun 8, 2021

We have replaced the notebooks with one preprocessing script. See #21 . It has optimized the running speed: with 4 cores it should finish within 1 day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants