8_processing notebook process_patient takes extremely long #10

jareducherek · 2020-02-28T03:38:32Z

def process_patient(aid) takes extremely long to run in the notebook. This function takes about 30 seconds to run per patient input. Even running the subsequent part with 20 workers would still take several days to complete.

Is there anything I can change to get this to complete in a reasonable amount of time? I'm running this on a 20 Core Intel(R) Core(TM) i9-9900X CPU @ 3.50GHz.

def process_patient(aid):
    with open('admdata/log/adm-{0}.log'.format(str('%.6d' % aid)), 'w') as f:
        try:
            proc = processing(aid, f)
            if len(proc) == 0:
                return
            res = {
                'timeseries': sparsify(proc),
                'general': ageLosMortality(aid, f),
                'icd9': ICD9(aid, f)
            }
            np.save('admdata/adm-' + str('%.6d' % aid), res)
            print('finished {0}!'.format(aid))
        except Exception as e:
            with open('admdata/log/admerror-{0}.log'.format(str('%.6d' % aid)), 'w') as ferr:
                traceback.print_exc(file=ferr)
            traceback.print_exc(sys.stdout)
            print('failed at {0}!'.format(aid))
    
process_patient(136796)

num_workers = cpu_count()
p = Pool(num_workers)
for aid in admission_ids:
    p.apply_async(process_patient, args=(aid,))
p.close()
p.join()

The text was updated successfully, but these errors were encountered:

jareducherek · 2020-02-28T05:30:32Z

The SQL queries for chartevents are taking a very long time. My chartevents.csv is around 3.3GB, but I am not sure how to optimize these queries. Please let me know if there is any better way:

cur.execute('SELECT charttime,itemid,valuenum,valueuom FROM mimiciii.chartevents WHERE hadm_id = '+str(aid)+' and itemid in (select * from mengcztemp_itemids_valid_chart)')

JJnotJimmyJohn · 2020-04-24T12:58:18Z

Hey Jared,

Same issue here. I don't have a powerful machine like yours, it took me about 5 days to finish that cell.

Also, the last cell of that notebook was giving me error - "name 'admission_first_ids_set' is not defined". Are you in the same situation?

NchemIcaLS · 2021-05-05T07:43:05Z

I had the same issue and figured I would share my solution for anyone else in the future. There are two things that greatly improve the performance:

The queries for chartevents and labevents are duplicated 4 times. Deduplicating them by moving the query outside the function saves on database queries. Be careful that one of the queries selects valuenum while the others use value.
When querying the database by hadam_id an index is crucial. The authors already mention this as the first step in their script but it implemented it incorrectly (maybe it worked for them and the database has changed since or postgres used to work differently?). Since chartevents is just a collection of child tables the index must be applied to the children and not the parent table. You can add it to the parent table as well but as far as I can tell it does nothing.

conn = getConnection()
cur = conn.cursor()
for i in range(17):
	print(i+1)
	query = f'''DROP INDEX IF EXISTS chartevents_{i+1}_idx_hadm;
	CREATE INDEX chartevents_{i+1}_idx_hadm ON mimiciii.chartevents_{i+1} (hadm_id) INCLUDE (charttime,itemid,value,valuenum,valueuom)'''
	cur.execute(query)
	conn.commit()

# query = '''DROP INDEX IF EXISTS chartevents_idx_hadm;
# CREATE INDEX chartevents_idx02 ON mimiciii.chartevents (hadm_id) INCLUDE (charttime,itemid,value,valuenum,valueuom);'''
# cur.execute(query)
# conn.commit()
conn.close()

mengcz13 · 2021-06-08T20:15:53Z

We have replaced the notebooks with one preprocessing script. See #21 . It has optimized the running speed: with 4 cores it should finish within 1 day.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8_processing notebook process_patient takes extremely long #10

8_processing notebook process_patient takes extremely long #10

jareducherek commented Feb 28, 2020

jareducherek commented Feb 28, 2020

JJnotJimmyJohn commented Apr 24, 2020

NchemIcaLS commented May 5, 2021

mengcz13 commented Jun 8, 2021

8_processing notebook process_patient takes extremely long #10

8_processing notebook process_patient takes extremely long #10

Comments

jareducherek commented Feb 28, 2020

jareducherek commented Feb 28, 2020

JJnotJimmyJohn commented Apr 24, 2020

NchemIcaLS commented May 5, 2021

mengcz13 commented Jun 8, 2021