strange behaviors in kvell #9

anoyiuhu · 2020-09-03T02:41:29Z

Hi,

According to scripts/run-aws.sh, we will run ycsb many times. The first time, kvell will generate a database(e.g 100g), then run ycsb workload. In the second time, kvell can reuse database in the last time and recover it. However, I found that sometimes, after recovery of database, it would stop suddenly, very confusing. Do you know why this happened?

During my test, using 2 disks, 4 workers per disk and setting queue depth to 1, I found the Latency and Bandwidth cannot be matched. For example, for ycsb-uniform, latency is 116us, thp is 409838(req/s). Theoretically, the ideal thp is equal to (1/116)(24)*10^6= 68965 req/s, which is smaller than 409838. Can you explain this phenomenon?

Best regards
Looking for your reply.

BLepers · 2020-09-03T05:55:54Z

Hi,

Never happened to me. If it happens again, you can maybe find some info using gdb? I'd be interested in the debug info.
(Edited because I missed that you use a QD of 1)

If I remember correctly, the latency is computed from the moment a query is inserted in a worker's queue. So there is some degree of "batching" even with a QD of 1 because the queue can contain multiple items. You can try to set MAX_NB_PENDING_CALLBACKS_PER_WORKER to 1. Make sure NEVER_EXCEED_QUEUE_DEPTH is equal to 1 too.

Because the latency also includes the time "waiting in the queue", it complexifies maximum BW computation too. (Intuitively if a worker processes 1 request at a time + there is always 1 pending request in the queue, then your throughput is 2x what you would compute based on latency.)

The latency is also computed on the first 10M queries (see MAX_STATS in stats.c), so the average might be wrong if your test is long.

If you want to use the following formula:
thp = avg latency * batch size * number of workers
then modify the following line https://github.com/BLepers/KVell/blob/master/slabworker.c#L218 , replacing 2 by 0 (this will reset the latency measurement at that point in time).

Provide feedback