-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add possibility to run a workload with 'rw=trim' #31
base: master
Are you sure you want to change the base?
Add possibility to run a workload with 'rw=trim' #31
Conversation
8d0d2f3
to
bc07fc3
Compare
Hi @avikivity, please find the PR with addition of rw=trim workload to diskplorer. |
if args.run_sequential_trim: | ||
out(textwrap.dedent(f'''\ | ||
[prepare_data_for_trim(r_idx={read_iops_step},w_idx={write_bw_step},write_bw={write_bw},r_iops={read_iops})] | ||
''')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This preparation step can be run at maximum bandwidth, no?
Can we share the data prepared by write with the trim? This doesn't work if we have to adjust the read range, but perhaps we can pre-calculate it.
diskplorer.py
Outdated
read_offset = write_bw*(args.test_step_time_seconds+args.trim_offset_time_seconds) | ||
read_offset_str = f'offset={align_up(read_offset, args.read_buffer_size)}' | ||
write_offset = write_bw*args.trim_offset_time_seconds | ||
write_offset_str = f'offset={align_up(write_offset, args.write_buffer_size)}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this additional write stream part of the normal test? If so it affects it.
I don't understand the runtime part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, I will describe what was my intention. If the user specified args.sequential_trim=true
, then:
- Before ordinary read and write jobs are run for
(write_bw_step, read_iops_step)
, then a new job calledprepare_data_for_trim
is executed. It writes the data according to the following requirement:
Because we can't trim into the past, we'll need to add a k-second write/read job (that isn't measured) followed by the regular 30-second write/read/trim-behind job. The offset= fio command can be used to orchestrate the second write job wrt the first.
- The new job uses
out(group_introducer) == stonewall + new_group
to ensure that all previous jobs finished before it started. From FIO docs - stonewall:
Wait for preceding jobs in the job file to exit, before starting this one. Can be used to insert serialization points in the job file. A stone wall also implies starting a new reporting group, see group_reporting.
- The first ordinary write or read job also contains
stonewall + new_group
to ensure thatprepare_data_for_trim
finished its work before read+write workloads started.
Regarding runtime and offset parameters:
runtime={args.trim_offset_time_seconds}s
is used in case ofprepare_data_for_trim
to write the data that will be trimmed by trim workload in k secondswrite_offset = write_bw*args.trim_offset_time_seconds
instructs the write workload to start writing data right after the data that was written byprepare_data_for_trim
read_offset = write_bw*(args.test_step_time_seconds+args.trim_offset_time_seconds)
instructs read workload to not touch any of the data that will be trimmed. I thought that trim workload will discard(write_bw*args.test_step_time_seconds)
of bytes. However, I still added the preparation time - not sure why.
Regarding affecting the test case by prepare_data_for_trim
- I thought that waiting until it finishes before running read+write workloads and usage of different reporting groups is sufficient. Also, latency-postprocess.py contains:
if name == 'prepare' or name.startswith('prepare_data_for_trim') or name.startswith('trim_data'):
continue
There is a chance, that I misunderstood how FIO groups the measurements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, I will describe what was my intention. If the user specified
args.sequential_trim=true
, then:
- Before ordinary read and write jobs are run for
(write_bw_step, read_iops_step)
, then a new job calledprepare_data_for_trim
is executed. It writes the data according to the following requirement:Because we can't trim into the past, we'll need to add a k-second write/read job (that isn't measured) followed by the regular 30-second write/read/trim-behind job. The offset= fio command can be used to orchestrate the second write job wrt the first.
Ok, makes sense. But please add comments explaining it.
Maybe it should be bounded by write amount, not runtime. It makes sense to use the same write rate to not disrupt the flow.
- The new job uses
out(group_introducer) == stonewall + new_group
to ensure that all previous jobs finished before it started. From FIO docs - stonewall:Wait for preceding jobs in the job file to exit, before starting this one. Can be used to insert serialization points in the job file. A stone wall also implies starting a new reporting group, see group_reporting.
I think it's correct, but best to observe using iostat -x (which shows discards) and with blktrace (only useful for really short periods). It's important to have trust in the measurement tools.
- The first ordinary write or read job also contains
stonewall + new_group
to ensure thatprepare_data_for_trim
finished its work before read+write workloads started.Regarding runtime and offset parameters:
runtime={args.trim_offset_time_seconds}s
is used in case ofprepare_data_for_trim
to write the data that will be trimmed by trim workload in k secondswrite_offset = write_bw*args.trim_offset_time_seconds
instructs the write workload to start writing data right after the data that was written byprepare_data_for_trim
But it assumes the bw*time = what you wanted. It's probably okay but can be subject to rounding. Better to limit it by data size.
read_offset = write_bw*(args.test_step_time_seconds+args.trim_offset_time_seconds)
instructs read workload to not touch any of the data that will be trimmed.
Ok. What about the end of the range? Is it shifted into not-written territory, or does it stay?
I forgot how diskplorer ensures reads access previously-written addresses.
I thought that trim workload will discard
(write_bw*args.test_step_time_seconds)
of bytes. However, I still added the preparation time - not sure why.Regarding affecting the test case by
prepare_data_for_trim
- I thought that waiting until it finishes before running read+write workloads and usage of different reporting groups is sufficient. Also, latency-postprocess.py contains:
I meant something else - that there's a separate write stream during the measured time period. If there isn't we're good.
if name == 'prepare' or name.startswith('prepare_data_for_trim') or name.startswith('trim_data'): continue
There is a chance, that I misunderstood how FIO groups the measurements.
It's tricky and therefore important to observe runs and see that what happens matches expectations. That's what I did when writing the thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. But please add comments explaining it.
Maybe it should be bounded by write amount, not runtime. It makes sense to use the same write rate to not disrupt the flow.
Sure. I added a comment with explanations to the code. Moreover, I used the following syntax to ensure, that the desired amount of data is written by the job - pre_written_data_size
is write_bw*k_seconds
aligned according to the requirements.
time_based=0
runtime=0
size={pre_written_data_size}
Ok. What about the end of the range? Is it shifted into not-written territory, or does it stay?
I forgot how diskplorer ensures reads access previously-written addresses.
The end of the range is not changed according to FIO documentation that describes offset.
Start I/O at the provided offset in the file, given as either a fixed size in bytes, zones or a percentage. [...] Data before the given offset will not be touched. This effectively caps the file size at real_size - offset. Can be combined with size to constrain the start and end range of the I/O workload.
I assumed that either --prefill
or --size-limit
is used by diskplorer to ensure that read operations access previously written addresses.
With --prefill
flag diskplorer writes the whole disk as soon as it is invoked. This way we can use rw=randread
and don't care about the place where randread is done.
help='Prefill entire disk, defeats incorrect results due to discard (default)'
With --size-limit
I would assume, that the data was written before and we constrain diskplorer to use only that part of the storage.
The prepare job is defined as follows:
[prepare]
readwrite=write
time_based=0
blocksize=2MB
iodepth=4
runtime=0
The read job is defined as:
[read_job]
readwrite=randread
blocksize={args.read_buffer_size}
iodepth={args.read_concurrency}
rate_iops={this_cpu_read_iops}
rate_process=poisson
{read_offset_str}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense. But please add comments explaining it.
Maybe it should be bounded by write amount, not runtime. It makes sense to use the same write rate to not disrupt the flow.Sure. I added a comment with explanations to the code. Moreover, I used the following syntax to ensure, that the desired amount of data is written by the job -
pre_written_data_size
iswrite_bw*k_seconds
aligned according to the requirements.time_based=0 runtime=0 size={pre_written_data_size}
Ok. What about the end of the range? Is it shifted into not-written territory, or does it stay?
I forgot how diskplorer ensures reads access previously-written addresses.
The end of the range is not changed according to FIO documentation that describes offset.
Ack
Start I/O at the provided offset in the file, given as either a fixed size in bytes, zones or a percentage. [...] Data before the given offset will not be touched. This effectively caps the file size at real_size - offset. Can be combined with size to constrain the start and end range of the I/O workload.
I assumed that either
--prefill
or--size-limit
is used by diskplorer to ensure that read operations access previously written addresses.
Ah, I forgot about prefill. So it's enough to exclude anything that may have been discarded.
With
--prefill
flag diskplorer writes the whole disk as soon as it is invoked. This way we can userw=randread
and don't care about the place where randread is done.help='Prefill entire disk, defeats incorrect results due to discard (default)'
With
--size-limit
I would assume, that the data was written before and we constrain diskplorer to use only that part of the storage.
Yes, the main concern is that the default workload is sensible.
The prepare job is defined as follows:
[prepare] readwrite=write time_based=0 blocksize=2MB iodepth=4 runtime=0
The read job is defined as:
[read_job] readwrite=randread blocksize={args.read_buffer_size} iodepth={args.read_concurrency} rate_iops={this_cpu_read_iops} rate_process=poisson {read_offset_str}
Looks good.
Please include some results in latency-matrix-results (with links in README.md). Also include blktrace traces in the commitlog that show a fragment of a run. |
Note the .fio directories which contain the generated files. |
bc07fc3
to
624fd4d
Compare
The change-log is as follows:
|
Please supply an example run in latency-matrix-results/ (with references from README.md). Perhaps even two: one with 32MB trims, and one with much smaller trims to demonstrate how bad it is. |
Please find some results from
The first thing is that in each 5 seconds I can see
The second thing is that between discard requests, I can see
|
Interesting.
|
Hi @avikivity. Please find the answer below. The
I did not specify What is strange is the fact, that
Reference: https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-sync |
Now I don't remember why I thought they were odd. About the write-sync part - I have vague memory that aio+dio sets that as a hint to the drive (or the scheduler) to give them higher priority. Buffered writes were already acknowledged, so they aren't latency sensitive, but dio writes are. |
dio->inode = inode;
if (iov_iter_rw(iter) == WRITE) {
dio->opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
if (iocb->ki_flags & IOCB_NOWAIT)
dio->opf |= REQ_NOWAIT;
} else {
dio->opf = REQ_OP_READ;
} |
btw, use blkparse -t to get timing information. |
In order to check the impact of trim operations on read and write performance diskplorer.py is extended with the new argument '--sequential-trim' that allows users to enable the additional workload that trims data written k seconds before (by default k=10). The default time offset can be changed via the new parameter: '--trim-offset-time-seconds' that denotes the offset between write and trim workloads and is expressed in seconds. The bandwidth of trim operation is identical to the bandwith of write operation used in the given iteration. By default the block size used by trim requests is calculated as 'trim_block_size=(write_bw*(k/2))', where write_bw depends on the iteration. However, block size cannot be too small. Therefore, the minimum block size for the trim operation can be configured via '--min-trim-block-size'. The default minimum block size is 32MB. If the user wants to have equal 'trim_block_size' for each iteration it can be set via '--force-trim-block-size'. When sequential trim workload is enabled, the additional write workload that prepares data to be deleted is also run. Moreover, the original write and read workloads utilize 'offset=' parameter to ensure that they do not use trimmed data. Signed-off-by: Patryk Wrobel <[email protected]>
This change posts input files for FIO, that were generated during running read+write+trim workload that used 32MB trim block size on i3.2xlarge machine. Signed-off-by: Patryk Wrobel <[email protected]>
This change adds results obtained on i3.2xlarge machine with usage of --sequential-trim parameter with constant trim block size equals 32MB and trim bandwidth equals write bandwidth. Signed-off-by: Patryk Wrobel <[email protected]>
624fd4d
to
593c3d5
Compare
New changes since last version of the PR:
|
In order to check the impact of trim operations on
read and write performance diskplorer.py is extended
with the new argument '--sequential-trim' that allows
users to enable the additional workload that
trims data written k seconds before (by default k=10).
The default time offset can be changed via the new
parameter: '--trim-offset-time-seconds' that denotes
the offset between write and trim workloads and is
expressed in seconds.
The bandwidth of trim operation is identical to the
bandwith of write operation used in the given iteration.
By default the block size used by trim requests is
calculated as 'trim_block_size=(write_bw*(k/2))', where
write_bw depends on the iteration.
However, block size cannot be too small. Therefore,
the minimum block size for the trim operation can
be configured via '--min-trim-block-size'. The default
minimum block size is 32MB.
If the user wants to have equal 'trim_block_size' for
each iteration it can be set via '--force-trim-block-size'.
When sequential trim workload is enabled, the additional
write workload that prepares data to be deleted is also run.
Moreover, the original write and read workloads utilize
'offset=' parameter to ensure that they do not use trimmed
data.