Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to run a workload with 'rw=trim' #31

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pwrobelse
Copy link
Contributor

@pwrobelse pwrobelse commented Apr 18, 2024

In order to check the impact of trim operations on
read and write performance diskplorer.py is extended
with the new argument '--sequential-trim' that allows
users to enable the additional workload that
trims data written k seconds before (by default k=10).

The default time offset can be changed via the new
parameter: '--trim-offset-time-seconds' that denotes
the offset between write and trim workloads and is
expressed in seconds.

The bandwidth of trim operation is identical to the
bandwith of write operation used in the given iteration.

By default the block size used by trim requests is
calculated as 'trim_block_size=(write_bw*(k/2))', where
write_bw depends on the iteration.

However, block size cannot be too small. Therefore,
the minimum block size for the trim operation can
be configured via '--min-trim-block-size'. The default
minimum block size is 32MB.

If the user wants to have equal 'trim_block_size' for
each iteration it can be set via '--force-trim-block-size'.

When sequential trim workload is enabled, the additional
write workload that prepares data to be deleted is also run.
Moreover, the original write and read workloads utilize
'offset=' parameter to ensure that they do not use trimmed
data.

diskplorer.py Outdated Show resolved Hide resolved
@pwrobelse pwrobelse force-pushed the diskplorer-add-possibility-to-use-trim branch 2 times, most recently from 8d0d2f3 to bc07fc3 Compare April 23, 2024 13:43
@pwrobelse pwrobelse marked this pull request as ready for review April 23, 2024 13:44
@pwrobelse
Copy link
Contributor Author

Hi @avikivity, please find the PR with addition of rw=trim workload to diskplorer.

if args.run_sequential_trim:
out(textwrap.dedent(f'''\
[prepare_data_for_trim(r_idx={read_iops_step},w_idx={write_bw_step},write_bw={write_bw},r_iops={read_iops})]
'''))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preparation step can be run at maximum bandwidth, no?

Can we share the data prepared by write with the trim? This doesn't work if we have to adjust the read range, but perhaps we can pre-calculate it.

diskplorer.py Outdated
read_offset = write_bw*(args.test_step_time_seconds+args.trim_offset_time_seconds)
read_offset_str = f'offset={align_up(read_offset, args.read_buffer_size)}'
write_offset = write_bw*args.trim_offset_time_seconds
write_offset_str = f'offset={align_up(write_offset, args.write_buffer_size)}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this additional write stream part of the normal test? If so it affects it.

I don't understand the runtime part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I will describe what was my intention. If the user specified args.sequential_trim=true, then:

  1. Before ordinary read and write jobs are run for (write_bw_step, read_iops_step), then a new job called prepare_data_for_trim is executed. It writes the data according to the following requirement:

Because we can't trim into the past, we'll need to add a k-second write/read job (that isn't measured) followed by the regular 30-second write/read/trim-behind job. The offset= fio command can be used to orchestrate the second write job wrt the first.

  1. The new job uses out(group_introducer) == stonewall + new_group to ensure that all previous jobs finished before it started. From FIO docs - stonewall:

Wait for preceding jobs in the job file to exit, before starting this one. Can be used to insert serialization points in the job file. A stone wall also implies starting a new reporting group, see group_reporting.

  1. The first ordinary write or read job also contains stonewall + new_group to ensure that prepare_data_for_trim finished its work before read+write workloads started.

Regarding runtime and offset parameters:

  • runtime={args.trim_offset_time_seconds}s is used in case of prepare_data_for_trim to write the data that will be trimmed by trim workload in k seconds
  • write_offset = write_bw*args.trim_offset_time_seconds instructs the write workload to start writing data right after the data that was written by prepare_data_for_trim
  • read_offset = write_bw*(args.test_step_time_seconds+args.trim_offset_time_seconds) instructs read workload to not touch any of the data that will be trimmed. I thought that trim workload will discard (write_bw*args.test_step_time_seconds) of bytes. However, I still added the preparation time - not sure why.

Regarding affecting the test case by prepare_data_for_trim - I thought that waiting until it finishes before running read+write workloads and usage of different reporting groups is sufficient. Also, latency-postprocess.py contains:

if name == 'prepare' or name.startswith('prepare_data_for_trim') or name.startswith('trim_data'):
            continue

There is a chance, that I misunderstood how FIO groups the measurements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I will describe what was my intention. If the user specified args.sequential_trim=true, then:

  1. Before ordinary read and write jobs are run for (write_bw_step, read_iops_step), then a new job called prepare_data_for_trim is executed. It writes the data according to the following requirement:

Because we can't trim into the past, we'll need to add a k-second write/read job (that isn't measured) followed by the regular 30-second write/read/trim-behind job. The offset= fio command can be used to orchestrate the second write job wrt the first.

Ok, makes sense. But please add comments explaining it.

Maybe it should be bounded by write amount, not runtime. It makes sense to use the same write rate to not disrupt the flow.

  1. The new job uses out(group_introducer) == stonewall + new_group to ensure that all previous jobs finished before it started. From FIO docs - stonewall:

Wait for preceding jobs in the job file to exit, before starting this one. Can be used to insert serialization points in the job file. A stone wall also implies starting a new reporting group, see group_reporting.

I think it's correct, but best to observe using iostat -x (which shows discards) and with blktrace (only useful for really short periods). It's important to have trust in the measurement tools.

  1. The first ordinary write or read job also contains stonewall + new_group to ensure that prepare_data_for_trim finished its work before read+write workloads started.

Regarding runtime and offset parameters:

  • runtime={args.trim_offset_time_seconds}s is used in case of prepare_data_for_trim to write the data that will be trimmed by trim workload in k seconds
  • write_offset = write_bw*args.trim_offset_time_seconds instructs the write workload to start writing data right after the data that was written by prepare_data_for_trim

But it assumes the bw*time = what you wanted. It's probably okay but can be subject to rounding. Better to limit it by data size.

  • read_offset = write_bw*(args.test_step_time_seconds+args.trim_offset_time_seconds) instructs read workload to not touch any of the data that will be trimmed.

Ok. What about the end of the range? Is it shifted into not-written territory, or does it stay?

I forgot how diskplorer ensures reads access previously-written addresses.

I thought that trim workload will discard (write_bw*args.test_step_time_seconds) of bytes. However, I still added the preparation time - not sure why.

Regarding affecting the test case by prepare_data_for_trim - I thought that waiting until it finishes before running read+write workloads and usage of different reporting groups is sufficient. Also, latency-postprocess.py contains:

I meant something else - that there's a separate write stream during the measured time period. If there isn't we're good.

if name == 'prepare' or name.startswith('prepare_data_for_trim') or name.startswith('trim_data'):
            continue

There is a chance, that I misunderstood how FIO groups the measurements.

It's tricky and therefore important to observe runs and see that what happens matches expectations. That's what I did when writing the thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. But please add comments explaining it.

Maybe it should be bounded by write amount, not runtime. It makes sense to use the same write rate to not disrupt the flow.

Sure. I added a comment with explanations to the code. Moreover, I used the following syntax to ensure, that the desired amount of data is written by the job - pre_written_data_size is write_bw*k_seconds aligned according to the requirements.

time_based=0
runtime=0
size={pre_written_data_size}

Ok. What about the end of the range? Is it shifted into not-written territory, or does it stay?

I forgot how diskplorer ensures reads access previously-written addresses.

The end of the range is not changed according to FIO documentation that describes offset.

Start I/O at the provided offset in the file, given as either a fixed size in bytes, zones or a percentage. [...] Data before the given offset will not be touched. This effectively caps the file size at real_size - offset. Can be combined with size to constrain the start and end range of the I/O workload.

I assumed that either --prefill or --size-limit is used by diskplorer to ensure that read operations access previously written addresses.

With --prefill flag diskplorer writes the whole disk as soon as it is invoked. This way we can use rw=randread and don't care about the place where randread is done.
help='Prefill entire disk, defeats incorrect results due to discard (default)'

With --size-limit I would assume, that the data was written before and we constrain diskplorer to use only that part of the storage.

The prepare job is defined as follows:

[prepare]
readwrite=write
time_based=0
blocksize=2MB
iodepth=4
runtime=0

The read job is defined as:

[read_job]
readwrite=randread
blocksize={args.read_buffer_size}
iodepth={args.read_concurrency}
rate_iops={this_cpu_read_iops}
rate_process=poisson
{read_offset_str}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense. But please add comments explaining it.
Maybe it should be bounded by write amount, not runtime. It makes sense to use the same write rate to not disrupt the flow.

Sure. I added a comment with explanations to the code. Moreover, I used the following syntax to ensure, that the desired amount of data is written by the job - pre_written_data_size is write_bw*k_seconds aligned according to the requirements.

time_based=0
runtime=0
size={pre_written_data_size}

Ok. What about the end of the range? Is it shifted into not-written territory, or does it stay?
I forgot how diskplorer ensures reads access previously-written addresses.

The end of the range is not changed according to FIO documentation that describes offset.

Ack

Start I/O at the provided offset in the file, given as either a fixed size in bytes, zones or a percentage. [...] Data before the given offset will not be touched. This effectively caps the file size at real_size - offset. Can be combined with size to constrain the start and end range of the I/O workload.

I assumed that either --prefill or --size-limit is used by diskplorer to ensure that read operations access previously written addresses.

Ah, I forgot about prefill. So it's enough to exclude anything that may have been discarded.

With --prefill flag diskplorer writes the whole disk as soon as it is invoked. This way we can use rw=randread and don't care about the place where randread is done. help='Prefill entire disk, defeats incorrect results due to discard (default)'

With --size-limit I would assume, that the data was written before and we constrain diskplorer to use only that part of the storage.

Yes, the main concern is that the default workload is sensible.

The prepare job is defined as follows:

[prepare]
readwrite=write
time_based=0
blocksize=2MB
iodepth=4
runtime=0

The read job is defined as:

[read_job]
readwrite=randread
blocksize={args.read_buffer_size}
iodepth={args.read_concurrency}
rate_iops={this_cpu_read_iops}
rate_process=poisson
{read_offset_str}

Looks good.

diskplorer.py Outdated Show resolved Hide resolved
@avikivity
Copy link
Member

Please include some results in latency-matrix-results (with links in README.md). Also include blktrace traces in the commitlog that show a fragment of a run.

@avikivity
Copy link
Member

Please include some results in latency-matrix-results (with links in README.md). Also include blktrace traces in the commitlog that show a fragment of a run.

Note the .fio directories which contain the generated files.

@pwrobelse pwrobelse force-pushed the diskplorer-add-possibility-to-use-trim branch from bc07fc3 to 624fd4d Compare May 2, 2024 14:04
@pwrobelse
Copy link
Contributor Author

The change-log is as follows:

  • renamed --run-sequential-trim to --sequential-trim
  • introduced a new parameter --min-trim-block-size used to ensure, that trim uses at least given block size (defaults to 32MiB)
  • added a logic that selects the block size for trim operation as well as calculates the time required to issue at least 4 discard commands
  • adjusted the logic that prepares the data for trim - now it writes at least trim_block_size of data, because as soon as the workload is run a trim request with BS=trim_block_size may be issued
  • replaced a specific trim_data job name with {next(job_names)} - in case when multiple jobs are run as one group the result file contains only the name of the first one - thus, the special name was not so useful

@avikivity
Copy link
Member

Please supply an example run in latency-matrix-results/ (with references from README.md).

Perhaps even two: one with 32MB trims, and one with much smaller trims to demonstrate how bad it is.

@pwrobelse
Copy link
Contributor Author

Please find some results from blktrace+blkparse. In case more data is needed, I can provide the whole file - 60s trace is around 2GB.

blktrace had been started during prepare_data_for_trim job and it was running in parallel with read+write+trim. As far as I understand, the offset printed by blkparse assumes 512B blocks.

The first thing is that in each 5 seconds I can see DS (Discard-Synchronous) request submitted by FIO. It makes sense, because the used trim_block_size was write_bw*k/2, where k=10, where write_bw=120040634, which resulted in 572 MiB block size - greater than default 32MiB. Due to the write bandwidth one request had to be issued per 5 seconds.

259,0    1     9482     7.124364072  8699  Q  DS 0 + 1172272 [fio]
259,0    1     9483     7.124371614  8699  G  DS 0 + 1172272 [fio]
259,0    1     9484     7.124372665  8699  P   N [fio]
259,0    1     9485     7.124374263  8699 UT   N [fio] 1
259,0    1     9486     7.124375053  8699  I  DS 0 + 1172272 [fio]
259,0    1     9487     7.124393062   101  D  DS 0 + 1172272 [kworker/1:1H]
259,0    3     2432     7.125765420  8701  C  DS 0 + 1172272 [0]

[...]

259,0    0   987015    12.124114673  8699  Q  DS 1172272 + 1172272 [fio]

[...]

259,0    0   987017    12.124129310  8699  G  DS 1172272 + 1172272 [fio]

259,0    0   987020    12.124132412  8699  I  DS 1172272 + 1172272 [fio]

[...]

259,0    0   987024    12.124175848    77  D  DS 1172272 + 1172272 [kworker/0:1H]

259,0    1  1032289    12.125648947     0  C  DS 1172272 + 1172272 [0]

[...]

259,0    0   987081    12.125900229  8699  Q  DS 2344544 + 1172272 [fio]

The second thing is that between discard requests, I can see WS (write synchronous) and R (read) requests.

259,0    3   959682    12.124393616  8700  Q   R 177701496 + 1 [fio]
259,0    3   959683    12.124394005  8700  G   R 177701496 + 1 [fio]
259,0    3   959684    12.124394325  8700  D   R 177701496 + 1 [fio]
259,0    3   959685    12.124396938  8700  Q   R 835896982 + 1 [fio]
259,0    3   959686    12.124397542  8700  G   R 835896982 + 1 [fio]
259,0    3   959687    12.124397806  8700  D   R 835896982 + 1 [fio]
259,0    0   987030    12.124536193  8698  Q  WS 3514880 + 256 [fio]
259,0    0   987031    12.124537399  8698  G  WS 3514880 + 256 [fio]
259,0    0   987032    12.124539745  8698  D  WS 3514880 + 256 [fio]
259,0    1  1032289    12.125648947     0  C  DS 1172272 + 1172272 [0]
259,0    2  1017012    12.125691005     0  C   R 1826996167 + 1 [0]
259,0    0   987033    12.125692812  8699  C   R 924043527 + 1 [0]
259,0    2  1017013    12.125694101     0  C   R 376702341 + 1 [0]
259,0    2  1017014    12.125694738     0  C   R 1609579396 + 1 [0]
259,0    2  1017015    12.125695357     0  C   R 439600678 + 1 [0]
259,0    2  1017016    12.125695962     0  C   R 498326998 + 1 [0]
259,0    0   987034    12.125697836  8699  C   R 888870918 + 1 [0]
259,0    0   987035    12.125698584  8699  C   R 1491220998 + 1 [0]
259,0    0   987036    12.125699417  8699  C   R 1800930502 + 1 [0]
259,0    0   987037    12.125700129  8699  C   R 136082322 + 1 [0]
259,0    0   987038    12.125700873  8699  C   R 961352789 + 1 [0]
259,0    3   959688    12.125711768     0  C   R 257796490 + 1 [0]
259,0    2  1017017    12.125714338  8701  C   R 309401678 + 1 [0]

@avikivity
Copy link
Member

Interesting.

  1. The huge discard completes in a millisecond. Clearly it didn't really complete - it was just queued by the disk. If we keep piling them on, the disk will eventually choke on all the queued work. We've seen exactly the same with writes. So pacing is necessary. Not a surprise but good to see confirmation.
  2. The read+write are odd. They aren't part of the workload running in parallel?

@pwrobelse
Copy link
Contributor Author

2. The read+write are odd. They aren't part of the workload running in parallel?

Hi @avikivity. Please find the answer below.

The read+write are part of the workload running in parallel. To compare the output of blktrace for read+write+trim I also collected the traces for read+write. The results look similar.

259,0    7        7     0.000375921 26975  Q  WS 2864640 + 256 [fio]
259,0    7        8     0.000377171 26975  G  WS 2864640 + 256 [fio]
259,0    7        9     0.000379278 26975  D  WS 2864640 + 256 [fio]
259,0    0       19     0.000408372 26981  Q   R 715711729 + 1 [fio]
259,0    0       20     0.000409008 26981  G   R 715711729 + 1 [fio]
259,0    0       21     0.000409402 26981  D   R 715711729 + 1 [fio]
259,0    0       22     0.000412489 26981  Q   R 47439307 + 1 [fio]
259,0    0       23     0.000412909 26981  G   R 47439307 + 1 [fio]
259,0    0       24     0.000413187 26981  D   R 47439307 + 1 [fio]
259,0    0       25     0.000415492 26981  Q   R 609329307 + 1 [fio]
259,0    0       26     0.000415789 26981  G   R 609329307 + 1 [fio]
259,0    0       27     0.000416047 26981  D   R 609329307 + 1 [fio]
259,0    5       40     0.000417876     0  C   R 376955347 + 1 [0]
259,0    6       37     0.000418529     0  C   R 2968317088 + 1 [0]
259,0    7       10     0.000419634     0  C   R 2449824514 + 1 [0]
259,0    5       41     0.000423759     0  C   R 672399320 + 1 [0]
259,0    6       38     0.000423788     0  C   R 615315517 + 1 [0]
259,0    5       42     0.000437485     0  C   R 1335922978 + 1 [0]
259,0    0       28     0.000440929     0  C   R 3697026643 + 1 [0]
259,0    6       39     0.000449629     0  C   R 1474999982 + 1 [0]
259,0    0       29     0.000449808     0  C   R 2064158733 + 1 [0]
259,0    0       30     0.000450330     0  C   R 84724269 + 1 [0]
259,0    6       40     0.000454348     0  C   R 1478299699 + 1 [0]
259,0    5       43     0.000455490 26982  C   R 644217706 + 1 [0]
259,0    0       31     0.000456481     0  C   R 2635200624 + 1 [0]
259,0    2        7     0.000459449 26977  Q   R 1228640516 + 1 [fio]
259,0    2        8     0.000460377 26977  G   R 1228640516 + 1 [fio]
259,0    6       41     0.000460957     0  C   R 38201407 + 1 [0]
259,0    2        9     0.000460965 26977  D   R 1228640516 + 1 [fio]
259,0    2       10     0.000465650 26977  Q   R 787274164 + 1 [fio]
259,0    2       11     0.000466107 26977  G   R 787274164 + 1 [fio]
259,0    2       12     0.000466432 26977  D   R 787274164 + 1 [fio]
259,0    6       42     0.000468552     0  C   R 664170101 + 1 [0]
259,0    5       44     0.000469230 26982  C   R 1186353899 + 1 [0]
259,0    6       43     0.000469243     0  C   R 555552967 + 1 [0]
259,0    2       13     0.000470033 26977  Q   R 2473290180 + 1 [fio]
259,0    2       14     0.000470441 26977  G   R 2473290180 + 1 [fio]
259,0    2       15     0.000470830 26977  D   R 2473290180 + 1 [fio]
259,0    5       45     0.000478178 26982  C   R 3479163011 + 1 [0]
259,0    6       44     0.000484038     0  C   R 2920209455 + 1 [0]
259,0    4       19     0.000485207 26976  Q   R 2002892698 + 1 [fio]
259,0    5       46     0.000492594 26982  C   R 983889880 + 1 [0]
259,0    4       20     0.000493617 26976  C  WS 2864640 + 256 [0]

I did not specify --read-buffer-size. Therefore, diskplorer used the value read from {dev_path}/queue/logical_block_size, which was 512. The same in the case of --write-buffer-size. It used the default value, which was 128*1024. Thus, the blocks from blktrace seem to match the default values - blkparse assumes 512B blocks. Therefore, reads use +1 and writes use +256.

What is strange is the fact, that WS (Write Synchronous) is used. sync is not specified in any of the jobs and by default FIO uses sync=none.

sync=str
Whether, and what type, of synchronous I/O to use for writes. The allowed values are:
none
Do not use synchronous IO, the default.

Reference: https://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-sync

@avikivity
Copy link
Member

Now I don't remember why I thought they were odd.

About the write-sync part - I have vague memory that aio+dio sets that as a hint to the drive (or the scheduler) to give them higher priority. Buffered writes were already acknowledged, so they aren't latency sensitive, but dio writes are.

@avikivity
Copy link
Member

        dio->inode = inode;
        if (iov_iter_rw(iter) == WRITE) {
                dio->opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
                if (iocb->ki_flags & IOCB_NOWAIT)
                        dio->opf |= REQ_NOWAIT;
        } else {
                dio->opf = REQ_OP_READ;
        }

@avikivity
Copy link
Member

btw, use blkparse -t to get timing information.

pwrobelse added 3 commits June 3, 2024 13:53
In order to check the impact of trim operations on
read and write performance diskplorer.py is extended
with the new argument '--sequential-trim' that allows
users to enable the additional workload that
trims data written k seconds before (by default k=10).

The default time offset can be changed via the new
parameter: '--trim-offset-time-seconds' that denotes
the offset between write and trim workloads and is
expressed in seconds.

The bandwidth of trim operation is identical to the
bandwith of write operation used in the given iteration.

By default the block size used by trim requests is
calculated as 'trim_block_size=(write_bw*(k/2))', where
write_bw depends on the iteration.

However, block size cannot be too small. Therefore,
the minimum block size for the trim operation can
be configured via '--min-trim-block-size'. The default
minimum block size is 32MB.

If the user wants to have equal 'trim_block_size' for
each iteration it can be set via '--force-trim-block-size'.

When sequential trim workload is enabled, the additional
write workload that prepares data to be deleted is also run.
Moreover, the original write and read workloads utilize
'offset=' parameter to ensure that they do not use trimmed
data.

Signed-off-by: Patryk Wrobel <[email protected]>
This change posts input files for FIO, that
were generated during running read+write+trim
workload that used 32MB trim block size on
i3.2xlarge machine.

Signed-off-by: Patryk Wrobel <[email protected]>
This change adds results obtained on i3.2xlarge
machine with usage of --sequential-trim parameter
with constant trim block size equals 32MB and
trim bandwidth equals write bandwidth.

Signed-off-by: Patryk Wrobel <[email protected]>
@pwrobelse pwrobelse force-pushed the diskplorer-add-possibility-to-use-trim branch from 624fd4d to 593c3d5 Compare June 5, 2024 10:29
@pwrobelse
Copy link
Contributor Author

New changes since last version of the PR:

  • added --force-trim-block-size parameter to diskplorer.py to allow forcing certain block size for a trim (e.g. 32MB).
  • included generated FIO files in latency-matrix-results/i3.2xlarge.trim32MB.fio
  • updated README.md and added chart + JSON from the run with 32MB trim requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants