-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the method of creating an empty file with right size when saving binary files #3408
Conversation
Update the method to create an empty file with the right size
for more information, see https://pre-commit.ci
This is cool. Just tagging myself so I can follow this story as Heberto reviews it. |
Hi, thanks for the contribution. I am unable to reproduce your results on Ubuntu 24.04 with kernel 6.8 and an SSD with NVMe: I also kind of expect that one syscall (truncate) would be faster than two (seek and write). Can you give me more context about how you are doing your profiling and the reason for why seek plus truncate should be faster? Maybe I am doing something wrong on my profiling. |
I’m working with Windows 10 and Python 3.10. Could the issue be related to the operating system? |
@zm711 could you try the profile that I wrote above? |
For your convenience: %%timeit
file_size_bytes = 10 * 1024 * 1024 * 1024
file = open("temporary.bin", "wb+")
file.truncate(file_size_bytes)
file.close()
%%timeit
file_size_bytes = 10 * 1024 * 1024 * 1024
file = open("temporary.bin", "wb+")
file.seek(file_size_bytes - 1)
file.write(b'\0')
file.close() |
Sure. Running to meeting, but right after :) |
This could be a windows thing. So maybe the better strategy is actually per OS. I've been telling my lab mates to deal with it, but it is brutally slow waiting for large binary files to write for KS and MS5. |
I'll test on my Mac too and report that here as well. |
Just to be fair I'm also testing python 3.11 on Windows now, but it'll take a little while. |
Sure I'll start with my Mac and do windows later. |
Whenever you can, man, no rush : ) |
Times are exactly the same for 100 GiB on Mac. |
To be honest running the timeit on Windows is causing my computer to freak out... I don't know what was happening for the truncate. It worked instantly for seek..... |
@zm711 Great speed in testing! |
Maybe this is related to the zero filling? https://docs.python.org/3/library/io.html#io.IOBase.truncate |
The 100 GiB test failed on my Windows computer due to lack of storage. (I've only got a small SSD so this test might be hard for me to do). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be my suggestion unless you have a better idea Heberto.
Co-authored-by: Zach McKenzie <[email protected]>
for more information, see https://pre-commit.ci
@h-mayorquin I agree that Python on Windows does zero-filling in Besides, since |
I think this is a fair point but at least for me it's less about differentiating for a sub ms speedup and more about keeping the default for posix systems and then optimizing for Windows systems. Until someone feels comfortable knowing exactly why there is this difference (maybe zero filling but the docs aren't clear on this) we could keep our original implementation for posix and the speedup for Windows. Maybe you have some clearer docs on the situation so we can better understand the differences across OSs? Either way we should probably add a comment (either explaining why we added the Windows difference) or a comment why we changed everything to seek-write. |
This is a good point, maybe it doesn't make sense to differentiate. |
I'm primarily a Windows user. It is totally up to you how to implement it for posix systems. |
I'm between Windows and Mac so I think this Windows update will be wonderful. I could go either way on the posix implementation vs taking a minuscule slow down to reduce complexity. Could you add a comment just linking back to this PR where we did the profiling to indicate why we needed to switch :) for Windows. If you want to revert back to just one implementation that's fine with me. But I do want a comment so we know why we are doing what we are doing for this really low level operation. |
@zm711 I reverted it back. And I found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for me. context manager seems safer too. Thanks @jiumao2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this to our attention. This is a great contribution that would improve the lives of many window users.
Thanks @zm711 for reviewing this as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good by Heberto good by me.
Should I squeeze it in into the release? Or it can wait for the next one? |
My vote is for waiting on this one. |
I have everyone install from main so for me it's not a big deal/ my group. But this is a huge quality of life. We have so many issues on the tracker for Windows writing binary that we thought was concurrency, but I bet this could contribute. |
The current method of creating an empty file with
file_size_bytes
is significantly slow, especially for large files. This PR addresses the issue by introducing a more efficient approach.Current Implementation in
write_binary_recording
(located insrc/spikeinterface/core/recording_tools.py
):Proposed Improvement:
Creating a 100 GB empty file on SSD: