-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some common filetypes are not detected #12
Comments
You are correct, it is not able to detect these as those file types do not have file magic numbers for file detection and require additional analytics for a best guess that I have not written. For example it does support Python files with their first line formatted as '#!/usr/bin/env python', whereas it would be better to upgrade this module to do some loser matching or some analytics to give more / better results. (Already tried to capture this idea in #3 but better spelled out with your example) I don't have the time currently to work on it, but I at least remember how I thought about implementing I will capture in this issue:
|
This would not work because shebang in Python (and also in other common types as well) is not mandatory. A shebang is only relevant to runnable scripts that you wish to execute without explicitly specifying the program to run them through. You wouldn't typically put a shebang in a Python module that only contains function and class definitions meant for importing from other modules. Therefore lots of python files does not have a shebang and this is not enough to identify a Python file. Also, this would not work in Python running on Windows. |
Correct @ionecum that is the example to show where pure magic could do better with a more in depth parser and not just matching the first lines of a file. |
Look at File Format: DOCX the best way to match these would be to regex through the file for the REL string, for example:
This would provide a solid hit every time, but would need some changes to the PureMagic logic. An idea for this might be:
This would obviously have pros and cons. This obviously would look at anything with a @cdgriffith I know you are looking at ways to expand PureMagic's abilities, is this something that would be of interest? EDIT:
|
Hello,
I see, some web forms may accept docx files or pdf (much more important),
but your solution is too specific to Microsoft. What if a user sends a
Libre Office or Only Office document from Linux or Mac? What if the user
sends something from Android, which is today the most common case?
We should find a more general approach.
Sincerely
DR
…On Sat, May 4, 2024 at 7:08 AM Andy ***@***.***> wrote:
Look at File Format: DOCX
<https://docs.fileformat.com/word-processing/docx/#:~:text=A%20Docx%20file%20comprises%20of,files%20available%20in%20the%20archive>
the best way to match these would be to regex through the file for the REL
string, for example:
<Relationship Id#"rId1" Type#"
http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Target#"word/document.xml"/>
This would provide a solid hit every time, but would need some changes to
the PureMagic logic. An idea for this might be:
1. Still use the existing PK match to get the ball rolling
2. Use a modified version of the multi-part match to hold a regex
string in hex (this ensures we could safely store required characters), I
left the 0 in a dummy value for the example, we could of course ditch it as
regex would not require this:
"regex": {
"464f524d": [ ["3C52656C6174696F6E736869702049642322724964312220547970652322687474703A2F2F736368656D61732E6F70656E786D6C666F726D6174732E6F72672F6F6666696365446F63756D656E742F323030362F72656C6174696F6E73686970732F6F6666696365446F63756D656E7422205461726765742322776F72642F646F63756D656E742E786D6C222F3E", 0, ".docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"MS Office Open XML Format Word Document"]
}
3. Process and treat in a similar way to a regular multipart match
This would obviously have pros and cons. This obviously would look at
anything with a PK header, potentially needing longer times to match, and
heavier memory requirements if you have a huge file. The Pro would be in
theory a solid high confidence match.
@cdgriffith <https://github.com/cdgriffith> I know you are looking at
ways to expand PureMagic's abilities, is this something that woud be of
interest?
—
Reply to this email directly, view it on GitHub
<#12 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASPVE66J6HTYQSMXDAF5ICDZAS6R3AVCNFSM4E653PV2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBZGQYTEMRWGEZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@NebularNerd yes would love to see a rules engine be able to be engaged after the initial fast match, that could do those deeper searchers. Similar to outlined at the start of the issue #12 (comment) |
From my proof-of-concept this is mainly to see how we can improve matching for any and all files. PureMagic is pretty awesome and can already match most of the document types you mentioned. In regards to files that are all essentially a .zip we need to find other markers that are always present to improve confidence rates. I'm sure that we could find similar inside the Libre etc... formats. I picked on .docx/.xlsx as I have plenty of those to test against, not because I'm Microsoft centric. |
As a proof-of-concept it works, maybe I'll run up a PR and you can take a look and see what you think (you may even find a better implementation). For the time being I've got a way to drop them into the existing Multi-Match so we don't have to reinvent the wheel. My main concern is that even with such a large confidence match the lesser matches still 'win' (Confidence 'winner' fixed if #66 is approved) |
I'll add this here for now as it's on mostly topic, but more of a 2.0 goal than an immediate solution. .apk, .docx, .jar, .xlsx and anything else would be almost instantly matchable. If we know a .docx has x files that would always be present we could test for their presence in the file. This would be a secondary step to byte matching but it opens up a possible solution for dealing with those files. Scores would still be calculated the same way, just treat the matched file path (preferably paths to allow even longer matches) as if it were a byte string. Sample file names you could match for: .docx:
.jar:
.apk:
.odt: (Would need to test more files to confirm contents)
|
Pure magic seems to be failing to detect some very common file types, like text files (.py, .txt, .md).
The text was updated successfully, but these errors were encountered: