-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New internal module "extract" #1918
base: dev
Are you sure you want to change the base?
New internal module "extract" #1918
Conversation
Nice! This will be a fun one to build out, as we add support for every compression type and enable recursive extraction (archives within archives). I wrote code a while back to do this in credshed, which might be useful: |
…st has been written
I like the mapping of compression types to extraction functions. Probably we'll need to improve on our magic filetype detection, especially Also we might want to favor shell commands over python libraries, since CPU resources in the main process are really scarce, and offloading to tools like I wrote a system just like this in credshed, where each file would get extracted, and then its contents recursively searched for more compressed files, which would each get extracted to an auto-named folder (e.g. import os
import magic
import logging
import subprocess as sp
from pathlib import Path
log = logging.getLogger('credshed.filestore.util')
supported_compressions = [
('microsoft excel', ['ssconvert', '-S', '{filename}', '{extract_dir}/%s.csv']),
('rar archive', ['unrar', 'x', '-o+', '-p-', '{filename}', '{extract_dir}/']),
('tar archive', ['tar', '--overwrite', '-xvf', '{filename}', '-C', '{extract_dir}/']),
('gzip compressed', ['tar', '--overwrite', '-xvzf', '{filename}', '-C', '{extract_dir}/']),
('gzip compressed', ['gunzip', '--force', '--keep', '{filename}']),
('bzip2 compressed', ['tar', '--overwrite', '-xvjf', '{filename}', '-C', '{extract_dir}/']),
('xz compressed', ['tar', '--overwrite', '-xvJf', '{filename}', '-C', '{extract_dir}/']),
('lzma compressed', ['tar', '--overwrite', '--lzma', '-xvf', '{filename}', '-C', '{extract_dir}/']),
('7-zip archive', ['7z', 'x', '-p""', '-aoa', '{filename}', '-o{extract_dir}/']),
('zip archive', ['7z', 'x', '-p""', '-aoa', '{filename}', '-o{extract_dir}/']),
]
def extract_file(file_path, extract_dir=None):
file_path = Path(file_path).resolve()
if extract_dir is None:
extract_dir = file_path.with_suffix('.extracted')
extract_dir = Path(extract_dir).resolve()
# Create the extraction directory if it doesn't exist
if not extract_dir.exists():
extract_dir.mkdir(parents=True, exist_ok=True)
# Determine the file type using magic
file_type = magic.from_file(str(file_path), mime=True).lower()
# Find the appropriate decompression command
for magic_type, cmd_list in supported_compressions:
if magic_type in file_type:
log.info(f'Compression type "{magic_type}" detected in {file_path}')
cmd_list = [s.format(filename=file_path, extract_dir=extract_dir) for s in cmd_list]
log.info(f'>> {" ".join(cmd_list)}')
try:
sp.run(cmd_list, check=True)
log.info(f'Decompression successful for {file_path}')
# Recursively extract files in the new directory
for item in extract_dir.iterdir():
if item.is_file() and is_compressed(item):
extract_file(item, extract_dir / item.stem)
return True
except sp.SubprocessError as e:
log.error(f'Error extracting file {file_path}: {e}')
return False
log.warning(f'No supported compression type found for {file_path}')
return False
def is_compressed(file_path):
file_type = magic.from_file(str(file_path), mime=True).lower()
return any(magic_type in file_type for magic_type, _ in supported_compressions) |
Marked this ready for review now, This should be good for a base extracting the most popular compression types. I have also removed the jadx compatable compression types from libmagic so as to let that extract them instead of this module |
@domwhewell-sage thanks for your work on this. It's looking good! A few things:
|
This Draft PR adds an internal module "extract" which will contain several functions that can extract certain file types into folders ready for excavate to pull out useful information such as URLs, DNS_NAMEs etc.