Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanner ignores Files and Directories with 2- and 3-Byte Unicode Characters in their Name #1305

Open
petethecat-52 opened this issue Jan 22, 2025 · 0 comments

Comments

@petethecat-52
Copy link

I have been using the LMS and its predecessors for more than 15 years, currently version 9.0.1 under Windows 11 with the local music library on an NTFS hard drive.
I've missed some tracks or even entire albums already in the past, but only recently realized, that over 1,600 audio files could not be found in the LMS, more than 1% of the entire music library.

“Lost” audio files differ from others in that their filename contains Unicode characters consisting of 2 or 3 bytes. The LMS scanner “ignores” such files without logging an error or at least a warning.

A Java program determined the paths of all audio files and directories whose names contain characters with a (decimal) code greater than 256 and wrote them into a list, along with the codes of the Unicode characters.
The program found more than 200 different “bad” Unicode characters, for example FULLWIDTH COLON (U+FF1A, 65306), HYPHEN (U+2010, 8208), Fullwidth Solidus (U+FF0F, 65295), Combining Circumflex Accent (U+0302, 770), all Cyrillic characters starting with Cyrillic Small Letter A (U+1072, 1072).
Audio files (also .cue) with such characters in the name can be played without any problems in player apps such as Foobar.

How did files and directories with “bad” Unicode characters get into my music library? Most likely with the metadata retrieved from the CDDB when ripping CDs in the Exact Audio Copy tool.

The issue with unicode characters has been known for a long time, as pointed out by Bug Fix “#2475 - Problems with filenames containing non-current-codepage (foreign language, double byte) characters” in Changelog6 of version 6.5.4 – 2007-08-15, but has never been fixed, as my tests with various old versions back to 6.5 showed.

This suggests that the cause of the error lies in Perl, in the way Perl reads directory trees.
The opendir / readdir functions in PERL show exactly this incorrect behavior, see “Scanning directories with Perl” https://www.ralph-schuster.eu/2007/11/27/scanning-directories-with-perl/.
But perhaps only installations under Windows with an NTFS file system are affected?
In https://www.perlmonks.org/?node_id=11149351 a function WinReadDir() is presented as a mixture of Perl and JavaScript, which reads files with Unicode file names without errors. Maybie a code like this could be integrated?

What happens in the LMS when scanning a file that has unusual Unicode characters in the name, e.g. “08. Die Toten Hosen – Industrie-Girls.flac”, where there is no hyphen between industry and girls, but rather a hyphen?

[25-01-22 16:01:28.1403] main::main (213) Starting Lyrion Music Server scanner (v9.0.1, 1736238071, Thu Jan 9 17:14:13 CUT 2025) perl 5.032001
...
[25-01-22 16:01:28.4504] Slim::Music::Import::runImporter (581) Starting Slim::Media::MediaFolderScan scan
[25-01-22 16:01:28.4507] Slim::Media::MediaFolderScan::startScan (62) Starting audio-only scan in: ["g:\test_unicode"]
[25-01-22 16:01:28.4509] Slim::Utils::Scanner::Local::rescan (156) Rescanning g:\test_unicode
[25-01-22 16:01:28.4510] Slim::Utils::Scanner::Local::rescan (180) Discovering audio files in g:\test_unicode
...
[25-01-22 16:01:30.0406] Slim::Utils::Scanner::Local::Async::ANON (149) Found G:\test_unicode\Toten Hosen, Die(2012) Ballast der Republik\CD2\08. Die Toten Hosen – Industrie?Mädchen.flac
...
At this point already the HYPHEN (%E2%80%90) has mutated into a question mark (%3F), and accordingly a record with the URL
file:///G:/test_unicode/Toten%20Hosen,%20Die/(2012)%20Ballast%20der%20Republik/CD2/08.%20Die%20Toten%20Hosen%20-%20Industrie%3FM%E4dchen.flac
is written to the scanned_files table of library.db.
Of course, a check whether a file with this URL exists fails. Consequently, no entry is written to the tracks table and the audio file is lost for LMS.

By the way, I did not observe an exception as described in Issue #1256 “Hyphen in folder name makes scan crash”.

Proposal:
If the problem affects only a minority of users, or cannot be solved with reasonable effort, the scanner should at least log every audio file that is not taken into account (error or warning).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant