Automatically detect character encoding of YAML files and ignore files #630

Before this change, build_temp_workspace() would always encode a path using UTF-8 and the strict error handler [1]. Most of the time, this is fine, but systems do not necessarily use UTF-8 and the strict error handler for paths [2]. [1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode> [2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>

Before this commit, test_run_default_format_output_in_tty() changed the values of sys.stdout and sys.stderr, but it would never change them back. This commit makes sure that they get changed back. At the moment, this commit doesn’t make a user-visible difference. A future commit will add a new test named test_ignored_from_file_with_multiple_encodings(). That new test requires stdout and stderr to be restored, or else it will fail.

@adrienverge

Before this change, yamllint would open YAML files using open()’s default encoding. As long as UTF-8 mode isn’t enabled, open() defaults to using the system’s locale encoding [1][2]. This can cause problems in multiple different scenarios. The first scenario involves linting UTF-8 YAML files on Linux systems. Most of the time, the locale encoding on Linux systems is set to UTF-8 [3][4], but it can be set to something else [5]. In the unlikely event that someone was using Linux with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. The second scenario involves linting UTF-8 YAML files on Windows systems. The locale encoding on Windows systems is the system’s ANSI code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8 by default [7]. In the very likely event that someone was using Windows with a locale encoding other than UTF-8, there was a chance that yamllint would crash with a UnicodeDecodeError. Additionally, using open()’s default encoding is a violation of the YAML spec. Chapter 5.2 says: “On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported. If a character stream begins with a byte order mark, the character encoding will be taken to be as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (x00) characters.” [8] This change fixes all of those problems by implementing the YAML spec’s character encoding detection algorithm. Now, as long as YAML files begin with either a byte order mark or an ASCII character, yamllint will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other character encodings are not supported at the moment. Credit for the idea of having tests with pre-encoded strings goes to @adrienverge [9]. Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347. [1]: <https://docs.python.org/3.12/library/functions.html#open> [2]: <https://docs.python.org/3.12/library/os.html#utf8-mode> [3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html> [4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale> [5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f> [6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding> [7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page> [8]: <https://yaml.org/spec/1.2.2/#52-character-encodings> [9]: <adrienverge#630 (comment)>

Before this change, yamllint would decode files on the ignore-from-file list using open()’s default encoding [1][2]. This can cause decoding to fail in some situations (see the previous commit message for details). This change makes yamllint automatically detect the encoding for files on the ignore-from-file list. It uses the same algorithm that it uses for detecting the encoding of YAML files, so the same limitations apply: files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a byte order mark or an ASCII character. [1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input> [2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>

In general, using open()’s default encoding is a mistake [1]. This change makes sure that every time open() is called, the encoding parameter is specified. Specifically, it makes it so that all tests succeed when run like this: python -X warn_default_encoding -W error::EncodingWarning -m unittest discover [1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>

The previous few commits have removed all calls to open() that use its default encoding. That being said, it’s still possible that code added in the future will contain that same mistake. This commit makes it so that the CI test job will fail if that mistake is made again. Unfortunately, it doesn’t look like coverage.py allows you to specify -X options [1] or warning filters [2] when running your tests [3]. To work around this problem, I’m running all of the Python code, including coverage.py itself, with -X warn_default_encoding and -W error::EncodingWarning. As a result, the CI test job will also fail if coverage.py uses open()’s default encoding. Hopefully, coverage.py won’t do that. If it does, then we can always temporarily revert this commit. [1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X> [2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W> [3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically detect character encoding of YAML files and ignore files #630

Automatically detect character encoding of YAML files and ignore files #630

Commits on Nov 29, 2024

Automatically detect character encoding of YAML files and ignore files #630

Are you sure you want to change the base?

Automatically detect character encoding of YAML files and ignore files #630

Commits on Nov 29, 2024