Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect character encoding of YAML files and ignore files #630

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Commits on Nov 29, 2024

  1. tests: Use correct encoding for path

    Before this change, build_temp_workspace() would always encode a path
    using UTF-8 and the strict error handler [1]. Most of the time, this is
    fine, but systems do not necessarily use UTF-8 and the strict error
    handler for paths [2].
    
    [1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode>
    [2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>
    Jayman2000 committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    6703bb3 View commit details
    Browse the repository at this point in the history
  2. tests: Restore stdout and stderr

    Before this commit, test_run_default_format_output_in_tty() changed the
    values of sys.stdout and sys.stderr, but it would never change them
    back. This commit makes sure that they get changed back.
    
    At the moment, this commit doesn’t make a user-visible difference. A
    future commit will add a new test named test_ignored_from_file_with_multiple_encodings().
    That new test requires stdout and stderr to be restored, or else it will
    fail.
    Jayman2000 committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    4881789 View commit details
    Browse the repository at this point in the history
  3. decoder: Autodetect detect encoding of YAML files

    Before this change, yamllint would open YAML files using open()’s
    default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
    to using the system’s locale encoding [1][2]. This can cause problems in
    multiple different scenarios.
    
    The first scenario involves linting UTF-8 YAML files on Linux systems.
    Most of the time, the locale encoding on Linux systems is set to UTF-8
    [3][4], but it can be set to something else [5]. In the unlikely event
    that someone was using Linux with a locale encoding other than UTF-8,
    there was a chance that yamllint would crash with a UnicodeDecodeError.
    
    The second scenario involves linting UTF-8 YAML files on Windows
    systems. The locale encoding on Windows systems is the system’s ANSI
    code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8
    by default [7]. In the very likely event that someone was using Windows
    with a locale encoding other than UTF-8, there was a chance that
    yamllint would crash with a UnicodeDecodeError.
    
    Additionally, using open()’s default encoding is a violation of the YAML
    spec. Chapter 5.2 says:
    
    	“On input, a YAML processor must support the UTF-8 and UTF-16
    	character encodings. For JSON compatibility, the UTF-32
    	encodings must also be supported.
    
    	If a character stream begins with a byte order mark, the
    	character encoding will be taken to be as indicated by the byte
    	order mark. Otherwise, the stream must begin with an ASCII
    	character. This allows the encoding to be deduced by the pattern
    	of null (x00) characters.” [8]
    
    This change fixes all of those problems by implementing the YAML spec’s
    character encoding detection algorithm. Now, as long as YAML files
    begin with either a byte order mark or an ASCII character, yamllint
    will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
    character encodings are not supported at the moment.
    
    Credit for the idea of having tests with pre-encoded strings goes to
    @adrienverge [9].
    
    Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.
    
    [1]: <https://docs.python.org/3.12/library/functions.html#open>
    [2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
    [3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html>
    [4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
    [5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
    [6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
    [7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
    [8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
    [9]: <adrienverge#630 (comment)>
    Jayman2000 committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    e5ef039 View commit details
    Browse the repository at this point in the history
  4. decoder: Autodetect encoding for ignore-from-file

    Before this change, yamllint would decode files on the ignore-from-file
    list using open()’s default encoding [1][2]. This can cause decoding to
    fail in some situations (see the previous commit message for details).
    
    This change makes yamllint automatically detect the encoding for files
    on the ignore-from-file list. It uses the same algorithm that it uses
    for detecting the encoding of YAML files, so the same limitations apply:
    files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a
    byte order mark or an ASCII character.
    
    [1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input>
    [2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>
    Jayman2000 committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    4f97d1f View commit details
    Browse the repository at this point in the history
  5. tests: Stop using open()’s default encoding

    In general, using open()’s default encoding is a mistake [1]. This
    change makes sure that every time open() is called, the encoding
    parameter is specified. Specifically, it makes it so that all tests
    succeed when run like this:
    
    	python -X warn_default_encoding -W error::EncodingWarning -m unittest discover
    
    [1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>
    Jayman2000 committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    a09f5f0 View commit details
    Browse the repository at this point in the history
  6. CI: Fail when open()’s default encoding is used

    The previous few commits have removed all calls to open() that use its
    default encoding. That being said, it’s still possible that code added
    in the future will contain that same mistake. This commit makes it so
    that the CI test job will fail if that mistake is made again.
    
    Unfortunately, it doesn’t look like coverage.py allows you to specify -X
    options [1] or warning filters [2] when running your tests [3]. To work
    around this problem, I’m running all of the Python code, including
    coverage.py itself, with -X warn_default_encoding and
    -W error::EncodingWarning. As a result, the CI test job will also fail
    if coverage.py uses open()’s default encoding. Hopefully, coverage.py
    won’t do that. If it does, then we can always temporarily revert this
    commit.
    
    [1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X>
    [2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W>
    [3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>
    Jayman2000 committed Nov 29, 2024
    Configuration menu
    Copy the full SHA
    a6031a4 View commit details
    Browse the repository at this point in the history