A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://python.github.io/peps/pep-0706/ below:

PEP 706 – Filter for tarfile.extractall

PEP 706 – Filter for tarfile.extractall
Author:
Petr Viktorin <encukou at gmail.com>
Discussions-To:
Discourse thread
Status:
Final
Type:
Standards Track
Created:
09-Feb-2023
Python-Version:
3.12
Post-History:
25-Jan-2023, 15-Feb-2023
Resolution:
Discourse message
Table of Contents Abstract

The extraction methods in tarfile gain a filter argument, which allows rejecting files or modifying metadata as the archive is extracted. Three built-in named filters are provided, aimed at limiting features that might be surprising or dangerous. These can be used as-is, or serve as a base for custom filters.

After a deprecation period, a strict (but safer) filter will become the default.

Motivation

The tar format is used for several use cases, many of which have different needs. For example:

To support all its use cases, the tar format has many features. In many cases, it’s best to ignore or disallow some of them when extracting an archive.

Python allows extracting tar archives using tarfile.TarFile.extractall(), whose docs warn to never extract archives from untrusted sources without prior inspection. However, it’s not clear what kind of inspection should be done. Indeed, it’s quite tricky to do such an inspection correctly. As a result, many people don’t bother, or do the check incorrectly, resulting in security issues such as CVE-2007-4559.

Since tarfile was first written, it’s become more accepted that warnings in documentation are not enough. Whenever possible, an unsafe operation should be explicitly requested; potentially dangerous operations should look dangerous. However, TarFile.extractall looks benign in a code review.

Tarfile extraction is also exposed via shutil.unpack_archive(), which allows the user to not care about the kind of archive they’re dealing with. The API is very inviting for extracting archives without prior inspection, even though the docs again warn against it.

It has been argued that Python is not wrong – it behaves exactly as documented – but that’s beside the point. Let’s improve the situation rather than assign/avoid blame. Python and its docs are the best place to improve things.

Rationale

How do we improve things? Unfortunately, we will need to change the defaults, which implies breaking backwards compatibility. TarFile.extractall is what people reach for when they need to extract a tarball. Its default behaviour needs to change.

What would be the best behaviour? That depends on the use case. So, we’ll add several general “policies” to control extraction. They are based on use cases, and ideally they should have straightforward security implications:

After a deprecation period, the last option – the most limited but most secure one – will become the default.

Even with better general defaults, users should still verify the archives they extract, and perhaps modify some of the metadata. Superficially, the following looks like a reasonable way to do this today:

However, there are some issues with this approach:

To solve these issues we’ll:

The hook API will be very similar to the existing filter argument for TarFile.add. We’ll also name it filter. (In some cases “policy” would be a more fitting name, but the API can be used for more than security policies.)

The built-in policies/filters described above will be implemented using the public filter API, so they can be used as building blocks or examples.

Setting a precedent

If and when other libraries for archive extraction, such as zipfile, gain similar functionality, they should mimic this API as much as it’s reasonable.

To enable this for simple cases, the built-in filters will have string names; e.g. users can pass filter='data' instead of a specific function that deals with TarInfo objects.

The shutil.unpack_archive() function will get a filter argument, which it will pass to extractall.

Adding function-based API that would work across archive formats is out of scope of this PEP.

Full disclosure & redistributor info

The PEP author works for Red Hat, a redistributor of Python with different security needs and support periods than CPython in general. Such redistributors may want to carry vendor patches to:

The proposal makes this easy to do, and it allows users to query the settings.

Specification Modifying and forgetting member metadata

The TarInfo class will gain a new method, replace(), which will work similarly to dataclasses.replace. It will return a copy of the TarInfo object with attributes replaced as specified by keyword-only arguments:

Any of these, except name and linkname, will be allowed to be set to None. When extract or extractall encounters such a None, it will not set that piece of metadata. (If uname or gname is None, it will fall back to uid or gid as if the name wasn’t found.) When addfile or tobuf encounters such a None, it will raise a ValueError. When list encounters such a None, it will print a placeholder string.

The documentation will mention why the method is there: TarInfo objects retrieved from TarFile.getmembers are “live”; modifying them directly will affect subsequent unrelated operations.

Filters

TarFile.extract and TarFile.extractall methods will grow a filter keyword-only parameter, which takes a callable that can be called as:

filter(/, member: TarInfo, path: str) -> TarInfo|None

where member is the member to be extracted, and path is the path to where the archive is extracted (i.e., it’ll be the same for every member).

When used it will be called on each member as it is extracted, and extraction will work with the result. If it returns None, the member will be skipped.

The function can also raise an exception. This can, depending on TarFile.errorlevel, abort the extraction or cause the member to be skipped.

Note

If extraction is aborted, the archive may be left partially extracted. It is the user’s responsibility to clean up.

We will also provide a set of defaults for common use cases. In addition to a function, the filter argument can be one of the following strings:

Any other string will cause a ValueError.

The corresponding filter functions will be available as tarfile.fully_trusted_filter(), tarfile.tar_filter(), etc., so they can be easily used in custom policies.

Note that these filters never return None. Skipping members this way is a feature for user-defined filters.

Defaults and their configuration

TarFile will gain a new attribute, extraction_filter, to allow configuring the default filter. By default it will be None, but users can set it to a callable that will be used if the filter argument is missing or None.

Note

String names won’t be accepted here. That would encourage code like my_tarfile.extraction_filter = 'data'. On Python versions without this feature, this would do nothing, silently ignoring a security-related request.

If both the argument and attribute are None:

Applications and system integrators may wish to change extraction_filter of the TarFile class itself to set a global default. When using a function, they will generally want to wrap it in staticmethod() to prevent injection of a self argument.

Subclasses of TarFile can also override extraction_filter.

FilterError

A new exception, FilterError, will be added to the tarfile module. It’ll have several new subclasses, one for each of the refusal reasons above. FilterError’s member attribute will contain the relevant TarInfo.

In the lists above, “refusing” to extract a file means that a FilterError will be raised. As with other extraction errors, if the TarFile.errorlevel is 1 or more, this will abort the extraction; with errorlevel=0 the error will be logged and the member will be ignored, but extraction will continue. Note that extractall() may leave the archive partially extracted; it is the user’s responsibility to clean up.

Errorlevel, and fatal/non-fatal errors

Currently, TarFile has an errorlevel argument/attribute, which specifies how errors are handled:

A filter refusing to extract a member does not fit neatly into the fatal/non-fatal categories.

To satisfy this, FilterError will be considered a fatal error, that is, it’ll be ignored only with errorlevel=0.

Users that want to ignore FilterError but not other fatal errors should create a custom filter function, and call another filter in a try block.

Hints for further verification

Even with the proposed changes, tarfile will not be suited for extracting untrusted files without prior inspection. Among other issues, the proposed policies don’t prevent denial-of-service attacks. Users should do additional checks.

New docs will tell users to consider:

Also, the docs will note that:

This list is not comprehensive, but the documentation is a good place to collect such general tips. It can be moved into a separate document if grows too long or if it needs to be consolidated with zipfile or shutil (which is out of scope for this proposal).

TarInfo identity, and offset

With filters that use replace(), the TarInfo objects handled by the extraction machinery will not necessarily be the same objects as those present in members. This may affect TarInfo subclasses that override methods like makelink and rely on object identity.

Such code can switch to comparing offset, the position of the member header inside the file.

Note that both the overridable methods and offset are only documented in source comments.

tarfile CLI

The CLI (python -m tarfile) will gain a --filter option that will take the name of one of the provided default filters. It won’t be possible to specify a custom filter function.

If --filter is not given, the CLI will use the default filter ('fully_trusted' with a deprecation warning now, and 'data' from Python 3.14 on).

There will be no short option. (-f would be confusingly similar to the filename option of GNU tar.)

Other archive libraries

If and when other archive libraries, such as zipfile, grow similar functionality, their extraction functions should use a filter argument that takes, at least, the strings 'fully_trusted' (which should disable any security precautions) and 'data' (which should avoid features that might surprise users).

Standardizing a function-based filter API is out of scope of this PEP.

Shutil

shutil.unpack_archive() will gain a filter argument. If it’s given, it will be passed to the underlying extraction function. Passing it for a zip archive will fail for now (until zipfile gains a filter argument, if it ever does).

If filter is not specified (or left as None), it won’t be passed on, so extracting a tarball will use the default filter ('fully_trusted' with a deprecation warning now, and 'data' from Python 3.14 on).

Complex filters

Note that some user-defined filters need, for example, to count extracted members of do post-processing. This requires a more complex API than a filter callable. However, that complex API need not be exposed to tarfile. For example, with a hypothetical StatefulFilter users would write:

with StatefulFilter() as filter_func:
    my_tar.extract(path, filter=filter_func)

A simple StatefulFilter example will be added to the docs.

Note

The need for stateful filters is a reason against allowing registration of custom filter names in addition to 'fully_trusted', 'tar' and 'data'. With such a mechanism, API for (at least) set-up and tear-down would need to be set in stone.

Backwards Compatibility

The default behavior of TarFile.extract and TarFile.extractall will change, after raising DeprecationWarning for 2 releases (shortest deprecation period allowed in Python’s backwards compatibility policy).

Additionally, code that relies on tarfile.TarInfo object identity may break, see TarInfo identity, and offset.

Backporting & Forward Compatibility

This feature may be backported to older versions of Python.

In CPython, we don’t add warnings to patch releases, so the default filter should be changed to 'fully_trusted' in backports.

Other than that, all of the changes to tarfile should be backported, so hasattr(tarfile, 'data_filter') becomes a reliable check for all of the new functionality.

Note that CPython’s usual policy is to avoid adding new APIs in security backports. This feature does not make sense without a new API (TarFile.extraction_filter and the filter argument), so we’ll make an exception. (See Discourse comment 23149/16 for details.)

Here are examples of code that takes into account that tarfile may or may not have the proposed feature.

When copying these snippets, note that setting extraction_filter will affect subsequent operations.

Security Implications

This proposal improves security, at the expense of backwards compatibility. In particular, it will help users avoid CVE-2007-4559.

How to Teach This

The API, usage notes and tips for further verification will be added to the documentation. These should be usable for users who are familiar with archives in general, but not with the specifics of UNIX filesystems nor the related security issues.

Reference Implementation

See pull request #102953 on GitHub.

Rejected Ideas SafeTarFile

An initial idea from Lars Gustäbel was to provide a separate class that implements security checks (see gh-65308). There are two major issues with this approach:

However, many of the ideas behind SafeTarFile were reused in this PEP.

Add absolute_path option to tarfile

Issue gh-73974 asks for adding an absolute_path option to extraction methods. This would be a minimal change to formally resolve CVE-2007-4559. It doesn’t go far enough to protect the unaware, nor to empower the diligent and curious.

Other names for the 'tar' filter

The 'tar' filter exposes features specific to UNIX-like filesystems, so it could be named 'unix'. Or 'unix-like', 'nix', '*nix', 'posix'?

Feature-wise, tar format and UNIX-like filesystem are essentially equivalent, so tar is a good name.

Possible Further Work Adding filters to zipfile and shutil.unpack_archive

For consistency, zipfile and shutil.unpack_archive() could gain support for a filter argument. However, this would require research that this PEP’s author can’t promise for Python 3.12.

Filters for zipfile would probably not help security. Zip is used primarily for cross-platform data bundles, and correspondingly, ZipFile.extract’s defaults are already similar to what a 'data' filter would do. A 'fully_trusted' filter, which would newly allow absolute paths and .. path components, might not be useful for much except a unified unpack_archive API.

Filters should be useful for use cases other than security, but those would usually need custom filter functions, and those would need API that works with both TarInfo and ZipInfo. That is definitely out of scope of this PEP.

If only this PEP is implemented and nothing changes for zipfile, the effect for callers of unpack_archive is that the default for tar files is changing from 'fully_trusted' to the more appropriate 'data'. In the interim period, Python 3.12-3.13 will emit DeprecationWarning. That’s annoying, but there are several ways to handle it: e.g. add a filter argument conditionally, set TarFile.extraction_filter globally, or ignore/suppress the warning until Python 3.14.

Also, since many calls to unpack_archive are likely to be unsafe, there’s hope that the DeprecationWarning will often turn out to be a helpful hint to review affected code.

Thanks

This proposal is based on prior work and discussions by many people, in particular Lars Gustäbel, Gregory P. Smith, Larry Hastings, Joachim Wagner, Jan Matejek, Jakub Wilk, Daniel Garcia, Lumír Balhar, Miro Hrončok, and many others.

Copyright

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4