Warning This page is under construction.
This document discusses the Continuous Integration (CI) system for PyTorch.
Currently PyTorch utilizes Github Actions for various different CI build/test configurations. We will discuss these related contents in the following sections:
PyTorch CI system ensures that proper build/test process should pass on both PRs and master commits. There are several terminologies we would like to clarify:
PyTorch supports many different hardware architectures, operation systems, and accelerator GPUs. Therefore, many different CI workflows run parallel on each commit to ensure that PyTorch can be built and run correctly in different environments and configurations.
Note examples are based on the CI workflows at the time of writing.
Obviously not every Cartesian product of these combination is being tested. Please refer to the PyTorch CI HUD for more information on all the combinations run currently.
Note variances in configurations can change more rapidly comparing to the basic configurations.
Other than the 4 basic configuration coordinates, we also have some special variances in configurations that we run. These variances are created to test some specific features or to cover some specific test domains. For example:
Currently there are 2 main categories of CI runs for a single commit:
ciflow/<workflow name>
tags to the PR. These can be further divided into:
Each category contains a subset of CI workflows defined in the CI matrix.
We generally consider CI workflows run on commits in main to be the baseline, and we run subsets of this baseline on PRs.
Using labels to change CI on PRAs mentioned in CI Matrix, PyTorch runs different sets of jobs on PR vs. on main commits.
In order to control the behavior of CI jobs on PR. The most commonly used labels are:
ciflow/trunk
: automatically added when @pytorchbot merge
is invoked. These tests are run on every commit in master.ciflow/periodic
: runs every 4 hours on master. Includes jobs that are either expensive or slow to run, such as mac x86-64 tests, slow gradcheck, and multigpu.ciflow/inductor
: runs inductor builds and tests. This label may be automatically added by our autolabeler if your PR touches certain files. These jobs are run on every commit in master.ciflow/slow
: runs every 4 hours on master. Runs tests that are marked as slow.For a complete definition of every job that is triggered by these labels, as well as other labels that are not listed here, search for ciflow/
in the .github/workflows
folder or run grep -r 'ciflow/' .github/workflows
.
Additional labels include:
keep-going
: test jobs stop at first test failure. Use this label to keep going after first failure.test-config/<default, distributed, etc>
: only run a specific test config.The total time of our entire test suite takes over 24 hours to run if run serially, so we shard and parallelize our tests to decrease this time. Test job names generally look like <configuration/architecture information, ex os, python version, compiler version> / test (<test_config>, <shard number>, <total number of shards>, <machine type>, <optional additional information>)
. For example linux-bionic-py3.11-clang9 / test (default, 2, 2, linux.2xlarge)
is running the second shard of two shards for the default test config.
Tests are distributed across the number of shards based on how long they take. Long tests are broken into smaller chunks based on their test times and may show up on different shards as well. Test time information updates everyday at around 5PM PT, which can cause tests to move to different shards.
Information about what test files are run on each shard can be found by searching the logs for Selected tests
.
Some PyTorch tests are currently disabled due to their flakiness, incompatibility with certain platforms, or other temporary brokenness. We have a system where GitHub issues titled “DISABLED test_a_name” disable desired tests in PyTorch CI until the issues are closed, e.g., #62970. If you are wondering what tests are currently disabled in CI, please check out disabled-tests.json, where these test cases are all gathered.
First, you should never disable a test if you're not sure what you're doing, and only contributors with write permissions can disable tests. Tests are important in validating PyTorch functionality, and ignoring test failures is not recommended as it can degrade user experience. When you are certain that disabling a test is the best option, for example, when it is flaky, make plans to fix the test so that it is not disabled indefinitely.
To disable a test, create an issue with the title DISABLED test_case_name (test.ClassName)
. A real title example would look like: DISABLED test_jit_cuda_extension (__main__.TestCppExtensionJIT)
. In the body of the issue, feel free to include any details and logs as you normally would with any issue. It is possible to only skip the test for specific platforms (like rocm, windows, linux, or mac), or for specific test configs (available ones are inductor, dynamo, and slow). To do this, include a line (case insensitive) in the issue body like so: "<start of line>Platforms: Mac, Windows<end of line>."
Currently, you are able to disable all sorts of test instantiations by removing the suffixed device type (e.g., cpu
, cuda
, or meta
) AND dtype (e.g., int8
, bool
, complex
). So given a test test_case_name_cuda_int64 (test.ClassNameCUDA)
, you can:
DISABLED test_case_name_cuda_int64 (test.ClassNameCUDA)
DISABLED test_case_name (test.ClassName)
. NOTE: You must get rid of BOTH the device and dtype suffixes AND the suffix in the test suite as well.It is not easy to test these disabled tests with CI since they are automatically skipped. Previous alternatives were to either mimic the test environment locally (often not convenient) or to close the issue and re-enable the test in all of CI (risking breaking trunk CI).
PRs with key phrases like “fixes #55555” or “Close https://github.com/pytorch/pytorch/issues/62359” in their PR bodies or commit messages will re-enable the tests disabled by the linked issues (in this example, #55555 and #62359).
More accepted key phrases are defined by the GitHub docs.
We have various sources of reruns in our CI to help deal with flakiness:
Additionally, there may be additional sources of retries, for example retrying network calls.
Which commit is used in CI for your PR?The code used in your PR is the code in the most recent commit of your PR. However, the workflow file definitions are a merge between the current main
brant workflow file definitions and those found in your PR. This can cause failures due to the merged workflow file referencing a file that might not exist yet and is generally resolved by rebasing.
A compilation of dashboards and metrics relating to CI could be found in HUD metrics page
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4