PK saved_corpus/data.pklFB ZZZZZZZZZ]q(Xtuserwarning: loky-backed parallel loops cannot be called in a multiprocessing with num_workers=1 but two dataloadersqXW[dtensor] add device_mesh.device_type to make RNGStateTracker support CUDA-like devicesqXE[dynamo] Better determinism of `ConfigModule` by walking using pytreeqXQ[dynamo] AutogradFunctionMethodHigherOrderVariable check for new guards is brokenqX)Is it a good time to switch to CXX11_ABI?qX$[dynamo] Expand _nonvar_fields namesqX.Allow to specify specific files for debug infoqXNew swap functionqXD[dynamo] add repro for functorch/fx interop issue (`allow_in_graph`)q XL[dynamo]: `nn.Module` recursively set `training` mode via `train` and `eval`q X'ninja: build stopped: subcommand failedq X*WIP Implement channels_last_3d convolutionq XOAdd CSR tensor with non-contiguous values support to CuSparseSpMatCsrDescriptorq XU[dynamo] `{*}Tensor.__init__` from list of ndarray as `torch.stack(List[FakeTensor])`qX!GPU computation is not equivalentqX$grad is inf/nan when using torch.ampqXS[dynamo] Implement `set.__contains__` for `Tensor` as object match of `FakeTensor` qX3Support calling __torch_function__ attribute accessqX!Implementation of Lion Optimizer.qXhack hack hackqX*Is the index_add_ function differentiable?qXcBug: torch.compile fails to compile torch.func.vmap with reduction functions and raw python numbersqX:Pass `ignored_params` at the leaf FSDP wrapping class callqX(Support tracing base torch_function implqX4TensorWithTFOverride inheritance from TensorVariableqX)An OOM where there should not be any OOM.qXNot Implemented IssueqX9[TESTING] Check Triton update after elementwise dedup fixqX)[dynamo] Remove VariableTracker.propagateqX+[dynamo] Remove VariableTracker.add_optionsqXL[torchx] Do not terminate parent process if exit code from child isn't validqXKAdd cudagraph_mark_step_begin in torch.compiler, reference in error messageq XConstrain sdpa to fx stridesq!X$add dry run metrics to td strategiesq"XGWrong way of checking if CustomModule is a subclass of torch.nn.Module q#X)[dynamo] Lazily construct symbolic_localsq$XCannot pip install torch 2.0.1q%X8[Export] Don't serialize missing args with default valueq&XA[dynamo] generic `is_` type shortcut is not appropriately guardedq'X!Re-enable some embedded bag testsq(X<[aotinductor] 14k models: CppCompileError: C++ compile errorq)X6`fbgemm` update causes failures in `test_embedding.py`q*X!lintrunner job time keeps growingq+XEDISABLED test_meta_outplace_fft_ifft_cpu_uint8 (__main__.TestMetaCPU)q,X.Add more flexibility on print / output consoleq-XkRunnings SentenceTransformer encoding step causes Docker containers on Mac (Silicon) to crash with code 139q.Xg[Release/2.1.1][ONNX] Fix aten::new_zeros due to TorchScript behavior change on Pytorch 2.1 Fix #110935q/X[export] 14k models: AssertionError: graph-captured input # 2, of type , is not among original inputs of typesq0X1DISABLED test_sigmoid (__main__.TestQuantizedOps)q1Xi[aotinductor] 14k models: TypeError: make_boxed_func..g() missing 1 required positional argument: 'args' q2XA[Quantization] Add a test for QAT + PTQ selective quantization inq3X>Document torch.from_file and fix UntypedStorage.from_file docsq4Xm[Release/2.1.1][DCP] Remove _shard_tensor() call in load_sharded_optimizer_state_dict in optimizer.py #111096q5X1RecursionError for backend='inductor' with a loopq6X3Disable dynamo when running generated opcheck testsq7X>[BE]: ruff apply rule PLW1510 to find silent subprocess errorsq8X1Make require_stride_order peek into AliasedLayoutq9X![pytorch-vulkan] Support zero-dimq:X^[Release/2.1.1] [Test][ShardedTensor] Add test for corner case for chunk sharding spec #109626q;X(AOT Inductor Does not Work with minifierqXDDynamo Compile samples should record file/line that raised exceptionq?X5[quant][bc-breaking] Remove deprecated QConfigDynamicq@X,Buffer overflow not prevented on MPS devicesqAXB[Release/2.1] Introduce is_big_gpu condition for test_max_autotuneqBXtorch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::binary_cross_entropy' to ONNX opset version 14 is not supported.qCX>[dynamo] Fix guard for ndarray calling `torch.as_tensor(None)`qDX"[dynamo] Tracking: object identityqEXDtorch.dynamo (caching?) issues with `Optional[np.ndarray]` argumentsqFXHigher-level custom op API, V3qGX<torch.library: Create helper function `is_functional_schema`qHX3Change torch.library.impl to accept a device stringqIX8[aotinductor] Update test utility to use AOTIModelRunnerqJX$WIP Adding 512 to xblock size configqKX/Static Linking C++, Op not available at runtimeqLX@s390x vectorization: implement atanh for complex vectorized dataqMXEDISABLED test_meta_outplace_fft_ifft_cpu_int64 (__main__.TestMetaCPU)qNX:FSDP CPU Offload + fp16 + sharded grad scaler crash / hangqOXG[dynamo] higher-order ops do not preserve `FakeTensor` for in-place opsqPXYtorchrun: elastic training not restarted on missing keep-alive heartbeat/scale-down eventqQXValueError: Using a target size (torch.Size([491])) that is different to the input size (torch.Size([1, 491])) is deprecated. Please ensure they have the same size.qRXVDISABLED test_nested_tensor_chunk_cpu_float16 (__main__.TestNestedTensorDeviceTypeCPU)qSXJCan't export a pth model to onnx (RuntimeError: Couldn't lower all tuples)qTX@[RFC] Enable Int8-Mixed-BF16 PT2E PTQ Quantization with InductorqUX<[Kineto][NCCL][1/n] Add the world size info in NCCL metadataqVXGDISABLED test_meta_outplace_fft_ifft_cpu_float64 (__main__.TestMetaCPU)qWXSupport fp8 in AOTInductorqXX0torch2.1.0 DDP+compile+dynamic_shape cause errorqYX4Batched matmul gives incorrect result on MPS devicesqZXXStatus Tracker And Summary of Support Needed: Make Dynamo Generated Artifacts Debuggableq[X[dynamo][profiler] console spew of ..."torch._dynamo.variables.torch: [WARNING] Profiler function will be ignored" for pages...q\X!int_mm microbenchmark experimentsq]XIDISABLED test_narrow_cpu_float64 (__main__.TestNestedTensorDeviceTypeCPU)q^X=Use Dr.CI GitHub checkrun summary when querying its API failsq_XX[inductor] Implement clone removal for user defined triton kernel via reinplace_scattersq`XEDISABLED test_meta_outplace_fft_hfft_cpu_uint8 (__main__.TestMetaCPU)qaXCMissing `ignored_param` when calling wrapper_cls (FSDP) recursivelyqbX?[5/N] Make torch context manager a TorchCtxManagerClassVariableqcX1maximum Python version supported is not indicatedqdX+Add meta support for embedding bag backwardqeX2DISABLED test_cat_nhwc (__main__.TestQuantizedOps)qfXDebug trymerge internalqgX7[dynamo] fix None routing bug during var_getattr on UDOqhX-[AOTInductor] Enforce no_grad for Run entriesqiX2[HigherOrderOp] don't mannually set input for condqjX[Pytorch][Vulkan] mean.dimqkX[UCC][CUDA] Overlap p2pqlXGDISABLED test_meta_outplace_fft_hfft_cpu_float64 (__main__.TestMetaCPU)qmXIDISABLED test_narrow_cpu_float32 (__main__.TestNestedTensorDeviceTypeCPU)qnX,Add aot inductor test for dynamic batch sizeqoX0Revert "Revert "Nvfuser code removal (#111093)""qpX5`Enum` used as a key of the input raises guards errorqqX;Add testing for foreach scalar Tensor overloads in inductorqrX%Pass `BUILD_ENVIRONMENT` to MPS testsqsX [functorch] support lstm on cudaqtX;Apply same 'pick_grad' on generating fp64 reference outputsquX9[inductor][easy] skip test_extension_backend.py in fbcodeqvXAAdd decomp for `replication_pad2d` and use for CUDA deterministicqwXUpdated new README stylingqxXwipqyXWUse 'device' argument in test_sparse.py::TestSparseAnyCUDA::test_as_sparse_gradcheck_*qzXTDISABLED test_vmapjvpall_linalg_det_singular_cpu_float32 (__main__.TestOperatorsCPU)q{X10D] C++ Callbacks part 1q|X)Place local_used_map_dev_ on CPU for MTIAq}XIDynamic shapes doesn't work for torch.diff / resize__symint in some casesq~X@Prolonged network hiccup preventing retrieval of workflow job idqXY`illegal memory access` for `torch.sparse.mm(src, other) / deg.view(-1, 1).clamp_(min=1)`qXIDISABLED test_meta_outplace_fft_hfft_cpu_complex64 (__main__.TestMetaCPU)qX5Tensor `.cuda()` very slow with specific array sizes qXb[dynamo] so-called global state guard is installed on global, when in fact values are thread-localqX Enable cuptiqXIDISABLED test_narrow_cpu_float16 (__main__.TestNestedTensorDeviceTypeCPU)qX-build: failure when building pytorch with TBBqXWmisusing percision value in test_cuda function in torch/testing/_internal/common_nn.py.qXAHigher-order derivatives extremely slow, increasing exponentiallyqXD[dynamo] `not aliased -> aliased` Guard only implemented for TensorsqXQDISABLED test_meta_outplace_addmm_decomposed_cpu_complex64 (__main__.TestMetaCPU)qXFC] Add GradScaler on CPUqXJ[dynamo] Implement `set.__contains__` for tensors based on object identityqXOFix inconsistency of max_split_size between DeviceStats and CUDAAllocatorConfigqXEAOTAutograd: handle set_(), detect metadata mutations that cancel outqXD[Bug]: some parameters' grad is None when using FSDP with torch2.1.0qXBCustom `ModuleDict.__getitem__(key: tuple)` produces a graph breakqX&[dynamo] Implement full `is_` checkingqXIDISABLED test_detach_cpu_float64 (__main__.TestNestedTensorDeviceTypeCPU)qX.Bug with as_strided_tensorimpl for MPS devicesqX[dynamo] `set.__contains__` is not properly implemented for tensors, by virtue of `eq(Tensor, Tensor)` being inconsistently implementedqX2Enhance the unit testing doc: add one more exampleqXVPropose to add constant padding mode to the `torch.nn.functional.grid_sample` functionqX0[Pytorch][CPU] Switch building compiler to ClangqXYDISABLED test_Conv2d_naive_groups_cuda_float16 (__main__.TestConvolutionNNDeviceTypeCUDA)qXRDISABLED test_meta_outplace_addmm_decomposed_cpu_complex128 (__main__.TestMetaCPU)qX0[dynamo] Fix context wrapping grad mode variableqXBDISABLED test_Conv2d_groups_nobias_v2 (__main__.TestConvolutionNN)qX?DISABLED test_Conv2d_groups_nobias (__main__.TestConvolutionNN)qX!Add compile support for NT unbindqX_[dynamo] `no_grad`, `enable_grad` - `_NoParamDecoratorContextManager` are not handled correctlyqX+Functorch FCD breaks with tensor subclassesqX2[vision hash update] update the pinned vision hashqX3Insufficient hasattr guards on user defined objectsqX0[pt2+profiler] attach aot_id to CompiledFunctionqX+MPS Performance regressions on Sonoma 14.0 qX8[ci] Save various json files from test infra into folderqXIDISABLED test_detach_cpu_float32 (__main__.TestNestedTensorDeviceTypeCPU)qXGDISABLED test_meta_outplace_addmm_cpu_complex128 (__main__.TestMetaCPU)qXD[re-land][inductor] Refactor and optimize allocation calls (#111117)qX:[WIP][TD] Historical edited files and profiling heuristicsqX;Sparse Tensor Sum Still Does Not Work for PyTorch GeometricqX-LBFGS accuracy difference between CPU and GPUqX [BE] Enable Ruff's Flake8 PYI036qX=XLA Tensor creation fails on functionalization inside dynamo.qX:Dynamo runner: add FSDP handcrafted module wrapping policyqXFix iphoneos compilationqXDAdd unit test for ONNX models with torch.distributions.normal.NormalqXCAdd support to ExportedProgram as input to torch.onnx.dynamo_exportqX.[ONNX][dynamo] Parameter to export flat graphsqX4[dynamo] support comparing LHS constant with tensor qXLUse more performant bsr_scatter_mm within bsr_dense_mm when blocksize is 16.qX)BFloat16 datatype support in QuantizationqX+Supports ROCm6.0 reorganization and cleanupqXcIncorrect and inconsistent outputs from CrossEntropyLoss(reduction="none") with torch.float16 dtypeqXMWhen keep_inference_input_mutations=True is set, one dynamic shape test failsqX"BUG: fix np.typecodes under DynamoqXAtorch.jit.script persistently changes default from utf-8 to asciiqXNmulti_head_attention_forward generates different values on MPS compared to CPUqX![caffe2] avoid variable shadowingqX1[qnnpack] suppress empty translation unit warningqXMRephrase sentence in "Why and when to use sparsity" for better understanding.qX@test_learnable_forward_per_channel fails due to integer overflowqX7Use lru_cache to cache indices data for bsr_scatter_mm.qX6Set `CAFFE2_STATIC_LINK_CUDA` in installed cmake filesqX yolov5_trainqX,new_qtensor support privateuseone allocator.qX1Add tests for strided layout in factory functionsqXQDISABLED test_meta_inplace_addmm_decomposed_cpu_complex128 (__main__.TestMetaCPU)qX7[dynamo] allow DeviceMesh variable desugar ProcessGroupqX([not for review] testing memory planningqXtorch.autocast() hangs on CPUsqX9[ONNX][dynamo] Failed to export cumsum with dtype=float16qX3Turn keep_inference_input_mutations on in aot_eagerqXW[FX Quant] operator.matmul (@ operator ) is not converted to torch.ops.quantized.matmulqXTDISABLED test_compile_dtensor_redistribute_backward (__main__.TestDTensorCompileE2E)qXNvfuser code base nukeqX2[vision hash update] update the pinned vision hashqX'[MPS] Add torch.cummin and torch.cummaxqX-torch.compile of simple loop takes 34 secondsqXAdd support for sym_iteqX#Avoid c++ exception and stack traceqX;[dynamo] Inlining Translator will compile partial subgraphsqX:[Inductor] Support user defined triton kernels in inductorqX2[vision hash update] update the pinned vision hashqX7[fix] accounting for dilation in pool padding assertionqX8HSTU large model loading using in-tensor multi-threadingqX9Multi-node torchrun training job does not use IB NetworkqXKtorch.compile x autograd.Function: Make the backward strict mode less srictqX?[RFC][inductor] FX graph cache: Add support for symbolic shapesqX!wrong dependency version requiredqX nonnull errorqX7AOTAutograd generates wrong strides for view+inplace opqXKThe results of masked.log_softmax on MPS are inconsistent with those on CPUqX[dynamo] Eagerly install guardsqXGMinifier doesn't transfer execution states like @torch.no_grad to reproqX:[inductor] Adding a way to force fusion of int_mm with mulqX([2/N] Apply clang-tidy to c10 CUDA filesqXaAOTAutograd: avoid intermediate_base logic when all aliased outputs came from a multi_output_viewqXyUse of -Wl,--as-needed in cmake config files can leak into third-party users' code and modify their own private librariesqX2[vision hash update] update the pinned vision hashqX!adding way to force int_mm fusionqX7[HigherOrderOp] Move map_impl to torch.ops.higher_orderqX5[inductor] Defer memory operation lowering to wrapperqXBDISABLED test_fused_int_mm_mul_gating (__main__.TestPaternMatcher)qX2[inductor][BE] split triton_meta and inductor_metaqXOptimise vector allocationsqX4Add an explicit _shutdown method to ProcessGroupNCCLqX6Torch Compile Dynamic fails on sample on diffusers VAEqXC[quant][pt2] Default batchnorm aten op has poor numerics during QATqX [WIP] TestingqX4Rename name->qualname in torch.library.impl_abstractqX>Add `torch.utils.deterministic.fill_uninitialized_memory` flagqXATensor.lerp inconsistent when using -Infinity between MPS and CPUqX,Tracker for torch._numpy errors under dynamoqX'Implement device parameter in Dropout2dqXCRuntimeError: CUDAPluggableAllocator does not yet support cacheInfoqXFailed to import transformer.qXRSimulating lower memory on GPU does not indicate simulated memory in error messageqX*fix: Flake8-BugBear code B-026 for PyTorchqX"I have a trouble with to_symmetricqX.Couldn't export yolov7 quantized model to onnxqXHSegmentation fault (crash) running demucs on RX 6700 XT using ROCm 5.6.1qX?_foreach_copy_ supports fast copy between cpu and cuda devices.qXFCUDA version 12.2 has differential accuracy when executing CPU and GPUqX,Custom Tensor Instances Do Not Work With DDPqX;DISABLED test_fused_int_mm_mul (__main__.TestPaternMatcher)qXReenable sgd benchmarkqX[WIP] 2D optimizer changerX)No speedup using semi-structured sparsityrXAdditions to functorch_scanrX#[dtensor] simply some padding logicrXz[dynamo] Proposal: `@init_values_once` API for initializing tensors and constants - without tracing the function in DynamorXi"InternalTorchDynamoError: source code not available" when using `torch.compile` in ipynb or google colabrX6[AOTInductor] Add UpdateConstants for AOTInductorModelrX7[AOTInductor] Add test for AOTInductorModel's interfacerX;[AOTInductor] Rename model_runner to model_container_runnerrX Device index3r XStest_max_pool1d reliably OOMs after https://github.com/pytorch/pytorch/pull/111216/r X([dynamo] Make configs more deterministicr XTorch 2.1 compile + FSDP (mixed precision) + LlamaForCausalLM: `RuntimeError: attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype 'float'.`r X)Remove dynamo suported check for Windows.r X%Rewrite torch.library's documentationrX [dynamo] Add LazyVariableTrackerrX0[dynamo] annotate config with `@compile_ignored`rX-Deprecated verbose parameter in LR schedulersrX-[dynamo] Cache size calc for differing configrX[dynamo] guarded configrX/[dynamo] md5 hash non `compile_ignored` configsrXF[Quant] [PT2] Add ConvBNAdd(ReLU) Annotation into X86InductorQuantizerrXB[Quant] [PT2] Enable QAT Quantization flow in X86InductorQuantizerrX8Cannot use compiled model together with the ddp strategyrX2[vision hash update] update the pinned vision hashrX*Specialize on symint if used as a dict keyrX$[pytorch] bfloat16 support in erfinvrX_[AOTInductor] 14k models: AssertionError: Dynamo attempts to add additional input during exportrX[AOTInductor] 14k models: AssertionError: Failed to produce a graph during tracing. Tracing through 'f' must produce a single graph.rXd[AOTInductor] 14k models: UserError: Tried to use data-dependent value in the subsequent computationrXp[AOTInductor] 14k models: AssertionError: original output #2 is None, but only the following types are supportedrX pytorch index_select is too slowrX,[C10D] address PR feedback for test_hooks.pyr X%[C10D] callback cleanups and commentsr!X!Add tensor parallel sharding APIsr"XZ[dynamo] `ConfigModule`: Implement mechanism to hash non-`compile_ignored` configs quicklyr#X.torch.export: cannot instantiate Dim from REPLr$X![TEST] Test is_allowed at the endr%XVnanogpt_generate: C++ compile times out, because the generated .cpp file is too large.r&Xh[AOTInductor] 14K models: TypeError: make_boxed_func..g() missing 1 required positional argument: 'args'r'XK[dynamo] Investigate interop issues with torch_scatter/torch_sparse/pyg_libr(X0Test factory functions with layout=torch.stridedr)X2[dynamo] annotate configs with `@compile_ignored` r*X)[dynamo] Tracking: improve `ConfigModule`r+X)[TEST] reverse sort signatures by defaultr,XYRuntimeError: Expected is_sm80 || is_sm90 to be true, but got false. (Using Google Colab)r-X8[dynamo] annotate `allow_in_graph` with soft constraintsr.X7`torch.utils.checkpoint` drops custom Tensor attributesr/X"[WIP] Support tensors as Dict keysr0X%Trace dynamic batch size with make_fxr1XFix flake8-bugbear B019r2X'Guards elimination for unused variablesr3XP[inductor][dynamic] fused_attention pattern could not be matched due to sym_sizer4Xtorch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331, unhandled cuda error (run with NCCL_DEBUG=INFO for details)r5Xf[ONNX][Exporter] Maintain support for exporter arguments export_params and keep_initializers_as_inputsr6X6Reshape mask for triu to have the same shape as input.r7X [BE] Enable Ruff's Flake8 PYI056r8XMTraining iresnet with torch.compile is slower than eager mode for torch 2.1.0r9Xt supported_dtypes(self, device_type) function in torch/testing/_internal/opinfo/core.py cannot enter the cuda branchr:XA[symnode] support symbool -> symint casting for symint arithmeticr;XNunique(return_counts=True) fails on MPS for unsorted tensors with 1M+ elementsr<X.add test for consecutive aot inductor compilesr=X>Result of adding noise is very different in mps vs cuda or cpur>X5Regression on CUDA 12.1 for vanilla transformer layerr?X3[dtensor] fix dtype/device conversion on nn.Modulesr@XiWrong onnx model from `torch.onnx.export` when using `index_add_` function with duplicate `index` values.rAXcRuntimeError in run_streaming_llama.py When Using Accelerate with Streaming LLMa Model on A4500 GPUrBX)overloads can perhaps be more performant?rCX*[wip] Add wanda pruner to torch.ao.pruningrDX1[HigherOrderOp] Move _map.py to _higher_order_opsrEX>[dynamo] `ConfigModule` and `config.patch` are not thread saferFX@[HigherOrderOp] make MapHigherOrderOp error out when graph breakrGX>[pytorch][PR] [Inductor][FX passes] Pre grad batch relu fusionrHX,'torch._C.Node' object has no attribute 'cs'rIXg[dynamo/higher order op] fix flaky / disabled tests - context fn is not `None` when a `noop_context_fn`rJX[Module states cannot be fully synchronized due to the DDP broadcast_buffers breaking changerKX@No op for aten::where with argument types: Tensor, Tensor, bool.rLXTMismatch results of index_add_ between torch.compile Inductor backend and eager moderMXA"Invalid Scalar type" when using bf16 allreduce with Gloo backendrNXUpdate ROCm triton pinrOX4type promotion test for torch.div variants is brokenrPXRuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument state_steps in method wrapper_CUDA___fused_adamw_)rQX=torch::serialize::OutputArchive::save_to crash if save on C:\rRX#[MPS] add Hardshrink to MPS backendrSXsRunning inference with model.compile gives us wrong predictions on a X3D regression model trained using PyTorch 1.0rTX#Add comment to keep PYI041 disabledrUX"Reduce overhead in cudagraph treesrVX7[HigherOrderOp] Move map_impl to torch.ops.higher_orderrWX#[AOTInductor] Wrapper codegen fixesrXX"Build failure with Xcode 15 linkerrYX{Getting "master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified" warning when using rdzv.rZXvThere is a performance drop because we have not yet implemented the batching rule for aten::mkldnn_rnn_layer_backward.r[XvThere is a performance drop because we have not yet implemented the batching rule for aten::mkldnn_rnn_layer_backward.r\XJAOTAutograd perf: avoid as_strided() calls when we have intermediate basesr]XCdump_operator_names.cc uses std::cout but dose not include iostreamr^XExport swallows exceptionr_X!Option to disable fastpath in MHAr`X<DISABLED test_unfuse_bias_addmm (__main__.TestPaternMatcher)raXHDISABLED test_uint4x2_mixed_mm_gating_works (__main__.TestPaternMatcher)rbXIDISABLED test_uint4x2_mixed_mm_fail_to_match (__main__.TestPaternMatcher)rcX?DISABLED test_uint4x2_mixed_mm_epi (__main__.TestPaternMatcher)rdX;DISABLED test_uint4x2_mixed_mm (__main__.TestPaternMatcher)reX=DISABLED test_splitwithsizes_cat (__main__.TestPaternMatcher)rfXN[dynamo] Allow autograd.Function tracing to smuggle attributes through liftingrgX5DISABLED test_mm_plus_mm (__main__.TestPaternMatcher)rhX:DISABLED test_mixed_mm_gating (__main__.TestPaternMatcher)riX=DISABLED test_mixed_mm_epi_works (__main__.TestPaternMatcher)rjX3DISABLED test_mixed_mm (__main__.TestPaternMatcher)rkX4DISABLED test__int_mm (__main__.TestSelectAlgorithm)rlX7[sparse] semi-structured sparse + torch.compile supportrmX5Remove the dependency on kineto when not using kinetornX1[HigherOrderOp] cond should accept pytree inputsroXH[WIP] Persist copy_ in training graph for inputs that don't require gradrpX![ROCm] Skip failing tests on ROCmrqX+[ROCm] Centos stream9 pytorch image supportrrX?DISABLED test_multi_return_some_unused (__main__.TestSerialize)rsX6[experiment] see the effect of freezing on AOTInductorrtXO[ONNX] Export of `torch.distributions.normal.Normal` fails in functionalizationruX8`CapabilityBasedPartitioner` returns invalid partitions.rvXC[dynamo] allow_in_graph decorator doesn't work on autograd.FunctionrwXO[JIT] Error when scripting wrapper of `matrix_norm` using `p: Union[str, int]` rxXB[PT2.1] SIGSEGV seen with view + sgn operator inside torch.compileryXUnprompted UserWarningrzXGncu python conv2d.py runs indefinitely after activating cudnn.benchmarkr{XJ[Dynamo] Error in speculate_subgraph doesn't report inner user stack tracer|X[Dynamo] Support more argument types for autograd Function speculate: HigherOrderOperator with body that accepts non-Tensors as inputr}X2[optim] be explicit about CPU scalar tensor dtypesr~XCDisable FlashAttenion for is_causal=True when seqlen q not equal kvrX0Dynamo inlining should compile partial subgraphsrXIncrease ROCm test shards to 6rX2DISABLED test_mem_leak (__main__.TestProfilerCUDA)rXgDISABLED test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float16 (__main__.TestMatmulCudaCUDA)rXdDISABLED test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float16 (__main__.TestMatmulCudaCUDA)rXaDISABLED test_cublas_baddbmm_large_input_2_100_100_100_cuda_float16 (__main__.TestMatmulCudaCUDA)rXMDISABLED test_jvp_linalg_det_singular_cpu_float32 (__main__.TestOperatorsCPU)rXT`pip install deepspeed` fails if number of GPUs greater than a certain small number?rX,[Nested tensor]support nested tensor in hookrXC[dynamo] add infinite generators `itertools.{count, repeat, cycle}`rX1`torch.is_autocast_enabled()` always False on CPUrX)[ONNX] Update ACPT to support Python 3.11rX7increase CPU memory requirement for test_nll_loss_largerX[v2.1.1] Release TrackerrXnccl flight recorderrX5`model.named_buffers()` fails if module not hashable.rX-[sparse] Add wanda pruner to torch.ao.pruningrXqkBackendDefaultTimeout is causing a timeout exception when rank 0 process exceeds 30 minutes preparing a dataset.rX"Enable more flake8-pyi ruff checksrXbRuntimeError: out_ptr == out_accessor[thread_count_nonzero[tid + 1]].data() INTERNAL ASSERT FAILEDrXremove \ in cache_dirrXFgdb core dump when enable DEBUG mode to compile cpu torch in centos!!!rX6Noisy warnings from broadcasting in boolean operationsrX&Optim.Adam 'step' default setting bug.rX#Issue with torch.distributed.launchrXC[inductor][cpu] [dynamic shapes][cppwrapper] performance regressionrXJtorch.compile failed. "g++: error: /tmp/torchinductor_**** file not found"rX(tried removing extraneous faketensorproprX4Solving pickle error when saving CyclicLR state_dictrX2[dynamo] add config to report guard failure valuesrXEExpose CompPhoto library to Fbcode for Consumption by IQCloud ServicerX7[inductor][fx pass] Add new split cat pattern detectionrX)The NCCL kernel did not start as expectedrXE[inductor]: improve support for pow codegen to reciprocal powers of 2rX&[inductor] WIP: cumsum template kernelrXRuntimeError: !needs_dynamic_casting::check(iter) INTERNAL ASSERT FAILED at "../aten/src/ATen/native/cpu/Loops.h":349, ... please report a bug to PyTorch.rX>[Inductor] `ConstantFolder` Utility Breaking in Recent NightlyrXAdds support for fp8 torch.addrX0Reland "[C10] PG observability hooks. (#108815)"rX;[Inductor CUTLASS backend] Epilogue fusion codegen (Step 1)rX1Skip resizing storage in FSDP during dynamo tracerX*Depthwise conv3d slower than normal conv3drX3MAX_JOBS ignored when compiling pytorch from sourcerX=[PyTorch] AOTI: add CPU fast path in aoti_torch_empty_stridedrX/[PyTorch] -DNDEBUG in inductor codecache buildsrXB[torch op][xs] verbose error message for type mismatch in toList()rX][dynamo] Add asserts to prevent user defined objects/classes from going into ConstantVariablerX!Gradients (Jacobian) in inferencerXDBCEWithLogitsLoss: Check if labels / targets are within zero and onerX.Add batch decomposition for torch.unsafe_chunkrX9Verify the indices and provided values for index put metarX<Broadcasting matmul is much slower than corresponding einsumrXLEnable Wno-unused-private-field,Wunused-lambda-capture and fix CUDA warningsrX4Matmul failure after dtype change on mixed AMD setuprXIImprove Best Practices to Edit and Compile Pytorch Source Code On WindowsrX$[MPS] add complex_out to MPS backendrX9Fix num_batches_tracked of BatchNorm when load_state_dictrX#Hash mismatch when installing torchrXTMPrXAOTAutograd: set_ under no_grad still triggers "a view of a leaf Variable that requires grad is being used in an in-place operation"rXPAOTAutograd: set_ on input that ultimately no-ops fails in runtime_wrapper copy_rXK[Inductor] [cpu][amp] Eager model failed to run for some torchbench modelsrXUse aligned_allocrX+[TP][Inference] Add decompose for matmul oprXA[dynamo]: Guard for global function by checking code object matchrXEnable flake8-bugbear B020 lintrXV"Expected a 'mps:0' generator device but found 'cpu'" using shuffle=True on DataLoaderrXFSegmentation fault on aarch64 (Rpi4) using Pytorch 2.1.0 & torchaudio rX5The derivation of swish activation function is wrong.rX([WIP] unskip and xfail more opinfo testsrX&M2 Failing to build example-app in c++rXmfeat(dynamo): remove inconsistent tracing histories by acknowledging possibility of inconsistent side-effectsrX7Toggling model.train() causes guard failures every timerX5[Easy] Eagerly propagate guards for functools partialrXAONNX converter does not properly trace dynamic axis through graphrX4torch.compile does not know .half() has side-effectsrX,Build process failure with torch_shm_managerrX7[dynamo] Properly track user-defined types for `type()`rX(Use hidden visibility in OBJECTCXX filesrX(functools.partial can result in KeyErrorrX4Unrecognized attribute: axes for operator ReduceMeanrX7[Inductor] remove extra buffer creation in realize intorX7.lldbinit formatters only work when building with clangrX\[DTensor] Fix DTensor.from_local() returns DTensor with wrong size for uneven sharded tensorrXDbackward and grad behave inconsistently w.r.t. set_ on leaf variablerXTRepro for non-deterministic "operation not permitted when stream is capturing" crashrXWtorch.distributed.pipeline skip module throws assert error that portal.grad is not NonerX*Remove requires_grad_info from AOTDispatchrX=[PyTorch] AOTI: add CPU fast path in aoti_torch_empty_stridedrX/[PyTorch] -DNDEBUG in inductor codecache buildsrXwDynamo forward hooks registered before compile region (inlined) w/o tensor ops, don't respect is_compiling control flowrXN[WIP] Added consistent default values for torch.dropout and related functions.rXb[DTensor] DTensor.from_local() creates DTensor with wrong .shape/.size() for un-even sharding caserXXUsing ChainedScheduler with ReduceLROnPlateau leads to unexpected keyword argument errorrXkFused Adamw RuntimeError: params, grads, exp_avgs, and exp_avg_sqs must have same dtype, device, and layoutrX=[PyTorch] AOTI: add CPU fast path in aoti_torch_empty_stridedrX/[PyTorch] -DNDEBUG in inductor codecache buildsrXPAdd padding support for dense matrices in (semi-structured sparse @dense) matmulrX,[transform] consistent update of constraintsrX-[inductor] decomposition for complex additionrX$Support using SymBool in arithmeticsrX+pytorch consuming all cpu cores 100% on ARMrXJ[dynamo][guard-refactor] TupleIteratorGetItemAccessor using LambdaAccessorrXDebug Mode c++ macrorX_[dynamo]: `assert counter.frame_count == 1` is a bad practice when checking for no graph breaksrX9Only set grad_fn when mirroring if original was not leaf.rX Make torch._check work in DynamorX@fix(dynamo): `Optimizer._init_group` did not handle return valuerXrfeat(optim): use `has_complex` shortcut flag for all applicable optimizers, use `_view_as_real` auxiliary functionrX.sliding_window attention in scaled_dot_productrX,Error with monai SwinUNETR and checkpointingrX.More informative variable names in AOTAutogradrXomax_pool3d_with_indices_backward_cuda and avg_pool3d_backward_cuda does not have a deterministic implementationrX[onnx export jit.script ShapeInferenceError Unexpected axis value: 1. Expected range [-1, 1)rX0[dtensor] reuse dtensor spec as much as possiblerX$Dynamo guard on global configurationrX/[RFC] Scaled Dot Product Attention API ChangesrXDBackward pass for Nested Tensors using flash attention in sdpa failsrXe[sparse] Shape mismatch when doing matmul with semi-strutctured sparse and non-contiguous dense inputrX@[WIP][DDP] Use compiled_autograd to trace DDP backward allreducerXC[ROCm] Properly set atol/rtol settings for test_Conv2d_groups testsrXopinfo split is confusingrX,Pytorch 2.1.0 CUDA 12.x docker image missingrX3Raise TypeErrors if Tensor cannot be cast to scalarrX&`pytest test/dynamo -v ` fails locallyrX[discussion] Have PyTorch functions support python scalars (like NumPy) + introduce convenience constants like `torch.pi` and `torch.e`rX-[DO_NOT_MERGE] test torchxla's xla pin updaterXUMemory efficient attention for tensors where the last dimension is not divisible by 8rX<[aimp][pt2] allow FakeTensor to find non-meta common devicesrX[experiment] Shard in build?rX+[CUDA] Errors when building with cuda 12.2 rXStorch.compile CPU backend is slower than eager for several transcendental functionsrXADISABLED test_type_promotion__foreach_sub (__main__.ForeachTests)rX'[skip ci] voz/fsdp_autograd3 tracker PRrXFValueError issued instead of TypeError when tensor is cast to a scalarrX(AOTAutograd logging: log autograd graphsrX<Add mixed dtypes MM implementation based on CUTLASS upstreamrX2[torch.compile] Multiple set operations don't workr XS[PyTorch 2.1 regression] TorchScript behavior changed from 2.0.1 (and older) to 2.1r X_Incorrect docstring / documentation for torch.nn.functional.scaled_dot_product_attention in 2.1r XRMultiprocessing takes forever after on .get() with mp.Queue() (Possible Deadlock)r X9DISABLED test_cond_with_quantization (__main__.MiscTests)r Xwip - Hook new guard systemrX*[dynamo][guard-refactor] TypeGuardAccessorrX7libtorch.so: error adding symbols: file in wrong formatrXZgh-110507 Add Dtype Support and Clarify Constraints in `torch.nn.softshrink` DocumentationrXSwitch eigen to GitHub mirrorrX0Added a unittest for ModuleWrapPolicy callable. rXH[WIP] [TD] New heuristic: Historical correlation with TestClass failuresrXESupport de-functionalizing _c10d_functional.all_reduce in AOTInductorrX/Replace int with DeviceIndex for device indicesrX)Unify torch.SymInt and torch.types.SymIntrX6Enable more mypy import following for torch/_inductor/rXNative c10d_functional opsrX6automate the full source tarball release asset (sdist)rX,[Inductor] ABI-fy some aten fallback kernelsrX)[Optimus][pt2] Initial opportunity finderrX!Call RECORD_KERNEL_FUNCTION_DTYPErX1[For jansel] [Do not review] .data -> set data fnrXCreate nested _sdpa_flashrX#[CI] Add inductor workflow for rocmr X6Clean way to distinguish python subclass NT vs. C++ NTr!X1On the correctness of torch.signal.windows.cosiner"X#Add bandwidth to extern kernel calcr#X9[MPS] Unsupported operand type for * with complex tensorsr$Xjperformance drop because batching rule for aten::_scaled_dot_product_attention_math is not yet implementedr%X'[dynamo] Implement set in terms of dictr&XF[dynamo] Simplify add_dict in preparation to refactor it with call_setr'X$[dynamo] [easy] Move Set to dicts.pyr(X2Torch Nested Issue With Backward Pass In Transposer)X]DynamicQuantizedLinear shows incorrect qscheme after applying eager mode dynamic quantizationr*X!Remove some CUDA nvcc suppressionr+XA[ROCM][CI] Introduce tests-to-include as rocm-test workflow inputr,X+doc modification of torch.nn.softshrink apir-X;[dynamo] Slow compile times for optimizers due to for loopsr.X1scaled_dot_product returns NaN arrays with eval()r/X%[DEMO] cached allocate across devicesr0X@[DO NOT MERGE][CUDNN][CUDNN V8 API] Testing submodule 1.0 updater1X8Fix resume issue in CosineAnnealingWarmRestarts (#88791)r2Xexpose sdpa helpers to pythonr3Xexpose mem-eff to autogradr4XA[export] `torch.tensor(0)` should not get burned in as a constantr5X:[export] Constant tensors should not get lifted to buffersr6XX[FSDP] [Checkpointing] Loading optimizer state dict with use_orig_params True causes OOMr7X([ONNX] Export and runtime error minifierr8X?[ONNX] Figure out aot inline strategy for Dort / onnxrt backendr9X.Support mutating constant attributes in exportr:X*[optim] Better support subclassing usecaser;XUpgrade CI to ROCm5.7r<X@Custom tensor attributes not preserved with registered functionsr=X#Local build breakage on AWS clusterr>X}[fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow-far-from-bounds (size 4) in c10::IValue::IValue()r?X7[pytorch][PR][inductor] Change log to debug for Optimusr@X8`test_pytorch_onnx_onnxruntime_cuda.py` is not run in CIrAX3[C10D] Split watchdog into CUDA and non-cuda calls.rBX4Explore Hybrid (CPU+GPU) Graphs in Scalar parametersrCXTUsing `torch.onnx.export` from file named `onnx.py` results in cryptic error messagerDXATorch.onnx.export of module used positional and keyword argumentsrEXT[fx] Add cache_result option in ShapeProp which can help cache interm tensor in metarFX%Pytorch for Python 3.12 not availablerGXjacrev Issue when Using CudarHX1bypass nvml for torch.cuda.device_count() if rocmrIX7dynamo support for functorch `grad` with pytree inputs.rJXODynamicShapesExportTests::test_retracibility_dynamic_shapes times out with ASANrKX5Add support for `torch.Generator` type in TorchScriptrLX2WIP / TST: allow testing torch._numpy under DynamorMX&[experiment] do not use slow test jsonrNX-Add scatter_mm and bsr_scatter_mm operations.rOXFDifferent results for forward pass of two equal tensors through Conv2drPX>Allow storages to alias even when the deleter is deleteNothingrQX+Removed special case for reshape on the IPUrRXPytorch LoadNativeLibrary issuerSX#[3/N] Clean up CMake target linkingrTX,[xla hash update] update the pinned xla hashrUX6[dynamo][nn_module] Save the nn module object as valuerVX<Categorical Simplex constraint throws error for valid valuesrWXnn.BatchNorm2d (track_running_stats = True) causes "modified by an in-place operation" error when in torch.nn.parallel.DistributedDataParallelrXXbBatchNorm layer 'num_batches_tracked' overwritten with default value when loading empty state_dictrYXlDropout signature inconsistent between `torch.dropout`, `torch.nn.Dropout` and `torch.nn.functional.dropout`rZX<[dynamo] Added support for type for generated custom object r[XRfeat(inductor): Improve compilation speed and add `SGD` Optimizer back to Inductorr\Xbfeat(inductor): Add `RAdam` to Inductor by converting data-dependent control-flow to `torch.where`r]XAONNX export: TransformerEncoder is exported with fixed input dimsr^XMfeat(inductor): Improve `Adamax` to be better fused by Inductor and enable itr_XU[inductor]: Not handling `ConcatKernel/NopKernel` fusions leads to suboptimal fusionsr`X)[dynamo] Add support for itertools.repeatraXgfeat(optimizer): `Adagrad` will use `device` when `capturable` - True always when compiling with dynamorbX/[cpu] explicitly vectorize trigamma & polygammarcXXfix(inductor): `ForeachKernelSchedulerNode` group shape should be opaque for graph debugrdX=Perf-Drop (factor=2) Ubuntu-vs-Windows on same PC (dual-boot)reXn`torch.jit.load()` might unresponsive in IBM s390x when loading some certain torchscript saved by x86 machine.rfXktorch.Tensor.__repr__ causes torch.compile to error: "got an unexpected keyword argument 'tensor_contents'"rgX2[vision hash update] update the pinned vision hashrhX3[pytorch] add should_deepcopy flag to AveragedModelriX7[wip, dynamo] run all guard fail hooks after cache missrjXntorch._dynamo.exc.Unsupported: unexpected sourceless type bases: (,)rkXB[ONNX][DO NOT REVIEW] Experiment passing 'dynamic' to dort backendrlX1[inductor] Decompose boolean min/max into all/anyrmX/[ATen] Support multi dim any and all reductionsrnXnpytorch_stargan: "_inductor/fx_passes/joint_graph.py", line 166, in constant_fold_uniform_value KeyError "val"roXFFix for out of bounds read in mobile interpreter FORMAT opcode handlerrpXNFix for out of bounds read in mobile interpreter INTERFACE_CALL opcode handlerrqXIFix for out of bounds registers_ access in mobile TorchScript interpreterrrX3Tests modify global state cause later tests to failrsXAOT Autograd Neg View SupportrtXFPYTORCH_TEST_WITH_DYNAMO=1 pytest -n 4 test/test_nestedtensor.py failsruX.TypeError: Got unsupported ScalarType BFloat16rvXXfeat(Pipeline Parallelism): use mincut optimization for local communication optimizationrwX,Allow nn.Transformer to be exported as ONNX.rxX.Update hipify mappings and rocm_version.h pathryX%[DO NOT MERGE] Fix CMake static buildrzX7[S366352] Avoid creating ncclStartEvent for deadlockingr{X;[FX/const_fold][bugfix] Set codegen on wrapping GraphModuler|X.[dynamo][guard-refactor] Global Guard Accessorr}XGCannot avoid kineto_LIBRARY-NOTFOUND error when using pre-built pytorchr~X/ONNX export of torch.nn.Transformer still failsrX[fbcode] s,\bnp\.bool\b,bool,rXcuda/tf32 docs are outdatedrX.Accessing Particular Nightly Builds Don't WorkrXY`torch.func.functional_call` does not work with `__torch_function__ ` Tensor-like objectsrXIDISABLED test_detach_cpu_float16 (__main__.TestNestedTensorDeviceTypeCPU)rX2[inductor] Triton mm codegen uses `make_block_ptr`rX2[ONNX] Replace torchscript with new graph buildingrXKPyTorch with non-shared build (building a single shared lib) is unsupportedrXadd full permutedrX?[fbcode] small fix on dataloader and content_understanding/mainrX![RELAND] Disallow skipping dynamorXcDISABLED test_noncontiguous_samples__native_batch_norm_legit_cuda_float32 (__main__.TestCommonCUDA)rXPRuntimeError: Expected packed scalar Tensor to be of dimension 1. Got 0 instead.rX-cudaMallocAsync cause too much fragmentation.rX7CI: rocm `(default, 1, 3, linux.rocm.gpu)` is very slowrX&Squeeze and unsqeeze 3d to 4d for sdparX+[ONNX] Add sanity check in CI for onnxbenchrXUModule-level bufferization for torch dynamo module spanning multiple `fx.GraphModule`rX:[inductor] Added decomposition for _upsample_bilinear2d_aarXAdd cuSPARSELt to the nightliesrX@[DeviceMesh] Record PT-D API usage on parent mesh dim in MeshEnvrX<Fix for PyTorch mobile flatbuffer loader out of bounds readsrX&Add _worker_end_fn_t to the DataLoaderrX+[TESTING] Capture scalar/dynamic by defaultrXValueError: args contained 2 None's after flattening. When exporting a ScriptModule or ScriptFunction, no args may be None because that breaks type propagation.rX)Torch.onnx.dynamo_export stuck at reshaperX0[DONT MERGE][ROCm] Update magma to 2.7.2 versionrX> custom_ops._destroy("test::foo") doesn't remove abstract_implrXFUnbacked SymInts get reallocated whenever you repropagate fake tensorsrX&logging stack_info doesn't do anythingrXHSome ONNX tests have been disabled because of new tensor.split signaturerXXCreate a new heuristic TD rule for failures coming from base commit of the pull requestsrX-[manual] Rewrite remaining mock external_depsrX@[MPS] Fix output stride holistically when input isn't contiguousrX%[WIP] Cache `FakeTensor` propagation.rXASkip cuda kernel launch with torch.sum when dimension length is 0rX8torch._dynamo.reset() before and after running each testrXRtorch.export.export does not support already-fakefied inputs within FakeTensorModerX+Dynamo tests in CI seem to not run at timesrXGPT2ForSequenceClassification, LayoutLMForSequenceClassification: "torch._dynamo.exc.Unsupported: call_function BuiltinVariable(setattr) [HFPretrainedConfigVariable(), ConstantVariable(str), ConstantVariable(str)] {}"rX.[inductor][Optimus]Improve logging for OptimusrXcsam: AssertionError at torch/_inductor/graph.py `assert isinstance(value, (TensorBox, sympy.Expr))`rX+scatter_add: Mixing 0-dim and 1-dim tensorsrX Devices APIrX:Fix linalg_vector_norm ONNX export with wrong output dtyperXVImportError: libc10_cuda.so: cannot open shared object file: No such file or directoryrXBtest_complex_div_underflow_overflow: increase amount of input datarX1Fix clang-tidy errors misc-definitions-in-headersrX#Add IWYU pragma for umbrella headerrXA[BUG] Elastic cannot kill all subprocesses after sending sigterm.rXQDISABLED test_tags_recomputed_rand (__main__.ActivationCheckpointingViaTagsTests)rX@[dynamo] Fix comparison between SymNodeVariable and ListVariablerX[torch.onnx.export causes floating point exception with core dump for empty slice assignmentrX/[Dynamo][Test] reland testcase with state againrXFDISABLED test_tags_rand (__main__.ActivationCheckpointingViaTagsTests)rX2[vision hash update] update the pinned vision hashrXGrouped query attentionrX:DISABLED test_circular_dependencies (__main__.TestImports)rXVDISABLED test_tags_multiple_checkpoints (__main__.ActivationCheckpointingViaTagsTests)rX2[experiment] rename distributed tests with backendrX@Update caffe2 with LLVM-18 API change to CreateMalloc/CreateFreerX=Race condition on shutdown involving PThreadPool and autogradrXCDataloader resetting with num_workers=1 and persistent_workers=TruerX[ignore] placeholder PRrXJDISABLED test_raises_mesh_dim_less_than_2 (__main__.TestDeviceMeshGetItem)rX(SummaryWriter.add_figure: add type hintsrX)[aotinductor] Pass TorchIR to AOTInductorrX4tan/tanh discrepancies with complex due to jiteratorrXHDISABLED test_tags_module (__main__.ActivationCheckpointingViaTagsTests)rX=Change signature of CompilerFn for register_backend decoratorrXEPlease offer packages with local version `torch==2.1.0+cpu` for macOSrXB[aotinductor] support _scaled_dot_product_flash_attention fallbackrXERuntimeError: device >= 0 && device < num_gpus INTERNAL ASSERT FAILEDrX3DISABLED test_mandelbrot_numpy (__main__.MiscTests)rXVDISABLED test_tags_function_with_kwargs (__main__.ActivationCheckpointingViaTagsTests)rXODISABLED test_mandelbrot_numpy_dynamic_shapes (__main__.DynamicShapesMiscTests)rX$add test cases for gradscaler on CPUrXadd gradscaler on CPUrX'Implmenet kthvalue for bfloat16 on CUDArXgStatic quantization for Transformer block : AttributeError 'function' object has no attribute 'is_cuda'rX`DISABLED test_tags_function_via_global_checkpoint (__main__.ActivationCheckpointingViaTagsTests)rX2DISABLED test_cat_addmm (__main__.TestMaxAutotune)rX2[vision hash update] update the pinned vision hashrXYtorch-.dist-info WHEEL file contains incorrect metadata for M1/M2 macOS platformrX3Dtype hard-coded in DataLoader (for python floats).rX5[WIP] Make ONNX OpSchema function matcher more robustrX4[2/N] Cleanup header inclusions in torch_cpu by iwyurXJWelfordReduction seems to have invalid/dead code when reduction_numel <= 1rX/How to compile torch 2.0.1 version from source?rX2[vision hash update] update the pinned vision hashrX.Simple script segfaulting when grad is enabledrX<Indexed batch matrix multiplication to support MoEs and FFFsrX]Problems when loading PT files und Linux - Duda which are created under Mac Apple Silicon MPSrXpytorch XLA document errorrX6Need latest NCCL support to reduce GPU HBM consumptionrXBatching for is_inrXFix S367052 to unblock ICVR MC3rX=test test_2d_fsdp_integration_fsdp_nested_param_groups failedrX2[vision hash update] update the pinned vision hashrX!Memory access fault with AMD RocmrX((pytorch) add List[float] type to get_lrrX$cm3leon_generate failing compilationrXQImport order issue with torch and pybind11 Library Statically Linked to libstdc++rX9[torch] Defer resolution of allowed/disallowed decoratorsrX0[dynamo] fix reconstruct of ConvertSymintSource.rX0Decompose to native_dropout in eval mode as wellrX)[inductor] Avoid bool being upcast to intrX"Dynamo error for autograd functionrX>Large Discrepancies between PyTorch and ONNXRuntime Inference rX=Add `endpoint` argument in `linspace` to match numpy behaviorrX7[core IR] Add decompositions for _assert_async to no-oprX$Error using torch.onnx.dynamo_exportrXJDISABLED test_tags_function (__main__.ActivationCheckpointingViaTagsTests)rX>moco: torch._dynamo.exc.Unsupported: hasattr: TensorVariable()rXRevert D49433268: Multisect successfully blamed "D49433268: [pytorch][PR] [Inductor] Extend Pattern Matcher to Match Equivalent Function Invocation" for test or build failuresrX<Support torch.export.export through torch.onnx.dynamo_exportrX9Fix CPU bitwise shifts for out-of-limit values in VSX-vecrXZDALLE2_pytorch: "torch._dynamo.exc.Unsupported: call_method NNModuleVariable() eval [] {}"rXPbasic_gnn_gcn: ERROR:common:TypeError: object of type 'GreaterThan' has no len()rX)Move at::{Refcounted,}MapAllocator to c10rXg[FSDP ]How to convert sharded_state_dict files into full_state_dict offline without distributed processrX&[inductor][cpu] performance regressionrX8Allow try except check for numpy bfloat16 representationrXIDISABLED test_tags_dropout (__main__.ActivationCheckpointingViaTagsTests)rXHWrongly returns nan for vectorized complex numbers division on PPC/ZArchrXIDISABLED test_tags_decomps (__main__.ActivationCheckpointingViaTagsTests)rX0[BUG?] Why Allocator use stream to manage Block?rXMDISABLED test_symints_location (__main__.ActivationCheckpointingViaTagsTests)rXkCannot use constrain_as_size from fake tensor implementations: RuntimeError: tried to get Int out of SymIntrX;Move InputDim to torch.export instead of defining in a passrXESevere performance regression on deterministic algorithm in torch 2.0rX<Directly support assert on Scalar, instead of forcing TensorrXFix S367052 to unblock ICVR MC3rXtorch._export has no loggingrXV[dynamo][stream] Stream runtime operation in FX graph is ignored by remaining compilerrX[Reland2] Update NVTX to NVTX3rX[MPS] add support for heavisiderX2[vision hash update] update the pinned vision hashr X%Implement Copy-on-write (COW) tensorsr XCDISABLED test_kwargs (__main__.ActivationCheckpointingViaTagsTests)r X"PIN disabled tests for the releaser XGValueError: only one element tensors can be converted to Python scalarsr X<[Inductor CUTLASS backend] Epilogue fusion codegen prototyperX-Incompatible dimensions error for FusedMatMulrX*[Not for Land] Add verbose all-gather inforX2Bits types cannot be used under deterministic moderX0Surround num-destroyed-communicators with spacesrXFix tensor unpicklingrX-Heap-buffer-overflow during tensor unpicklingrX%[inductor] Add lowering for aten.takerXAtest/test_static_runtime.py: test_fork_wait_4 sometimes deadlocksrXj`torch.embedding`, `weight[indices]`, `torch.index_select` returns random data with indices on meta devicerXYan issue occurs while `loss.backward()`: You are trying to call the hook of a dead modulerX%Wrong vector shift results on PowerPCrX$[DDP + Dynamo] Tracing DDP AllReducerXcSlow performance when running torch.jit traced model with Flash Attention using libtorch on WindowsrX6LLaMA-2 70b model convert from PyTorch to ONNX format rX/[ONNX] Remove the deprecated function `_export`rX DTensor: summon full tensor API?rX2[vision hash update] update the pinned vision hashrX)fp16 parity issue with traced code on GPUr X=[Quantization] Add "quantization_tag" as metadata to fx proxyr!X8[RFC][TorchElastic] topology info in training apps/ranksr"XC[caffe2/torch] Package Importer with compatibility for Lazy Importsr#XD[C10D] Report detected failures when emitting collective end events.r$XK[dynamo] lift the constraint that cannot make_fx a dynamo compiled functionr%XHistogram Fixes for QATr&X3Profiler should implicitly synchronize gpu devices r'X&assert_is_valid_input_type is too weakr(XMMake torch.cuda.graphs.is_current_stream_capturing() available in TorchScriptr)XQRegression on 2.1 RC RoCm: data parallel error on `torch._C._broadcast_coalesced`r*X<Make standard container classes satisfy container Protocols.r+X&[inductor][cpu] performance regressionr,XI[TorchScript] Support ScriptFunction arguments in torch.jit.script calls.r-X"[DDP + Dynamo] Traceable DDP hooksr.X/Extends the functionality of `nn.BatchNorm1d`.r/X@[RFC]: Moving most torch.compile backends out of core by 12/1/23r0XV[pytree] Make `optree` optional and populate members from `_pytree` when not availabler1X&[WIP] Dynamo CPU backend under Windowsr2XF[FSDP] UnpicklingError when calling save_state_dict in distributed runr3X1FSDP: ShardedStateDict support for world_size = 1r4X.Add Pass to move constructors from cpu to cudar5X2[vision hash update] update the pinned vision hashr6X@Inductor lowering error for aten fallbacks with multiple outputsr7Xsupport for fp8 allgather FSDPr8X.[WIP] compiled autograd on inductor torchbenchr9X(InstanceNorm does not catch dim mismatchr:X,Add ``onnx`` backend to ``torch.export`` APIr;X+Add `backend` concept to `torch.export` APIr<X+Enable masked_scatter_backward for inductorr=X0dynamo: break graph when "out" has complex dtyper>XA[Inductor] Move fake_tensors to the same device as example_inputsr?X%[ONNX] Enable more OpInfo tests in fxr@X=AsyncCompile loses useful exception backtrace in __get_resultrAXn"RuntimeError: (*bias): last dimension must be contiguous" with F.scaled_dot_product_attention + torch.compilerBX[inductor] Update triton pinrCX5ensure uint8 is honoured for cpu operations in dynamorDX:test_memory_timeline fails on PPC due to extra temoprariesrEX$[WIP] Trace model attribute mutationrFX\Max pool with negative integer inputs and channels_last memory layout gives the wrong valuesrGXl[Torch-Onnx] Exporting the operator 'quantized::conv_transpose2d' to ONNX opset version 13 is not supported.rHXT[dynamo][jagged tensor] Slow compilation time for a helper function of jagged tensorrIX$Make Dropout take a dim=... argumentrJXtorch.optim.AdafactorrKX1Support FloatFunctional subclasses in eager moderLX[[Android: React Native] couldn't find DSO to load: libtorch-code-gen.so when loading model rMX%Add Half support for AvgPool2d on CPUrNXONNX Export errorrOXC[Not for merge][Repro] Unbacked symint in Inductor size_hint outputrPXT[WIP] fix: added check for convolution output shape wrt kernel_size and input lengthrQX-Adding T4 GPUs to inductor nightly benchmarksrRX2[vision hash update] update the pinned vision hashrSX&[fake/meta] Bad meta kernel for conv1drTX[wip]: fsspec remote code cacherUX;[foreach] check for empty tensors before dispatching to MTArVXUpdate torchbench pinrWX-Torch FX SubgraphMatcher Any / Oneof PatternsrXX+attn_output_weights sometimes rerurn `None`rYXDRAFTrZXO`TORCH_DISTRIBUTED_DEBUG=DETAIL` raises a RuntimeError on `_start_coalescing()`r[X"_assert_bound_is_rational can failr\X%[torch.optim/C++] Add NAdam optimizerr]X[dynamo] torch._dynamo.exc.Unsupported: call_function BuiltinVariable(setattr) [TensorVariable(), ConstantVariable(str), ConstantVariable(bool)] {}r^Xk[dynamo] torch._dynamo.exc.Unsupported: comparison SymNodeVariable() ListVariable()r_XDReport NameError when name is not defined, rather than unimplementedr`XVery big differences in output of `torch.lobpcg` (values and run-time) compared to SciPy on a very ill-conditioned Laplacian matrixraX?Performance degradation on AMD + A800 when computation is smallrbX(Avoid cuda stubs libraries being RPATHedrcX$[inductor] Remove `is_big_gpu` checkrdX,Fix MultiProcess failure on nodes with 1 GPUreX4Investigate Strictness of torch.compile `is_big_gpu`rfX=[bug] FALLBACK path has been taken inside: runCudaFusionGrouprgXMFix access to unitialized memory in VSX vector functions for quantized valuesrhX,[POC] Add caching for faketensor propagationriXG[dynamo][symbolic shapes] Long compilation time for KJT helper functionrjX,[xla hash update] update the pinned xla hashrkXl[Docs][Distributed] Add migration notes for `--local-rank` option style change for `torchrun` in PyTorch 2.0rlXBProcessGroup is not automatically destroyed when the process exitsrmX5[DTensor] optimizer step performance is still too badrnX2[vision hash update] update the pinned vision hashroX@Revert "[inductor] let codegen not rely on node order (#107320)"rpX>Revert "[inductor] Fix inputs with existing offsets (#108168)"rqX$[Decomposition] hann_window.periodicrrXIInconsistent behavior for in-place operations on coalesced sparse tensorsrsXU[BUG][pytree] treespec serialization for locally defined classes and namedtuple typesrtXITraining results from using MPS backend are poor compared to CPU and CUDAruX:Inconsistent Behavior of `ConvTranspose2d` on CPU and CUDArvX>[dynamo]Scuba log some debug info about list of integer inputsrwX:Backward pass of inverse FFT is sometimes incorrect on GPUrxX2[vision hash update] update the pinned vision hashryX-torch pollutes libgomp symbols when import _CrzXfMemory usage steadily increasing when using back propagation with sparse CSR parameter matrices on CPUr{X&RNN Documentation is Confusing / Wrongr|XYCPU memory cannot get released after `torch.compile` (caused by importing `AsyncCompile`)r}X7Improve IDE Type Hinting for torch.Tensor class methodsr~X [FSDP] supports QLora finetuningrX6Can dtensor flexibly modify the layout via devicemesh?rXVgh-108197 Update in `AdaptiveMaxPool2d` func of `pytorch/torch/nn/modules/pooling.py`rX"[Dynamo] Match closures by code IDrXHCannot export a quantized model that permutes a quantized tensor to ONNXrXUse weakref in fast tracebacksrX*Supporting Block_Ptrs in inductor code genrX2Move eval_frame global variables into module staterX2[vision hash update] update the pinned vision hashrX5[export] Support tracing constant attribute mutationsrXRevert D49284640: Multisect successfully blamed "D49284640: [inductor][Optimus]Improve logging for group batch fusion" for test or build failuresrXtimeout for send() / recv()rX'Interleaved isend and irecv causes hangrXD[FSDP] Implement additional check for turn on 2D TP + FSDP extensionrX1Make Fx Generating Incorrect Graph For GPTQ modelrX?FSDP crashes when submodule calls method that isn't `forward()`rX1cuda rng state for 2.0.1 cannot be used for 2.1.0rXCDISABLED test_ddp_activation_checkpointing (__main__.TestMultiProc)rXJDISABLED test_compute_local_shape_and_global_offset_1D (__main__.UtilTest)rX7TCPStore() RuntimeError: unmatched '}' in format stringrX[PT2.1/ PT2.2(Nightly)][torch.compile][dynamic shape enabled]: TorchDynamo failed with Dynamic shape gives runtime error in 'pow' operation.rX[DISABLED test_nondeterministic_alert_median_cuda_float64 (__main__.TestTorchDeviceTypeCUDA)rXUpdate Android to R21erXMinimize protobuf dependencyrX-Adding T4 GPUs to inductor nightly benchmarksrX]DISABLED test_nondeterministic_alert_kthvalue_cuda_float64 (__main__.TestTorchDeviceTypeCUDA)rX2[vision hash update] update the pinned vision hashrXDDISABLED test_backend_match_guard_multi_threads (__main__.MiscTests)rX!Add decomp rule for div and truncrX5Add a multiprocess CI job to torchbench dynamo runnerrXRDISABLED test_nondeterministic_alert_histc_cuda (__main__.TestTorchDeviceTypeCUDA)rXASupport the `ExitStack` context manager (or a simplified version)rX:Create static analysis tool to improve ONNX export successrXtAttribute 'kernel_shape' is expected to have field 'ints' when exporting a module with `List[Tensor]` inputs/outputsrX.aten::squeeze exported to ONNX as an `If` noderXUDISABLED test_nondeterministic_alert_bincount_cuda (__main__.TestTorchDeviceTypeCUDA)rX$PyTorch 2.1 smoke test requirements rX[inductor][cpu] perf regressionrXAdding index function for listsrX\add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPUrX`DISABLED test_nondeterministic_alert_MaxUnpool3d_cuda_float64 (__main__.TestTorchDeviceTypeCUDA)rX`DISABLED test_nondeterministic_alert_MaxUnpool3d_cuda_float32 (__main__.TestTorchDeviceTypeCUDA)rX,Add a unittest for ModuleWrapPolicy callablerX2[vision hash update] update the pinned vision hashrX<[profiler] Show shapes for lists of tensors in chrome tracesrXHAdd .item() and .tolist() support in Dynamo/Inductor without graph breakrXU[dynamo] Disable DDPOptimizer or error out if DDPOptimizer + static_graph is detectedrX1[FSDP] Simplify `_fully_sharded_module_to_handle`rXDecompose div.Tensor_moderX)Increase tolerances for atanh opinfo testrXRemove det_singular OpInforX metric tablerXBAOTAutograd should put keep mutations in the graph during trainingrXXAOTAutograd should track view chains so it can replay them, instead of using as_strided.rXEReport name of defining class along side function name in Dynamo logsrXLONNX exporter issue: fails to add conversions exporting T5 Transformer modelrX`DISABLED test_nondeterministic_alert_MaxUnpool3d_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)rX0inductor/test_max_autotune having timeout issuesrXEST] Release only changesrX"Apply release only changes to corerX%TST: add a numpy reproducibility testrXPytorch ROCM windows buildsrX&[torch.optim/C++] Add Adamax optimizerrX`DISABLED test_nondeterministic_alert_MaxUnpool2d_cuda_float64 (__main__.TestTorchDeviceTypeCUDA)rX`torch.mean` not supported for `torch.sparse_coo_tensor`, but `torch.sum` is supported (`scipy.sparse.coo_matrix` does support both `mean` and `sum`)rX`F.conv2d(input, weight, bias, self.stride, RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERRORrXDGradients across different ranks are not synchronized when using DDPrX`DISABLED test_nondeterministic_alert_MaxUnpool2d_cuda_float32 (__main__.TestTorchDeviceTypeCUDA)rX FSDP vs. MiCSrXTSparseSemiStructuredTensors are constructed differently from the original dense onesrX>NAN appears in the backward results of masked.cumprod on macosrX`DISABLED test_nondeterministic_alert_MaxUnpool2d_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)rX2[vision hash update] update the pinned vision hashrXC[pytorch][PR] [pytorch-vulkan] add aten::randn_like & aten::normal_rXFCPU Publish: Fix Assign device error, when module has multiple devicesrX\[dynamo] `torch.no_grad` doesn't preserve the name of the wrapped function (eager mode does)rXE[ONNX] Provide an option to not generate `report_dynamo_export.sarif`rXDecomp div.Tensor_moderXJFSDP should have tests for partial state_dict and optim state_dict loadingrX-[inductor] Parameterize ir.Scan on combine_fnrX6Introduce 'backend' concept to torch.export.export APIrX`DISABLED test_nondeterministic_alert_MaxUnpool1d_cuda_float64 (__main__.TestTorchDeviceTypeCUDA)rX3[2D] Add deprecation warning to enable_2d_with_fsdprX Decomp div.Tensor_mode and truncrXZThe API "torch::jit::_load_for_mobile" is limited to create an object living on the stack.rX<Unable to install the latest version of PyTorch using mamba.rXCannot construct `torch.sparse_coo_tensor` (but `scipy.sparse.coo_matrix` works fine): `TypeError: only integer tensors of a single element can be converted to an index`rXpDDP - "No backend type associated with device type cpu" with new Model Phi 1.5 despite everything loaded on GPUsrXMFSDP do not support `ignored_parameters` when `auto_wrap_policy` is specifiedrXCan't initializa NVMLrX?Parameters of cuda module zero out when used in multiprocessingrX:[WIP] Use user directed names for variables where possiblerX2[vision hash update] update the pinned vision hashrXW[TEST][pytorch] Use cpuinfo to determine c10::ThreadPool thread number + internal patchrX|torch.compile/triton holding GIL during compilation and CompiledKernel call results in deadlocks during distributed trainingrX>DISABLED test_out_randn_cuda_float32 (__main__.TestCommonCUDA)rX%torch.argmax fails for device='mps:0'rXOFaster gc_count update for CUDACachingAllocator (and avoid nullptr dereference)rX7CollectiveFunctionRewriteVariable for all_to_all_singlerXDDISABLED test_out__refs_randn_cuda_float32 (__main__.TestCommonCUDA)rXUUserError: Can't call type() on generated custom object. Please use __class__ insteadrX>AOTAutograd view-replay logic does not handle dtype views wellrX,Allow reductions to write into pinned memoryrXTest triton conda buildrX-[FX] Fix an issue in constructing GraphModulerX%Disable extreme test (fix OOMs maybe)re(XwiprXwiprXwiprX+Batching rule not implemented for `aten::t`rXVGPT2ForSequenceClassification fails accuracy in cpu inference after HF version upgraderXEnable bugbear B023 lintrXtorch.sparse_coo_tensor argname quirks + [feature request] `.numpy()`/`from_numpy` method for sparse_coo_tensor/sparse_csr_tensor (or maybe name them as `.scipy()`/`.from_scipy()` or at least under some `torch.utils.*` namespacerXpr build failures in inductor dynamic shape test for operation tests with simple tensors. Side effect of current test frameworkrX'Cannot install torchmetrics - ERROR 403rX'The following will always fail on NixOSrX$TypeError: mask must have dtype boolrX,[xla hash update] update the pinned xla hashrX7fix(test): `test_ops::test_out` previous uncaught errorrXF[FSDP] How can I wrap a model that has both nn.Parameter and nn.ModulerXIncorrect strides and accuracy when combining `torch.compile` with `op(out=out)` having complex number outputs, `test_ops::test_out` is buggedrX[inductor][cpu] perf regressionrX2fix(test, inductor): use deterministic manual seedrX_DISABLED test_maml_omniglot_mechanism_make_functional_cpu (__main__.TestExamplesCorrectnessCPU)rXYDISABLED test_complex_half_reference_testing_pow_cuda_complex32 (__main__.TestCommonCUDA)rX2[vision hash update] update the pinned vision hashrX|`PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_ops.py -k test_out_{warnings_, *}{_refs_, *}randn_cuda_float32` fails on mainrXfTighten exclusion of floating point mask to enable most common mask types (those with 0 and -inf only)rX%MPA linalg cholesky ex implementationrX9PPC64le: GCC 11.2.1 Linker Error in bin/torch_shm_managerrXSZeroTensor (and probably neg/conj) doesn't play well with wrapper tensor subclassesrX.feat(inductor): Accumulate fp16 for bmm and mmrX[WIP] Test pinning a100 dockerrX[feature request] Provide some sparse eigen solver(s) for PyTorch (maybe via `ARPACK` as in scipy) + SPD sparse / laplace linear system solver - maybe NVidia AMGx library?rX About FSDPrX\ Exporting the operator 'aten::_convolution_mode' to ONNX opset version 14 is not supported.rXDtensor and jacrevrXSReally slow compilation times for torch.compile causing distributed training errorsrXBUnnecessary cuda synchronizations that we should remove in PyTorchr X?torch.compile graph breaks should be independent of DDP bucketsr X2[vision hash update] update the pinned vision hashr X6Update FlashAttention guard message for attention maskr X+[dynamo] Improve recompilation reason infrar X0Add Lambert W function as torch.special.lambertwrX7Dynamo's eval_frame.c is not thread/subinterpreter saferXPPC64le: vsx_helpers.h errorsrXcuda.cmake: Fix Identity CheckrX5cuda.cmake: Found Conflicting, Identical CUDA InstallrX2[vision hash update] update the pinned vision hashrXFSupport ONNX export for aten::select_backward and aten::slice_backwardrX&Add missing ops for new keyboard modelrX`DISABLED test_nondeterministic_alert_MaxUnpool1d_cuda_float32 (__main__.TestTorchDeviceTypeCUDA)rX<[c10d] fix functional collective reduce op naming conventionrX)[dynamo] Missing guard on global functionrX,Run CI for oneDNN Graph Inductor IntegrationrX5RuntimeError: Unrecognized tensor type ID: ZeroTensorrX0[dynamo] fix functools.wraps on nested functionsrX)nanogpt_generate and clip pass regressionrXOInconsistent any( ) between cuda and cpu - Incorrect complex to bool conversionrX:[optimize_ddp] moco - NameError: name 's2' is not defined rXTest a100 fixesrX2Validate that storage have enough memory allocatedr XHInconsistent, platform-dependent torch.ones_like behavior on metatensorsr!X%undefined symbol: cuModuleGetFunctionr"XGet libcudnn from CUDA_HOMEr#Xo[Inductor][Compile]torch.compile remembers the old weight for evaluation if grad is disabled and it runs on CPUr$XaA100 runners down: apt-get install nvidia-docker2, Could not get lock /var/lib/dpkg/lock-frontendr%XRuntimeError: DataLoader worker (pid 11011) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limitr&X=Certain torch functions are not handled by torch func wrapperr'XEUbuntu vs Windows: torch.cuda.OutOfMemoryError only happens on Ubuntur(X`DISABLED test_nondeterministic_alert_MaxUnpool1d_cuda_float16 (__main__.TestTorchDeviceTypeCUDA)r)X8Fix the max pool kernel with channels_last memory layoutr*Xabout nccl not workr+XMDependencies.cmake: support building against CUPTI outside of CUDA_SOURCE_DIRr,XWIPr-X/Tensor Parallel doesn't work with torch.compiler.X1[dynamo][guard refactor] C++ Guard data structurer/XFCompileId in Dynamo log messages should include restart analysis countr0X$Export should never unspec NN moduler1Xjtorch._export.pass_base.ExportPassBaseError: Unsupported target type: r2X;torch._dynamo.export produces object that is not pickleabler3X2DISABLED test_lstm_packed (__main__.CPUReproTests)r4X[Decomposition] squeezer5X+[Export] Utils functions for GraphSignaturer6X9Export torchvision detection model retinanet_resnet50_fpnr7XE[PT2] Make masked_scatter_backward traceable without unbacked symintsr8X%Dynamo Swallowing Exception In Lambdar9XoBack out "[Dynamo x FSDP] Add support for params, buffers, submodules on FSDPManagedNNModuleVariable (#107923)"r:X1[C++ Frontend] Simple Changes for Cleaner Optionsr;XEnable BF16 datatype on ROCmr<X[[state_dict][2/N] Implement the flattening and unflattening of optimizer state_dict featurer=X.Collect more env variables in `collect_env.py`r>X6Tracing interpolate with tensor scale_factor is cursedr?X1RuntimeError: tried to get Double out of SymFloatr@X/Sourceforge outage causing multiple CI failuresrAXSSDPA with nested backend: expose a way to avoid recomputing data layout informationrBXH[ghexport] reintroduce old way of non diff train way of finding oss baserCX)[ghexport] dynamically get github repo idrDX=[Quantization] Add "quantization_tag" as metadata to fx proxyrEXUpdate torch.export() docrFX[temp] enabled benchmark fusionrGX&[inductor][cpu] performance regressionrHXFix MASTER_ADDR bugrIX-updated test cases to use MultithreadTestCaserJX;Unable to compile the function which contains dict of typesrKX.[dtensor] Add debug tool to visualize shardingrLX[PT2.0] [.Compile] [Dynamic] Pytorch FX/JIT graph's inputs/nodes ordering is changed when FX recompiles even though the graph operations are samerMX1switch more test cases to use MultithreadTestCaserNX_DISABLED test_complex_half_reference_testing_fft_hfft2_cuda_complex32 (__main__.TestCommonCUDA)rOXG[dtensor] enable tensor metadata check across ranks when run_check=TruerPX test commit 2rQX test commit 1rRXDDDP Elastic "master_addr" resolution error in environment variables.rSX[Decomposition] sumrTX,[fx][split][testing] Add testing for #107981rUX[Decomposition] rand_likerVX[Decomposition] lift_freshrWX2[vision hash update] update the pinned vision hashrXX+Support benchmark fusion for TemplateKernelrYXg[TGIF Inplace] [xlv2][1/n] Expose a couple APIs from inline_container that will be used for chunk readrZXVdoctr_reco_predictor: ERROR:common:call_function groupby in skip_files Builtin groupbyr[X9[DeprecatedAPI][iOS 2][stringWithCString:] - xplat caffe2r\X<Adding Maximal Update Parametrization (µP) to torch.nn.initr]X;Move negative index checking to common.py - Fix issue 97365r^XSimplify symbolize choicer_X6RuntimeError when calling conv_transpose2d with groupsr`X;avg_pool3d_backward fails on meta with grad_input parameterraX-torch.jit.script produces incorrect gradientsrbX<Lower MEMORY_LIMIT_MAX_JOBS to avoid oom during conda buildsrcX,[TEST] Try larger instances for conda buildsrdX4INTERNAL ASSERT FAILED in `shape_type_inference.cpp`reXAdd aten::trunc to core IRrfX:libtorch: runtime error when iterating batch of dataloaderrgX-Unsupported: inline in skipfiles: Logger.inforhX5Heap buffer overflow with `torch::load` on fuzzy datariXuninformative OOM errorrjXDtorch.topk returned values and indices are reordered if sorted=FalserkXCtorch.onnx.export does not trace all outputs for the HF BLOOM modelrlX6use reduced_precision_reduction flags in Triton matmulrmX0torch.compile operation benchmark result is poorrnX:Back out "Faster gc_count update for CUDACachingAllocator"roXBautocast not consistent across different GPUs (A100 and RTX A6000)rpXO[inductor] Triton matmul templates should use reduced_precision_reduction flagsrqX3[pytorch] Test key ET models export to core aten irrrXT[codemod] Del `(object)` from 10 inc caffe2/fb/nas_profiler/lookups/xtensa_lookup.pyrsX![WIP] Test threaded multi compilertX0[export] Fix getattr node issues with custom objruX*[torch][cse]fix cse pass for hashing slicervX2Move sequential partition utils to fx/passes/utilsrwX#torchrun fails to run on Windows 11rxXLIntroduce triton_jit decorator to simplify defining triton.jittable kernels.ryX5[POC] Avoid `recordStream` for `_reduce_scatter_base`rzX0[POC] Avoid `recordStream` for `_allgather_base`r{XRTorchInductor workers use "fork" which doesn't work in a multithreaded environmentr|X.[dynamo] scalar tensor not supportedr}XYsubclasses <> compile <> dynamic shapes: assume only first inner tensor gets dynamic dimsr~XBCall for a deterministic implementation of scatter_add_cuda_kernelrX9Allow slicing of Nested Tensors along constant dimensionsrXq`bytes(...)` support of torch tensor does not match numpy + it would be nice to support tensor.tobytes() as aliasrX[1/N] Elimates c10::to_stringrX;Fix permuted sum precision issue for lower precision on CPUrX[Decomposition] unbindrX[Decomposition] uniform_rX[Decomposition] split.TensorrX[Decomposition] resizerX[Decomposition] randn_likerX2[vision hash update] update the pinned vision hashrX[Decomposition] randintrX[Decomposition] full_likerX[Decomposition] exponential_rX[Decomposition] bernoullirXDBreaking incompatibility with Cuda 12.2, pytorch stable, torchvisionrX?nn.Transformer has dropout layers that BERT / GPT-2 do not haverXAresutl of (torch.mm(a,b) does not match result of (a[:part,:], b)rXD[inductor] CPU int32 overflow behavior differs between clang and gccrX5Pytorch profiler with Tensorboard example not workingrX?Multi-Head Attention: Only require attn_mask if actually neededrXS390x complex divisionrX@torch model to onnx conversion success but failed when inferencerXEliminate calls of c10::guts::conjunction,c10::guts::disjunction,c10::guts::negation,c10::guts::void_t, c10::invoke and c10::guts::applyrX,[xla hash update] update the pinned xla hashrXRefactor FindSanitizer.cmakerX [Dynamo][Guard]expose guard coderX[complex][cpu] nansum & nanmeanrX![cond] cache size limit exceeededrX)The CPU version of `torch.cummax` is slowrXbackend-friendly distributionsrXRRWKV + Adam exp_avg_sq will change from positive to negative after loss.backward()rXSuppport Fused AdamW on CPUrX.DistributedDataParallel to support __getattr__rX@Efficient and robust calculation of diag(sparse @ diag @ sparse)rX-Don't call release_lock_on_cudamalloc on ROCmrXMCNN w variable sized input performance regression 1.10.2 cu113 -> 2.0.1 cu117rX,[unwind.cpp] In process unwind symbolizationrX[WIP] lazy list length guardingrXU`SymInt` input doesn't get optimized out from `torch.compiled()` graph even if unusedrX%_foreach_copy_ with scalar second argrX8[quant][pt2e] Refactor annotate functions for binary opsrX6Torch compile generates incorrect graph on Llama modelrXwWrong result of first run with torch.compile() when backend is using torch.jit.trace() and model has inplace operators rXRevert D48801487: Multisect successfully blamed "D48801487: [export] Copy gm before calling PassManager" for test or build failuresrX&[dynamo] Add DictView variable trackerrXFtorch.einsum() computes different results on cpu and cuda on A100 GPU.rX0enhance documentation around the developer buildrXmultiple AMD GPUsrX[dynamo] Reduce cache size to 4rX-Crash on converting circular padding to onnxrXDquantized module serialization through prepack function registrationrXCGeneralize weight prepacking during quantized model deserializationrX2[vision hash update] update the pinned vision hashrX_FSDP always puts parameters to fp32 when loading state_dict, even if state_dict has bf16 paramsrXNCCL ISend is not asynchronous rX3[ONNX] dort to inline onnx model before running ortrXB[dynamorunner] init FSDP on meta device to fix OOM on large modelsrX+Use fmt::print to format exception messagesrX^Cross Entropy doesn't work with the specific batch, but works with each sample from this batchrXGONNX export constant folding messes up with shared weight deduplicationrXWAdded attention mechanism error,Need to modify torch.use_deterministic_algorithms(True)rXIRuntimeError: dims.value().size() == self->getMaybeRFactorDomain().size()rX[inductor][cpu] Perf regressionrXd[dynamo][stream]support device-agnostic stream in dynamo and capture stream/event method in fx graphrX2[vision hash update] update the pinned vision hashrX5[optim] Make casting to match params a hook (2nd try)rX[ez] Make internal linter happyrX9Can't run Test/Inductor test: test_compiled_optimizers.pyrX'torchbench mfu mem bandwidth+ refactorsrX4Pass 'dynamic' flag from `torch.compile` to backendsrX([do not land] [FSDP] deepspeed benchmarkrX6Transformer performance drop due to slow PyTorch GEMMsrXDONNX-FX based exporter documentation/tutorial topics for PyTorch 2.1rX4Back out "Serialize pytree to json string (#106116)"rXT[REDO][WIP][cuDNN][cuDNN V8 API] Add experimental cuDNN MHA/Flash Attention support rX"pack_padded_sequence on GPU devicerX8Integrate cutlass headers and scripts in pytorch packagerX&Pytorch versions without the abi3 flagrXQUnrecognized attribute: axes for operator ReduceMean during onnx model conversionrXC RuntimeError: NYI: Named tensors are not supported with the tracerrX/[FSDP] New rate limiter and memory reuse systemrX= DistributedSampler class: Change total_size into num_samplesrX.torch.nn.functional.pad() with value type boolrX[docs] F.interpolate(uint8_input, mode = 'bicubic', ...) overshoot behavior: adjust the note in docs to explain that for uint8 saturating store is done and no manual clamp is needed or mention that bicubic is not supported for uint8 inputsrX)qnnpack quantized model can not be tracedrXoIs there a standard procedure to check the consistency of environment across all nodes in PyTorch DDP training?rXm[Compile] Running Llama2 with torch.compile and FSDP results in Type mismatch assert in LlamaRotaryEmbedding rX6Using distributed RPC and DDP together triggers error.rX(ninja: build stopped: subcommand failed.rX-AdaptiveMaxPool documentation is not detailedrXnductor] benchmark fusionrXR[FSDP] incorrect backward prefetch order when using BackwardPrefetch.BACKWARD_POSTrX6Add FakeTensor support to torch._utils._rebuild_tensorrX7[Performance] Pass in head_size_og to FlashAttentionV2 rXL'test_index_add_correctness' - "AssertionError: RuntimeError not raised by "rXOProperly skip fast path in TransformerEncoder/MHA if autocast is enabled on CPUrX"Enable FlashAttentionV2 on WindowsrXFFlashAttentionV2 will OOM when building on ci/cd with default settingsrX&TorchInductor Opinfo fixes for rng opsrX;[inductor] Replace empty_strided with empty in simple casesrXGTORCHELASTIC_RESTART_COUNT doesn't seem to be broadcasted to all workerrXOAutomate release only changes in https://github.com/pytorch/pytorch/pull/108053rX3`Tensor.uniform_` uses illegal argument name `from`rX8Problems hit when upgrading the version of HF used in CIrX4Enable Mypy Checking in torch/_inductor/scheduler.pyrX)nccl:all_reduce is not profiled correctlyrX3[inductor] Minifier fails on resnet50_quantized_qatrX`[BC BREAKING] Change default behavior of scaled_dot_product_attention's causal masking alignmentrX:[inductor] soft_actor_critic training is slower than eagerrXD_sampled_addmm_kernel cause 'misaligned address' with new triton pinrXN[fx] Show original user stack trace on GraphModule exception if it's availablerX4DISABLED test_multilayer_var_cpu (__main__.CpuTests)rX![inductor] minifier fails on mocorX4[Optimizer Perf] Improve speed of _init_group to c++rX=Aliased Input/Output Requirement in `aot_export_joint_simple`rX&Update to newest CUTLASS version 3.2.1rXtesting out_dtype_int_mmrX[POC] Add `frozen_offload`rX/enhance argument checking for at::embedding_bagrX2DDP training can not accept subnet address in IPV6rX?Enhanced Available Backend Discovery and Selection in PyTorch 2rXNRuntimeError with nn.ConstantPad when using torch.compile in max-autotune moderX,[xla hash update] update the pinned xla hashrX_Undefined Symobl: pybind11::detail::type_caster::load(pybind11::handle, bool)rXCScript to compare measured (trace) runtimes with estimated runtimesrXUDISABLED test_autocast_flash_attention (__main__.ActivationCheckpointingViaTagsTests)rXK[FSDP] Ignored modules on meta device seem to be initialized on CUDA devicerXGShapeEnv produce_guards AssertionError Triggered when tensor is resizedrX:(#66813) Adding support for slots in subclasses of Module.rXSFailure in Initiating Pyotch DDP-style code ( Multi-machine multi-card environment)rX#NameError: name 's1' is not definedrXeInstallation with rocm5.6 results in error: assert len(weights) == expected_node_count AssertionErrorrXUIntra-graph communication reordering pass on Inductor scheduler IR (based on #100762)rX;`upsample_bilinear2d_backward_out_cuda` is nondeterministicrX2[vision hash update] update the pinned vision hashrX6fix: adam(w) ignore stride mismatch when dim is size 1rXfDISABLED test_predispatch_with_for_out_dtype_nested_dynamic_shapes (__main__.DynamicShapesExportTests)rXenable in-place buffer mutationr XNBatching rule for aten::_scaled_dot_product_attention_math not yet implementedr XW[PyTorch] torch.empty_permuted: rename param name from 'physical_layout' to 'dim_order'r Xaten.lift throws error in dynamo backends -> RuntimeError: !at::functionalization::impl::isFunctionalTensor(self) INTERNAL ASSERT FAILED at "../aten/src/ATen/FunctionalizeFallbackKernel.cpp":167r X&Torch compile: libcuda.so cannot foundr XEImprove Error Message in MultiMarginLoss for Inconsistent Target SizerXPyTorch profile issues summaryrX%add torch.float16 support for xla amprX%add torch.float16 support for xla amprXRDISABLED test_redundant_clone_for_layout_convert_cuda (__main__.FreezingCudaTests)rXTExporting the operator 'aten::linalg_inv' to ONNX opset version 18 is not supported.rX/Remove parameter `self` in `typeConvertIndices`rX0Torch 1.13 Onnx Scope constant name not correct!rXQExport to onnx error: RuntimeError: ArrayRef: invalid index Index = 3; Length = 3rXJDISABLED test_conv_weight_layout_convert_cuda (__main__.FreezingCudaTests)rXonnx export errorrXG[BUG] "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16'rXfProvide a `reset_parameters()` method for MultiheadAttention to support FSDP meta device initializtionrXRzou/out dtyperX@Support values backward on sparse CSR, CSC, BSR, and BSC tensorsrXN[FakeTensor] fake tensor mode not working with inference mode on Tensor.item()rXwip add a testrXr[feature request] [ux proposal] Min-max linear normalization to be supported in F.normalize (or in a new function)rX#Fail to build C++ test_aot_inductorr XBDISABLED test_conv_with_as_strided_cpu (__main__.FreezingCpuTests)r!X1[POC][HSDP] Add option to disable all-reduce onlyr"X0FakeMode should not fakify non persistent bufferr#XE[BE] Consolidation of SymNode methods constant_int, maybe_as_int, etcr$X([clang-tidy] Get rid of WarningsAsErrorsr%X0Graph break: call_function partial in skip_filesr&X4`C10_HOST_DEVICE` for `std::isnan(c10::complex)`?r'X1About the multi-node example not working properlyr(XT"file_descriptor" multiprocessing sharing strategy works incorrectly in dataloading r)X>Throw error if setting static grads to `None` in `zero_grad()`r*X=nn.AdaptiveMaxPool2d returns identical results within a batchr+XVGot Expand nodes with static shape input when exporting onnx model with dynamic shape r,XI[Inductor] Extend Pattern Matcher to Match Equivalent Function Invocationr-XFSDP custom args per moduler.Xetorch.compile() fails when an `autograd.Function` gets called and torch.no_grad() is *not* being usedr/X9`torch.distributions.Pareto.sample` sometimes gives `inf`r0X[`add_image_with_boxes` method from `torch.utils.tensorboard.writer.SummaryWriter` is brokenr1X doc stuffr2X2[WIP/CI Test] Try to tighten up VT stack invariantr3X[feature request] [discussion] Include basic `ctypes` bindings for `cudart`/`cublasLt`/`cublas`/`nvrtc`/`cudnn` with stock PyTorchr4XaFake Tensor error 'lengths' argument should be a 1D CPU int64 tensor, but got 1D meta Long tensorr5XLBack out "[inductor] make thread order consistent with loop order (#106827)"r6X'Add caffe2 ideep/onednn tests to OSS CIr7XAdd ceil to core IRr8X>DISABLED test_conv_stride_constraints (__main__.CPUReproTests)r9X2libtorch infer error : CUDNN_STATUS_INTERNAL_ERRORr:XClibtorch vs (onnx+tensorRT) show different object detection resultsr;X1Enable Mypy Checking in torch/_inductor/bounds.pyr<XkDISABLED test_make_fx_symbolic_exhaustive_special_bessel_y0_cpu_float32 (__main__.TestProxyTensorOpInfoCPU)r=XAdd Half support for range, logspace, logit, median, nanmedian, kthvalue, poisson, cummax, cummin, prod, cumprod, histc, logcumsumexp, vander, cross, aten2, logaddexp, logaddexp2, hypot, and nextafter on CPUr>XkDISABLED test_make_fx_symbolic_exhaustive_special_bessel_j1_cpu_float32 (__main__.TestProxyTensorOpInfoCPU)r?Xconv cudnn support integersr@XiDISABLED test_make_fx_symbolic_exhaustive_special_airy_ai_cpu_float32 (__main__.TestProxyTensorOpInfoCPU)rAX-[ONNX] Retire FXSymbolicTracer in FX exporterrBXBump Triton versionrCXPDISABLED test_multilayer_var_dynamic_shapes_cpu (__main__.DynamicShapesCpuTests)rDXJHardtanh docs are inaccurate/incomplete, since hardtanh behaves like clamprEXdInconsistencies when handling scalars that are out of the range relative to the input tensor's dtyperFXCarange.out produces incorrect output when out tensor has dtype longrGXHwhere.self_out doesn't fail gracefully when inputs have different dtypesrHXdindex.Tensor_out & index_put.out errors or segfaults with indices list containing only null tensors rIX9Enable thp(transparent huge pages) for buffer sizes >=2MBrJX>New variables in torch._ops.py pollute the torch.ops namespacerKXGmasked_fill_ outputs incorrect results for 'mps' tensor after transposerLX.Inconsistencies when casting to integral typesrMXtorch._dynamo.exc.Unsupported: call_function BuiltinVariable(zip) [ListVariable(), ListVariable(), ListVariable(), UserDefinedObjectVariable(KJTList)] {}rNX-Error in ONNX during Export GLU with Opset 18rOXQ[Dynamo] 'NoneType' object is not subscriptable from torchrec (bad error message)rPX?Actually raise an error on all graph breaks with fullgraph=TruerQX`torch.nn.functional.cross_entropy different loss when providing one_hot_target and class weightsrRX0Enable Mypy Checking in torch/_inductor/debug.pyrSX;[Torch.fx] Torch fx failed to trace torch extension libraryrTXAdd mha to Autocast CPUrUX%torch.dot gives wrong result on MacosrVX"adding _int_mm to out_dtype mm WIPrWX``RuntimeError: expected scalar type BFloat16 but found Float` with `torch.nn.TransformerEncoder`rXXAA backward bug of dtensor seems to be caused by new_empty_stridedrYXWfullgraph=True doesn't actually raise error when you don't manage full graph inside DDPrZX[DDP PT2] TypeError: convert_frame_assert.._convert_frame_assert() missing 2 required positional arguments: 'hooks' and 'frame_state'r[XCSupport integer implementations for max_pool1d/2d/3d (cpu and cuda)r\XInvfuser does not respect CMAKE_INSTALL_PREFIX when build (cmake) libtorchr]Xadd user frame to shape guardr^X/torch.fx.Interpreter modules don't get compiledr_XWtorch._dynamo.exc.InternalTorchDynamoError: 'NoneType' object has no attribute 'guards'r`X=ModuleNotFoundError: No module named 'torchgen.code_template'raX4[ao] updating embedding_bag support for fx and eagerrbXGAdd dynamo support for `autograd.Function` with multiple return values.rcX#[Inductor] Autotuner Model TrainingrdXPrevious version not foundreX/[Fix] add validation logics to TCPStore queriesrfX3Support AMD Ryzen Unified Memory Architecture (UMA)rgX]torch._dynamo.exc.Unsupported: call_method UserDefinedObjectVariable(defaultdict) items [] {}rhX+dynamo: don't graph break on ctx.mark_dirtyriX\`repeat_interleave` does not support tensor indexes on different devices while `repeat` doesrjXTSelect on a coalesced COO tensor returns COO tensor with coalesce flag set to False.rkXuRun transformers.OPTForCausalLM(config=config) occurs 'GraphModule' object has no attribute 'compile_subgraph_reason'rlX-Add support for float8_e4m3fnuz and _e5m2fnuzrmX<Enable Mypy Checking in torch/_inductor/triton_heuristics.pyrnXB[FakeTensor] `to` doesn't error with `allow_non_fake_inputs=False`roX=[LibTorch/iOS] Building with METAL support script is freezingrpX:Doc is unclear on how to install pytorch with Cuda via piprqXUhalo,I continue pretrain llama2-13B model ,but save state_dict is about 50GB filerrX,[xla hash update] update the pinned xla hashrsXCcaching keys+values in TransformerDecoderLayer for faster inferencertXMRuntimeError: Unsupported value kind: Tensor while torch.jit.script nn.ModuleruXHacked up SHAPE_ENV provenancervX)[pt2] enable meta tests for `foreach` opsrwX(Dynamo guards on unused Tensor variablesrxX:[pt2] raise an error in `test_meta` when `OpInfo` is wrongryXGInteger multiplication overflow when running torch.nn.AdaptiveAvgPool2drzXAInteger multiplication overflow when running torch.nn.MaxUnpool3dr{X;Integer multiplication overflow when running torch.diagflatr|X:Storage size calculation overflowed when torch.nn.Upsampler}XPStorage size calculation overflowed when running torch.nn.functional.interpolater~X6Integer multiplication overflow when running torch.eyerXQInteger calculation overflow when running torch.nn.functional.adaptive_avg_pool2drXCInteger overflow when running torch.nn.functional.upsample_bilinearrX:Integer overflow when running torch.nn.functional.upsamplerX7Integer overflow when running torch.nn.ReplicationPad3drX8Integer overflow when running torch.nn.AdaptiveAvgPool2drX2Integer overflow when running torch.nn.MaxUnpool2drX,Index out of bound when running torch.gatherrX>Integer overflow when running torch.nn.functional.max_unpool2drX[fx] tracing function with in-place mutation results in unexpected behaviour due to local vars becoming persisted in `GraphModule(nn.Module)`rXCAppending new logs to existing tbevent files when using tensorboardrXNNPACK slow down M1/M2 Mac CPUrX/Inconsistent results when running torch.arctanhrXRExport of `quantized::linear_relu` operator not supported with `torch.onnx.export`rX[cpu] vectorize asinhrX+Register AutoRT backend for Windows DirectXrX#conv2d wrong results on 3090/3090tirXR[nightly][jit] bad constant exponent (e+38.f) in default_program fused_mul_div_addrX/[Test] Upgrade NVTX to NVTX3 - no gated includerX\Mismatch in type of error raised when reducing along empty slice between eager and primtorchrX$[Inductor] Autotuner Data CollectionrX [Inductor] Autotuner IntegrationrXSAdd drop_remainder & redistribute to torch.chunk and drop_remainder for torch.splitrX%[ROCm] support caffe2 operator exportrXOAdding batched CSR tensors with different sparsities produces an invalid tensorrXJ[ROCm] Add gcnArchName to collect_env and torch.cuda.get_device_propertiesrXSReference cycles involving code -> co_extra -> compiled output -> reference to coderX[cpu] vectorize acoshrX([torch.optim/C++] Add Adadelta optimizerrXNMoving tensor to MPS using .to(torch.device('mps') deletes entries from tensorrXCreate tensors from setsrXTConversion from COO with two sparse dimensions to CSR with dense_dim specified failsrX][testing] dynamo testing: we should call `dynamo.reset` before running each test with dynamo.rX&Determinism by using datapipes shufflerX0[fbcode][RFC] Parallel fast cat on cpu in cat oprXKThe generated triton MaxPool2d kernel has poor performance on amd vega20/60rX/[FSDP]coding to multi-node save optimizer errorrXMmake backward function explicit in a layer which is a combination of some opsrXENo checks when running torch.nn.functional.ctc_loss with bogus inputsrX[Inconsistent results when running torch.nn.functional.embedding_bag on CPU (1.12.0, 1.13.0)rX0Abort when running torch.set_num_interop_threadsrXWDISABLED test_memory_format_nn_ConvTranspose1d_cuda_complex32 (__main__.TestModuleCUDA)rX7DISABLED test_wait_i_6 (__main__.TestMultiThreadedWait)rX7DISABLED test_wait_i_5 (__main__.TestMultiThreadedWait)rX2[vision hash update] update the pinned vision hashrX.[Test] add python binding for oneDNN Graph APIrX<max_pool1d, max_pool2d, max_pool3d Integers for cpu and cudarXProblems of torch.multinomialrXDMultiple runners shutdown for an autoupdate while still running jobsrXc[regression] Not getting `CUDA error: device-side assert triggered` on main for CUDA_KERNEL_ASSERT2rX6[Test] Get rid of special case meta registration logicrXi[LibTorch/iOS] Unknown custom class type quantized.Conv2dPackedParamsBase. Please ensure it is registeredrX<Overly strict type hints for `torch.utils.data.random_split`rXFcaffe does not respect CUDNN_LIB_DIR when building from source (cmake)rX6Incorrect type hint for `torch.library.Library.define`rXODISABLED test_kineto_profiler_with_environment_variable (__main__.TestProfiler)rXPsparse_mask method ignores masked-in elements of sparse compressed input tensorsrX~[feature request] Support native ONNX export of FFT-related ops in opset17 (with `inverse=True`, it also includes inverse DFT)rX.DataParallel scatter method split tensor wrongrX&torch compile error with SyncBatchNormrXi[test_nn] add custom device support for convolution tests、embedding tests、pooling tests and nn tests.rXRegression in text encodingrX;[Inductor] Run compiled model failed on 2023_08_17 nightlyrXVGetting more human-readable input and output names in the onnx model exported by torchrX=dist.scatter is incompatible with transpose/permute operationrXCtorch/testing: Update test command for TorchInductor pytorch#106377rX[WIP] test docker buildsrXFusing Union[str, Tensor] as an argument to a torch.jit.script functionrX!commit for testing internal toolsrXP[codemod] Use C++17 [[fallthrough]] in 1 file inc caffe2/caffe2/opt/optimizer.ccrXNumPy 2.0 SupportrXAdist.destroy_process_group did not destroy the process group wellrX1'MPS' training Issue(s) with NanoGPT: -Inf, NaN'srXEnable flake8-bugbear B028 lintrXCSparse compressed tensor values autograd support is not implementedrX4DISABLED test_find_or_create_pg (__main__.TestPgTag)rXgTranslation layer (similar to torch_np) that can reliably lift Python operations into Tensor operationsrXCUBLAS_STATUS_NOT_SUPPORTEDrXKmodel.forward() get error with torch.compile() when using huggingface llamarX0Fix batch_norm_cpu to check the sizes of tensorsrX2`torch.float8_e4m3fn` does not support `torch.cat`rXSet debug=0 for nccl buildrX4Conda configuration shouldn't pollute $PATH variablerXFallback random in testrX0torch.compile not tracing ops on tensor subclassrX-How to export GNN with dict inputs correctly?rX'[dynamo] sdp_kernel ctx manager supportrX7Type Annotations inside `torch.compile(fullgraph=True)`rXFInconsistency between CPU and GPU for Linear() layer with input size 0rX<[CPP API] Add Adadelta, Adamax, ASGD, NAdam, RAdam and RproprX3DISABLED test_RNN_input_size_zero (__main__.TestNN)rXDDocumenting `__getitems__` for slicing support in `torch.utils.data`rXGDocumenting `IterableDataset`'s needing `StopIteration` for finite datarXThe difference between input grad computed by channels last backward and the input grad computed by channels first backward of Hardswish on MPS is too largerXW[ONNX] ONNX doesn't support exporting non-persistent buffer included models in FakeModerXkThe difference between channels last backward and channels first backward of AvgPool2d on CUDA is too largerXc[inductor] [dynamic shape] 5 HF models fails with `Constraints violated` using transformers v4.31.0rXfDISABLED test_make_fx_symbolic_exhaustive_special_entr_cpu_float32 (__main__.TestProxyTensorOpInfoCPU)rX2[vision hash update] update the pinned vision hashrX,[xla hash update] update the pinned xla hashrX,Can't construct a tensor from List[SymFloat]rX1DISABLED test_RNN_dropout_state (__main__.TestNN)rX/Timeout during NCCL initialization due to storerXFsdp_kernel causes dynamo error on torch.compile(model, fullgraph=True)rX-Surface NCCL and CUDA version incompatibilityrX'Dynamo test_vmap failures on Python-3.8rX8Torch randn cannot take symbol shapes as shape argument.rXIjit compilation returns an int rather than a bool when using math.isnan()rXOCreate fastpath backend context manager, similar to SDPA kernel backend managerrXiDISABLED test_learnable_forward_per_channel_cpu (quantization.core.test_workflow_ops.TestFakeQuantizeOps)rXDrop c10/util/string_view.hpprXg[dynamo] calling __torch_function__ with dynamically created subclass of torch.Tensor fails compilationrX,[xla hash update] update the pinned xla hashrXUpdating cpuinfo to the latestrX7[opinfo] Add cases to slice_scatter to improve coveragerX?torch.inverse throws error when DP but not in DDP or single GPUrXD[docs] Document dtype conversions dtype.to_complex() dtype.to_real()rXPcombining `vmap` with NN containing `MaxPool2d' leads to discrepancies in outputrX1H100 works differently than rtx4090 on same modelrXkDISABLED test_make_fx_symbolic_exhaustive_special_bessel_y1_cpu_float32 (__main__.TestProxyTensorOpInfoCPU)rXfrom_blob python apirX2[vision hash update] update the pinned vision hashrX&WIP: [pt2] add metas for `foreach` opsrX1Error when using sparse_coo tensor with optimizerrX>memoryview support for `torch._C.import_ir_module_from_buffer`rX0Revert "Revert "Add `_foreach_clamp` (#106574)""rX2[vision hash update] update the pinned vision hashrXSRuntimeError with operations on torch.float8_e5m2 and torch.float_e4m3fn data typesrX1[FSDP] summon_full_params won't change parametersrXC[Dynamo] Unable to Trace AdamW Optimizer when there is LR SchedulerrX,[xla hash update] update the pinned xla hashrX2[vision hash update] update the pinned vision hashrX,[xla hash update] update the pinned xla hashrXCBuild a check we can defer to runtime, potentially add to the graphrXXExtend dict and by extension __dict__ modeling in dynamo to support `setdefault`, `get`rX*Dynamo x FSDP - Issue Tracking Master TaskrX9UNSTABLE pull / linux-bionic-py3.8-clang9 / test (dynamo)rX,[RFC] Option to check eager if compile failsrXI want to calculate the matrix multiplication of two Boolean matrices, but torch.mm will report an error. Is there any more efficient alternative?r X!Dynamo not handling a NamedTuple r X9[TEST] Revert "NumPy support in torch.compile (#106211)"r XSVectorized operation on quantized tensors returns wrong values (different rounding)r XDoc for dtensor?r Xa bug about tensor striderX[feature request] [onnx] Support QuantLinear/DequantLinear float16 inputs (opset19 and maybe "backport"-support them for opset17)rXBtorchrun: RendezvousConnectionError when use C10d on multi nodesrX:Memory tracker does not report the module name correctly. rXcov to onnx errorrXOCreate fastpath backend context manager, similar to SDPA kernel backend managerrXAdd flake8-bugbear code B007rXRmax_pool3d_with_indices_backward_cuda does not have a deterministic implementationrX'[WIP][CI Test] Inline through skipfilesrX?Apply fusion more aggressively in NAdam and Adagrad compilationrX3Dynamic shapes support for inductor foreach codegenrXtmp testrXRuntimeError: 0 INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/alias_analysis.cpp":615, please report a bug to PyTorch. We don't have an op for aten::full but it isn't a special case. Argument types: int[], bool, NoneType, NoneType, Device, bool, rX2[vision hash update] update the pinned vision hashrX3Only check for fusions within a node distance of 64rX@Use aot_compile instead of compile_fx_aot in _export.aot_compilerX5Add cutlass as an alternative backend of PT2 InductorrX2`ray` multiprocessing interference by torch importrXLReland "Make adding buffers more like adding parameters (#104069)" (take #2)r XSdpa higher order opr!X2Facing error while using onnx from scatterelementsr"X2RuntimeError: _Map_base::at when exporting squeezer#X#Found two conflicting CUDA installsr$XXONNX Model Producing Different Results Compared to Original PyTorch and JIT Traced Modelr%X6Enable Mypy Checking in torch/_inductor/kernel/conv.pyr&X`tensor.repeat` quirks: has no `torch.` variant, no `out=` variant, no inplace variant | `torch.tile` also does not have `out=` variant and uses `dims=` instead of `dim=`r'XBEnable Mypy Checking in torch/_inductor/fx_passes/mkldnn_fusion.pyr(X/Readily available python wheels for windows ARMr)X:stride of gradient is not same as the corresponding tensorr*X3Enable Mypy Checking in torch/_inductor/freezing.pyr+X.Enable Mypy Checking in torch/_inductor/exc.pyr,XAEnable Mypy Checking in torch/_inductor/codegen/triton_foreach.pyr-XS[Minor Bug] Should consume_prefix_in_state_dict_if_present change ordering of keys?r.XGCannot export MiVOLO model into `onnx` format using `torch.onnx.export`r/X#Other overloads of `_foreach_clamp`r0Xfix B017 lintsr1X1[Dynamo x FSDP][5/x] Fix bug in __class__ sourcesr2X[Port "Fix kDefaultTimeout multiple definition build failure (#97270)" to release/2.0 branchr3X2[torch.optim/C++] Add Adagrad state initializationr4X6[autograd.Function] freevar lifting is too aggressive?r5XT[autograd.Function] torch.compile w/ once_differentiable leads to opaque graph breakr6X4[Dynamo x FSDP][3/x] TypedStorage and storage_offsetr7XMDynamo graph break when using pyton module `heapq` (manipulates with `list`s)r8X[ONNX] Float8 supportr9XJ[Dynamo] Integration exporter's diagnostic system into ONNXRuntime backendr:XG[Dynamo] revise ONNXRuntime backend's use of CapabilityBasedPartitionerr;XA[Dynam] a graph pass in Dynamo-ONNXRuntime backend needs revisionr<XT[Dyanmo] Pre-allocate flag should be a ONNXRuntime inference session level attributer=X\[Dynamo] ONNXRuntime backend (DORT) requires some guards to re-partition extracted by Dynamor>X;[Dynamo] ONNXRuntime Backend Shold Allow External Allocatorr?X#Add Half support for aminmax on CPUr@XERPC all_gather doesn't work with dynamic world size (world_size=None)rAXuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): RuntimeError: nvrtc: error: invalid value for --gpu-architecture (-arch)rBX*`1/torch.inf` produce inconsistent resultsrCXeMultiprocess DataLoader doesn't work with sparse tensor as it'll try to access the underlying storagerDXFix warning in CUDAJitLoops.cuhrEX!Use expect tests for error inputsrFX3Ablate TORCH_CUDA_ARCH_LIST from torchaudio installrGXSimplify TypeIndex.hrHXBroadcasting semantics notesrIX7Please verify 1.14.1 ONNX release candidate on TestPyPIrJX<[MPS] Make several ops unranked to avoid graph recompilationrKX?Optimizers should use learning rates passed as tensors directlyrLX\Timer benchmark stores only one time value, and therefore has broken mean/median/etc metricsrMXkDISABLED test_conversions_all_patterns_backend_cutlass_cuda_float16 (__main__.TestSparseSemiStructuredCUDA)rNXfDISABLED test_linear_inference_mode_False_backend_cutlass_cuda (__main__.TestSparseSemiStructuredCUDA)rOXlDISABLED test_conversions_all_patterns_backend_cutlass_cuda_bfloat16 (__main__.TestSparseSemiStructuredCUDA)rPX0Fix prod double backward when there are 2+ zerosrQX7Binary op support for (B, *, D) NT with (B, 1, 1) denserRX&[ux] Suppot torch.tensor(set([1,2,3]))rSXLinf and nan are mapped to quant_min in torch.fake_quantize_per_tensor_affinerTXa`torch::nn::MultiheadAttention`, `F::multi_head_attention_forward`: check embed_dim and num_headsrUX9Enable Mypy Checking in torch/_inductor/codegen/triton.pyrVXD[Inductor][cpu] torchbench model doctr_det_predictor perf regressionrWX/[Feature request] Add new API Tensor.device_as rXXC[Inductor][amp] clip: expected scalar type BFloat16 but found FloatrYXC[FX][ONNX][exporter] Failed to export traced fx graph to onnx modelrZXpytorch with ROCM on Windowsr[X&[FSDP][WIP] [Do not review] Trace FSDPr\X2[vision hash update] update the pinned vision hashr]XExpose dcp utilsr^X9Hugging Face safetensor does not work with FakeTensorModer_X'Add AMD image to the .devcontainer specr`X/Provide .devcontainer PyTorch - MPS environmentraX,Add reset_parameters to nn.Module base classrbX!Dev Container Support for PyTorchrcX_'CUDA out of memory' when using a GPU services for reinforcement learning in Torch rpc tutorialrdX[WIP] FSDP rate limiterreX/Dataloader extremely slow on in-memory datasetsrfXCC++ API `torch::nn::MultiheadAttention` Crashes by division by zerorgX3torch.jit.script: scripting doesn't work with wrapsrhX;Enable Mypy Checking in torch/_inductor/fx_passes/pad_mm.pyriXDtorch.polygamma inconsistent with scipy.special.polygamma for n >= 1rjX@Enable Mypy Checking in torch/_inductor/fx_passes/joint_graph.pyrkXQDDP grads not synced when static_graph=True and module output is a dict subclass?rlX&add register backend for custom devicermXAEnabling Transformer fast path for not batch_first (MHA, TE, TEL)rnXJ[docs] Idea collection of examples of custom ops / inline torch extensionsroXHInconsistency between CPU and GPU for `Linear()` layer with input size 0rpXK[docs] URL and link format proposal to make function page URLs more conciserqX>Will torch.sparse.mm support multiplying two boolean matrices?rrX3Question about garbage collection without GPU sync rsX?Enable transformer.py fastpath for not batch_first for TE & TELrtXFAdd functional collective all_to_all_single and support it in InductorruX7Dynamo graph break on triplet_margin_with_distance_lossrvX*add is_complex op for ShardedTensor #93886rwX*Using retain_graph in backward() with FSDPrxX1Hackable distributed filesystem reader and writerryXNConfusing error message for DataLoader with num_workers=0 and non-zero timeoutrzXFRefcount problem for torch.distributed.Store objects defined in Pythonr{XHno_grad() changes output of TransformerDecoder module during evaluation r|XDynamo debug decoratorr}Xe[feature request] [ux] Frontend methods for fused elementwise affine transform: mul+add+dtype convert + support `integer_tensor.mul_(float_constant)` and `float_tensor.mul(some_constant, out = integer_tensor)` maybe via new args `rounding_mode=...` and `dtype=...` + maybe support OpenCV-style saturated dtype conversions (e.g. `clamp_` before conversion)r~XBMeta implementations of FFT operators often have incorrect stridesrX2FFT Samples Inputs with More than Three DimensionsrXwCase study of torch.compile / cpp inductor on CPU: min_sum / mul_sum with 1d / matmul-like with static / dynamic shapesrX*Stopgap patch for avoiding polluting cacherXROCm & Windows SupportrXAMore Performant CachingHostAllocator for Pinned Memory AllocationrX$Relu6 not able to process nan valuesrX onednn ops supported in pytorchrX?[ONNX] Keep functional ops as functions in dynamo exported onnxrX'[discussion] move-semantics for tensorsrX@[ROCm] Add summary warning about no write permissions for hipifyrX=Lacking commutativity of `tensor.expand` and `tensor.flatten`rX4[inductor] Add ir.Scan and lower aten.cumsum on CUDArX=[dynamo] Unsupported to trace through Boolean Tensor indexingrXBoolean valued images loaded from disk, when converted to torch int/float tensor, the True valued pixels gets converted to 255 instead of 1rX Enable more flake8-bugbear lintsrX!DTensor Sharding prop cache statsrX'install cuda version always get cpuonlyrX^NotImplementedError: Could not run 'aten::multinomial' with arguments from the 'Meta' backend.rX SerializerX9DISABLED test_cpp_wrapper_cpu (__main__.FreezingCpuTests)rX-Pytorch: torch.autograd.grad returns NoneTyperX(UFMT utils tensorboard, data, benchmark.rX>[dynamo] teach dynamo about `pytree._broadcast_to_and_flatten`rX.Can't build PyTorch 1.13.1 with Vulkan supportrX%Potential Issue with Pandas DataframerX2`softmax` to handle dimensions comprised of `-inf`rXBranch name in double quotes ""rXDataset with Queue issuerX>Enable Mypy Checking in torch/_inductor/fx_passes/split_cat.pyrXICUDA device support does not register allocator to c10::GetAllocator(...)rXPytorch + ROCm+ WindowsrX*Add ModuleInfo for torch.nn.ChannelShufflerXLDistributed torch.linalg.eigh (and other functions) on cuda using cuSOLVERMGrX'Use FastCat in PT Concat implementationrXFix reset_parameters for nn.MHArX+Add ModuleInfo testing for reset_parametersrX=Increasing batch size makes network forward 1000 times slowerrXHExtreme slowdown of torch.mm for certain sizes and strides with bfloat16rXznn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory accessrX2Enabling Transformer fast path for not batch_firstrX)AOTAutograd should detect false aliasing.rX0vmap, jacrev, jacfwd, hessian, etc., in libTorchrX(Add half specializations for load of sumrXwip docker issue debugrX.DISABLED test_cat_addmm (__main__.TestDoBench)rX>Check for output_padding <= stride/dilation in ConvTranspose1drX+Enable optional tensorList fallback to cpu.rX5Fix Clang compilation error with Lib ATen for ppc64lerXB[dynamo] Support dict.get with no default specified, and dict.copyrXmDISABLED test_aot_sequence_nr_dynamic_shapes (dynamo.test_aot_autograd.DynamicShapesAotAutogradFallbackTests)rX.[PG NCCL][RFC] Pause before throwing exceptionrX[JIT] .item() dict keys cause `RuntimeError: Cannot create dict for key type 'Scalar', only int, float, complex, Tensor, device and string keys are supported`rX7inconsistent dtype of scale and zero_point in observersrXD`torch.nn.utils.clip_grad_norm_()` causes H2D sync with foreach ops.rX0[WIP][Not for landing now] TP benchmark for perfrXAPyTorchMPS not showing up in Instruments for `torch.mps.profiler`rX[Pytorch][Vulkan] aten::gtrX=[torch.compile] autograd.Function with multiple return valuesrX*Better export story for autograd.Function?rXJ[torch.compile] autograd.Function where we assign a Tensor directly to ctxrXInstalling torchvision for CPU leads to unwanted upgrade of torch + pip would not install nightly as considers that release is the latest (?)rXU[dynamo] can't compile if tensor subclass implements __torch_function__ using super()rX'Command to reproduce error is incorrectrX>nll_loss reference shouldn't be registered as a decomposition.rX;[ONNX] scatter_reduce does not support `include_self=False`rX7Link compiled protobuf files to `protobuf::libprotobuf`rX:Calling ops.aten.embedding_bag() function got silent crashrX\DISABLED test_ddp_apply_optim_in_backward_ignored_params (__main__.TestDistBackendWithSpawn)rXPImproving save_on_cpu's performance by overlapping memory transfers with computerX0backwards compatibility about class _LRSchedulerrX9[PyTorch][export] type check utils for Basic Python typesrXK[CoDev Test] Pay no attention to this, just a noisy pr for testing ghimportrXHTransformer.generate_square_subsequent_mask has nan values on MPS devicerXTReduceLROnPlateau increases learning rate exponentially, causing training to divergerX8Inefficient code generated - does not use 256b registersrXPadd private API for generating all CompositeImplicit decomps from the dispatcherrX3pre_dispatch tracing: fix for nn.MultiheadAttentionrX7Don't set CUDA_HOME when not compiled with CUDA supportrXgDISABLED test_cuda_assert_should_not_stop_common_distributed_test_suite_cuda (__main__.TestTestingCUDA)rXW`torch.nn.modules.MultiheadAttention` yields different graph under pre_dispatch tracingrX=Torch.onnx.export a fp16 model but get the output tensor fp32rX$Can't build with non-static protobufrX,[xla hash update] update the pinned xla hashrX%torch DDP oom caused by weak protocolrX/Added More Information About Adadelta OptimizerrXQMany tests in test/dynamo fail if run in the context of just 'pytest test/dynamo'rXURuntimeError: GlooDeviceFactory::makeUVDevice(): interface or hostname can't be emptyrXJMake our source attribution debug prints more useful for Compiler ExplorerrXlRuntimeError: Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'grad_y'rXFix sym node printingrX"Type annotations for functional.pyrXRuntime Error: Empty tensorrX=Network module momory is not released in C++ libtorch 2.0.1 rXF Improve Error Message in MultiMarginLoss for Inconsistent Target SizerXUChange behavior of .users and .args of SchedulerNode to match the same API in fx.NoderXBOneCycleLR's state_dict includes a full reference to the optimizerrX2[vision hash update] update the pinned vision hashrX'[inductor] AOTInductor w/ saved weightsrX!ghstack + mergebot race conditionrX6[Compiled Autograd] Refactor duplicate symint handlingrXRProp improvement trackerrX>torch compile does not work with torch.nn.functional.softmax ?rX/Decomposition of bmm, addmm, mm for dot productrXI[dynamo.export] Assertion Error: Mutating module attribute during export.rX-dynamo struggles with list of tensor elementsrX)Build failure due to C++ version mismatchrX4Enable mypy check for torch/_inductor/codegen/cpp.pyrX5Scalar Tensor lowering to Fake Tensor inside InductorrX5[caffe2] Clean up platform-specific fbobjc deps/flagsrX'[Reland] fix building errors on FreeBSDrXeRevert "[quant][pt2e] store scale/zero_point as tensor attributes to support serialization (#105894)"rX;[dynamo.export] symbolic_shapes.GuardOnDataDependentSymNoderX)[ONNX] Set dynamic_shapes default to TruerXH`torch.ops.aten.split.Tensor._schema` return alias annotations are wrongrX"torch compile changes model outputrX"Automated submodule update: FBGEMMrXAdd foreach functions to docsrXEdistributed.batch_isend_irecv() crash when send/recv refers to itselfrXr[RFC][WIP] Extend convert_to_unspecialized for module attr mutation to module fields mutated through BINARY_SUBSCRrXrocm support for windowsrX"Automated submodule update: kinetorX&Pytorch nighlty and openAI/triton cudarX8Sparse COO indices are torch.Int64 -- is this necessary?rXR`export(..., pre_dispatch=True)` for model in eval mode still inserts autograd opsrX)bc-linter false positive with TypeAliasesrXpRegistering function that takes const std::vector& to SymInt[] schema gives confusing error messagerXOtorch._subclasses.fake_tensor.DynamicOutputShapeException: aten.nonzero.defaultrX=Allow settingsetGraphExecutorOptimize default for all threadsrXsTorchscript optimizer incorrectly applies constant propagation to convert prim::ListConstruct() into prim::ConstantrX:Libtorch report C10 error when compiling on my own projectrXP[feature request] Better argument checks and error messaging for `tensor.repeat`rXFGot error when train models with more than one param_group in torch2.0rX)Documentation fix in contributing.md filerX?Removal of Object Check in CUDAPluggableAllocator::raw_delete()rXKMPS cumprod gradient is broken even when using cpu fallback on macos 13.2.1rX2Fix test flakiness in test_sparse_triangular_solverX)llama model failed for dynamic shape pathrXEMFORMER_RNNT not compilabler X#Hack FSDP to work with CPU trainingr X_DO NOT MERGE! Test if backwards cumprod is broken even when falling back on cpu on macos 13.2.1r XN[do not review][Dynamo] Wait for lazy accumaultion of guards for input tensorsr X0Potential lack of CI testing on older NVIDIA GPUr XFTensors always get 0/1 specialization guards, even if they're not usedrX4"Graph break: inline in skipfiles:" is a bad messagerX9Avoid incrementing refcount of `grad_fn` in `unpack_list`rX9torch._dynamo.export does not support symbolic int inputsrXD[FSDP] Investigate sharded GPU gradient lifetime when CPU offloadingrX?DISABLED test_profiler_cuda_sync_events (__main__.TestProfiler)rXUDISABLED test_triton_template_with_epilogues_and_dynamic_shape (__main__.TestDoBench)rXMake Intel GPUs availablerX%enable torch.device TorchFunctionModerX^Misleading error message in multilabel_margin_loss when passing incompatible tensor dimensionsrX3[quant][fx] Fix node deletion bug during fx convertrXA[torch.compile] assertion sometimes ignored with inductor backendrX%[FSDPxMTPG] Migrate TestFSDPTraversalrX([FSDPExecOrder] Migrate one test to MTPGrX[FSDP test][ez] remove setUp()rX7[ComposablexMTPG] Migrate some composable tests to MTPGrX![FSDPxMTPG] Migrate one more testrXEFlatbuffer torchscipt files don't load in PyTorch Android Lite 1.13.1rX)[aarch64][xplat/caffe2] Fix aarch64 buildr XIvmap and rnn/lstm "accessing '.data' under vmap transform is not allowed"r!X%[DONOTMERGE][ROCm]Test MI210 CI Nodesr"XTestr#X[Error in Profiler : RuntimeError: Expected !config.profile_memory to be true, but got falser$X;Ensure PRs are rebased on top of a recent commit (CI check)r%XIReplayRecordTests.test_fn_call_args and others fail on my local devserverr&XPT2 is not thread safer'X@Differences in the results of conv2d calculations in PyTorch 1.8r(X=[FSDP] Support `OptimStateDictConfig.offload_to_cpu` for realr)XJFlip default on `add_zero_attn` in `torch.nn.MultiheadAttention` to `True`r*X[DISABLED test_cross_entropy_large_tensor_reduction_sum_cuda (__main__.TestNNDeviceTypeCUDA)r+X)Dynamo silently ignores TorchDispatchModer,X3enabling fused A16W8 mm through prologue fusion WIPr-X*Possible speed up of nn.MultiheadAttentionr.X4test_linalg: triangular_solve - set higher precisionr/XPTorch.jit : RuntimeError: Unable to extract string literal index for ModuleDictr0XaTorch.jit.frontend.NotSupportedError: not supporting functions with variable number of arguments.r1X>Missing coalesced flag from `torch.autograd.Function.backward`r2X3torch.compile(cpu) does not handle float16 properlyr3X!Batching rule for aten::bincount.r4X,Enable xpu backend in totchdynamo benchmarksr5X*Libtorch linking Error:undefined referencer6XGDefault parameters missing of maxpool2d node generated by dynamo exportr7X?torch.jit.frontend.NotSupported when compiling stable-diffusionr8X2[vision hash update] update the pinned vision hashr9X4Add oneDNN Graph fusion pass in Inductor CPP backendr:XdFakeTensor detach() gives meta tensor other than FakeTensor under `torch._C._DisableTorchDispatch()`r;XWPyTorch 2.0.x `CUDA error: operation not supported` when `Tensor.to` a different devicer<X5torch.compile does not respect branching in forward()r=X9Programmation error enabling unlegal memory access on gpur>X0Report model flop utilization (mfu) in benchmarkr?X:[WIP][Experiment] Avoid real computation for dynamo exportr@X&Bug when dealing with fallbacks on CPUrAX-Strange backward behavior with sparse tensorsrBXo[FSDP] FSDP doesn't work (random accuracy performance) when using `param_init_fn` and `sync_module_states=True`rCXPMPS memory issue, MPS backend out of memory, but works if I empty the MPS cacherDXNExporting the operator 'aten::grad' to ONNX opset version 18 is not supported.rEX&Remove TORCH_API from OpaqueTensorImplrFX1Add UT for NEON implementation of vec_reduce_all rGX@Redundant kernels for torch.scatter() when using torch.compile()rHX/JIT input aliasing does not support aten::fill_rIX(Conversion Error to ComplexDouble on MPSrJXHChanged documentation of DataLoadder to match class signature in sour…rKX;Errors while trying to finetune compiled transformers modelrLXH[MPS] aten::erfinv bug fix: add storage offset buffers to handle slicingrMX=inconsistent signature for dataloader in docs/source/data.rstrNXEnable the RUF013 rule in ruffrOX5[BUG] Fix grad_ready_order_indices' error type in DDPrPXGtorch.sparse.mm() with reduce operator for GPU support and COO matricesrQXoDDP , error . [c10d] The client socket has timed out after 900s while trying to connect to (XX.XX.XX.XX, 8514).rRX"[WIP] Fix Prims as_strided_scatterrSX7Mode to warm up PT2 with a regular eager mode executionrTXRhttps://pytorch.org/docs/stable/backends.html does not describe torch.backends.cpurUX3Add automated minifier script for Dynamo benchmarksrVX9torch.compile uses more memory when using less parametersrWX$I just need to ssh into a CI machinerXX*Improve error messaging on EmbeddingBag.curYX_Revisit checkpoint naming mismatch with torch name (and ONNX initializer name as a consequence)rZX@`torch.unique()` messes around with order even if `sorted=False`r[X.[FAILING] Make guard after freeze a hard errorr\XPypi is missing dependenciesr]X8[FSDP] using CPUOffload cannot make the code runing stopr^X(Compile error PyTorch 2.0.1 / GCC 13.1.0r_X`There is a big precision error between A100 and 3090 when using torch.matmul with fp16 precisionr`X<[pytorch] replace __FILE__ with __FILE_NAME__ for exceptionsraX,Parameter ... has been marked as ready twicerbXJDynamo test pipeline failed on MaxPool2d test when changed to use f-stringrcX,Fail to build c++ extension in pytorch 2.0.0rdXUnable to build documentsreX=Slightly improve AOTAutograd logging with ViewAndMutationMetarfXKDon't use weak ref finalization for freeing resources when code objects diergXSupport symmetry in einsumrhXAn experiemnt proflingriX%[WIP] low mem max_pool2d_with_indicesrjXJError using torch.compile with HF transformers and model `mosaicml/mpt-7b`rkX3`torch.autocast(bfloat16)` runs bwd matmuls in fp16rlXGRunning Llama 2 on Apple Silicon GPUs - missing MPS types and operatorsrmX7[LTC] Fix type inference for native_layer_norm_backwardrnX*Pytorch - cpu only & caffe2 build failingroX1add Half support for interpolate operators on CPUrpXLTensor subclass is not preserved during backward with gradient checkpointingrqX\Turn indexing with a scalar tensor into an copy into a view and avoid a D2H synchronization.rrX+Add z3-solver as dependency to dynamo testsrsX'[MPS] Add mps support for max unpool2d rtX0Syntax error when compileing Megatron-LM models.ruXMFSDP with gradient checkpointing lead to redundant allgathers during backwardrvXG[inductor] unexpected dynamic shape error encountered in TritonTemplaterwXAtorch.nn.TransformerDecoderLayer lacks parameter validation checkrxX4F.pad will accept 0 and negative values as parameterryXCExtend the device verification of the RPC module on the Python siderzXY[ONNX] fix `test_fx_op_consistency.py` test failure when running on torch built with cudar{X%Enable Mypy checking for scheduler.pyr|X-Out of bounds error with `nn.MultiMarginLoss`r}XAdd sdpa op prototyper~X/Change default autograd fallback mode to "Warn"rX?[Inductor] Add support for NEON ISA in the Inductor C++ backendrXqRFC: Integrating oneDNN Graph Compiler into Inductor C++/OpenMP Backend for Enhanced Graph Fusion and PerformancerX2Add color-coding to fx graph readable printouts :)rX Using scansrX9Lowering topk to reductions and pointwise when k is smallrXMMove Inductor-specific decompositions to general decomposition registrations.rXreplication_pad1drXReflection_pad1drX$aten.multilabel_margin_loss_backwardrXaten._cdist_backwardrXaten._trilinearrXaten._cdist_forwardrXNAvoid calling AOTAutograd from AOTInductor, since Export has already done thatrX+[easy] Add an option to force recompilationrXCtorch.sparse.sampled_addmm doesn't compute gradients for 3D tensorsrX [MPS] Lerp tensor implementationrXUserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::index_add_. Please file us an issue on GitHub so that we can prioritize its implementation.rX[POC] DynamicTensorrX!test_torchinductor_opinfo trackerrXptts_angular: fail_to_run, torch._dynamo.exc.Unsupported: call_method NNModuleVariable() flatten_parameters [] {}rXQconvit_base: AssertionError: Mutating module attribute rel_indices during export.rX&Efficient BMM for sparse-dense tensorsrXXDISABLED test_conv_with_as_strided_dynamic_shapes_cuda (__main__.DynamicShapesCudaTests)rXtorch.onnx.export errorrXV[ONNX] Exporting the operator 'aten::exponential' to opset version 13 is not supportedrXMaten.bernoulli.p is missing in core aten IR opset but does not get decomposedrX7Avoid synchronization when using scalar tensor as indexrX,[ONNX] FX produce valid node names in modelsrXxport] Serialize SymFloatrX,[FSDP] Revisit mixed-precision casting logicrXItorch.save throws an error when the path uses mixed separators on WindowsrX/Specifying `FakeTensorMode` for Custom BackendsrX[OpInfo] index.TensorrX0[benchmark] Rename the count field FunctionCountrX[proposal] Bit ops: e.g. setbit/getbit/togglebit/byteswap + introduce well-standardized unsigned dtypes (uint16, uint32, uint64)rXA[ONNX] Support Fake Tensor Mode on new Dynamo based ONNX exporterrXSpecify versionrX,Adding documentation an diagram on code baserX/Top level Glossary for users (not contributers)rXtorch.onnx.export failed: torch.onnx.errors.SymbolicValueError: Unsupported: ONNX export of convolution for kernel of unknown shaperX=Will nn.unfold support non-4D-tensor input in future version?rX\DISABLED test_cross_entropy_large_tensor_reduction_none_cuda (__main__.TestNNDeviceTypeCUDA)rXISilent Error of torch.fx.symbolic_trace when forward hooks are registeredrXB`vmap` causes unpredictable behavior when combined with `autocast`rX3Need support and testing for Adam optimizer for MPSrX3FSDP loading with a partial state triggers KeyErrorrX Quadric LayerrXK[pytorch][codev] add current test finding logic to find_matching_merge_rulerX%Set dir for aot_inductor output filesrX@torch.onnx.export does not support divisor_override in AvgPool2drX+FSDP Full Shard compatibility with BF16 AMPrX+[ONNX] Refactor `test_fx_op_consistency.py`rXEnable SLEEF on ARMrX9DISABLED test_super_resolution_cuda (__main__.TestModels)rX:Softmax doesn't support sparse tensors with the CSR layoutrX%TorchInductor Hack-a-Day on July 19thrX(Can't vmap over torch.tensor constructorrXPadded tensor subclassrX,DeadKernel when training GNN for Cora on MPSrX-Implementation of torch.sparse.sampled_baddmmrXS[docs] torch.sigmoid to make clear equivalence relations to other sigmoid functionsrX2Failed to convert model that has LeakyReLU to ONNXrX4Batching rule not implemented for aten::unsafe_chunkrXzBackward pass with sparse parameters results in error "Sparse division requires a scalar or zero-dim dense tensor divisor"rX5Support ONNX opset 20 to export GELU to one single oprXTorch.compile Error: RuntimeError: aten::_conj() Expected a value of type 'Tensor' for argument 'self' but instead found type 'complex'.rX@Optimize PyTorch C++ part with Profile-Guided Optimization (PGO)rX>[Dynamo][Compile]Torch compile with dynamic shapes not workingrXL[DO NOT MERGE] Testing to see if CUDA API call is allowed in watchdog threadrX2added some more codegen files from inductor modulerX<Inductor generates incorrect CPU code for `uint8` operationsrXrecompile fx.GraphModule lazilyrXK[discussion] Integrate widely used utilities from fvcore into the core reporXNDISABLED test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim)rX\DISABLED test_cross_entropy_large_tensor_reduction_mean_cuda (__main__.TestNNDeviceTypeCUDA)rXHMultiple linux jobs are failing with version `GLIBCXX_3.4.30' not found rX'Enable Mypy Checking in torch/_inductorrX\Significant time difference of calculating Jacobian matrix using jacrev and oracle functionsrX Export+AOTInductor issue trackerrXG[DTensor] Dtensor API should report the correct device when GPU is usedrXQ[DTensor] Module parallelized with ColwiseParallel should return a sharded tensorre(XHautocast + torch.no_grad inference cause backward graph nodes to be lostrXIPytorch dataloader not loading first-available data with multiple workersrXKError loading TorchScript model with torchvision::nms operation in libtorchrX9Repro str could be displayed with slightly wrong env varsrX|[DO NOT MERGE][NCCL][CUDA][CUDA Graphs] Set watchdog runtime capture mode to thread local to handle cleaning straggling workrXUtorch.compile leaks memory after compiled object is deleted, no apparent way to cleanrX,[Dynamo]`__torch_function__` tracing supportrX.[PyTorch-TB] Write full tensor as tensor protorX4PT2 custom ops does not work with future annotationsrX7[ROCm] Add ROCm AMDGPU support for inductor cpp codegenrXTypeError: 'NoneType' object is not subscriptable (Occurred when translating col2im). Can't translate torch.nn.functional.fold in opset_version 18.rXGDISABLED test_conv (quantization.jit.test_quantize_jit.TestQuantizeJit)rXQDISABLED test_conv_transpose (quantization.jit.test_quantize_jit.TestQuantizeJit)rXaDISABLED test_observer_with_ignored_function (quantization.jit.test_quantize_jit.TestQuantizeJit)rXPDISABLED test_single_linear (quantization.jit.test_quantize_jit.TestQuantizeJit)rXIDISABLED test_nested (quantization.jit.test_quantize_jit.TestQuantizeJit)rX6DISABLED test_unary_ops (__main__.TestTensorExprFuser)rX+MacOS arm64 runners are not available in CIrX.Remaining functions without meta registrationsrX?workaround for using vmap when .item() is being used internallyrX&[RFC] Proposal to upgrade LLVM versionrX<Fix kwargs for `checkpoint`; composition with `fully_shard`rX4torch.load fails under FakeTensorMode for GPT2 modelrX[ONNX] Support aten::var_meanrXO[linalg] test_ops.py::test_python_ref_meta__refs_linalg_svd_cpu_complex failingrX7test_view_dynamic_zero_dim no longer testing zero inputrXc[feature request] make the input k in rot90 a list of int to rotate tensors individually in a batchrX!Add more mac messages to setup.pyrX;extra information messages for mac in setup.py would help. rXPSupport Delay Loading of c10.dll in when using libtorch as a thirdparty library.rX+Multiple dimensions support for `torch.max`rXt`assert has_same_metadata(inpt_new, inpt_old)` fails when capturing forwards + backwards in train_step with resnet18rXGDISABLED test_homogeneous_attributes (__main__.TestFSDPMiscMultiThread)rXLDISABLED test_compile_vmap_hessian_cuda (__main__.TestCompileTransformsCUDA)rX(Move tools/autograd to torchgen/autogradrX+[export] tensor creation ops burn in devicerX[NotImplementedError: Could not run 'aten::_spdiags' with arguments from the 'CUDA' backend.rXN[not ready for review yet] torch.compile support for parseSemiStructuredTensorrX<Add a diagram showing the code structure to CONTRIBUTING.md rXWSaving a LightningModule torch.jit.ScriptModule is incompatible with torch.amp.autocastrXfix Type Hinting AnnotationsrXHRuntimeError: DataLoader worker (pid(s) 9036, 10492) exited unexpectedlyrXE[Inductor] [CPU] performance regression with TORCHINDUCTOR_FREEZING=1rXHONNX export process failed to keep consistence of input_names specifiedrXK[torch.compile] RuntimeError during Gradient Computation in torch.compile()rXtorch version comparerX:Unnecessary record_stream call for backend:cudaMallocAsyncrXfix for documentation linksrX2StableDiffusion with dynamic=True still recompilesrXjtorch.jit.frontend.NotSupportedError: keyword-arg expansion is not supported: for dgl.nn.HeteroGraphConv()rX8Errors when converting LLaMA to ONNX using dynamo exportrX6Refactor Adam and AdamW by abstracting out common coderXV[dynamo][higher_order_op] assert in check_kwargs leads to error instead of graph-breakrXdtorch.onnx.export does not respect nn.Module.forward API when using export_modules_as_functions=TruerXDDISABLED test_custom_op_cuda_cuda_wrapper (__main__.TestCudaWrapper)rXqtorch/testing/_comparison.py: If you are a user and see this message during normal operation please file an issuerXerrors in CONTRIBUTING.mdr X;Conversion of a CSR tensor with batches to COO tensor failsr XOrfftn and irfftn operations in pt2 return different results compared to v1.12.1r XAtorch.nn.Conv2d's padding mode circular cannot accept 3-dim inputr X=Torch's `LayerNorm` and Adam optimizer vs those in tensorflowr X1DISABLED test_custom_op_cuda (__main__.CudaTests)rXaDISABLED test_custom_op_cpu_dynamic_shapes_cpp_wrapper (__main__.DynamicShapesCppWrapperCpuTests)rXtorch.norm inconsistency?rXE[feature request] torch.mix function to generalize/symmetrize addcmulrX.Implement `diag` method for sparse COO tensorsrXWMPS matmul with sliced (strided) out argument produces wrong output, may corrupt memoryrX8Unrelated error messages with torch.nn.AdaptiveAvgPool3drX'torch.func.jvp fails with BERT trainingrX=[RFC] Let in-place foreach functions return a list of TensorsrXD[compile][dynamic] dsplit is seeing a list of mixed ints and symintsrXPPyTorch built with CuDNN-8.8.1 crashes if CuDNN-8.9.2 is installed on the systemrX*Regression in Dalle2 due to dynamic shapesrX7have inductor fallback for fp16.view(dtype=torch.int16)rX7inductor/triton fails on `view(..., dtype=torch.int16)`rX>[BE] Evaluate and improve eager for-loop optimizer memory perfrXIUse `isinstance` instead of `type` when checking for `torch.nn.Parameter`rXBtorch.nn.CrossEntropyLoss: class weighting changes label_smoothingrX'Subgraph matcher returned a false matchrX%Support for `eval` in functional_callr XRTorch Filename Storage hangs on "file_system" sharing strategy after in-place fillr!X/fsdp load model causing insufficient CPU memoryr"X|torch._dynamo.exc.InternalTorchDynamoError: Could not run 'aten::_local_scalar_dense' with arguments from the 'Meta' backendr#XFError reporting uses formal parameter names of downstream C++ functionr$XZtorch.jit.trace says "Arguments for call are invalid" on torch.ops.aten.sub(3, x, alpha=3)r%XNCorrecting error message for invalid output_size input in nn.AdaptiveAvgPool2dr&X4Add support for NEON ISA in the Inductor C++ backendr'XFNondeterministic segfault in test_content_store.py under Dynamo configr(X$torch.jit slicing error (styleganv2)r)X New Loss Function Add In Pytorchr*X;generate_vmap_rule=True sometimes gives batched grad_outputr+X[feature request] Specialized memory layouts and wide blocked/tiled dtypes for cublasLt/onednn: e.g. torch.float16x32 / torch.int8x32 (akin to torch.quint2x4)r,XESystem memory leak when using different input size of torch.nn.Conv3dr-XUIncorrect Error Message Ordering for nn.AdaptiveAvgPool2d with Incorrect output_size r.X-LSTM built-in dropout not reproducible on GPUr/XDDISABLED test_cuda_memory_leak_detection (__main__.TestCudaMultiGPU)r0X2torch._dynamo.export does not work with bert modelr1X?[compile] DDPOptimizer + activation checkpointing not supportedr2XExtend ATen op benchmarksr3XPvision_maskrcnn: AssertionError: expected size 368==368, stride 156==28 at dim=0r4X5I propose a new overview section in the documentationr5X\`torch.distributed.rpc.backend_registry.register_backend` fails to update `BackendType` enumr6XFtorch.compile fails with "INTERNAL ASSERT FAILED" when compiling GPT-2r7XEFailure in optimize_for_mobile when using conv1d(..., padding='same')r8XBF.adaptive_avg_pool3d(input, 1) returns infinity in half precisionr9XIERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE.r:X_[ONNX] Exprted Graph has Different Behavior from Eager Mode for CUDA FP16 Tensor Times a Numberr;X"Bug in Conv/BN fuser with torch.fxr<Xdversion libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time referencer=XFIssue with FSDP does not reduce memory footprint when scaling up GPUsr>X$Conv1d step-by-step numerical error r?XDcpu reduce: output accumulate type when input is bfloat16 or float16r@XCUpdated the documentation of torch.sparse to make it more readable.rAXRemove cpp_custom_type_hackrBX`Init_rpc() errors when running the test code in the TorchPRC document on two different machines rCX$torch compile for jacrev'ed functionrDX"Remove deprecated fbgemm operatorsrEXI[RFC] Optional Modular Representation for FX Graph from `torch.compile()`rFXLDISABLED test_sparse_all_reduce_sum_cuda (__main__.TestDistBackendWithSpawn)rGXgvec_test_all_types_xxx with dtype c10::complex and c10::complex has failures on divisionrHXXUsing the latest version of Torch, when the code executes tcpstore, there is no responserIX;[error] while Implementation of pytorch DistributedParallelrJXTImeout in NCCL doesn't workrKX<Wrong functionalization of as_strided leads to wrong resultsrLXWGet errors after compiling and running PyTorch MINIMAL EXAMPLE for c++ Mac M1 with makerMXFAdd inverse gamma distribution and fix `sign` bug in `PowerTransform`.rNXM[NCCL][CUDA][CUDA Graphs] Flush enqueued work before starting a graph capturerOX#FSDP Optimizer Overlap - follow upsrPXLInvestigate numerical stability of forward-mode AD of some foreach functionsrQX0[test-only] Tensor load endianness default valuerRXo`torch.view_as_real(tensor)` should return `nn.identity(tensor)` if its not complex instead of raising an errorrSXZ[Feature Request] Add a new overload of torch::jit::load to restore traced shape and type rTX/Deprecated the device usage without device_typerUXadd tsan workflowrVXKDISABLED test_nnc_correctness_frac_cpu_bfloat16 (__main__.TestNNCOpInfoCPU)rWX7[ONNX][TypePromo] Automate codegen type promotion rulesrXXoNumpy/scipy module works fine with Torch modules, but not TorchScript. How to torchscript a numpy/scipy module?rYXFunctorch scanrZX9Silent incorrect result for addmm for noncontiguous inputr[XDtorch.compiled model output gets overwritten despite tensor.detach()r\X5Make decomps opt-in for upsample_nearest 1D / 2D / 3Dr]X1LibTorch 2.0.1 scripting in Debug mode on Windowsr^XSupport CUDA 12.2 r_XRuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED at HIPGuardImplMasqueradingAsCUDA.h:60, please report a bug to PyTorchr`XDDetailed error: Tensor-likes are not close! When use torch.jit.traceraX=Inconsistencies in ONNX exporting of operation `torch.full()`rbX/distributed hooks want to support custom devicercXAFakeTensor can't handle meta impls that perform device conversionrdXSDISABLED test_conv3d_64bit_indexing_cuda (__main__.TestConvolutionNNDeviceTypeCUDA)reXhReduceLROnPlateau will throw IndexError: list index out of range with modified optimizer's param_groups.rfX1[testing only] Enable inlining modules by defaultrgXdSegmentation error while using F.cross_entropy with mps(for code that works fine with device= "cpu")rhXODISABLED test_backward_ddp_inside (__main__.TensorPipeDdpUnderDistAutogradTest)riX{Illegal Memory Access on H100 `TestSparseCompressedTritonKernelsCUDA.test_triton_sampled_addmm_block_size_16_cuda_bfloat16`rjX_Torch randperm with device mps does not sample exactly uniformly from all possible permutationsrkX*Attempt to use minifier on sam model failsrlXGtorch.distributed.all_to_all_single & alltoall_base, size limit INT_MAXrmX9affine_grid and grid_sample operators merge/accellerationrnX4getattr on `__slots__` object potentially suspiciousroXK`F.conv1d` and `F.conv2d` propagate `nan`'s incorrectly when minibatch > 15rpXCRename `topic: not user facing` to `release notes: not user facing`rqXbtorch._dynamo.exc.TorchRunTimeError in get_fake_value while performing quantization aware trainingrrXUImportError: libcudnn.so.8: cannot open shared object file: No such file or directoryrsX0[FSDP] `ignored_states` is broken with auto wraprtXA[RFC] Make `_HYBRID_SHARD_ZERO2` public as `HYBRID_SHARD_GRAD_OP`ruX3[inductor] Updated upsample_bicubic2d decompositionrvX[proposal] "Name" string attribute for modules, parameters, buffers, tensors for more pleasant debugging (especially for graph printouts / export / studying compiled generated code)rwX6DISABLED test_mem_get_info (__main__.TestCudaMultiGPU)rxX=Enable quantization dispatch for backend QuantizedPrivateUse1ryX=[ONNX] Investigate `nn.functional.nll_loss` skip/xfail reasonrzXITorchscript with dynamic quantization produces inconsistent model outputsr{XCView ops on fake tensors can dispatch `detach`es to backend kernelsr|XpConversion from strided to batched sparse compressed tensor with a non-constant number of zeros in batches failsr}Xktorch.embedding: Trying to convert BFloat16 to the MPS backend but it does not have support for that dtype.r~X@Add memory managemenet information for Apple silicon mps backendrX4[inductor] Updated upsample_bilinear2d decompositionrXMNo document for parameter `load_debug_files` in `torch::jit::load` in C++ APIrX.distributed.scatter memory leak in source rankrX>Incorrect Reduce collective result with `_coalescing_manager` rXDDP enhancement rX+Nested Tensor with PyG dataset custom classrXANetwork does not return any thing, not even None and breaks loopsrX(add dist hooks support for custom devicerX]Numbers bigger than the range should be inf while the implementation just keeps the original.rX^Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)rXQ[RFC] TorchInductor with X86 CPU as backend of Quantization in PyTorch 2.0 ExportrX:PyTorch2.0 ROCM LayerNorm HIP error: invalid configurationrX5make_fx: torch.where scalar promotion burns in devicerXU[ONNX] Support symbolic tracing without using external `FakeTensorMode` on public APIrX-add fsdp checkpoint support for custom devicerX.Python Crashes When Importing Torch With C APIrXRe-enable `test_typing`rX,Documentation building fails due to torchgenrX(Tensor to_sparse fails on large matricesrX9batch size unexpectedly affects model inference on Mac M1rXCInductor dynamic shapes output: NameError: name 's2' is not definedrX2DISABLED test_graph_breaks (__main__.LoggingTests)rX3(Possible) Memory leak on deleting a compiled modelrX-RuntimeError: _ivalue_ INTERNAL ASSERT FAILEDrXERegressions with torch.compile + amp + ddp with recent nightly buildsrX8Distributing HSDP checkpoint writing for load balancing rX7[WIP]Add mutliple CUDA streams support to TorchInductorrX?Tracking issue for optimizer graph not being an inference graphrX[torch.compile] torch._dynamo.exc.TorchRuntimeError: Failed running call_function [PT2.0][compile] torch._dynamo.config.log_level does not existrXT[torch.compile] `permute_linear_fusion` ignores the inplace operation for the tensorrX^DISABLED test_backward_ddp_outside_uneven_inputs (__main__.TensorPipeDdpUnderDistAutogradTest)rXXSeveral Torchbench models don't run with float16 or bfloat16 in the inference eager moderXA[Inductor] Freezing Add support for Caching Parameter ConversionsrX@[dynamo] Update Unsupported to raise from fake tensor exceptionsrXj gfx906 ROCM print black images all ai torch: 2.0.1+rocm5.4.2/rocm5.5 only works with torch=1.13.0+rocm5.2rX<'MPS' Issue Running HuggingFace Transformer Pix2Struct ModelrXa[ONNX] Isolate TorchScript-based code-base from Dynamo-based ONNX exporter for easier deprecationrX&How to unwrap after auto_wrap in FSDP?rXGCODEOWNERS file has errors due to non existent people being referred torX5Need the full "Release Compatibility Matrix" of torchrX)How to modify gradients of an FSDP model?rXwhen train with multi GPUSrX8torch.save() fails if path contains multibyte charactersrX[torch.fx] Deserialization Error - TypeError: ones() received an invalid combination of arguments - got (tuple, device=Attribute) rXSDISABLED test_gather_state_dict_dtensor (__main__.TestShardUtilsDistributedDTensor)rXF[dynamo] functools.wraps : graph-break when wrapping nested functions.rX0Remove setting eval mode in observed custom LSTMrX@Issue with loading similar checkpoints in a distributed fashion rX0Docker images: faster linker for `torch.compile`rXxRuntime Error outerNode->outputs().size() == node->inputs().size() INTERNAL ASSERT FAILED when exporting custom operatorrXaCan ``torch.vmap`` add ``grad_fn``= SelectBackward when maping over some dimension of the inputs?rXIrow.device().is_cpu() INTERNAL ASSERT FAILED at "csrc/cpu/diag_cpu.cpp":7rXNDISABLED test_create_chunk_dtensor (__main__.TestShardUtilsDistributedDTensor)rX2FSDP full precision `model.eval` silently failing rXGLong PR description leads to "Argument list too long" error from dockerrX@[ONNX] FX exporter: replace `aten::copy_` with out-place versionrX9Segmentation fault when tensorrt is imported before torchrX&torch compile aten::floor_divide errorrX-Some parameters are missing type descriptionsrXhThe document style is inconsistent with other documents, and the parameter type is not clearly highlightrX!Missing examples in some API docsrX[question] [docs] Short/mid/long-term status of TorchScript / JIT / torch.jit.trace / FX / symbolic tracing and its replacement by DynamorXDGradient operations (zero_grad and gradient accumulations) as graphsrX type conflictrXhPlease consider the SCFA/dynamic flash attention for your implementation of scaled dot product attentionrX:Torch 1.13 for GPU breaks if libcublas is already present.rX5[dynamo] AssertionError for custom iterable nn.ModulerX$RPC Framework support custom backendrXTUpgrading SpGEMM algorithm to resolve Cusparse SpGEMM insufficient resources problemrX'abnormal behavior in function "scatter"rX%Add alias support for sparse tensors.rX'Error when building with USE_TENSORRT=1rXSupport `Sequence` type in JITrXSEager PTDQ Performs Worse Than Non-Quantized Linear Layer on CPU(in Terms of Speed)rXJMis-annotated return for `F._no_grad_embedding_renorm_` (also JIT related)rX8Type misalignments in `nn.functional` (also JIT related)rXB[Torch Mlir] avg_pool1d function padding init value should be (0,)rX0Generate complete annotations for `torch._C._nn`rX5[PT2] Return int32 indices in max_pool2d_with_indicesrXH[ONNX] Handle absence of `onnxscript` module in PyTorch requirements.txtrX*Merge type stubs for `torch.nn.functional`rXDdlrm and hf_T5_generate fails aot_eager with bfloat16+dynamic_shapesrX>libtorch > 1.9.1 produces segfault on Qt5 gui application exitrX5Pytorch not calling to C code from a docker containerrX#SDPA produces NaN with padding maskrXI[FSDP] train throughput become slow down when loaded shard optimizer dictrXs[FSDP] save model checkpoint with StateDictType.LOCAL_STATE_DICT and LocalStateDictConfig(offload_to_cpu=True) failrX,torch.compile() bug in AOTAutograd or DynamorX/DataParallel interfering with TorchDispatchModerX<Non actionable perf hint: reduction over non-contiguous dimsrX4[ONNX] Discuss improvements to Diagnostic public APIrX1TorchDynamo assertion with `try: return; finally`rX%[dtensor] introduce experimental dmaprX<fairseq distributed training dumps core with flash attentionrXT(fsdp) Support for accessing unsharded parameters for methods other than `forward()`rXCExported model with dropout incorrectly applies dropout during evalrX<[dynamo] Add config that turns on tracing through nn modulesrX~detectron2_fcos_r_50_fpn and other models have enough graph breaks that we end up with multiple cache entries on module blocksrX\"Y.getIntrusivePtr()->set_storage(X.getIntrusivePtr()->storage()); " in C++ is not supportedrX>MultiheadAttention should split embed_dim into four parametersrXinductor: support horizontal reduction with vec_transpose to improve TIMM swin_base_patch4_window7_224 dynamic shape performance rXgAOT autograd: Avoid dependency on strides for manual regeneration of outputs that are aliased to inputsrX1Insert nvtx markers into generated triton kernelsrX:"addmm_out_sparse_csr_impl_mkl" not implemented for 'Byte'rXEDisclose C++ ATen ops type promotion rules under OpOverload in PythonrX,2D model checkpointing hangs on a ViT modelrXPDISABLED test_backward_ddp_outside (__main__.TensorPipeDdpUnderDistAutogradTest)rX=ARM based GPU support for Distributed Data Parallelism ModulerXztorch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant on DynamicShapesMiscTests.test_slice_inputrXECan't call allow_in_graph inside of a function being torch.compile'd rX;Installing Torch on AMD Platform Leads to Huge Docker ImagerX!test_fstrings2 fails with dynamicrXc`interpolate` with `antialias=True` on CUDA doesn't work if the difference of spatial size is largerXLSTM/RNN operation agnosticrXEtorch.cuda.mem_get_info to return 0 if CUDA context isn't initializedrXL[Inductor] Constant folding support for FX module captured by Dynamo Export rX>Passing dict in datapipe/dataset will have memory leak problemrXGSupport ByteTensor and ShortTensor for nn.Embedding and nn.EmbeddingBagrXMImportError: undefined symbol: cublasSetWorkspace_v2, version libcublas.so.11rX$add default argument device type apir X[ONNX] Support aten::mTr X,[ONNX] Support aten::linalg_solve_triangularr X'[ONNX] Support aten::linalg_cholesky_exr X"File Missing When i build with C++r X=Request: flag to know model is compiled after torch.compile()r XGInject detailed NVTX markers into the Inductor Triton generated kernelsr XHtorch.fx.passes.split_module.split_module doesn't support dynamic shapesr XEDeduplicate the operands passed into torch.cond after dynamo tracing.r XL`gradcheck` produces false positives with sparse inputs when `masked=False`.r X=[functorch] [FakeTensorMode, meta tensor] + aot_autograd Bug.r X)CUBLAS_WORKSPACE_CONFIG can not be parsedr XERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6047) of binary: /home/win10-ubuntu/anaconda3/envs/vicuna-7b/bin/pythonr X.DISABLED test_mem_get_info (__main__.TestCuda)r XCNo backward implementation for `torch._native_multi_head_attention`r XCtorch._dynamo.exc.Unsupported: Tensor.backward with aten_graph=Truer XDocument CI retry rulesr X1[Inductor] Optimize More Cases of Int32 -> Int64 r XbError encountered when tracing model with Dynamo/Functorch for export with trilinear interpolationr X[inductor] multi-kernel supportr X8[ao] making hist_obs handle torch.inf and closeby valuesr X+Memory efficient SDP yields wrong gradientsr XAsynchronous CUDA AveragedModelr X3Deprecation warning on lr_scheduler.step(num_steps)r XPtest_generate_tensor_from_list_of_numpy_primitive_type fails if run under pytestr X<The document does not emphasize Illegal value in nn.Bilinearr X<The document does not emphasize hidden range in nn.Embeddingr X<The document does not emphasize hidden range in nn.MaxPool2dr XRPossible memory leak when using Torch and Torchvision in conjunction with XGBoost r XdTorch model compile error "/usr/bin/ld: cannot find -lcuda" though cuda is installed via run filer X-[inductor][cpp_wrapper] Support rand fallbackr X7[Distributed] Limit world_size to 8 for FSDP Unit testsr X1[decomp][bad accuracy] AlbertForQuestionAnsweringr X6LayerNorm freeze processes using torch multiprocessingr! X,Typing missing on arithmetic ops on `Tensor`r" X\NotImplementedError Could not run 'c10d::alltoall_' with arguments from the 'Meta' backend. r# X<Inplace binary ops on tensor subclasses can cause mypy errorr$ X@ImportError: cannot import name 'Store' from 'torch.distributed'r% X>torchgen/gen_backend_stubs.py compatibility with DispatchStubsr& X8test_workspace_allocation_error fails on my local devgpur' X'RuntimeError: CUDA error: unknown errorr( X4Libtorch compile error when defining D_GLIBCXX_DEBUGr) X/Add a requirements.txt for windows pip packagesr* X[feature request] Native method for iterating Python items of tensors: `iteritems()` and a new `tensor.item(i, j, k, ...)` methodr+ XCmps and cpu give far different results when training a transformer.r, Xpython test/inductor/test_split_cat_fx_passes.py -k test_consecutive_split_merge fails, but running all tests together succeedsr- X,Improve `_group_tensors_by_device_and_dtype`r. XRuntimeError: torch.vmap a function that includes in-place arithmetic operations on a zero-initialized tensor, an error "vmap: inplace arithmetic(self, *extra_args) is not possible" is raised.r/ X,Disabling ALL TestOptim on the dynamo configr0 X-Custom autograd function causes a graph breakr1 XXbinary_cross_entropy (loss) seems to be giving incorrect values for very negative logitsr2 X3Add Half support for softmax and log_softmax on CPUr3 X/Fast kernels for low rank matrix multiplicationr4 X0setup.py fails to pass USE_ROCM to CAFFE2 build r5 X%DTensor uneven sharding corner cases.r6 X$distributed.gather shape constraintsr7 X!Dynamo trouble shooting dead linkr8 XoneDNN kernel fails to compiler9 X'Misaligned address error with torch.catr: XLWarn / deprecate / remove ProcessGroupNCCL._group_start(), _group_end() APIsr; X?Unexpected High PCIe traffic in Distributed Training since PT 2r< X[Issue-103101: Refactor dimensionality check in tuned_mm_plus_mm to pattern matching phase.r= X4torch.jit.script mean(keepdim=True) segfaults on GPUr> X1torch.cuda.memory_reserved always returns 0 bytesr? XImage Processing with Pytorchr@ XBBenchmark --quick with huggingface runs almost indefinitely on CPUrA XMcompilation fails `error: invalid argument '-std=c++17' not allowed with 'C'`rB XW[help] did torch.distributed.launch can be applied on k8s cluster with pytorch-operatorrC X.Undeterministic behavior in testing in dynamo.rD XCPyTorch can not be compiled with MKLDNN if system compiler is clangrE X9[inductor] test_fft_real_inputs fails with dynamic shapesrF XC(fsdp - maybe a bug) SHARDED_STATE_DICT returns tensor with no datarG X&[RFC] Emit better Telemetry in PyTorchrH X2breakpoint() in torch.compile region behaves oddlyrI X9Calling jacrev with LSTM and functional_call gives errorrJ XOAllow overriding __repr__ to call dataclass_repr (infinite recursion right now)rK X3Build fails at linking torch_shm_manager on aarch64rL X;Optimize the copy of Half to Float and Float to Half on CPUrM Xerror: ‘aligned_alloc’ was not declared in this scope static_cast(aligned_alloc(FLATBUFFERS_MAX_ALIGNMENT, size)), free);rN XCObserved regress in DataLoader spawn from PyTorch1.13 to PyTorch2.0rO X?Turn on Inductor Max Pool2d Backward Lowering For Channels LastrP XVIncreased / more verbose type aliases for improved readability of user defined contentrQ XIPyTorch should not use `windows.8xlarge.nvidia.gpu` to test binary buildsrR X/Refactor mm_plus_mm to check conditions upfrontrS X(torch.compile specializes on output namerT XFInconsistent memory allocation using FSDP between PT 2.0 and NightliesrU X:[OOM] Unable to convert 30B model to ONNX, using 4x A100'srV X9Ambiguitiy in causal-mask in scaled_dot_product_attentionrW XDtorch.compile crash for tensor computing when tensor size is bigger rX XDUnexpected failure in LLVM JIT when running TorchScript model in C++rY X2Symbolic trace error about torch.nn.functional.padrZ XA[Pytorch 2.0] torch::nn::Dropout output is incorrect on Windows r[ XZlit-llama lora fine tuning NetworkXUnbounded: Infinite capacity path, flow unbounded abover\ XLMPS bug: padding_idx in nn.Embedding does not prevent gradient accumulation r] X7Preserve weight_g/weight_v accessors on new weight_normr^ Xaraise `RuntimeError` faster when loading an object with a torch CUDA tensor on a CPU-only machiner_ XuDiscussion and Design for Masked Loss Functions which can be used with PackedSequence training (but not exclusively)r` XZhow to workaround the error "don't have an op for vulkan_prepack::create_linear_context" ?ra X!torch.svd fails on large matricesrb X[Inductor] add debugging toolsrc X.TypeError: (): incompatible function argumentsrd X/[onnx] aten::cumprod cannot be exported to ONNXre Xtorch.onnx.export error ------RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)rf XTSupport for efficiently processing categorical distributions with varying dimensionsrg XWtorch.cuda.is_available() returns False on GTX 1650 with cuda 11.7 and torch==2.0.0+cpurh XUnbox expectedmlri XQPackedSequences on MPS accelerator yields `grad_y` missing or crashes the kernel.rj Xnn.ChannelShuffle1drk XRUnable to checkpoint model and optimizer state when using Hybrid Sharding Strategyrl XVAfter dynamo minifier generates repros that don't entirely match what we minified overrm XQBCELoss and BCEWithLogitsLoss differ when one of the input logits is float("inf")rn X0[dynamo] Diffusers - Graph break on OrderedDict ro XkInductor: delete code that extracts out sizevars by inspecting tensor inputs to find a size that handled itrp X.pdb but for dynamo (and time travel debugging)rq XLThe operator 'aten::poisson' is not currently implemented for the MPS devicerr X[Dynamo should only unroll loops by a preset factor (unless otherwise explicitly instructed)rs X5LibTorch-Lite 1.13.0.1 Crash on iOS 12 on app startuprt X.TypeError: (): incompatible function argumentsru X0Unknow error when using `make_graphed_callables`rv XkUnable to resume job using FSDP with 64 nodes, errors appeared during loading sharded optimizer state dict rw X'mark_dynamic may error too aggressivelyrx X7[DTensor] Error in distribute_module with module._applyry X3`torch.poisson(torch.tensor([torch.inf))` returns 0rz X-Do smarter layout decisions with concatenate.r{ X"Improve shape padding in training.r| XkDISABLED test_make_fx_symbolic_exhaustive_special_bessel_j0_cpu_float32 (__main__.TestProxyTensorOpInfoCPU)r} X,Introduce OptimizerInfos to test_optim (pt1)r~ X1Support In-place Triangular Matrix Multiplicationr XQFollowup on the extra graph breaks for yolov3 model caused by layout optimizationr X>Pytorch Build images for RISCV64 Devices in the nightly buildsr XNError: no matching constructor for initialization of 'at::OptionalIntArrayRef'r XCDISABLED test_ddp_has_finalized (__main__.TestDistBackendWithSpawn)r X<Unlocking PyTorch's Power: README.md in Multiple Languages! r XMparameterizations.orthogonal does not work as intended with nn.GRU or nn.LSTMr X8Building NCCL with `make -l $MAX_JOBS` slows down buildsr Xf[FSDP] When amp is enabled, there is a noticeable difference during training between `FSDP `and `DDP`r X:Best practices clarification for initialization strategiesr X8DISABLED test_Conv2d_dilated_cuda_tf32 (__main__.TestNN)r XkExporting the operator \'aten::fused_moving_avg_obs_fake_quant\' to ONNX opset version 13 is not supportedr X-Fix dynamo-related debug Python 3.11 failuresr XWInvestigate the perf drop on timm for dynamic shape when layout optimization is enabledr XDuplicate parameters (_flat_params and original params) in the state_dict when using `use_orig_params=True` and `StateDictType.LOCAL_STATE_DICT`r XTest test_vjp_nn_functional_scaled_dot_product_attention_cuda_float32 fails with `query: last dimension must be contiguous` on H100r XBtorchscript dataclasses have bad support for class types as fieldsr XJError when exporting to onnx for albert-base-v2, issue with attention_maskr X[inductor] Memory planningr X;Extremely slow will_fusion_create_cycle on nanogpt_generater X)Cannot invoke prims.sum with output_dtyper XD[prims] torch.ops.aten.le decomposition confuses scalars and tensorsr XFix sparse windowsr X>Support for activation checkpoint on demand in custom functionr XXJetson NX with torch 1.12.0 :cannot import name 'ProcessGroup' from 'torch.distributed'.r XPDynamo should feed optimized upstream graph's output to downstream graph for DDPr X$Mergebot should merge non-stacked PRr Xltest_functional_autograd_benchmark.py::TestFunctionalAutogradBenchmark::test_fast_tasks passes with all NaNsr XU[RFC] Add third-party malloc library to improve pytorch memory performance on Windowsr X<Segfault when running vulkan program linked against libtorchr X5[MPS] Fix MPS sorting issue with strided view tensorsr Xg[feature request] PyTorch support for sub-interpreters with PEP 684 accepted and release in Python 3.12r XGpytorch java api documentation is not clear and does not cover example r X'Faster BatchSampler with big batch sizer XY[Inductor] [CPU] hf_Longformer performance regression > 10% on 2023-05-28 nightly releaser Xt[Utils][tensorboard]Enhancement: Include 'max_outputs' parameter in torch.utils.tensorboard.summary's 'image' methodr XA[REQUEST] - Update Multiprocessing best practices with CPU devicer XStorch.onnx.errors.CheckerError: The model does not have an ir_version set properly.r XYMatrix multiplication performance regression in case of an additional dimension of size 1r XABatching rule for `aten::_scaled_dot_product_efficient_attention`r X9RuntimeError using torch.nn.functional.pad when using MPSr X:Add additional "sigmoid" approximation to GeLu activation?r X"Add pin_memory and is_pinned to NTr X,DDP multi node multi gpu inconsistent paramsr X discuss.pytorch.org signup issuer X1multiple mps for base X86 Mac with multiples gpusr X8torch.distributed.all_reduce() has inconsistent behaviorr XAdd support ONNX Opset 19r X;Add default device function interface for device-aware apisr X9_view_func but without keeping original view tensor aliver XJ[Composable] Unified way to check if modules are managed by composable APIr X.Unexpected Behavior when using torch.isclose()r X(Hooks not working in version 2.0.1+cu118r X[cuda] Switching CI to CUDA 12.1 timing out linux-bionic-cuda12.1-py3.10-gcc7 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu)r X/Issue with ShufflerIterDataPipe in torch 1.13.1r X:PyTorch hangs at import when used together with TensorFlowr X;Data type mismatch in `batch_isend_irecv` docstring exampler XO"Examples" in "batch_isend_irecv" should be modified to get the correct resultsr X@[Dynamo] Better graph-break message for unsupported ctx managersr XITensors that share same underlying storage to also share gradient storager X$There is a memory leak in torch.loadr Xwtransformer encoder-layer, the sample-Independent attn_mask(dim=3) has different behaviors when training and validatingr XRe-enable Low Memory Dropoutr X [Dynamo]Outdated logging settingr XZafter add /path_to_libtorch/libtorch/lib to LD_LIBRARY_PATH, I can't import torch_scatter.r X/Import of torch breaks standard multiprocessingr X[ExponentialLR unexpectedly calls `step()` when init argument `last_epoch` is larger than -1r X4lintrunner should fail on badly formatted docstringsr X=[ONNX] test_op_consistency.py doesn't support constant inputsr X#skipIfTorchInductor Tracking Issue r X=copy_'s functionalized operator keeps copied into tensor liver XXaot_export_joint_simple on plain callable (not graph module) doesn't attach stack tracesr Xscipy.ndimage.find_objectsr XCtorch.func.jvp fails when acting on a DistributedDataParallel modelr X(Extend fake fast path to more situationsr XStorch/distributed/_spmd/api.py should aot_module_export instead of make_fx directlyr X-Calling pin_memory() fails for nested tensor r X;NotImplementedError in backprop on on dense-sparse matricesr X^DISABLED test_build_tuple_unpack_dynamic_shapes (torch._dynamo.testing.DynamicShapesMiscTests)r XAdd mps support for maxpool3dr XPytorch CXX11 ABI versionr XTDISABLED test_inplace_grad_index_put_cuda_complex128 (__main__.TestBwdGradientsCUDA)r XZDISABLED test_inplace_grad_div_trunc_rounding_cuda_float64 (__main__.TestBwdGradientsCUDA)r XUDISABLED test_fn_grad_div_trunc_rounding_cuda_float64 (__main__.TestBwdGradientsCUDA)r X#Enable DEBUG asserts for C++ buildsr XLBatchNorm can't be symbolically traced with torch.fx as a standalone module r X!Documentation Error of torch.onnxr X'CPU Fallback does not convert Tensor?[]r XCadd Half support for AdaptiveAvgPool2d and AdaptiveMaxPool2d on CPUr XfAddressSanitizer: heap-buffer-overflow in test_comprehensive_nn_functional_embedding_bag_cpu_bfloat16 r XBDISABLED test_build_tuple_unpack (__main__.DynamicShapesMiscTests)r X"Can't vmap over a slice expressionr Xv[Inductor] nanogpt_generate failed due to RuntimeError: probability tensor contains either `inf`, `nan` or element < 0r X4[dynamo][BE] Revisit call_method of NNModuleVariabler XVDISABLED test_comprehensive_empty_strided_cuda_int64 (__main__.TestInductorOpInfoCUDA)r XeDISABLED test_call_parent_non_class_methods_from_child (torch._dynamo.testing.DynamicShapesMiscTests)r XXDISABLED test_comprehensive_empty_strided_cuda_float64 (__main__.TestInductorOpInfoCUDA)r XMtorch.compile FakeTensor tracing fails with foreach ops with multiple devicesr X([FSDP] Ensure full precision checkpointsr X'[FSDP] Summon buffers in full precisionr X1[PyTorch] Redirect c10::optional to std::optionalr X|[dynamo] BackendCompilerFailed: backend='inductor' raised: NetworkXUnbounded: Infinite capacity path, flow unbounded above.r XqRequest for adding support for `torch.rand_like`, `torch.randn_like`, `torch.randint_like` with `torch.Generator`r XNo pytorch_android 2.0.x buildsr X7CrossEntropyLoss output difference on Windows vs. Linuxr XTscaled_dot_product_attention produces NaN when input has NaN in masked-out positionsr X:SummaryWriter background thread holds the GIL for too longr X9torch.flip is inplaced too aggressively in torch inductorr XDmps device bug - a weird inconsistency on tensor indexing operationsr XAParameter gradient is not moved parameter is moved across devicesr X-[feature request] [minor] Inplace torch.flip_r Xcan not find tensorrtr X,[compile] Tracker for `torchrec_dlrm` issuesr X3Can't reproduce/non-deterministic results with CUDAr XCrash on Python // PyArrow // r X>torch.quantile on MPS doesn't sort values when dim is not Noner X5Can group convolution support other grouping methods?r Xdtorch.compile makes transformers model (llama) generating different outputs compared with the nativer XqError, attribute exists on the Python module, but we failed to convert Python type: 'list' to a TorchScript type r X.Observing negative number in PyTorch profilingr X*torch.jit.trace() Floating point exceptionr X=Unexpected modification to CPU affinity of Dataloader workersr X%Update minimum supported gcc to gcc-9r X*Fix annotation update code in qat_utils.pyr X5[ATen][Sparse] Use Third-Party Eigen for sparse addmmr X<pytorch-nightly not have torch/version.py.tpl:cuda specifiedr XCFix absolute links in pytorch repository and allow it to be proxiedr XFImplement `to_numpy` method to speed up matplotlib with PyTorch arraysr XRDISABLED test_decoder_padding_and_src_mask_bool_cpu (__main__.TestTransformersCPU)r XRDISABLED test_encoder_padding_and_src_mask_bool_cpu (__main__.TestTransformersCPU)r X.Add support for bfloat16 in torch.from_numpy()r XLDISABLED test_fn_grad_remainder_cuda_float64 (__main__.TestBwdGradientsCUDA)r XKDISABLED test_fn_grad___rmod___cuda_float64 (__main__.TestBwdGradientsCUDA)r XPDISABLED test_fn_gradgrad_remainder_cuda_float64 (__main__.TestBwdGradientsCUDA)r XODISABLED test_fn_grad_index_put_cuda_complex128 (__main__.TestBwdGradientsCUDA)r XK[TorchScript] aten::__and__ with argument types: Tensor, bool not supportedr X[discussion] [feature request] Native tensor-backed string array and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure)r X4[inductor][cpu] Cache for VecISA compilation resultsr XO[refs] inplace references resize the input to match the broadcasted input shaper X4Unexpected behavior of fmod op in some float32 inputr XDUnexpected behavior comparing uint8 tensor to value greater than 255r XStorch.profiler.profile has an empty python replay stack under certain circumstancesr XFake Tensor multithreadingr XTDISABLED test_compare_cpu__refs_empty_strided_cuda_float32 (__main__.TestCommonCUDA)r XG[Dynamo + DDP] If DDP partitions FX graph generated by Dynamo correctlyr X8[Dynamo] Can't inline functions under torch.nn.parallel r XB[BE]: pyupgrade Python to 3.8 - remove extraneous parentheses onlyr X,multiple values for argument `softmax_scale`r X/Users/davidlaxer/pytorch/third_party/tensorpipe/third_party/libuv/src/unix/getaddrinfo.c:165:10: error: implicit declaration of function 'uv__idna_toascii' [-Werror,-Wimplicit-function-declaration] rc = uv__idna_toascii(hostname,r XAUnable to do tensor comparison on Metal Performance Shaders (MPS)r XNtorch.cuda.set_device cannot use to set cpu device, but give an ambiguity hintr XX Exporting the operator 'aten::scatter_reduce' to ONNX opset version 15 is not supportedr XYtorch.nn.functional.scaled_dot_product_attention() : support both attn_mask and is_causalr XKInconsistent performance degradation of 3x3 convolution (torch 2.0.1+cu118)r XQDISABLED test_noncontiguous_samples_matmul_cuda_float32 (__main__.TestCommonCUDA)r X%Support pipeline parallelism with PyGr X@Investigate random sequence number broadcast initially incorrectr Xonnx runtime errorr X#[bazel] add inductor to bazel buildr X!Regression in NCCL error handlingr X[Do automatic replacement of scaled dot product with fast sdpa implementation in HF models. r XqRuntimeError: Triton Error [CUDA]: device-side assert triggered when trying torch.compile max-autotune on nanoGPTr XEnhance FSDP debugabilityr! X>not yet implemented the batching rule for torchaudio::_lfilterr" XP2D inputs to linear layers run up to 25% slower than 4D ones on some Nvidia GPUsr# X(import functorch.dim monkeypatches torchr$ XSymIntify first class dimsr% XDelete old vmap prototyper& X#problem of compilation for torch2.0r' XDataParallel for nested modulesr( X4ONNX model different to pytorch and jit trace outputr) X?torch.Tensor.is_sparse returns false for non-COO sparse tensorsr* XSall_to_all_single seems to be missing a check for checkSplitSizes when splitsize=0.r+ XR[torch.compile] CRASH with segmentation fault when assign cuda value to cpu tensorr, XISparseAdam: working with dense parameters but sparse gradients - usecase r- X Theme updater. X:RuntimeError in Scaled Dot Product Attention Tutorial Coder/ Xcinductor: inductor conv2d get a different size and stride with eager mod when input channel is zeror0 XJfsdp training with the seq2seqTranier module gets stuck during evaluation.r1 X0Functions for Calculating Skewness and Kurtosis r2 XTerminate handlerr3 X,Pytorch 2.1.0.dev20230512 cuda not availabler4 X:Speed when installing from source is very low with CUDA 11r5 XDeprecated File bugr6 X_Shared library loading logic breaks when CUDA packages are installed in a non-standard locationr7 XcDocs suggestion `FullyShardedDataParallel.summon_full_params` must be called on all ranks/processesr8 XMOperations to shared tensors in the forked process could lead to silent crashr9 X=fused torch.optim.AdamW isn't faster than the unfused versionr: X)Support fake tensor real inputs in dynamor; XJShould be ok to call _dynamo.export and torch.compile under FakeTensorModer< X1IPEX as TorchDynamo Backend Performance Dashboardr= X|Noisy warning - torch.fx.experimental.symbolic_shapes: [WARNING] Ignored guard (...), this could result in accuracy problemsr> XQtorch._dynamo.exc.UserError: Dynamic control flow is not supported at the moment.r? XMac m2 MPSNDArray.mm:78: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: dimension index (2) not within number of dimensions (2) Dimension indices are 0-based'r@ XJ`einsum` is about 40x slower on CUDA than manually multiplying and summingrA XJTool for identifying where in eager model an operation is nondeterministicrB X7Different results with vmap when using torch.jit.scriptrC XKHow the [WARNING] using triton random, expect difference from eager arises?rD X/GLCM implementation in pytorch C++ api and cudarE X2Migrate windows runners to non-ephemeral instancesrF XgAOTAutograd export path does not support training graphs with parameters that do not receive gradients.rG Xcustom_op API follow-upsrH X8`dense -> sparse compressed` to work with empty batches.rI XOWeird dataloader performance degradation caused by torch and numpy import orderrJ XVPure virtual function call exception on Python interpreter exit when using debug wheelrK X;Fork run CI from upstream remote (more than 10,000 emails) rL X%version 4.26.1 to 4.29.0 has two bugsrM X\[torch.compile] torch._dynamo.exc.Unsupported: setattr(UserDefinedObjectVariable) for yolov7rN X.round float16 calculation error in mps backendrO XMDISABLED test_fsdp_tp_checkpoint_integration (__main__.TestTPFSDPIntegration)rP X8Fine-tuning HuggingFace wav2vec 2.0 with `torch.compile`rQ XRInconsistency between GPU memory usage in torch.cuda.memory_summary and nvidia-smirR X$[Dynamo] TB hf_Reformer graph breaksrS X>Importing torch after TensorFlow results in std::runtime_errorrT Xh[ONNX] OnnxFunction of aten_index_put_bool operation isn't consistent to aten::index_put inx FX exporterrU XSFault and vauge error when invoking nvcc: The system cannot find the file specifiedrV X[PyTorch/Triton with ROCm 5.5] torch._dynamo.exc.BackendCompilerFailed: debug_wrapper raised TypeError: 'NoneType' object is not subscriptablerW XSPytorch compile failure on Windows with CUDA 12.1 because of lacking NVTX componentrX X@Barriers to using torch.compile directly in PyTorch library coderY X+Tensorboard graph tracing with torch fx APIrZ X!Make compiled models serializabler[ X*addmv doesn't do type promotion correctly,r\ X([BE] Refactor logic for MultiTensorApplyr] XNCannot export quantized model to onnx: cannot call qscheme on UnknownQuantizerr^ X?Multiple Learning Rate Scheduler for Specific Parameters Groupsr_ X*Sequence annotation in type hints is wrongr` XPtorch.lobpcg producing different largest eigenvalue than scipy and np.linalg.eigra X/Lazily format C++ stack trace if it is not usedrb Xtorch.autograd.detect_anomaly should report the original forward trace as part of the error, rather than as out of band warningrc X/Tensor __getitem__ not documented, sparse grad?rd X1DISABLE libtorch-2.0.0+cu117 destructor exceptionre X6[torch.compile] the sum of `softmax` isn't `1` on cudarf XPExporting the operator 'prim::is_cuda' to ONNX opset version 14 is not supportedrg X5[PT2] torch.compile doesn't perform horizontal fusionrh XDUnsupported: ONNX export of operator interpolate (with scales) errorri X8[Placeholder] PyTorch 2.0 Dynamo/Inductor Hack{day/week}rj XUONNX TorchDynamo Exporter - Ability to export and load ONNX files without parametersrk X#Extending compatibility of LibTorchrl XkRuntimeError: nonzero is not supported for tensors with more than INT_MAX elements, file a support requestrm XFnative_batch_norm has different size results on "CPU" vs "META" devicern XBPytorch 2.0.1 pypi wheel does not install dependent cuda librariesro XFAssertionError: slice.Tensor is not supported with cpp wrapper (llama)rp X#Issues building with caffe2 enabledrq X[PyTorch installs the file mkldnn.cmake that looks for the package MKLDNN that doesn't existrr XJtorch.concat fails with float16 input in autocast(device_type=cpu) contextrs XVDISABLED test_vmapjvpvjp_linalg_lu_factor_ex_cuda_float32 (__main__.TestOperatorsCUDA)rt X6[MPS] Track failures of test_module.py for MPS backendru Xq[onnx] UnsupportedOperatorError: Exporting the operator 'aten::l1_loss' to ONNX opset version 17 is not supportedrv XRevise glossaryrw XS`torch.distributions.categorical.Categorical` samples indices with zero probabilityrx X,MPS backend is not supported on MacOS 12.6.3ry X.onnx.export fails if do_constant_folding=Falserz X([BUG] Poor torch.bmm performance on H100r{ X*logger instead of print in lr_scheduler.pyr| XZAccuracy issues with Jitterated complex kernels for acos, acosh, asin, asinh, tan and tanhr} XBDynamo infers different return type vs. eager for `torch.ops.aten`r~ X8add new private operator copy-on-write torch._lazy_cloner X;implement a function to materialize a copy-on-write storager X;hf_LongFormer failing eval with inductor and dynamic shapesr XE[torch.compile] returns output with WRONG SHAPE after `cat_slice_cat`r X/Wrong type for `get_lr` inside lr_scheduler.pyir XsThere is a performance drop because we have not yet implemented the batching rule for aten::native_dropout_backwardr X.[Quant][pt2e] Failed to run pt2e flow on LLaMAr X5Quickstart notebook fails to train properly with ROCmr X7inductor cpp wrapper: crash when disable lowmem_dropoutr XDONNX Opset 16 GridSample Does Not Support 5D Volumetric Input Tensorr Xcompile torch2.0 in debug moder Xk[CUDA RPC] Incorrect results of GPU Tensor transferring using RPC when parallelized with other GPU programsr XB[torch.compile] returns NaN for `tensor.mul(big_number).softmax()`r X<[MPS] Unary ops yield wrong results if striding is differentr XHNightly torch.compile fails with dynamically patched `nn.module.forward`r XQ`torch::jit::EliminateExceptions` lowering pass never completes on specific modelr X[[CUDA RPC] Incorrect messages in CUDA Support RPC when parallelized with other GPU programsr X)torch.cuda.amp.GradScaler initialization r XH[Discussion] Investigate possibilities for Windows Arm64 BLAS and LAPACKr XUDISABLED test_inplace_gradgrad_remainder_cuda_float64 (__main__.TestBwdGradientsCUDA)r X,Add support for MaxPool3D on the MPS backendr XQOn UMA systems, pytorch fails to reserve memory exceeding the initial memory sizer XZUserWarning: must run observer before calling calculate_qparams. Returning default values.r XUOptimal Batch Size Selection in Torchdynamo Benchmarks for Different GPU Memory Sizesr X:tracing does not work when torch.distributions is involvedr XFWill Deep Implicit Models ever become first class citizens in PyTorch?r XFix static libr XWGPU VRAM usage significantly higher for Lenet5 models when compared to other frameworksr X([doc] torch.scalar_tensor doc is missingr XKSynchronization issue when combining DPP and RPC - "Parameter marked twice"r X3Add support for aten::tril_indices for MPS backend r XPundocumented error on torch.autograd.Function.jvp for non-Tensor forward returnsr X;Use a label instead of body text for merge blocking CI SEVsr X0[ONNX] Opset 18 support for TorchScript exporterr XGBackward hook execution order changes when input.requires_grad is Falser XZDISABLED test_inplace_grad_div_floor_rounding_cuda_float64 (__main__.TestBwdGradientsCUDA)r XGAccuracy repro extraction, constants in graph are not preserved exactlyr XvArithmetic of single-element Tensors with different dtypes on 'cpu' and 'mps' results in obscure/unhelpful `TypeError`r X7DISABLED test_wait_i_3 (__main__.TestMultiThreadedWait)r XT[Inductor] Fuse Attention pattern match doesn't work with masking or dropout or FP16r X7DISABLED test_wait_i_4 (__main__.TestMultiThreadedWait)r XTHigher GPU consumption for Lenet-5 and LSTM models when compared to other frameworksr X0Subgraph rewriter: Unable to match constant argsr X1Can't export onnx model from a torch script modelr X7Sparse Matrix nnz Overflow when casting from COO to CSRr X<Stop importing HuggingFace transformers in DataClassVariabler XyImport setuptools.command.build_ext from torch.utils.cpp_extension somehow indirectly imports Cython when it is installedr X>VecISA.__bool__ is very expensive (nearly a second) on startupr X__sfdp_init is extremely expensive for startup time, even on networks that don't benefit from itr X2[BUG] add 1 to different tensor but get same valuer XLsome of the enteries in the previous version of pytorch section are invalid r XJTensor on shared memory is set to 0 when using concurrent.futures and CUDAr XHnn.MultiheadAttention doesn't use efficient scaled_dot_product_attentionr X'[torch.compile] `sum` out-of-bound readr X#MPS device inference all same valuer X*[inductor] Move loop ordering after fusionr XCompiled function inside vmapr Xv[torch.compile] raises RuntimeError in `sdfp_pattern_1` that `Expected size for first two dimensions of batch2 tensor`r X)Using ddp training with different machiner XMgraph._export_onnx() incorrect data types in the binary string representationr X2Dynamo capture for HigherOrderOperators, followupsr X9[Performance] Potential Performance optimization for SDPAr XXprofiler.export_stacks doesn't return stack trace unless experimental_config is providedr X[discussion] "TensorList" as first-class abstraction (including python frontend) and as key for dispatch for merging `torch._foreach_*` into regular `torch.*` functionsr Xltorch.utils._content_store will deduplicate storage with identical contents; may be problematic for mutationr XSInteraction of torch.no_grad and torch.autocast context managers with torch.compiler X]torch.compile is not compatible with DPP with torch.nn.SyncBatchNorm.convert_sync_batchnorm()r X#Add missing `OpInfo`s for prims opsr Xwtensor with dims marked with torch._dynamo.mark_dynamic loses dynamic dim marks after being moved to a different devicer Xv`torch.sparse_csc_tensor` matrix multiplication produces MKL error SPARSE_STATUS_ALLOC_FAILED when density is too highr X(Illegal instruction in ARM64 (ver 2.0.0)r XQDISABLED test_open_device_registration (__main__.TestCppExtensionOpenRgistration)r X>This flag not work : torch.backends.cudnn.allow_tf32 = False r X(Error saving MONAI pytorch model to ONNXr X"Error building Pytorch from sourcer X='pip install triton' from pinned hash gives unreliable tritonr XX[pt2-functorch] torch.func.functional_call works with func.vmap but breaks for func.gradr X#Inductor origins still not accurater X2detectron2_fcos_r_50_fpn shape error with inductorr XRTransformerEncoderLayer behavior inconsistent between training and evaluation moder XC[regression] torch.norm with out dtype bfloat16 cause runtime errorr X6[Indexing] Incoherent Tensor indexing for nested listsr X*[compile] output does not match eager moder XPDISABLED test_checkpointing_resets_persistent_refs (__main__.CudaGraphTreeTests)r X&Issue with FSDP + HuggingFace generater X*add github check that diffs generated coder Xhtorch.compile() drops the performance of validation / Dynamo is not guarding on attributes on NN modulesr XFpre_autograd `make_fx` broken with simple F.linear with symbolic shaper X4Add compile option -Werror=return-type compile errorr XTnn.Transformer out[0:-1] not precisely equal to last_out when predicting in tgt maskr X3Issue of HistogramObserver to handle abnormal valuer X[Tensor Parallel] Clarify docsr XNDataloader multiprocess loading with num_worker > 0 calls __main__ file to runr XRevive multigpu testingr X/torch.triu() may returns wrong values using MPSr X Runtime Errorr X/OpInfo missing for `prims.convert_element_type`r X<Copying an MPS tensor to a CPU tensor using a for loop failsr XEtorch.cuda.is_available() crashes python in systems with disabled gpur X<Group Norm crashes on Apple M1/MPS devices for versions 2.0+r XXI encountered an error while trying to save the stylegan2 network as torch. onnx. exportr XUtorch.jit.trace can not trace buffer by Module.register_buffer() when use DDP Module.r X&`print` statement causes inplace errorr X.[inductor] Autotuning leads to non determinismr XSFSDP + gradient clipping raises an odd warning with the simplest model on torch 2.0r Xcbenchmarks/dynamo/ci_expected_accuracy/update_expected.py truncates file if only one shard succeedsr XNciflow/inductor should run both inference and training even if inference failsr X[RFC] DebugModer X9Deprecate torch.distributed.algorithms._optimizer_overlapr XECan the CUDA device LUID be exposed as part of _CudaDeviceProperties?r XSMany models are failing on periodic dynamic shape benchmark tests dynamic_aot_eagerr X(HashTest.Scalar from test_lazy is brokenr Xf[torch.compile] unsupported operand type(s) for @: 'Tensor' and 'Tensor' when enabling `shape_padding`r X-Dynamo config patching in our code is brittler X9Logs output_code and inductor do not interact as expectedr X[Slight numerical divergence between torch.compile and eager; shows up in practice on yolov3r X>NTK notebook calculates wrong object - wrong output dimensionsr XzWhen backend is nccl, the distribution group type generated by Pytorch 2.0 shoule be ProcessGroupNCCL, but is ProcessGroupr X+Tracer cannot infer type of Seq2SeqLMOutputr Xcuda.is_available() errorr X&AOTAutograd/Inductor file system cacher X`cat` gradgrad tests failingr X0torch.multinomial() always returns [0] using MPSr XIWindows fatal exception: stack overflow while using pytorch for computingr X-Automatic broadcasting for sparse csr tensorsr XYApple metal (MPS) device returning incorrect keypoints for YOLOv8 pose estimation model r X'Cannot compile torch 1.10 in CentOS 7.3r X<2.0.0+cu118 package missing proper libnvrtc-builtins.so.11.8r XyRuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides w/ `dynamo.export`, `make_fx` and `functionalize`r X%Deformable Convolution export to onnxr X?cuda 12.0 support request for building pytorch from source coder XDno-duplicate-decl-specifier as a invalid compile flag for CXX in GCCr XCpca_lowrank and svd_lowrank broken under automatic mixed precision.r Xwhen convert to onnx with dynamix_axis, the Reshape op value is always the same as static, dynamic_axis is useless, it cant't inference right shape dynamicallyr XWARNING: The shape inference of prim::PadPacked type is missing, so it may result in wrong shape inference for the exported graph. Please consider adding it in symbolic function.r X2gpu training work well, but cpu training not workr X`[torch.compile] can't multiply sequence by non-int of type 'float' when enabling `shape_padding`r XJIn torchelastic support running worker rank 0 on agent rank 0 consistentlyr Xf`torch.ops.aten.empty` is not discoverable from `dir(torch.ops.aten)` until explicitly calling getattrr XKConda MacOS installation install pytorch-1.13 rather than 2.0 as of Apr 4thr X9DistributedDataParallel doesn't work with complex buffersr X^[torch.compile] raises an error that expanded size doesn't match when enabling `shape_padding`r X+Ban GradScaler scale from being less than 1r X5Torch hangs at import if tensorflow is imported firstr XdParameterisation of MultivariateNormal distribution using Cholesky decomposition of precision matrixr XJConda Pytorch set processor affinity to the first physical core after forkr XCUPTI Initialization error r X1Make broadcast_coalesced to a op for processgroupr X&add Half support for layer_norm on CPUr XQTraining Faster R-CNN model with COCO dataset has been consistently unsuccessful.r X%lintrunner mypy raises error in numpyr X=Pytorch mobile crashes on Android when loading a custom modelr X<torch.func.jacrev fails if model contains full_backward_hookr X5Batching rule not implemented for aten::narrow.Tensorr X&Cross compile Pytorch for ARM in Bazelr XgJacfwd become slower after update pytorch ("We’ve integrated functorch into PyTorch---Documentation")r X)Inserting observer bug in FX quantizationr XSupport polyphase channelizerr XW'Illegal instruction (core dumped)' for gpt-j bf16 generation task using greedy search r X;Not Preserving Grad For Tensor Created Inside torch.compiler XIPrint the index and summary of the SampleInput that failed an OpInfo testr X@vision_maskrcnn failing on periodic dynamic_aot_eager_torchbenchr XERuntimeError: "replication_pad1d_cuda" not implemented for 'BFloat16'r! XU[DTensor] parallelize_module failed with nn.Transformer and the PairwiseParallel planr" X*Question about GRU(RNN/LSTM) outputs shaper# X-Extend TorchInductor to support more backendsr$ X<The meta implementation of `index_put` does not do any checkr% XPtorch.nn.functional.multilabel_margin_loss cuda lacks checking of "out of bound"r& XGTorch.fx.symbolic_trace removes some of the keys from module state_dictr' X6FakeTensor lacks support for sparse compressed tensorsr( XJSupport `cond` branches that reference variables defined in an outer scoper) XOInternal errors with cuda graph (CUBLAS_STATUS_NOT_INITIALIZED and jit failure)r* Xtorch.compile errorr+ XNPyTorch 2.0.0 encountered CUDA error: an illegal memory access was encounteredr, X,[c++17] Replace lock_guard with scoped_lock r- X"add `-std=c++20` build-only CI jobr. X`we should make semantically meaningless positional arguments positional only in our operator APIr/ X&torch.linalg.lstsq doc arguments errorr0 X'Functorch pytrees with custom iterablesr1 X"Torch func Documentation for treesr2 X CI for s390xr3 X/Need TransformerEncoder to output attention mapr4 X>There has implmenet bug in LTC IrBuilder's MakeSizeMul method.r5 X+Slicing and indexing support negative stepsr6 XSAutomatically set dropout for SDPA depending on training mode / `training` argumentr7 X>Add `TORCH_ASSERT_ONLY_METHOD_OPERATORS` to functorch codebaser8 X6Build error on libstc++ header stl_alogbase.h on riscvr9 X&[MPS] Add support for autocast in MPS r: XRemove lr_scheduler.print_lrr; X[MPS] Add lu_factorr< XEmbedding layer tensor shaper= X+the error message of torch.addcmul is wrongr> X<tools PYTHONPATH trick in run_test.py does not work reliablyr? X(Broken link for torch dynamo FAQ in docsr@ X&Adding MPS support for 3D convolutionsrA X)vision_maskrcnn failing in periodic/trunkrB X*Libtorch consumes too much memory as 16225rC X3intermittent inductor segfault in CI (test_ops.py)rD XASporadic CUDA error in `test_nccl_warn_not_in_group_debug_detail`rE X,opacus_cifar10 fails in dynamo due to hooks rF XzUnused `import torch` followed by `cuml.NearestNeighbors` leads to nondeterministic segfault (during Python process exit?)rG X(Add Debug builds for python with pydebugrH XtRun ChatRWKV on MBP(intel CPU)+eGPU[rx6800 16G], returna a very big number -9223372036854775808, looks like overflowrI XTORCH_COMPILE_ABLATE envvarrJ X9Spectral Normalization can not be applied to Conv{1,2,3}drK XF`torch.sparse.sum` backward fails when reducing over dense dimensions.rL XGNo documentation to show how to implement aten::view for custom backendrM XSMore Nested Tensor Functionality (layer_norm, cross_entropy / log_softmax&nll_loss)rN XWhy nn.Upsample/F.interpolate followed by nn.InstanceNorm2d will report error "Unsupported: ONNX export of instance_norm for unknown channel size."rO X&torch.cuda.is_available() return FalserP XaDISABLED test_fake_crossref_backward_no_amp_index_fill_cuda_float32 (__main__.TestFakeTensorCUDA)rQ XInvalid Reference to ClassrR X,Look into test coverage for `UntypedStorage`rS XPMemory allocation issues in distributions.multivariate_normal.MultivariateNormalrT XXAttributeError: type object 'torch._C._profiler.ProfilerActivity' has no attribute 'MPS'rU XSIssue on building from source: Remove -mfpu=neon option on MacOS with Apple siliconrV XJIs there a way to get the full call stack of pytorch from python to C/C++?rW X6Dtype changes while going from FX graph -> TorchscriptrX XB[BUG]Float32 attention mask not working with torch.autocast("cpu")rY XHcreate_graph_input and add_grapharg should be combined into one functionrZ Xc[torch.compile] makes `linear(permute(input))` succeed for integer input in `torch.no_grad` contextr[ XP[BE] Dedup the functorch skipOps mechanism and the common_method_invocations oner\ XLSparse Tensor: in-place operation on detached tensors no longer raised errorr] X[torch.compile] `replace_fx` r^ X8Please verify 1.14.0 ONNX release candidate on TestPyPI r_ XHbehaviour of `torch.tensor()` changes after editing `Tensor.__getitem__`r` XHAdd `torch.cat` support for torch native sparse tensors. (Need for PyG)ra X[torch.fx] Upgrade on node inforb XPtorch.dist with minus norm returns tensor(0.), while with -inf can return resultrc XITracingContext.get().frame_summary_stack doesn't produce full stack tracerd X)torch.sparse_csr_tensor() stops gradientsre X8Changing module attributes doesn't retrigger compilationrf Xadd gradscaler on CPUrg XDRequest for deterministic support for reflection_pad2d_backward_cudarh X@Integrate open device privateuse1 customized method registrationri X>Unable to load MultiStepLR with torch.load(weights_only=True) rj XAChange module to module_ in torch/csrc/api/include/torch/python.hrk XMove template code to headerrl XCTest failure: TestCommonCPU.test_python_ref__refs_abs_cpu_complex32rm XPChanges to TorchScript autodiff changing default behavior are no longer acceptedrn X9[PT2] AOTAutograd de-dups but skips de-dup guards for DDPro X3Expand component configurable logging system to C++rp XCDocument the user-facing API for the component-level logging systemrq X!Support SPDA on non-CUDA backendsrr X.Problem with instalation torch2 on a100+cu12.1rs X)Sparse Tensor not working for `torch.cat`rt X!Sharded Grad Scaler Issue Trackerru X1[PT2] Some errors with `cond` and `torch.compile`rv XxPyTorch's packaged libgomp causes significant performance penalties on CPU when used together with other Python packagesrw X4[PT2.0] empty output shape causes Segmentation faultrx X7[functorch] vmap_hessian_fc - fails under torch.compilery X?[functorch] functorch_maml_omniglot - fails under torch.compilerz X<[functorch] torch.compile - functorch transforms Interactionr{ X>[FSDP] summon_full_params with_grad=True CPU offload can crashr| XFile-level retry enhancementsr} X3autocast does not work properly on embedding moduler~ X#[FSDP] move up the first all gatherr XvDiscrepancy of supported Python versions between Get Started page and index of pre-built binaries for PIP installationr X6DataLoader doesn't accept non-cpu device for loading. r X2[SPMD] DistCompiler graph optimization improvementr X2[triton hash update] update the pinned triton hashr XCPytorch member variable not working after converting to onnx formatr XGConflict between ``torch.func`` transformations and ``torch.jit.trace``r XaUbuntu 22.04 LTS issue returned NULL without setting an exceptionr X3Torchscript: Name Mangling prevents Type Refinementr X,Linking ResNeXt PyTorch Hub in Pipeline docsr XJDISABLED test_gradgrad_nn_GroupNorm_cuda_float64 (__main__.TestModuleCUDA)r XFDISABLED test_grad_nn_GroupNorm_cuda_float64 (__main__.TestModuleCUDA)r X$torch.matmul with batched CSR matrixr X%[ux] Non-blocking tensor constructorsr X7Cannot use `checkpoint_sequential` with `torch.compile`r X:DISABLED test_transpose_with_norm (__main__.CPUReproTests)r X%Add test/distributed/test_c10d_mpi.pyr XWrong illustration in README.mdr X.Cannot use AT_CUDA_DRIVER_CHECK from user coder XA`F.interpolate` and `F.grid_sample` - documentation error and bugr X7Tracker - Failing models in the torch.compile dashboardr X3torch.jit.script codegen warning with cuda and vmapr X9Training runs 50% slower when using 2 GPUs comparing to 1r XDMemory corruption using torch.ops.* to access re-registered operatorr X:Segfault when using torch.ops.* to access de-registered opr XeDynamo compiled graph gets overwritten by eager in a data dependent branch when False branch is emptyr X8torch.cond should work with expressions involving SymIntr X(Power VSX vectorization support disabledr X?`torch.nn.utils.rnn.unpad_sequence` modifies arguments in-placer XYHigher order derivatives not working when setting compute device to `torch.device("mps")`r XI[onnx]Unsupported: ONNX export of convolution for kernel of unknown shaper XStrided to batch BSR/BSC conversion fails when the number of zeros per block varies while the number of blocks per patch is constantr Xxtorch.fx.GraphModule inside custom backend has `training` attribute always set to `True` regardless of the user settingsr X/Options are not forwarded to the custom backendr X!Improvements to FSDP debugabilityr XHBring CudaPluggableAllocator to feature parity with the Native Allocatorr Xtacotron2 times outr XCNeed better error message when a merge cancelled because of timeoutr X9Fail to pass test HAVE_XXX_REGEX while building pytorch r X#README could use link to governancer X1Torch Compile is slightly slower than eager mode.r Xassert callable(unaltered_fn)r X2[FX] Symbolic trace over `torch.Tensor.${fn}` APIsr X(Support backward hook optimizers in FSDPr X8Backwards graph is labeled incorrectly when dynamic=Truer X9PyTorch 1.12, high failure rate for test_optim/test_nadamr X1TORCH_COMPILE_DEBUG and TORCH_LOGS interact badlyr X'`torch.Tensor.layout` is not documentedr X&Contribute to the privateuse1 backend.r XK[PTD][Checkpoint] Enable single_file_per_rank for fsspec storage read/writer XKpip doesn't install the right version of pytorch when torchtext is involvedr X*Intermittent failure of mobilenet_v3_larger XG[functorch] [vmap] tests fail when `_set_vmap_fallback_enabled(False)`.r X@[cpu] Fix div with rounding_mode="floor" when division overflowsr Xn"We don't have an op for aten::bitwise_and but it isn't a special case." when exporting NMS operation as ONNX.r X2Make BetterTransformer implementation non-blockingr XWhen I use the DDP model, I use a custom loss function, when the batch size changes during training, the process will be stuck.r X[Inductor] [CPU] Huggingface model BartForCausalLM & MBartForCausalLM & OPTForCausalLM & PLBartForCausalLM performance regression > 10% on 2023-04-02 nightly releaser e(XPInconsistent nn.KLDivLoss behavior: 0s in target OK on cpu, but gives nan on mpsr XPhf_Longformer regression caused by https://github.com/pytorch/pytorch/pull/98119r XIBroken mypy check in test_type_hints.py::TestTypeHints::test_doc_examplesr X3DISABLED test_doc_examples (__main__.TestTypeHints)r XH[CUDA][MAGMA][Linalg] Remove MAGMA from CUDA linear algebra dependenciesr XG[Dynamo] Enable `dynamo.export` for huggingface models w/ `ModelOutput`r X8inductor `compile_fx_inner()` segfaults on `torch.isinf`r XGaten::_linalg_solve_ex.result' is not currently implemented for the MPSr XaWrong results for GELU forward pass (CPU vs MPS) while inferencing a GLPN model from huggingfacer X@torch.jit.script + legacy executor mode has diff in some patternr X=Add a deterministic version of reflection_pad2d_backward_cudar X$NaN appears when initializing tensorr XEAssertionError: was expecting embedding dimension of 22, but got 1320r X1torch.nn.init functions with `generator` argumentr X add register_default_collate_forr XsRuntimeError: CUDA error: an illegal memory access was encountered, torch/cuda/streams.py", line 94, in synchronizer X9[onnx] AdaptiveMaxPool2d can not convert to GlobalMaxPoolr X*how can i load seperate pytorch_model.bin?r X\The operator 'aten::_weight_norm_interface' is not currently implemented for the MPS device.r XDforward AD implimentation : _scaled_dot_product_efficient_attention r X(Compiling complex-valued functions failsr X#double free or corruption (fasttop)r XGA Segment Fault can be triggered in torch._grid_sampler_2d_cpu_fallbackr X>[interoperability] zero-size cuda arrays do not look supportedr X*PyTorch Profiler fails recording functionsr X3[pt2] `movedim` + `add_` + `cat` triggers exceptionr X9Request to cherrypick a fix into v1.13.1 (v1.8 has a CVE)r XFUnable to run session using exported ONNX model using dictionary inputr XAGroupNorm cpu/gpu parity tests fail with pretty large differencesr X0[dynamo] hf_Reformer's graph break has increasedr XGIs there a recommended implementation of yuv2RGB for the current torch?r X6Unexpected results with torch.nn.functional.layer_normr X( Add PrivateUse1 folder in aten/src/ATenr X/Request custom backend device memory Allocator.r X1Module 'Sequential' has no attribute '_modules' :r X<DISABLED test_scatter_1d (__main__.DeviceMeshCollectiveTest)r X*OpClassTensorOp for fp32 torch.bmm(NT, NT)r XAutomate aarch64 buildsr XK[Nova] Add metadata validation step to the smoke tests for core and domainsr X"Write Binary Builds oncall runbookr X5Create release checklist template for the Launch Dater X5Create a plan on removing conda dependency from CI/CDr X<matmul with CSR matrix in inference mode throws an exceptionr XZDataLoader with collate_fn that returns tensors in GPU memory raises warnings when deletedr X6torch.compile not compatible with multiprocessing poolr X3functional collective should respect the whole meshr X.Relax version dependencies on CUDA pip wheels?r X?Can libtorch be used for quantization-aware training of models?r XKDynamo doesn't report accurate line numbers for in some situationsr X*torch.randn signature is missing generatorr X0[CI/Infra] Record keeping: runner shutdown spiker X8Investigate Lazy{*}Norm{*}d modules no batch dim supportr X=BUG torch.jit.annotate on List + torch.stack give wrong DTYPEr X>`torch.func.functional_call` doesn't work with compiled modelsr X9Multiple model init using OpenMP in c++ does not speed upr X.Dropout traces poorly with AotAutograd/make_fxr X6A parameterized fill value for triu and tril functionsr X%Type conversion between float/complexr XFMissing torch import in _contextlib.py when using torch.jit._recursiver Xnn.linear not support bfloat16r X(Unable to install torch on python 3.8.16r Xotorch.onnx.errors.OnnxExporterError: Unsupported: ONNX export of operator unsafe_chunk, unknown dimension size.r Xmake tensor data const correctr X/Functionalize crashes on train_step GraphModuler XmTORCH_LIBRARIES variable leads to undefined reference function error in compiling while using libtorch in c++r X)Document _wrap_fx_args_as_onnxscript_argsr X&CUDA 10.2 cudnn 8.2.4 run Conv2d errorr X<[WIP] _nested_view_from_buffer.cont, torch.cat([NTs], dim=0)r XbMemory leak when saving an input tensor returned as-is if mark_dirty and running with dual tensorsr XLSupport sparse COO/CSR/CSC/BSR/BSC return values in gradcheck input functionr XHUsing `param in param_list` can trigger `non-singleton dimension` error?r X(Compile error when using consecutive padr X7Some c++ library docstrings incorrectly linked/repeatedr XAdd SSIM as Loss Functionr Xotorch.compile fails with torch._dynamo.exc.TorchRuntimeError on a function that contains a torch script moduler X5The first epoch is very slow when using torch.compiler XWinductor: NameError: name 'math_floor' is not defined when running fx_graph_runnable.pyr X#consider bumping `DEFAULT_PROTOCOL`r XDtorch.testing.assert_close: allow check to fail on part on the inputr XXTest Failure: TestUnaryUfuncsCPU.test_reference_numerics_normal_cos_cpu_float32 on s390xr XoneDNN 3.0+ supportr X*irrelevant error output for Minified repror XBug on Minified repro example r X8TypeError: 'torch._C._TensorMeta' object is not iterabler XNDynamo generates invalid frame when graph-breaking due to opacus_cifar10 hooksr XVdynamo sometimes hits the cache size limit due to the foreach flag in optimizer.step()r X= Compile targts cuda:0 rather than the device the model is onr X*[FSDP] Consolidate test_fsdp_state_dict.pyr X-Pytorch 2 compile + fsdp + transformers crashr X3[FSDP] test model.eval() + keep_low_precision_gradsr X1sparse_csr_tensor matmul wrong output in bfloat16r XGHow do I get the original object wrapped by the torch.fx.proxy class?r X,[bug] Internal assert failed when using pyror X#transposed 2d copy bfloat16 supportr X.torch.onnx.export support sparse tensor formatr X.Regression in jit for f-strings with new linesr XSJAX + PyTorch produces `OMP: Error #13: Assertion failure at kmp_affinity.cpp(532)`r XFtorch.zeros_like on a zero-sized BSR/BSC tensor results invalid tensorr X4Compile dynamic does not support GroupNorm in moduler XiMPS: grid_sampler_2d falls back to CPU, even though warning says it is natively supported on macOS >=13.1r XInsufficient MPS Documentationr X5can get_submodule be called within a ScriptFunction ?r XTorch 2.0 import hangs foreverr X=Multi-output derivative formulas can save unnecessary tensorsr XPackedSequence failure with MPSr! X%InfoNCE loss for contrastive learningr" XATransformerEncoder fast path raises incorrect mask dtype warning r# XBurn benchmark suites into CI docker image. Not only this saves test time, but also it will get rid of occasional model installation failures. (@weiwangmeta )r$ X'torch.cppExtension won't work with wsl2r% Xtorch.compile not work in WSLr& X.set_ operation on a view (detach()) of the view tensor changes grad_fn of the original view tensor from ViewBackward0 to AsStridedBackward0r' X `onnxrt` fails with compilationsr( X*Function Registry for extending collate_fnr) XE[DTensor] Add a unittest to cover default PG condition for DeviceMeshr* XImake_fx(functionalize(f), tracing_mode='symbolic') breaks on torch.matmulr+ X"Improve collectives fingerprintingr, X:pytorch dynamic quantized model failed to convert to onnx r- XChange progressbar for hubr. X5torch.compile not working with gradient checkpointingr/ XWsuspicious memory leak when increase DataLoader's prefetch_factor and enable pin_memoryr0 XDUnsupported: ONNX export of operator group_norm, unknown input rank.r1 XKAfter the release of pytorch 2.0.0, the compilation of ACLs is problematic.r2 XUstreamNone = get_cuda_stream(None) RuntimeError: invalid argument to getCurrentStreamr3 X,Further memcopy improvement at FX body levelr4 X:DISABLED test_checkpoint_trigger (__main__.TestCheckpoint)r5 XEImport fails when both `USE_TENSORPIPE=OFF` and `USE_DISTRIBUTED=ON`.r6 XExpanded weights tests brokenr7 XFDISABLED test_vmapjvpvjp_svd_cuda_float32 (__main__.TestOperatorsCUDA)r8 X6inductor illegal memory access on indirect load on cpur9 X@torch.compile()'d optimizer.step() has too many arguments in C++r: X"Sparse is not available on Windowsr; XBtorch.onnx.export crashes on ReduceMax operator with onnx opset 18r< X7Traced module shows non-deterministic behaviour on CUDAr= XD`torch.fmod` produces inconsistent results in eager and compile moder> XDtorch.ops.aten.pow(2.0, 3) return unexpected value with complex typer? X<`torch.compile` + `torch.no_grad` not working for Mask R-CNNr@ XOMPS: `unique` and `unique_consecutive` extremely slow when `return_counts=True`rA XHTorch Dynamo allow_in_graph doesn't capture the custom function in graphrB XR`jacrev` and `jacfwd` raise an error that `Sparse CSR tensors do not have strides`rC X\test_sparse_addmm fails on linux-bionic-py3.11-clang9 / test (crossref, 1, 2, linux.2xlarge)rD XE`jacfwd` fails when computing the gradient for `channels_last` tensorrE X [composable FSDP] clip_grad_normrF X)Desync debugger encounters traceMap errorrG XMDISABLED test_vmapjvpvjp_linalg_svd_cuda_float32 (__main__.TestOperatorsCUDA)rH Xfunctorch roll-up issue for 2.1rI X<Inductor "Original ATen:" doesn't work for backwards kernelsrJ XGNon-deterministic results when training a model on GPU with MPS backendrK X$Incompatibility with complex tensorsrL XINTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus32729rM X=aten::sym_size is not using torch._ops.OpOverload in FX graphrN X3Sequential/Partial unpickling and loading of modelsrO X5torch.randint range for torch.int64 dtype seems wrongrP X'Building LibTorch on Ubuntu with Mac M1rQ X1Support 0-sized batches in SyncBatchNorm cuda opsrR X0warn on future reshape alias mutation violationsrS X=Nightly conda binaries failed to pass tests since 2023-03-17 rT XU[FSDP][optim_state_dict] Need more comprehensive tests for optim_state_dict interfacerU X-Implement `torch.distributions.Poisson.cdf()`rV XFCustom recurrent network takes very long to compile for long sequencesrW XNRPC Tutorial can not profile the rpc operations communication between workersrX X<Problem with Hugging Face model that is not in training looprY XBRuntimeError: NYI: Named tensors are not supported with the tracerrZ X6Errors using torch.compile() on Megatron-LM GPT model r[ X;Incorrect gradient calculation for upsample nearest on CUDAr\ XTInconsistent results when mutating a tensor with shared storage in a nested functionr] XPMultiHeadAttention, fast path broken with `bias=False` or uneven number of headsr^ XH[Compile] NameError: name 'buf0' is not defined (raised in ddp-training)r_ X(nn.Conv function to compute conv formular` X;[Dynamo] symbolic_convert returns ValueError: Cell is emptyra XF[Feature Proposal: New Distributed Training Algorithms] LSGD and EASGDrb XoTransformerEncoder truncates output when some token positions are masked by `src_key_padding_mask` across batchrc X"Adaptive pool MPS: input sizes must be divisible by output sizes", I keep getting this error even when I try to adjust for sizerd Xslow torch import on macos re Xbtorch.cuda.FloatTensor().normal_() generate (partially) different sample on different gpu machinesrf X3A Segment Fault can be triggered in torch.embeddingrg X1A Segment Fault can be triggered in torch.adjointrh XNA crash due to Floating Point Exception can be triggered in torch.index_selectri XU[MPS] Fix and refactor unary/binary ops with non-zero offset or non-contiguous outputrj XStimm models that are instantiated using timm's fast norm layer trigger graph breaksrk XYInconsitent results before/after compilation for squeeze + tensor mutation + if statementrl X'[Dynamo] aot_autograd throws IndexErrorrm X,[Dynamo] compile_check_fn throws IndexError rn X![compile] KeyError: example_valuero XW[compile] TypeError: __init__() missing 1 required positional argument: 'parent_module'rp X9[RFC] CPU float16 performance optimization on eager mode.rq XWOptimize for mobile produces incorrect result with INSERT_FOLD_PREPACK_OPS optimizationrr X'DDP static graph fails for static modelrs X%How to get list of all valid devices?rt XConvnext breaks torch.compile ru X+[Inductor] atomic_add does not support bf16rv XPdeprecate integral and boolean dtype support torch.logit and torch.special.logitrw XN[Feature Request] Compile compatible Neighborhood Algorithms for large Tensorsrx X]Small learning rate with `capturable=True` causes Adam optimizer to blow up model parameters.ry Xh[Inductor] [CPU] Torchbench model hf_Reformer performance regression > 10% on 2023-03-15 nightly releaserz XVGet error: "tuple index with non-constant index" when exporting a model to ONNX formatr{ X[mps] conv1d outputs zerosr| X8[ONNX] Export failed for Module with Keyword-only inputsr} XWAdding sparse `addmv` and `triangular_solve` support on CPU - Mac OS - Apple Silicon M2r~ X*GPU:7900xtx Pytorch2.0.0 rocBLAS error:r XFTensorStorage error when deepcopying FX graph_module of Adam optimizerr X;torch.onnx.export failed for models with Bernoulli operatorr XTDoing inplace on a inplace view of tensor that retains_grad triggers internal assertr XUExpected scalar type Half but found Float when running nn.MultiheadAttention with AMPr X4Performance Drop for linalg_ldl_factor and ldl_solver X\`cumprod` triggers INTERNAL ASSERT FAILED when `out` is a tensor on cuda but input is on cpur XJ Segmentation fault (core dumped) during Torch finetuning (at random step)r X[MPS] pinverse dtype errorr X<`sparse.mm` triggers INTERNAL ASSERT FAILED when backwardingr X/Follow-ups to do after adding nested checkpointr X Improve checkpoint thread-safetyr X*[inductor] flaky rexnet_100 accuracy testsr X7[ONNX] FX exporter 'test_models_onnxruntime.py' trackerr X"Pruning under channels_last formatr XPytorch2.0 compile errorr X,Many padding Module fail memory_format testsr Xzwhen run python run_test.py -i test_ops_jit error like this. ValueError: option names {'--junit-xml-reruns'} already addedr X)Memory not release after jit.trace/freezer XH[MPS] `.to('mps')` zeroes out elements in tensors taking up >=2^32 bytesr X|[Inductor] [CPU] Huggingface model MobileBertForQuestionAnswering performance regression > 10% on 2023-03-12 nightly releaser Xq`logical_xx` operations trigger INTERNAL ASSERT FAIL when `input` is complex tensor on cuda and `other` is on cpur X?torch.compile mode="max-autotune" precision appears to be lowerr X[H100] `test_ops.py::TestFakeTensorCUDA.test_fake_crossref_backward_amp_nn_functional_scaled_dot_product_attention_cuda_float32` failedr XVNo GPU found, using CPU during preprocessing Error processing dataset with NsfHifiGAN r X3[FSDP] Make FSDP support local optimizer state_dictr X(Harden composable fully_shard: Checklistr Xtorch.compile gets stuckr X<PyTorch SGEMV is using 1 single core on AMD CPUs (very slow)r X0Completely different output between .pt and .ptlr XENot allow force merge when lint fails and not because of broken trunkr XURequest for adding Warning/Error feature when dropout set to 1.0 in Transformer layerr X:torch.cuda.graph "Invalid capture" with torch.linalg.solver X<Dataloader should kill & restart workers when timeout is hitr X[fix] angle for -0.0r X Build errors in two Vulkan filesr X#Tensor Permutation Along Given Axisr X4[MPS] Incorrect results for cumsum with bool tensorsr X;The output of torch.histc is incorrect on both CPU and CUDAr X5Triton compile error for pad + broadcast + pad on GPUr X7 No matching distribution found for torch==1.13.1+cu117r X\[MPS] softmax returns NaN attention probabilities for large tensors, in float16 and float32.r X>Why doesn't PyTorch install the REAL nvidia cuDNN pip package?r X<Proposal: Disable GC in test suite; GC after every test caser XIWrong return type from operation on custom tensor inside registered hook r X!Enable functorch testing for rocmr X3tests for linearize fail under the dynamo CI configr Xearly stoppingr X=[ONNX] FX exporter 'test_pytorch_onnx_onnxruntime.py' trackerr X6MPS Backend Doc, model = YourFavoriteNet() not definedr X#fft should ignore dims with shape 1r XOThe sign of torch.distributions.transforms.PowerTransform seems to be incorrectr XE[Inductor] C++ compile error when using integer type lower than int32r X.Make aten.rand and aten.empty as core aten opsr X:Torch Dynamo backend compilation error with dynamic = Truer XU[MINIFIER] Running code snippet with TORCHDYNAMO_REPRO_AFTER="dynamo" leads to errorr XF[torch.compile] Warning using BuiltinVariable.call_dict but not for {}r X7Shape Error when training HF deberta-base with Inductorr XG'aten::affine_grid_generator' to ONNX opset version 14 is not supportedr X6Unable to move torch.jit.load-ed models to XLA devicesr X5Information about CPU in `collect_env` is too verboser XCNot implemented error for `aten.quantize_per_tensor.tensor_qparams`r XxCompressed sparse constructor allows mixed `int32/int64` indices which leads to dtype promotion/demotion in conversions.r X]Add location information when exception are thrown in `torch.jit.annotations.try_ann_to_type`r XProxy Options for Pytorch Hubr XInitialization on `meta` device failing for models containing `nn.utils.weight_norm`, with `NotImplementedError: Could not run 'aten::_weight_norm_interface' with arguments from the 'Meta' backend.`r XO[dynamo] add hook to modify instructions before/after instructions be generatedr X9Mnist model training with "reduce-overhead" mode is flakyr X`[export] "strict subset of traced input/output" error when huggingface `ModelOutput` is returnedr Xo`dynamo.export` "input not consistent with traced input" error when input default value type is `torch.Tensor`.r X-[BE] Avoid .data usage in FSDP buffer castingr XD[minifier] hf_Longformer fp32 accuracy pass error cannot be minifiedr XIviews created in __torch_dispatch__ share storage but not version_counterr X<Support managed memory backed dlpack with torch.from_dlpackr XD`FractionalMaxPool3d` INTERNAL ASSERT FAILED when computing `jacrev`r XBReuse autograd.grad graph for rapid, repeated gradient calculationr X@Inductor guards are not propagated to Dynamo with dynamic shapesr X;Better error message when trying to run fp16 weights on CPUr X=Pytorch 2.0 Segmentation error / IMA on model compile on GPT2r XOA Segment Fault can be triggered in torch.adaptive_max_pool1d with an edge caser XAA Segment Fault can be triggered in torch.geqrf with an edge caser X2A Segment Fault can be triggered in torch.pinverser XMDynamic batch size support when combine `torchdynamo.export` and `compile_fx`r XlRuntimeError: one of the variables needed for gradient computation has been modified by an inplace operationr X=nn.interpolate scale_factor floors output size with floating r XN[MPS] F.conv1d and F.conv2d produce incorrect gradients when minibatch >= 2^16r XL[Dynamo] HuggingFace transformers configuration_utils graph break workaroundr XCdynamo + dict subclass + tensor instance check: NotImplementedErrorr X1`gradgradcheck` does not work with sparse inputs.r XL`ld: error: unknown argument '-force_load'` when linking libtorch on Androidr X5[torchdistx] Future of the large model initializationr Xmps bug: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: subRange.start (6) is not less than length of dimension[0] (6)'r XTmkldnn matmul kernel may be slower than openblas kernel for very small tensor shapesr XI`torch.utils.checkpoint` should avoid updating BatchNorm statistics twicer XFtorch.compile fails when compiling a T5-style model with HF interfacesr XJmake_fx tracing with dynamic shapes should also disable_slice_optimizationr XGActivation Checkpointing PT2 - AOTAutograd cannot handle set_rng_state r XStatic size boolean maskingr X9torch.where behaves differently from in place replacementr XKError during inference on iOS: INTERNAL ASSERT FAILED at it_type_base.h:535r X6[RFC] Extend CPU AMP to add FP16 support on eager moder XdDISABLED test_nn_sequential_invocation_dynamic_shapes (torch._dynamo.testing.DynamicShapesMiscTests)r XMAdd support for `__collate__` attrib on dataset elements in `default_collate`r X=Pytorch 2.0 installation tutorial does not work under Macbookr XOLinking libtorch with QT5 OpenGL application using llvmpipe mesa opengl crashesr X3MPS device throws error for `F.adaptive_avg_pool2d`r X'No speedup and a null pointer exceptionr X7Add arm64 builds for libtorch on MacOS with mps supportr XoCannot access data pointer of Tensor that doesn't have storage when using `torch.func.jvp` with `torch.compile`r XQuestions and Possible Features: Pytorch RPC 'future.wait()' will not release GIL which will block other thread's execution when using multithreading.r X`Encourage dynamo.export users to assume static by default if they call nonzero / unbacked SymIntr X([inductor][cpp] xcit_large_24_p8_224 OOMr XCGraphstate checkpointing doesn't checkpoint ShapeEnv / shape guardsr XK`@torch.jit.unused` does not properly ignore unsupported function signaturer X2FSDP fails to load state dict under inference_moder X4[vulkan] missing aten::reflection_pad1d.out operatorr X&The torch.sparse document's typo errorr Xadd debug moder XbBuild from source, Undefined symbol: c10::detail::maybe_wrap_dim_slow(long long, long long, bool)r X CPU time performance is unstabler X/training hangs at line torch.cuda.synchronize()r X arange bugr X0ROCm distributed flaky on test_distributed_spawnr X'torchvision Caltech101 collate_fn errorr XTautograd.functional.jacobian : tensor instead of function as input for reverse mode?r XM[PTD] dist.barrier() unreliable when using collectives from multiple threads.r X(The Benchmark test for dynamo runs stuckr X!corrupted size vs prev size errorr X>Input names provided three but onnx recognizes two inputs onlyr X4[JIT] Support string type annotations in NamedTuplesr X&weakref.proxy issue with torch.compiler XQ`AssertionError: Activation` when compile spconv structure like `BaseBEVBackbone`r X+Let Nested Tensor Metadata be cached on GPUr XgTorch._dynamo.optimize: The tensor has a non-zero number of elements, but its data is not allocated yetr XMdrastic speed regression of torch.jit.load starting with the 20230301 nightlyr X$Static asserts on accessor templatesr XnFully quantized model (`torch.quantization.convert`) produces incorrect output compared to analytical solutionr X"SymInt'ify _gather_sparse_backwardr XX`torch.Tensor.is_set_to` raises `NotImplementedError` when inputs contain sparse tensor r X:Implementing the batching rule for aten::bucketize.Tensor.r X%Inconsistent behaviour of torch.all()r XAchange stacksize_analysis to worklist algorithm for better resultr X>`torch.nanmedian` return a negative value when input is empty r XRFaketensor issue when using torch inductor as backend with Huggingface Trainer APIr X:dist.barrier() should be able to go through custom backendr XJdistributed training: lots of "Exception ignored" at the end of each epochr X/[BE] [cuDNN] Always build assuming cuDNN >= 8.0r Xfunctorch.compile.memory_efficient_fusion errors with: RuntimeError: forward() Expected a value of type 'Tensor (inferred)' for argument 'primals_356' but instead found type 'int'. r XLMultiheadattention module doesn't implement the function about kdim and vdimr X7`copy.deepcopy` does not copy gradients of nn.Parameterr X3Dynamo + MacOS: fatal error: 'omp.h' file not foundr X>Can only import torch after Tensorflow accessed its gpu devicer XCleanup redundant CMake coder XbTorchinductor backend fails to compile a model with index_put_(accumulate=True) with dtype float64r X!Unable to import ``torch.linalg``r X<torch.compile compilation time on huggingface regressed ~11%r XPtorch needs to SHOW that it support sm_89 even if functionally the same as sm_86r XXCreate a new Docker image with all inductor benchmarks and pre-trained models downloadedr X([inductor] Accuracy issue on Nvidia V100r XFPytorch Home Page does not specify which version of python it requiresr XTesting InvokeAI 2.3.1.post1, using mps, with PyTorch nightly dev20230226 yields RuntimeError cross-device copies are not allowed!)r XQ[onnx] sort / argsort with `stable` argument specified cannot be exported to onnxr XuPerformance bugs exists in multiple convolution operations(e.g., `Convtranspose2d`) when useing the `groups` argumentr! XTorchInductor fails with memoy violations in `test_comprehensive_grid_sampler_2d_cuda_float16` and `test_reflection_pad2d_dynamic_shapes_cuda`r" XJConfusing error messages from `torch.nn.LazyLinear` in different versions.r# X][Reproducibility]replication_pad2d_backward_cuda does not have a deterministic implementationr$ X_Support datatype argument for torch.distributed.all_gather() (And the whole distributed module)r% XRtest_layer_norm_backward and test_layer_norm_backward_5d run OOM in slow gradcheckr& XZtorch.jit.load documentation doesn't specify if it is safe to load untrusted models or notr' XX torch.distributions.kumaraswamy.Kumaraswamy generates samples outside its support (0,1)r( XATensor.all() fails on MPS for tensors with more than 4 dimensionsr) X1dynamo+aot improperly handles dupe args via *argsr* XImport parameters from jitr+ X8Torch RPC on multiple nodes with GPU returns a EOF errorr, X+Enrich shape operations with nested tensorsr- X-[BE] Make ActivationWrapper an abstract classr. X6hf_GPT2_large CPU inference shows random failure on CIr/ XM`add/add_` for CSC: errors when trying to access non-existent `crow_indices`.r0 XAExtend docs - Fixing out of memory with python garbage collectionr1 XItorch.profiler.tensorboard_trace_handler Generates an incorrect JSON filer2 XwIt seems that `torch.Tensor.addmv` and `torch.Tensor.addr` will check some inputs' dtype if and only if in `backward()`r3 XtRegression bug in `torch.nn.ReLU6` and `torch.nn.Hardtanh` that `inplace=True` doesn't work in PyTorch 1.10.0~1.13.1r4 XDynamo or FakeTensor bug: reshape(): argument 'shape' (position 1) must be tuple of ints, but found element of type FakeTensor at pos 0r5 X5PT2 Computes Multi Device Backward in a Single Threadr6 XVDISABLED test_variant_consistency_jit_linalg_lstsq_cpu_complex64 (__main__.TestJitCPU)r7 XParallel Associative Scanr8 XNtest_ddp_apply_optim_in_backward in distributed_test.py fails for gloo backendr9 X*Add log highlights to Dr. CI's failed jobsr: X1Investigate/add Windows Arm64 support for cpuinfor; X0Add oscillating activation functions to PyTorch.r< X3`argmin` + `view` Trigger Exception in compile moder= X3build failed when strictly following the guidelinesr> X\Changing behavior of module.to() to better support mixed real- and complex-valued parametersr? X$Circular padding error for 3D arraysr@ XD`torch.distributed.Store` triggers INTERNAL ASSER FAILED when setingrA XJ`torch.cartesian_prod` returns inconsistent dimensions with only one inputrB XContinuous dropout layerrC Xrtabulate is used by `torch.fx.graph_module.GraphModule.print_tabular` but is not installed when installing pytorchrD X=`Tensor.copy_` + `moveaxis` Trigger Exception in Compile ModerE X<Make this ridiculously long error message more user friendlyrF X.Pytorch profiler stack exporting does not workrG X+test_foreach failing cuda memory leak checkrH X:ONNX Exporter for circular padding mode in convolution opsrI X-Remove conda virtualenv from the docker imagerJ XiAdd parallel attention layers and Multi-Query Attention (MQA) from PaLM to the fast path for transformersrK X&new backend privateuseone with "to" oprL XEPytorch 2.0 [compile] scatter_add bf16 Compiled Fx GraphModule failedrM X:High Cuda Memory Consumption for Simple ResNet50 InferencerN X6Pytorch 2.0 [compile] index_add bf16 compilation errorrO XJPytorch 2.0 [compile] as_strided inplace causes out of bounds for storage rP XWDISABLED test_memory_format_nn_ConvTranspose2d_cuda_complex32 (__main__.TestModuleCUDA)rQ X7COO @ COO tries to allocate way too much memory on CUDArR XPAOTAutograd based torch.compile doesn't capture manual seed setting in the graphrS X/Reversing along a dimension, similarly to numpyrT X3Whether to consider native support for intel gpu?rU X0Add local version identifier to wheel file namesrV X/Differentiate with regard a subset of the inputrW XXDefault value of `validate_args` is set to `True` when passed as `None` in `Multinomial`rX Xv`INTERNAL ASSERT FAILED` -When using the PyTorch docker environment released by pytorch, a Vulcan support issue occursrY XCCosineAnnealingWarmRestarts but restarts are becoming more frequentrZ Xcuda 12 support request.r[ XKWhen using `ceil_mode=True`, `torch.nn.AvgPool1d` could get negative shape.r\ X]Proposal: `@capture`: Unified API for capturing functions across `{fx, proxy_tensor, dynamo}`r] XX`torch.nn.LazyLinear` crash when using torch.bfloat16 dtype in pytorch 1.12.0 and 1.13.0r^ XLAOTAutograd can add extra as_strided() calls when graph outputs alias inputsr_ X}RuntimeError: view_as_complex is only supported for half, float and double tensors, but got a tensor of scalar type: BFloat16r` X-test_torchinductor.py test isolation problemsra X:Implement a `torch.cuda.visible_device_indexes` function. rb X(Make artifacts easier to discover on HUDrc XRA100 Perf Job artifact zipfiles unzip to generic folder that loses job informationrd XS`torch.cuda.device_count` cached return value does not reflect environment changes.re X$Upsampling ResBlock GPU memory spikerf X_[Inductor] [CPU] Huggingface model AllenaiLongformerBase performance regression > 10% on ww07.4rg X[Inductor] [CPU] Huggingface model MT5ForConditionalGeneration &T5ForConditionalGeneration & T5Small performance regression > 10% on ww07.4rh XV[Inductor] [CPU] Torchbench model hf_Longformer performance regression > 10% on ww07.4ri Xi[Inductor] [CPU] Torchbench model hf_T5 & hf_T5_large & hf_T5_base performance regression > 10% on ww07.4rj XkcuDNN doesn't support convolutions with more than `INT_MAX` elements and native kernel uses too much memoryrk XCustom operations in inductorrl X>NCCL backend can't be used with a dataset that is IterDataPiperm XOinteractions between views + autograd.Function + AOTAutograd causes memory leakrn X4Internal Assert During Distributed Autograd Backpropro X-[libtorh]Consistency problem of gpu computingrp X*Slow inference of torchscript model in C++rq XlCSR matrix add_ error with RuntimeError: CUDA error: kernel launch failure when calling cusparseXcsrgeam2Nnzrr X%PR #88607 breaks build for POWER9 CPUrs X0[numpy] mean & nanmean should support int dtypesrt X=ASSERT(initialized()) Debug Error after JIT fusion on Windowsru XBOptimizer "Lion" in Symbolic Discovery of Optimization Algorithmsrv XMemory leak in torch.fft.rfftrw X)torch.sum does not return the sum on ROCmrx Xq[Inductor] [CPU] as_strided is much slower than empty_strided in single-thread single-batch mode in lennard_jonesry XUaten::cudnn_convolution chooses different conv implementation given the same inputs. rz X9[FSDP] Gradients not propagating for mixed precision caser{ X$torch.compile breaks reproducibilityr| XR`torch.compile` produces `RuntimeError` on function wrapped with `torch.func.grad`r} XDDynamo.export should support formatting tensor value within a stringr~ X)Rationalize specialize_int_float handlingr XGAllow Dynamo backends to use Inductor as fallback instead of eager moder XLinking error with Libtorchr X=Make `torch.onnx.utils._optimize_graph` use several CPU coresr X9`tag` parameter is ignored from NCCL P2P isend/irecv pairr Xgrid_sample with relative gridr X4Memory Corruption in torch.lstm caused by edge casesr X^ImportError: cannot import name 'Backend' from 'torch._C._distributed_c10d' (unknown location)r XrBuild Error: no matching function for call to ‘dnnl::graph::stream::stream()’r X'Compiling PyTorch from Source on Xavierr X4Compiling libtorch from Source on Mac Beyond v1.11.0r X[mta] Implement fused SGDr X!pytorch log level API and env varr X9Better Numpy API (interoperability between ML frameworks)r XH`torch.compile` doesn't consider the alias tensor created by `tensor[:]`r XQMPS internal error in `torch.gather` when last dimension is a singleton dimensionr X3Update PyTorch's default C standard to C17 from C11r X$Add nvml.dll search path for Windowsr X*Option to bypass NOLA check in torch.istftr XIInvestigate queue disparity between `windows.4xlarge` and `linux.4xlarge`r X8Split getitem OpInfo into dynamic and non-dynamic inputsr XV`where` triggers INTERNAL ASSERT FAILED when `out` is a long tensor due to mixed typesr X4A segment fault can be triggered in torch.avg_pool1dr XAA segment fault can be triggered in torch.max_pool1d_with_indicesr XNinductor `compile_fx_inner` output is incorrect on graph with trailing copy_()r XNan is output by GRU on mpsr X6[kineto] Enable CUPTI metrics profiling in pytorch …r X`UnsupportedOperatorError`, `OnnxExporterError` and `SymbolicValueError` related to MultiheadAttention export to onnx with torch.jit.scriptr X-A segment fault can be triggered in torch.svdr X>A segment fault can be triggered in torch.lstm with edge casesr XBcannot create weak reference to 'weakproxy' object in compile moder X%Missing FX documents for some modulesr X'dynamo: handle contiguous graph breaks r X&[RFC] Add a static_graph mode for FSDPr XJetson CI needs Updatesr XILots of different `nn.Sequence` instances trigger the Dynamo cache limitsr XSaving a `torch.nn.HuberLoss` using `torch.jit.script().save()` doesn't seem to implicitly convert from `int` type to `float` type.r XPyTorch 2.0: AttributeError: __torch__.torch.classes.c10d.ProcessGroup (of Python compilation unit at: 0) does not have a field with name 'shape'r X5A segment fault can be triggered in torch.histogramddr X>Memory corruptions can be triggered in torch._remove_batch_dimr X-Issue with `upsample_nearest2d` decompositionr X?A Segment Fault can be triggered in torch.affine_grid_generatorr X`permute` for named tensorsr XB[Dynamo] Key Mismatch When Loading Checkpoints Trained with Dynamor X Abort Caused by Virtual Functionr Xtorch.lgamma CUDA driver errorr XLDISABLED test_pickle_nn_RNN_eval_mode_cuda_float64 (__main__.TestModuleCUDA)r XpPerformance does not meet expectations when training OPT-30 with FSDP, there may be problems with cpu offloadingr XA[mypy] skipping mypy for a few torch/fx and torch/_subclass filesr X-Dynamo captures only CUDA streams in FX graphr X.pybind11 SymNode binding is a footgun py::castr XK[Functionalization] `index_reduce_` op tests with functionalization enabledr XKLSTM on CPU is significantly slower on PyTorch compared to other frameworksr XlDocument and promise reproducibility torch.randn / torch.rand / torch.randint family behavior on CPU devicesr Xb`jacrev` raise "Cannot access storage of TensorWrapper" error when computing the grad of `storage`r XXPickling OneCycleLR.state_dict() with an unpickleable optimizer will result in an error.r X>A better error msg for `cuda.jiterator` when input is on `cpu`r XA`get_debug_state` a script function causes INTERNAL ASSERT FAILEDr XgExporting the operator 'aten::_transformer_encoder_layer_fwd' to ONNX opset version 13 is not supportedr XX[RFC]FSDP API should make limit_all_gathers and forward_prefetch both default to be Truer X`nn.TransformerEncoderLayer fastpath (BetterTransformer) is much slower with src_key_padding_maskr Xu[fake_tensor] torch._subclasses.fake_tensor.DynamicOutputShapeException when calling torch.nonzero using aot_functionr X=jacfwd and jacrev are fundamentally broken for complex inputsr XH`func.jacrev()` should be implemented as `func.jacfwd().mT.contiguous()`r XD[pt20][eager] Lamb optimizer cannot be used in the compiled functionr XaInconsistent results when using torch.Tensor.bernoulli with float instead of Tensor probabilitiesr X;[dynamo] equivalent conditions get different optimized coder X:[fx] const_fold.split_const_subgraphs leads to UserWarningr XcQAT + torch.autocast does not work with default settings, missing fused fake_quant support for halfr XX`scatter` fails the gradient computation in reverse mode for `src` when `index` is emptyr X*cpu log1p for bfloat16 gives wrong result.r X<RFC: Enabling AVX512 dispatch for compute-intensive ATen opsr X)Unimplemented lowering - torch.jit.scriptr XRuntimeError: p.block != nullptr && p.block->ptr != nullptr INTERNAL ASSERT FAILED at "../c10/cuda/CUDACachingAllocator.cpp":1275, please report a bug to PyTorch.r X4CUBLAS_STATUS_NOT_SUPPORTED when calling cublasDgemvr X9torchdynamo.export doesn't work with float multiplicationr X7What type of attributes does symbolic function support?r Xqwhen group number is 2,and channel is 2, dim H and dim W is 1, N is 10,the result should be 0,but now it is not 0r X bugs when try parallel test coder XONNX export produces hundreds of weight/bias/Matmul/etc. files alongside the `.onnx` file, and the `.onnx` file seems to be incorrect.r X4GroupNorm ONNX export does not reproduce same outputr X;`PyTorchFileWriter` should drop the GIL while writing filesr X+unsqueeze a single dimension multiple timesr XJ`zeros_like` + `fill_` makes the gradient computation in forward mode failr XHAddition of hybrid CSR tensors produces incorrect and invalid CSR tensorr X>Addition of CSC/BSR/BSC tensors raises RuntimeError exceptionsr XGAddition of batch CSR tensors produces incorrect and invalid CSR tensorr XK[pt2] The min and max parameters of torch.clamp do not support numpy formatr XVFaster `pad_sequence` and `tensor_split` function with CUDA kernel, are they possible?r XYPytorch 2.0: Detection models from torchvision don't work with onnx and tensorrt backendsr X8DISABLED test_index_select_scalar (__main__.TestNLLLoss)r X6JIT: Dropout fails codegen on the third forward passesr X<Subclassed Tensors Decrease Training GPU Throughput by ~40% r X$Asking for a LAZYMODULEMIXIN warningr XAfaster WeightedRandomSampler implementation based on alias methodr XGA Floating Point Exception can be trigerred in torch._C._nn.slow_conv3dr Xb`cat` fails the gradient computation in forward mode with empty tensors when used with legacy vmapr X*dynamo crashes on optimizer initializationr XM`svd` triggers INTERNAL ASSERT FAILED when computing jacobian in forward moder XI`MSELoss` fails to compute the gradients when inputs have different dtyper X=`unfold` fails in forward mode when unfolding a scalar tensorr XBTracker for `scatter_reduce` additional reduction options requestsr X`[dynamo] enable export path to preserve a meaningful parameter name in the exported graph moduler X;Set AVX2 is minimum supported instruction set for Linux X86r XMType promotion for accumulate operation differs between eager and CPP dynamo r X6Type promotion mismatch between eager and inductor powr X1test_nccl_warn_not_in_group_debug_detail is flakyr X=`linalg.lstsq` fails the gradient computation in forward moder X`Enable Link Time Optimization in PyTorch 2.0 Release Binaries - Smaller, Faster, Better Binariesr XO[RFC] Support Huge Model Init Without mallocs for Compile/Distributed Use Casesr XGerror: no member named 'residual_with_sum_zero_point' in 'ideep::attr_tr X`torch.jit.trace` memory usage increase although forward is constant, and gets much slower than forward with model depth increaser X<[FSDP] `summon_full_params(writeback=True, rank0_only=True)`r X;onnx_torch.ModelProto exceeded maximum protobuf size of 2GBr XY[pt20][aot_eager] Exceed Python recursion limit with huge model or frequent recompilationr XCannot export models which access int/float stored as module attributes (they get unspecialized into inputs, which makes export choke)r X3Dynamo uses CONSTANT_MATCH guards for string inputsr X6[BUG] jit.trace not working for torchvision ViT modelsr X<[dynamo]: Unsupported: call_method ListVariable() copy [] {}r X$[Dynamo] Don't graph break on einopsr XsWhy does the torch model have no memory leaks under gpu, but there is a memory leak under cpu, torch version 1.10.1r XD[Dynamo] torch.autocast context manager doesn't support graph break r XDImporting tensorflow (2.12) before torch (2.0) hangs at import torchr XQ`PYTORCH_DEBUG_MODE`, better invalid index embedding lookup error message on cudar XCInductor miscompilation with dynamic shapes from Background_Mattingr X_Minifier related: perhaps same_two_models should reseed between the regular and optimized runs?r X<Bitwise-perfect method for (de)serializing tensors in base64r XWMinifier has trouble correctly setting up requires_grad'ness of inputs for forward onlyr X Enable CUPTIr X7torchdim can not be compiled for Python-3.11 on WindowsrXQsave_config/load_config for torch._dynamo.config and friends hardcodes file pathsrX@Failures in cuda11.7-py3.10-gcc7-sm86-periodic-dynamo-benchmarksrXRlarge number of temporary files generated when using dataloader with num_workers>0rX1EmbeddingBag to support mini-batches with offsetsrX9ONNX Export Fails: Model input type is Dict[str, Tensor] rX[pt2] MMDet meets Exception: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode error with aot_eager backendrX0torch.jit.script does not work with DataParallelrX=`log_softmax` + `pad` triggers assertion fail in compile moderXHMaskRCNN with `torch.compile` fails with `CUDA error: an illegal memory`r X>[pt2] cannot compile function having `gt`, `expand` and `add_`r XQ(DDP) RoBERTa_large training with `torch.compile` results in OOM and other issuesr X6Aot accuracy minifier with dynamic shapes doesn't workr XJOption for minifier to dump the actual tensor inputs/parameters to be usedr XcMinifier should also dump compilation artifacts from the real execution for ease of sanity checkingrX>Make torch.testing functions overrideable with torch_function?rX@Inductor miscompilation with dynamic shapes from LearningToPaintrX_Minifier launcher incorrectly runs backwards even when original reproducer didn't run backwardsrXrminifier_launcher.py silently swallows "ran into runtime exception which is likely an unrelated an issue" warningsrXATensorboard SummaryWriter with cloud storage does not work on MacrXRwhen I want to use a new backend, how to deal with the op with 'device' argument? rX'Quantized Transformer ONNX Export FailsrX1aten::int_repr not supported in torch.onnx.exportrXCMinifier should not use pickle to save state into minifier launcherrX+Minifier doesn't save/load functorch configrXhConvention for printing the "internal representation" of compiled functions from inductor/other backendsrX?[CI] PyTorch Windows Test AMIs contains CUDA-11.3 installationrX`torch.compile()` failed on Huggingface Flan-T5 `torch._dynamo.exc.Unsupported: call_function UserDefinedObjectVariable(forward) [] OrderedDict()`rX=Errors when running the fsdp benchmarks for hf_Bert and hf_T5rX8Estimate effort needed to bring PyTorch to Windows Arm64rXBug in torch.linalg.svd rX]Bad conversion from torch.split(2d_tensor,splitsize_list) to SplitToSequence OP (onnx export)rXJ`torch.compile` produce wrong result in `interpolate` when `mode=bilinear`r XFMaskRCNN model loaded fail with torch::jit::load(model_path) (C++ API)r!X:`min` reduction on float16 tensor failed on certain shapesr"X=USE_CUDNN=1 doesn't force cmake to fail if cudnn is not foundr#X;Well known way to request user backtrace when inside Dynamor$XxMinifier produces minifier script that doesn't fail accuracy on Background_Matting (dynamic shapes, inductor, inference)r%XNMinifier does not run on LearningToPaint (dynamic shapes, inductor, inference)r&Xssqueezenet1_1 fails accuracy with AMP (but not on CI and dashboard); minifier does not work (when not using cuDNN?)r'X.Build from Source Issues on MacOS Ventura 13.2r(X(Add Support for RockChip NPUs (RKNN(2)) r)X>Why is AvgPool2D taking longer than Conv2D for the same input?r*XF[RFC] PT2-Friendly Traceable, Functional Collective Communication APIsr+X+TorchDynamo Performance Dashboard (float32)r,X<Segmentation fault between Numpy and Pytorch using torch.bmmr-XSupport for VeLO optimizer.r.X*Dynamo doesn't support dict(list_argument)r/X"Dynamo doesn't support OrderedDictr0X(Failed to Open libnvrtc-builtins.so.11.7r1X[RFC] Flop counters in PyTorchr2X+[Releng] [Conda] Optimize PyTorch packagingr3XQDISABLED test_inplace_grad_index_put_cuda_float64 (__main__.TestBwdGradientsCUDA)r4XaDISABLED test_forward_mode_AD_linalg_det_singular_cuda_complex128 (__main__.TestFwdGradientsCUDA)r5XYDISABLED test_fn_grad_linalg_det_singular_cuda_complex128 (__main__.TestBwdGradientsCUDA)r6X5numpy v1.24 does not work with `writer.add_histogram`r7Xptxas segfault with PT2r8X8Replace pattern fails on incompatible function argumentsr9X"[BE] Improve FSDP <> AC Unit Testsr:X#Feature request: access to variabler;XiTest Failure: TestUpgraders.test_aten_div_scalar_at_3 on a big-endian machine (issue in torch.jit.load())r<X9ONNX export of batch_norm for unknown channel size issue.r=XDTracking issue for segfaults and floating point exceptions on 1.12.0r>X<test_jit_fuser_te SIGIOT's frequently during dynamo testing r?XInplace fused (leaky)relu+(leaky)dropout for memory savings (I think, can be made fully allocation-less if never fully allocating random mask in FlashAttention style and recover the mask from the output)r@X$Add Stride Argument For ConstructorsrAXX[Functionalization] Some ops need additional meta tensor support after functionalizationrBX?functorch.functionalize doesn't error out with logcumsumexp.outrCXTriton MLIR benchmarksrDXWtorch.jit.save() generates different contents in a file among different endian machinesrEX7[RFC] XLA Lazy Backend Support In DistributedTensor APIrFXPUnable to find an engine to execute when using pip to install but not with condarGXM[LibTorch] pickle_save output cannot be reloaded using pickle_load in WindowsrHX?[RFC] Make more operations inplace (GELU, BatchNorm, LayerNorm)rIX)JIT Function Fails when run a second timerJXJProfiler documentation doesn't mention some exports are mutually exclusiverKX#Enable OnDemand for Open Source CI rLX/Double free when running torch.linalg.ldl_solverMX"segfault when running torch.igammarNXeAbility to manually set the gradient in FSDP while inside `summon_full_params` and make it persistentrOX!Segfault when running torch.atan2rPXRtorch.fx fails to trace through "+" op between torch.Size and torch.fx.proxy.ProxyrQX?[complex] Jacobian of a non-holomorphic complex valued functionrRXZDynamo graph break due to context manager do not resume inside/outside the context managerrSXV[BE] move _apply_to_tensors from FSDP to torch.distributed.utils, use in _recursive_torTX(Segmentation fault when running torch.gerUX2Process get killed when running torch.combinationsrVX@Floating point exception when running torch.nn.AdaptiveMaxPool3drWX,Process get killed when running torch.normalrXX%segfault when running torch.lu_unpackrYXEno attribute torch._dynamo unless you explicitly import torch._dynamorZXA'MPS' issue: torch.multinomial() returning [-9223372036854775808]r[X3[JIT] Consecutive use of `addmm` Leads to Exceptionr\X9[JIT] Applying `conv2d` over Constants Leads to Exceptionr]X2Dynamo can not trace 'int(a_scalar_tensor.item())'r^X8[FSDP] Add `foreach` support to `FSDP.clip_grad_norm_()`r_Xiter(TensorVariable) failr`XPset_default_device/torch.device has performance impact for non-factory functionsraX0API to check for errors in c10d.ProcessGroupNCCLrbX+DDP+inductor+profiler crashes on toy modelrcXVTorchscript troubles with complex values. RuntimeError: isInt() INTERNAL ASSERT FAILEDrdXO[JIT] `Linear` + `BatchNorm2d` Trigger Inconsistency between Eager Mode and JITreX@14k github models TorchDynamo + TorchInductor bugs umbrella taskrfX-Traced model output differs on C++ and PythonrgXBUpdate quantization to make source files complient with /Zc:lambdarhX7INTERNAL ASSERT FAILED when mixed dtypes for `addcmul_`riX0Some tests in test_torchinductor.py fail locallyrjX3Improve Fake Tensor Error When Data Ptr is AccessedrkXD[JIT] INTERNAL ASSERT FAILED when `Conv2d` and `clamp` used togetherrlXTSpurious side effect diff when cond branches call different functions in outer scopermXV[JIT][TracingCheckError] inplace ops incompatible with `contiguous(.., channels_last)`rnX Major bug in Transformers' masksroX?[JIT] Inconsistency in tensor shape between eager mode and JITrpXPytorch AMP performance issue.rqX multiprocessing not work on WSL2rrXlINTERNAL ASSERT FAILED: Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.rsX![Inductor] support complex dtypesrtX,operations failed in TorchScript interpreterruXTypeError: no implementation found for 'torch._ops.aten.max.default' on types that implement __torch_dispatch__: []rvX;support setattr of arbitrary user provided types in tracingrwX1fft.fftshift, fft.ifftshift, roll not implementedrxX?backward(inputs= does not need to execute grad_fn of the inputsryX>Simplify module backward hooks to use multi-grad hooks insteadrzX3[Releng] Windows AMI needs to be pinned for releaser{X;Cost & performance estimation for Windows Arm64 compilationr|X*jit.fork stalls multiprocessing dataloaderr}XlRuntimeError: one of the variables needed for gradient computation has been modified by an inplace operationr~Xd"Get Started" tells us to use the anaconda installer for PyTorch 3.x - but this should be python 3.xrX0InstanceNorm operator support for Vulkan devicesrX(Always install cpu version automaticallyrX9distributions.Beta returning incorrect results at 0 and 1rX[discussion] Fused MLPsrX4`model.to("cuda:0")` does not release all CPU memoryrXM`torch.load(..., map_location="cuda:0")` allocates memory on both CPU and GPUrXRtorch.cuda.is_available() returns True even if the CUDA hardware can't run pytorchrXtest_qnnpack_add failsrXhCapabilityBasedPartitioner incorrectly sorts the graph, causing optimizer return/output node to be firstrXLInfinite recursion when tracing through lift_fresh_copy OP in Adam optimizerrX+Add torch::jit::ScriptModule to the C++ APIrX.Hijacked package names from nightly repositoryrXImprove make_fx tracing speedrXfalse INTERNAL ASSERT FAILED at "../c10/cuda/CUDAGraphsC10Utils.h":73, please report a bug to PyTorch. Unknown CUDA graph CaptureStatus32680rXIRuntimeError: derivative for aten::mps_linear_backward is not implementedrX;Triton Autotuning Cache-Clearing Adds 256MB Memory OverheadrX&test_fx_passes generate bad test namesrX/"multi device" tests get skipped in standard CIrX@PyTorch 1.13.1 hangs with `torch.distributed.init_process_group`rXcException in distributed context doesn't propagate to child processes launched with multiprocessingrX1Occassional OverflowError with mps running yolov7rXF[PT2.0 Feature Proposal] TorchInductor CPU FP32 Inference OptimizationrXk[Bug][Dataloader] unable to mmap 2048 bytes from file : Cannot allocate memory (12)rX7Torchrun seems to have problem with virtual environmentrXBDISABLED test_cuda_variable_sharing (__main__.TestMultiprocessing)rX/Unable to export timm models with torch._dynamorX) Forward arguments are not updated in DDPrX8Error while building pytorch mobile binaries from sourcerX2DISABLED test_cdist_large_batch (__main__.TestMPS)rXcompilig MultiHeadAttentionrX*Implement forward AD with grid_sampler_2d rX%jit testing fails on 3.11 debug buildrXNUpdate docs URLs in torch/_functorch/autograd_function.py to stable before 2.0rXN[Releng] Add repo dispatch via webhook to trigger domain builds after the corerX:Add plots of LRSchedulers to doc to make it easier to readrXgautograd.functional.jacobian : Imaginary part is lost for functions with real input and complex output.rX/export does not support boolean tensor indexingrX>Torch's affinity setting lead to openvino using only one core.rX5An error happend when I convert pytorch model to onnxrX/sympy failure on model when dynamic_shapes=TruerX%Unknown CUDA graph CaptureStatus21852rX=Torchdynamo with onnxrt backend generating fake tensor errorsrX*Pytorch is using system-installed mkl-dnn.rXAHessian produces wrong results, but works if I add a perturbationrXpProxy/cache server option/hooks for downloading model checkpoints and dataset archive files in cloud environmentrX+CUDA error `CUBLAS_STATUS_NOT_INITIALIZED` rXG[PT2.0 Feature Proposal] GNN inference and training optimization on CPUrXTRuntimeError: philox_cuda_state for an unexpected CUDA generator used during capturerXN_pack_padded_sequence fails in dynamo due to requiring a non-fake 2nd argumentrX"elastic job failed when scale downrX:torchrun elastic always “address already in use” errorrX$Fails to build on ppc64le with clangrX*from torch import * does not import dtypesrXTProfiling with stack enabled results in error when Python's cProfile is also runningrXMONNXRuntime outputs numerically incorrect results for mixed precision models.rX2Lazily start worker threads in the autograd enginerXToTensor deadlock in subprocessrX5No setting to allow collecting the first trace early.rXIOnly the first logged trace in a given log dir is visible in tensorboard.rX ddp vs fsdprXRtorch.Categorical samples indexes with 0 probability when given logits as argumentrXtorchrun --help is too slowrX.torchrun default value of command line optionsrXjacrev over huber functionrXOAdding a page for subfolder/subfile overview/descriptions in the developer wikirX\torch.onnx.export is throwing RuntimeError: prim::TupleUnpack not matched to tuple constructrX0Missing python 3.11 on anaconda for torch 1.13.1rXB[inductor] `triton.runtime.jit` does not provide `get_cuda_stream`rX[Bug/functorch] Cannot use `tensor.detach().numpy()` for `GradTrackingTensor`: Cannot access data pointer of Tensor that doesn't have storagerX-Better API for `torch.cov` (and `Tensor.cov`)rX'Codegen for in_out_ptr seems suboptimalrX4Inconsistent rank among torch.distributed primitivesrX\Error while building pytorch from source on windows - Ninja Build Stopped, Subcommand FailedrX CUDA error: initialization errorrX;SymIntType gets translated to int when going through pybindrX<[bazel] replace //c10:headers dependency by //c10 dependencyrX7tracing torchvision detection model results in an errorrX/[MPS] Improve the performance of torch.linear()rX.Errors using torch.compile() on wav2vec2 modelrX8linspace (and arange) behaves differently on GPU and CPUrXADynamo minifier fails with false internal assert on torch-nightlyrXG`@torch.compile` fails with `InternalTorchDynamoError` on torch-nightlyrX(Add vmap support for torch.linalg.vanderrXDSegmentation fault after trying to create a tensor with float valuesrXCBuild from source fails: undefined reference to caffe2::DeviceQueryrXm[discussion] Analyzing a list of tensors stored as intermediate values / saved_for_backward in autograd graphrX[quantization fuse in convert_fx leave a wrong dequantize node when fuse multiple-input noderX.Sparse tensor not supported (Minkowski Engine)rX#Wrong in building torch from sourcerX^AssertionError: tensor's device must be `meta` when trying to export a fake-initialized modulerX>[FSDP][BE] Add check that compute device equals current devicerX?FakeTensors not moving between device properly on Module.cuda()rXEStochastic Illegal Memory Access error mid-epoch on AWS p4d instancesrXYSegmentation fault when running torch.nn.functional.fractional_max_pool3d on torch 1.13.1rX)Periodic ROCM distribtued jobs are brokenrXtrainerrXGInvestigate CUDA enabled build-time difference between MSVC and GCC+WSLrX.Cross-compiled libtorch Windows Arm64 binariesrXSThere is no developer documentation about getting started with MPS native debuggingrXMMPS: `torch.sub` erroneously returns 0 on outputs of `chunk` via `layer_norm`rX<sparse.mm(coo, dense) produces wrong results on T4/V100 GPUsrXSSL: CERTIFICATE_VERIFY_FAILED while trying to download pretrained model within a company that transforms SSL certificates for security purposesrXwrong assert messagerX.vmap + nn.SyncBatchNorm.convert_sync_batchnormrXM`mul(CSC, CSC)` fails with layout mismatch between the inputs and the output.rXADivision by zero error when running torch.nn.functional.lp_pool1drXUCrashes of linalg.ldl_solve on different edge cases not coming from linalg.ldl_factorrX5Softmax function slows down for data with large rangerX2LBFGS wolfe exceeds the maximum allowed iterationsrX'[RFC] FP8 dtype introduction to PyTorchrXEAdd BlockWise Distribution Support to the torch.distributions PackagerX8Security policy impractical / lacks contact information?rX:torch.compiled mish function is x5 slower than eager (CPU)rXiBuild Error: OpenMP library could not be found. Proceeding might lead to highly sub-optimal performance.rX+min/max not supported for Long dtype on MPSrX\`torch::jit::optimize_for_inference` doesn't preserve exported methods when calling `freeze`rX:Segmentation fault when running torch.nn.AdaptiveMaxPool3drXKOverflow when running torch.nn.AdaptiveMaxPool3d on torch 1.12.0 and 1.13.1rX:Segmentation fault when running torch.nn.AdaptiveMaxPool2drX0Overflow when running torch.nn.AdaptiveMaxPool2drXR[Inductor] The Way of Input Mutation Handing Conflicts with CPP Kernel DeclarationrXJAdding label smoothing option to `nn.BCELoss` and `nn.BCEWithLogitsLoss`?rXePython 3.11.1 , even with nightly version of PyTorch: ERROR: No matching distribution found for torchrXj`torch.compile` frees computation graph in a GAN training setup and tries to call `backward` a second timerX6Unclear how to change compiler used by `torch.compile`rXHThe speed of matrix inversion is relatively slow for many small matricesrXMWhen dist.broadcast float32 to int64, it will silently generate wrong resultsrXCannot cast float64 to float32rX8functorch.so is installed back into the source directoryrX8[functorch] make batch norm docs point to UX limitationsrX3Update map_nt to take into account size and stridesrXbtorch.jit.script ERR: RuntimeError: Can't redefine method: forward on class: __torch__.SoSadModulerX8DISABLED test_tensor_requires_grad (test_jit.TestScript)rX(DISABLED test_rand (test_jit.TestScript)rX3DISABLED test_optional_tensor (test_jit.TestScript)rX7DISABLED test_prim_grad_undefined (test_jit.TestScript)rX6DISABLED test_requires_grad_loop (test_jit.TestScript)rXBDISABLED test_successful (jit.test_freezing.TestMKLDNNReinplacing)r XPDISABLED test_switch_inputs_to_inplace (jit.test_freezing.TestMKLDNNReinplacing)r XKDISABLED test_always_alive_values (jit.test_freezing.TestMKLDNNReinplacing)r X1DISABLED test_optional_list (test_jit.TestScript)r X?DISABLED test_tensor_as_tensor_shape_prop (test_jit.TestScript)r XFDISABLED test_merge_liveness (jit.test_freezing.TestMKLDNNReinplacing)rX)Clean up nt impl duplicates where one canrX0torch.compile loud error on functorch transformsrX@torch.compile with aotautograd does not support double backwardsrX>torch.compile incorrect when imperative autograd APIs are usedrX@DISABLED test_fs_preserve_sharing (__main__.TestMultiprocessing)rX;Degenerate ranges are allowed in NumPy, but not in PyTorch.rX6Pytorch2.0 doesn't support compiling GRU and RNN modelrX)using Tensor subclass between vmap layersrX7Batch_first attribute in quantizable multiheadattentionrXD[bazel] error: use of undeclared identifier 'cudaGraphDebugDotPrint'rX0Pytorch clang-tidy header-filter is still brokenrXI[JIT] Zero-channel conv2d cannot be applied with `optimize_for_inference`rX8PyObject preservation and resurrection for `StorageImpl`rXbgetting issue 'typeindex' file not found in Littorch-Lite/install/include/ATen/core/custom_class.hrXInternal Assert failedrX<[RFC] `quantile` should work for `float16`/`half` on the GPUrX;`NotImplementedError` when using `torch.distributed.launch`rX8PyTorch memory leak reference cycle in for loop, Mac M1 r X=MPS backend does not accept int64 model weights or input datar!X9Offer `get_buffer` and `get_submodule` in `ScriptModule`?r"X,[JIT] .backward() not supported by JIT tracer#Xnop_partitioner for AOTAutogradr$X=Perf reduction due to munmap with dataloader pinning thread ?r%X7Internal error during ONNX export, diagnostic unusable r&XRemove redundant logicsr'XLPytorch 1.13 conda package with cuda requires too many unneccessary packagesr(XTSupport for saving multiple storages/tensors that view same data as different dtypesr)X#Expand torch.utils._pytree.tree_mapr*XQProfiler is not properly recording seq number when any key above autograd is usedr+XBAOT Autograd should allow backend compilers to see input mutationsr,XPiSTFT produces RuntimeError with center=False and Blackman/Bartlett/Hann windowsr-Xq`nn.TransformerEncoderLayer` fastpath (BetterTransformer) is slower than the normal path when no mask is providedr.XG[🚀 Feature Request] pdf and sampling from Alpha-stable distribution r/Xmtorch.fx tracer emits type error when tracing module that directly contains and uses the torch.cat() functionr0XFcustom Function that supports functorch jvp doesn't work with in-placer1X#Keep getting ChildFailedError Errorr2X(torch.compile for calling func(**kwargs)r3XHtensor.to_sparse() handling indices incorrectly under dynamo/fake tensorr4X9Make quant_min/quant_max required for observer/fake_quantr5XpOpen file leak when dataloader is using persistent_workers and pin_memory AND you create multiple dataloaders. r6X<Potential bug found with pybind11 dec_ref while gil releasedr7X7Use dynamo to detect incorrect op schemas automaticallyr8X<Segmentation faults in DataLoader (in latest torch version).r9Xfirst class dims leak memeoryr:X(ONNX export question (using torchdynamo)r;X.JIT mishandles torch.round() in PyTorch 1.10.1r<X.Odd/hand-wavy mathematical notation for Conv2Dr=X5dcp resharding does not work for optimizer state_dictr>X=functorch.functionalize doesn't work with torch.autograd.gradr?X8DISABLED test_index_add_correctness (__main__.TestTorch)r@X6onednn(mkldnn) backend support for quantized operatorsrAX<not able to import pipelines as torch.distributed is missingrBXq[FSDP] FSDP with CPU offload consumes `1.65X` more GPU memory when training models with most of the params frozenrCX,`quantile` fails for `float16`/`half` inputsrDX6[Composable] Enable summon_full_params for fully_shardrEX'[BE] Investigate FSDP test _zero_model rFXIs there a way to write passes?rGXF[Dynamo] Graph Re-compilation Invoked by Changes of Unused Dict ValuesrHXR[TorchScript] Failed to Forward Correct Number of Arguments to Different FunctionsrIX\Remove redundant memory copy for HF multi-attention submodule for cpu path using MKL prepackrJXN[Inductor] `test_tmp_not_defined_issue1_cuda` raises `RuntimeError` but passesrKXYImplement L1 and L2 gradient as hooks with the option of changing the weight decay value.rLX2Unexpected behavior when running torch.max in cudarMXHIf minifier test fails, stderr/stdout of subprocess calls is not printedrNX4Simple deleting from the sys cache fails on reimportrOX%A Simple Function Causing Graph BreakrPX3node.stack_trace does not handle escaping correctlyrQX7overflow (?) on cuda tensor after matrix multiplicationrRXDCrash in `index_select` with singleton `self`, non-singleton `index`rSXWas_strided_scatter : INTERNAL_ASSERT_FAILED for requires_grad=True and non-config inputrTXZTorchDynamo doesn't inline modified nn.Modules forward - Fails with Huggingface AcceleraterUX+[Composable] Enable setting state_dict_typerVX;Add support for torch.zero_grad in dynamo w/ dynamic shapesrWXJdynamo.optimizations.training.aot_autograd does not trace correct overloadrXX=Support for Transformer Models on Android with Vulkan BackendrYX<Functorch does not work with CrossEntropyLoss and label=-100rZXMTorch SummaryWriter import fails with torch 2.0 with an error on numpy.objectr[XrError in guard code crashes process NULL ERROR: /Users/ezyang/Dev/pytorch-metal/torch/csrc/dynamo/eval_frame.c:251r\X&Retrieve Tensor from Tensor.data_ptr()r]X1Check that SymPy semantics match Python semanticsr^X^ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a packager_XYDISABLED test_numpy_ref_mps_nn_functional_group_norm_mps_float32 (__main__.TestCommonMPS)r`XP[inductor] Add more matmul configurations to `TORCHINDUCTOR_MAX_AUTOTUNE=1` moderaX5TorchScript with complex abs doesn't work in backwardrbX-Exporter for ONNX GroupNormalization operatorrcXEUmbrella issue for weakref related Dynamo PyTorch test suite failuresrdXLUmbrella issue for only populate real_value_cache in export test suite failsreXMSupport arbitrary masks for _nested_tensor_from_mask in nn.TransformerEncoderrfXdUmbrella issue for PyTorch test suite failures from torch.* returned non-Tensor output unimplementedrgX>[FSDP] Prepare to deprecate `FullyShardedDataParallel.`rhX?[FSDP] Investigate the need for public `check_is_root()` methodriXL[Distributed] Destruction order fiasco in ProcessGroupNCCL workCleanupLoop()rjX=AOT Autograd doesn't respect no_grad() during input mutationsrkX<nn.MultiheadAttention softmax inconsistent in training mode rlX.[FSDP] `fully_shard` Follow-Ups & Known IssuesrmXInference Mode docsrnX.Error when using torch.compile with Pytorch2.0roXDCompare oneDNN and OpenBLAS backend of PyTorch on arm64 architecturerpXSupport for PylintrqXSupport `divmod` for tensorsrrX;nn.CrossEntropyLoss error out when the sample size is largersX\[Composable API] Add `fully_shard` state dict unit test after manual "wrapping" is supportedrtX6[FSDP] Investigate `test_fsdp_pure_fp16.py` inaccuracyruXBExtend "torch.utils.cpp_extension.load" for both lib64 and **lib**rvXCannot compile torchtext modelrwX"PyTorch 2.0 not working on WindowsrxX6Large slow down by not calling `torch.set_num_threads`ryXPytorch 2.0 document issuerzXAdam (fused=True) issuesr{Xpytorch prune in libtorchr|XFImporting torch causes segfault when using python installed with condar}X< [Inductor] [CPU] optimize thread parallel and loop collapser~X#Adopt full_backward_pre_hook in DDPrXtest_copy_broadcastrX3PixelShuffle/Unshuffle Channels Last Support on GPUrXRuntimeError: kind_.is_prim() INTERNAL ASSERT FAILED. Only prim ops are allowed to not have a registered operator but aten::mul doesn't have one either. We don't know if this op has side effects.rXW`torch.empty` produces incorrect tensors with `layout=sparse_csr|sparse_csc` on the CPUrXQ[ONNX] Exporting the operator ::concat to ONNX opset version 13 is not supported.rX(In distributed get SIGTERM and run crashrXlRuntimeError: Error in dlopen: libnvJitLink.so.12: cannot open shared object file: No such file or directoryrXg[Feature Request] An alternative sampling routine for Dirichlet to fix Dirichlet and Beta sampling bugsrXA[FSDP][BE] Post-backward hook and `FlatParamHandle` dtype cleanuprX,[FSDP][BE] `test_fsdp_comm_hooks.py` cleanuprX!torch.min document not up to daterXX`torch.inverse` multi-threading RuntimeError: lazy wrapper should be called at most oncerXIOperator overload priority should not rely on static initialization orderrX^Export to ONNX of `as_strided()` hard codes stride in the graph, although it should be dynamicrX;[threaded pg] All threads share one Random Number GeneratorrX6AttributeError: 'tuple' object has no attribute 'grad'rXQMultiprocessing "Error Propagation" doesn't work for FullyShardedDataParallelism.rX!Bfloat16 tensor .numpy() support rXa[discussion, idea] Batched, vectorized base64 decoding / encoding + maybe RLE decoding / encodingrX+[RFC] Add torch.backends.tbb.is_available()rX@Embedding dynamic quantization is not documented and hard to userX6AOT Autograd should differentiate intermediate leaves.rXICould not run 'aten::as_strided' with arguments from the 'Metal' backend.rXAbort called in FSDP testsrX9Unable to link LibTorch against CUDA and CUDNN staticallyrX*[Dispatchable Collectives] Follow up tasksrXFtorch.compile() BackendCompilerFailed: _compile_fn raised RuntimeErrorrX1Bugs about BART of Hugging Face using Pytorch 2.0rXUIllegal hardware instruction following Real Time Inference on Raspberry Pi 4 tutorialrXNIllegal hardware instruction using torch.nn.Conv2d on aarch64 (Raspberry Pi 4)rXMvalgrind failure `Conditional jump or move depends on uninitialised value(s)`rXf[A ERROR in Docker] RuntimeError: CUDA error: no kernel image is available for execution on the devicerX%Can torchrun have a shell completion?re(XLFunctionalization on inplace_views should properly reflect autograd metadatarXTensor indexing and slicing documentation should explicitly state that indexing follows numpy semantics and link to the numpy indexing documentation.rXInternal assert when ctx.saved_tensors fails when saving results of an intermediate view tensor with torch.utils.checkpoint and use_reentrant=FalserX;Dynamo and cond with free variables creates malformed graphrX3Saving a scripted module to a buffer does not work.rX)[FSDP] Revisit meta device initializationrX7PR #89436 looks like it causes or enables a memory leakrXPAssertion failed: scales.is_weights() && "Resize scales must be an initializer!"rX)Strange issue with tensor asyncio and RPCrX<Different behavior for complex numbers operations with numpyrXGRuntimeError: Placeholder storage has not been allocated on MPS device!rX'Torch 1.13 Onnx Scope name not correct!rXCA few functions in fbgemm_utils.cpp are defined in global namespacerX*Importing numpy makes Tensor min max crashrXCIValue(c10::List) constructor is confusing and undocumentedrXHCannot add target-level dependencies to non-existent target "gloo_cuda".rXIFX graph mode quant: backendconfig configuration missing for torch.nn.GRUrXtorch.utils.tensorboard import fails if a new protobuf > 3.20 is installed (bug in tensorboard/tensorflow but better guard against it)rXN"Reached a code path in Module.get_extra_state() that should never be called."rX<[JIT] Wrong type inference leads to misleading error messagerX`Get the error: AttributeError: Can't pickle local object 'convert_frame.._convert_frame'rXH[JIT] INTERNAL ASSERT FAILED `torch.add` with boolean primitive constantrXH[JIT] INTERNAL ASSERT FAILED `torch.mul` with boolean primitive constantrXE[JIT] INTERNAL ASSERT FAILED when dispatching for `torch.Tensor.view`rXT[ONNX] test_mask_rcnn in test_models_onnxruntime.py failed with ONNX version==1.13.0rXD[RFC] Allow FSDP mixed precision for only certain type of submodulesrXD[Tracking Issue] Mixed precision does not work with ignored modules rXcInconsistent Hash of IValue between aten/src/ATen/core/ivalue.cpp and aten/src/ATen/core/Dict_inl.hrXUnknown buildin op: aten::padrX5torch._dynamo.exc.Unsupported: dynamic shapes: arangerXJquantization qconfig: can we set per-channel quant as default for qnnpack?rX?quantization observers: can we relax the default epsilon value?rX<Public API definition is not compatible with `torch.testing`rXcannot backward()rXPmake_fx loses node.stack_trace / turn on AOTAutograd by default for all backendsrXIs it possible to add a parameter in torch.onnx.export to skip the prim::PythonOp subgraph process when exporting the autograd function?rX9Why torch.mode return different value between CPU and GPUrX3LibTorch static build from source missing libshm.sorXH[Distributed] `Invalid scalar type` when `dist.scatter()` boolean tensorrXDStrategy for optimizing away transient dynamic shapes / device syncsrX.Graph breaks with HuggingFace Stable DiffusionrXUnexpected behaviour of 1.13.0rXGraph is renamed in torch.jitrX,wav2vec2 model: error trying to do inferencerX\Option to let DistributedDataParallel know in advance unused parameters at each forward passrX$Unable to export CFlow model to ONNXrXe[dynamo] CommandLine Error: Option 'amdgpu-assume-external-call-stack-size' registered more than oncerXW[Feature Proposal] Extend torch hub to better support cloud serving and edge deploymentrXDp.block != nullptr && p.block->ptr != nullptr INTERNAL ASSERT FAILEDrXtorch._dynamo.exc.BackendCompilerFailed: compile_fx raised TypeError: tqdm.__init__() got an unexpected keyword argument 'desc'rXCouldn't install pytorch 2.0rX.documentation need to be as per python versionrX=AOTAutograd input dedup needs a strategy for fake tensor argsrXno matches found: torch[dynamo]rX9Tensor.uniform_ fails to compile when using torch._dynamorXi[GradScaler] Inconsistent scale values across different GPUs caused by uneven inputs for AMP DDP trainingrXcforward-mode AD formula for torch.add (and possibly others) accidentally upcasts float32 to float64rX8DDP overlapped optimizer: set grads to None enhancementsrXB[feature request] Need dtype torch.complex64 support on MPS DevicerXDTraceable tensor subclasses cannot actually be used with AOTAutogradrX.TensorWithTFOverrideVariable unwraps too earlyrXjCan not use x=torch.tensor(b), to create a Tensor out of a List[List[Tensor]] (A List of Lists of Tensors)rX1Support getattr/setattr user properties on TensorrXVError in Adam.step(): If capturable=True, params and state_steps must be CUDA tensors.rX-minified code can not produce fp64_ref resultrXQCalling item() on symbolic shape fake tensor should give more clear error messagerXvRandom sampling from a tensor constructed on MPS device, results in elements returning as torch.zeros(tensor[i].shape)rX(Random K compression hook in PyTorch DDPrX;Export to ONNX with export_modules_as_functions works wrongrXfnn.CrossEntropy/nn.NLLLoss : Request for option to specify invalid ignore_index for perf. optimizationrX:[Dynamo] Examples that recompile beyond cache size limit rX<Way to run accuracy minifier on only one particular subgraphrXC[RFC] PyTorch Tensor Parallel(TP) User API for Distributed TrainingrX1Performance regression on interpolation in KorniarX5No pytorch_jni.dll file in libtorch 1.13.0 lib folderrX,torch1.13 quantized model export onnx errorrX&Wrong output type hint for `F.one_hot`rX update transformer init functionrXTThe current example for `torch.mode` is IMHO confusing and has room for improvement.rX:Basic math operations produce a "floating point exception"rXYInvokeAI using MPS is broken by torch nightlies since torch-1.14.0.dev20221104 inclusive rX6addcmul on CUDA does not have the correct FMA behaviorrXKDISABLED test_hf_bert_ddp_inductor (__main__.TestFakeDistributedSingleProc)rX2MMDet 3.x cannot run successfully in inductor moderXPthird-order gradient of torch.pow with tensor args and certain input returns NaNrXI[MPS] Add support for aten::repeat_interleave.self_Tensor for MPS backendrXCtorch.addbmm throws different exception differences on CPU and GPU.rXSample Weighted BatchNorm1drX]`torch.Tensor.flatten` Trigger Segmentation Fault when trying to provide and output named dimrX(DDP hangs on forward pass of transformerrXGSegfault on torch.nn.functional.one_hot with large tensor on Python 3.9rX M1 mps issuerX amd windowsrXMTensorWithTFOverrideVariable don't store fake tensor (they store real tensor)rX"Enable NCCL for PyTorch on WindowsrX(Dynamo is over-guarding on Tensor localsrX4MultiProcess tests fail when run on nodes with 1 GPUrXPTX codegen race?rX5`positive_semidefinite` constraint fails on CUDA 11.7rX[[ONNX] torch.onnx.export snapshots the grads as constants in onnx when op is in cuda devicerX,MPS bug on `torch.transpose` and `torch.log`rXMPS device ComplexFloatrX.torchinductor tests attempt to access internetrXP[ONNX] torch.onnx.export can not export the grad of conv when the op is in CPU rXf[dynamo] RuntimeError: Failed running call_function aten.nll_loss_backward(*(FakeTensor(FakeTensor(...r Xh[dynamo] RuntimeError: Failed running call_function aten.convolution_backward(*(FakeTensor(FakeTensor(..r Xd[dynamo] RuntimeError: Failed running call_function aten.lift_fresh_copy(*(FakeTensor(FakeTensor(...r X=Can not access to "sbgemm" routine with user-defined OpenBLASr XENVFuser failing masked.{amax|amin|sum} extremal and correctness testsr X<Building PyTorch with Vulkan backend fails (1.13 and master)rX<Caching a model's weights and state_dict to disk to save RAMrX;Finish deprecation of autograd decorator over class objectsrX8[Inductor] [CPU] LSTM is not using oneDNN in tts_angularrX`[Inductor] [CPU] Vectorization not supporting python pass-in scalar double in speech_transformerrXA[accuracy] [aot_eager] mobilenet_v2_quantized_qat fails accuracy rXC[Inductor] [CPU] Maxpooling is not vectorized in shufflenet_v2_x1_0rXMPartitioner generates useless constant SymInt edges between forward-backwardsrX?AOTAutograd generates useless tangent inputs for SymInt outputsrX+Unable to launch CUDA Graph with DDP model rX*Feature Request: deterministic CUDA cumsumrX=build: cmake: functorch.so not installed at expected locationrXFbuild: cmake: ability to disable -Werror* (-Werror considered harmful)rX\build: cmake: need to uniformize installation of libraries in CMAKE_INSTALL_LIBDIR (not lib)rXLkind_.is_prim() INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/ir.cpp":1098rX?Unexpected behavior from torchscript (mixing trace with script)rXPtorch.split: argument 'split_sizes' (position 1) must be tuple of ints, not listrX(Higher order derivatives of sinc exploderX5Partitioner that doesn't require functionalized graphr XTAccuracy minifier can find spurious accuracy failures involving uninitialized memoryr!XAAccuracy minifier should also work even if an exception is raisedr"X7Allow `low` and `high` to be tensors in `torch.randint`r#X5The problem caused by the parameter dim of torch.normr$X3fx.wrap is ignored with make_fx proxy tensor tracerr%XQEdge case: torch.baddbmm supports double x int8 x int8 inputs on CPU but not CUDAr&XNtorch.equal can still run successfully when the parameter types are different.r'Xdtorch.floor_divide: The dividend of torch.floor_divide is set to 0, but it can still run on the GPU.r(XSOSError: libcublas.so.11: cannot open shared object file: No such file or directoryr)XkWhen the torch.masked_select operator passes in the same parameters, it behaves differently on CPU and GPU.r*XGtorch.nn.MultiLabelMarginLoss has different performance on CPU and GPU.r+X@[MPS] Using unsqueeze in inference mode returns anomalous resultr,X"stacks file from profiler is emptyr-XKDISABLED test_coalesce_reference_cycle_cpu_float64 (__main__.TestSparseCPU)r.XKtorch.nn.TransformerEncoderLayer missing exception description information.r/X(Edge case: CPU bool abs is not supportedr0XWHow can i patch the torch.jit in the second solution? Could not figure out entrypoint ?r1X]torch.nn.ReplicationPad1d:The description of the exception information thrown is not accurater2X.jit.trace dost not support nested dict outputsr3X'prod_cpu not implemented for 'BFloat16'r4X9torch.nn.functional.normalize: whether true is equal to 1r5X6RuntimeError: CUDA error: device-side assert triggeredr6Xotorch.nn.functional.embedding_bag throws an exception when it runs on a CPU, but it runs successfully on a GPU.r7XuDocumentation: torch.nn.functional.embedding docs could more clearly state the requirement that weight be a 2D tensorr8XJQuantizable LSTM has different behavior than LSTM in bidirectional settingr9X(Per-sample input xfail / test generationr:X-AdaptiveAvgPool1d failed in the lower versionr;X@AdaptiveAvgPool1d throws different exceptions when using the gpur<X?torch.mm: Exceptions thrown on the CPU and GPU are inconsistentr=XSConv2d error on M1 mac, RuntimeError: NNPACK SpatialConvolution_updateOutput failedr>XQSegmentation Fault in Triton PTX codegen with cuDNN V8 API and `eca_halonext26ts`r?X>[inductor][Seg fault] Inductor segfaulting with few AMP modelsr@X*Should torchdynamo specialize on nn.ModulerAXI`masked_fill` with `FloatTensor` mask will never mask but fails silently.rBX0code sharing for fundamental ops in quantizationrCX&Meta implementation for copy_ is wrongrDX0[dynamic shapes] detectron2 dynamic shapes failsrEXfbgemm_avx512 build failurerFXf[Inductor] [CPU] Crash failure in torchbench model mobilenet_v2_quantized_qat & resnet50_quantized_qatrGX@torch.randn and torch.normal sometimes produce NaN on mps devicerHXnNotImplementedError: The operator 'aten::upsample_nearest1d.out' is not current implemented for the MPS devicerIXPtorch.addcdiv: input, tensor1, and tensor2 parameters should be of the same typerJX;[aot_eager] [hf_Longformer] Cannot view a tensor with shaperKXAtorch.lobpcg should support black-box linear operators like SciPyrLXa`torch.nn.ReplicationPad2D` Report "invalid configuration argument" Error under Compute SanitizerrMX sm_80 supportrNXnCan't use JIT modules traced with AMP autocast, with Triton Server (or any C++ environment) - freeze() issue ?rOX;Dynamo + NNC: incorrect results with in-place ops on inputsrPX>`torch.nn.LayerNorm` Abort with "invalid device ordinal" ErrorrQXT[BF16] Visit all the type cast from integer to BF16 type for potential accuracy lossrRXD`torch.nn.CTCLoss` Trigger out-of-bound Read under Compute SanitizerrSX@Libtorch's CPU inference is much slower on Windows than on LinuxrTXN[Inductor] [CPU] accuracy failure in torchbench model detectron2_fcos_r_50_fpnrUXdCollective operations do not work with `torch.BoolTensor`s on `gloo` and raise `Invalid scalar type`rVXh[aot-autograd] [hf_BigBird] Output 0 of CompiledFunctionBackward is a view and is being modified inplacerWX_[feature request] Add ability to preserve traced shape during torch.jit.save and torch.jit.loadrXX&Complex Not Supported in TorchinductorrYX>Got many TestDTensorOpsCUDA.test_dtensor_op_db_X test failuresrZXDSupport disallowing calls to certain instance methods in TorchDynamor[XG[FSDP] Adam Gives Different Results Where Only Difference Is Flatteningr\XL[FSDP] Investigate Unit Testing when Gradient Computation Differs on CPU/GPUr]XT[`NotImplementedError: AutogradFunctionVariable() is not a constant`] using xFormersr^X1torch.normal(...) on MPS sometimes produces NaN'sr_XObinary_cross_entropy/bce_with_logits (+ other loss functions) for nested_tensorr`XDZero-copy way to make flat tensor into a nested_tensor given a shaperaX,Implement generic batch normalization layer.rbXvjit.script() fails to resolve/cast Optional[Tensor] fields of sub-modules or base classes of the object being scriptedrcXBad string in GLSL shaderrdXpytreeify decoratorsreXCUnable to backprop through dense weighted sum of sparse_coo_tensorsrfX&Transformers model tracing not workingrgX4view_copy out= does not reshape zero element tensorsrhXTA more systematic API for resolving the "vmap-incompatible in-place operation" errorriXBImprove clarity of meaning of `torch.jit.trace`'s `example_inputs`rjXzUserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`.rkXKExtend test_proxy_tensor tests to support ops test non floating point typesrlX6Add a `device` keyword argument to `torch.manual_seed`rmX\caffe2_interface_library CMake macro prevents linking to LibTorch as a transitive dependencyrnX-torch.distributed can't establish connection.roXUcross compile pytoch using cmake , get an error : protobuf::protoc: command not foundrpX4[PT][1.13] torch .numpy() fn broke for some scenariorqX/Add smoke-tests for CPP extensions compilationsrrX'Fix fake tensor propagation for nvprimsrsX9Incorrect version in the instructions on official websitertX<nvprims.native_batch_norm doesn't support fake tensor inputsruXRGlog macro redefinition problem when including headers from both libtorch and glogrvX3M1 runner i-090e1df32b6f48a20 run out of disk spacerwX#[Inductor] Constant folding supportrxXY`torch.nn.functional.embedding_bag` Trigger RuntimeError under UndefinedBehaviorSanitizerryXK`torch.set_rng_state` Trigger RuntimeError under UndefinedBehaviorSanitizerrzX$torch.linalg.matrix_rank memory leakr{XJ`torch.Tensor.msort` Trigger RuntimeError under UndefinedBehaviorSanitizerr|XL`torch.linalg.eigvals` Trigger RuntimeError under UndefinedBehaviorSanitizerr}XA`torch.topk` Trigger RuntimError under UndefinedBehaviorSanitizerr~XC`torch.vander` Trigger RuntimeError with UndefinedBehaviorSanitizerrXI`torch.svd_lowrank` Trigger RuntimeError under UndefinedBehaviorSanitizerrXJ`torch.linalg.lstsq` Trigger RuntimeError under UndefinedBehaviorSanitizerrX8INTERNAL ASSERT FAILED. Missing scalar type infromation.rXftorchdynamo is not properly setting up input tracking (e.g., for symbolic shape guards) for view basesrXEMPS test_numpy_ref_mps_nn_functional_group_norm_mps_float32 is flaky?rX*[dynamo+ddp+symbolic-shapes] Issue TrackerrXMRuntimeError: derivative for aten::mps_max_pool2d_backward is not implementedrXZInvestigate why `test_aot_autograd_symbolic_exhaustive_masked_median_cpu_float32` is flakyrX;Can't import torch --> OSError related to libcublasLt.so.11rX9Add alphatensor support for faster matrix multiplication?rXF[Inductor] Input Buffers Should Be Representable As Storage And LayoutrXAtest/test_ops.py is segfaulting on master build with DEBUG assetsrX[RFC] PyTorch DistributedTensorrX.Inductor may merge two output tensors into onerXBReturn the attention weights using the Transformer Encoder class. rX,[Inductor] Vectorize Embedding Lookup in CPPrXF[feature request] Get/set fastmath CPU bit (and some other FPU flags?)rXXImportError: libcupti.so.11.2: cannot open shared object file: No such file or directoryrX>[ONNX] Convert to onnx scatter op and LSTMCell op and for LooprXZQuantization error between fake-quantized model and quantized model using the new observerrXEPotential bug in torch.optim.lr_scheduler.CosineAnnealingWarmRestartsrX Batched Random Number GeneratorsrXRtorch.jit.trace() - AttributeError: 'NoneType' object has no attribute '__module__rXPRuntimeError: method '__torch__.___torch_mangle_0.MyModule.sin' already defined.rX"scatter_ op convert onnx exceptionrX[DISABLED test_extract_gradients_from_optimizer_set_to_none (__main__.TestIdentifyGradients)rXforward AD for _euclidean_distrXAConsolidate binary build matrix for core and validation workflowsrX@API For Registering Stride Preferences For User Fallback KernelsrXsnn.Linear allocate too many space which lead to CPUAllocator "allocate memory failure" if it's BF16. good for FP32.rXMinifier crashrX=`MultiMarginLoss` doesn't check the value of `target` on CUDArX@`ConvTranspose` fails on CPU but returns an empty tensor on CUDArXApack_sequence() always fail after set_default_tensor_type to CUDArX1CUDA unknown error after suspend during debuggingrX7GitHub first-time contributors box pops up unexpectedlyrX3Cloud-based rendezvous backend / distributed store?rXr[FSDP] FSDP produces different gradient norms vs DDP, and w/ grad norm clipping creates different training resultsrX/[inductor] Accuracy failure in torchbench hf_T5rXnThe libtorch tests Simplify.{SimplifySymbolicMinMax,SimplifyNestedMax,SimplifyNestedMin} fail on Apple SiliconrXVThe libtorch test SequentialTest.ModuleForwardMethodOptionalArg fails on Apple SiliconrXMThe libtorch test TestScalarTensor.TestScalarTensorMPS fails on Apple SiliconrXYThe libtorch test ConstantPropagation.CustomClassesCanBePropagated fails on Apple SiliconrX.Bernoulli uses legacy contiguous memory formatrXIquantization convert should warn the user if calibration has not happenedrXSDespite having aten::diag_embed.out, torch.diag_embed doesn't support out= argumentrXV`pack_padded_sequence` not compatible with deterministic mode it calls `torch.scatter`rXFcpp_extension CUDA library path hard-coded as "lib64" but may be "lib"rX5[Quant] Validate FixedQParams observers in eager moderX2Dynamo handling for all methods of torch.GeneratorrX.Add support for `torch.Generator` in the FX IRrXGWhat causes CPU to degrade when I load the weight with torch.hub.load()rXO`nn.functional.embedding_bag` Trigger out-of-bound Read under Compute SanitizerrXNWe probably are allowing mutations to happen on fake tensor in VariableTrackerrXWquantization: error message when using `convert_fx` on a model on cuda should be betterrX)Don't store example_value on FX node metarXDtorch.set_grad_enabled results in RuntimeError with torch.jit.scriptrX[DISABLED test_module_attribute_mutation_violation_negative_2 (__main__.MutationExportTests)rXDMixed precision training fails due to NaN in batch norm running_meanrXNDISABLED test_index_put_accumulate_large_tensor_cpu (__main__.TestIndexingCPU)rX[DISABLED test_module_attribute_mutation_violation_negative_1 (__main__.MutationExportTests)rX[DISABLED test_module_attribute_mutation_violation_negative_4 (__main__.MutationExportTests)rX[DISABLED test_module_attribute_mutation_violation_negative_3 (__main__.MutationExportTests)rX;MaxPool1D output shapes can be negative when ceil_mode=TruerX2linear mm weight and bias dtypes mismatch bypassesrXF`unique` will reverse the input when `sort=False` on cpu (not sorting)rXtorch._dynamo.exc.Unsupported: call_function UserDefinedClassVariable() [] {} ([Feature request] Allow custom classes with custom __setattr__ method in torchdynamo)rXUHang: sampling VonMises distribution gets stuck in rejection sampling for small kapparXGview_as_real and split_with_sizes links in Tensor Views docs are brokenrXEnable AMP for MPS devicesrX-Flaky dynamo test_indexing flaky with SIGKILLrXbenchmark cache persistrXkUnit test with `--subprocess` command doesn't respect the `-k` filter flag and runs all available sub testsrX@Whether to support libtorch source code compilation of C++11 ?rX-Summary of inductor issues observed on masterrXNcuDNN error (CUDNN_STATUS_NOT_SUPPORTED) for torch.nn.functional.grid_sample()rXc[PrimTorch] Functionalization pass removes Instance Norm / Batch Norm running stats transformationsrX@TorchBench - moco - RuntimeError: Tensors must be CUDA and denserX9Move Dropout to LowMemDropout Replacement To PyDispatcherrXMSE documentation is weakrX"Group losses in a common namespacerXi`torch.load()` cannot load data saved at non-zero position in a file (`failed finding central directory`)rX2ASAN shard 4 started to OOM after unrelated commitrX`AttributeError: module 'tensorboard.compat.tensorflow_stub.io.gfile' has no attribute 'MakeDirs'rX<1.12.1 incompatible with c++ built for 1.12.0 and vice versarXDDiffuser pipeline device attribute broken when using optimized modelrXA[Feature Request][XLA] Support fallback for the dynamo-xla bridgerX?Blanket disable torch function/dispatch mode/subclass in dynamorX.build: failure when upgrade oneTBB to 2021.7.0rXGHessian is (incorrectly) zero when using MPS on M1 Mac, but not on cpu rX8[ONNX] Flaky CI test failures with different random seedrXWeird random SIGTERM occurancerX#Add `gloo` support for `all_to_all`rX_Inductor - resnet18 - large batch size - CUDA error: an illegal memory access was encountered rX-Enable `torch.topk` to support `stable` flag rX0Add torch.tensor replacement and int_tensor primrX.[Inductor] incorrect result of vision_maskrcnnrX%cudagraphify Dynamo's nvFuser backendrXNAdd a config option to raise errors instead of warnings in nvFuser integrationrX6[docs] torch.is_neg/torch.Tensor.is_neg not documentedrX6`torch.nn.RReLU` not reporting `lower > upper` on CUDArXhMoving tensor to GPU by .cuda() gets stucked when AMD Secure Encripted Virtualization (SEV) is activatedrX?`torch.mm` Trigger RuntimeError with UndefinedBehaviorSanitizerrXA☂️ Issues that trigger crashes due to corner-case API usagesrXCConv2d is not deterministic when input tensor has different stridesrX<AvgPool2D output shapes are inconsistent when ceil_mode=TruerXLRefactor `torch.return_types.topk` to behave like a `namedtuple` or a `dict`rXAAdd eq, to, masked_select, index_select, narrow to nested tensorsrX2Placing LSTM model on bfloat16 on GPU causes errorrX?Python Dispatcher registrations beyond BackendSelect do nothingrXWIP: feat: LARS optimizerrXEProcessGroupNCCL watchdog can't catch NCCL comm initialization issuesrX6Add nondeterministic alert to `torch.Tensor.scatter()`rX.out of memory with pytorch version after 1.8.1rX7convert torch.jit.script model to ONNX get wrong resultrX8Cannot import `traverse_dps` from torch.data.utils.graphrX$Different behaviour in sparse matmulrXF`torch.nn.CTCLoss` Trigger heap-buffer-overflow under AddressSanitizerrX4Minifier doesn't work on DebertaForQuestionAnsweringrXGInductor gives obscure error when FX graph to be compiled returns tuplerXATurning on minifier causes bug to go away (on DebertaForMaskedLM)rX@A segment fault can be triggered in fbgemm_pack_gemm_matrix_fp16rXbgetting error error: namespace "cub" has no member "Debug" when try to build v1.8.2 with CUDA 11.6rX [WIP] Composable FSDP Follow-UpsrX,C++ Extensions can't import c10d/reducer.hpprXEinsum Optimization TrackerrXKmulti-node distributed training rank0 hang at dataloader after a few epochsrXdtorch.rand(...) is not consistent for large shape dimensions across GPUs (with the same random seed)rXRamp with `bf16`: backward happens in `f16` when using `@torch.cuda.amp.custom_bwd`rX3`torch.distributed` crash with abort only inside ifrX(crash in `torch.package.PackageExporter`rX5crash when call `torch.set_num_interop_threads` twicerXXVS2022Preview ParallelCommon.cpp.obj : fatal error LNK1161: invalid export specificationrXXAutograd doesn't stop executing backward graph early enough in situations involving set_rXOAttributeError: '_MultiProcessingDataLoaderIter' object has no attribute 'next'rX8link error happen when intergrate libtorch to other toolrXXtest_conv_large_cuda: RuntimeError: CUDA error: an illegal memory access was encounteredrX\test_batchnorm_eval_cuda_float32: AttributeError: 'NoneType' object has no attribute 'clone'r Xtest_LSTM_grad_and_gradgrad_cuda_float64: ValueError: gradcheck expects at least one input tensor to require gradient, but none of the them have requires_grad=True.r X9test_Bilinear_empty_cuda: IndexError: pop from empty listr X[test_memory_format_ao_nn_quantized_MaxPool2d_cuda_float32: assert not memory_format, "TODO"r Xotest_cpu_gpu_parity_nn_AdaptiveAvgPool2d_cuda_float32: networkx.exception.NetworkXError: node sink not in graphr XPermuterX7"No CUDA GPUs are available" coming from GHA g5 runnersrX+Add aten::empty.memory_format for SparseMPSrXHFailure to export scripted models to ONNX when input is a list of tensorrXoRuntimeError: unable to mmap 29764 bytes from file : Cannot allocate memory (12)rX'M1 Mac, MPS: Buffer is not large enoughrXB[Inductor] Support deterministic parallel reduction in CPP backendrXE`max_unpool3d` will trigger an assertion fail under compute sanitizerrX[ONNX] Graph passes analysisrXECUDA error: operation not permitted when stream is capturing (2 GPUs)rX@`AvgPool` and `MaxPool` will crash in JIT w/o profiling executorrXS`BatchNorm` a 0-shape tensor will crash in JIT trace w/o profiling executor on cudarX5ONNX-exported model cannot output Dict[str, X] or strrX%AssertionError: Unknown expression s2rX$Libtorch windows binaries publishingrX9`torchtyping` annotations make saving to Torchscript failrX!Improvements to fuse optimizationrXKAutograd precision for CONV + BN between pytorch version 1.11.0 and 1.12.1rXK`torch.min`/`torch.max` returns bogus values for default int tensors on MPSr XTTorchDynamo: there has a accuracy issue for conv+unary(binary) post ops for gpu pathr!X0Checkpointing Support for Modularized Optimizersr"X]FakeTensorMode doesn't support two Scalar inputs, if we use prims' impl as the meta function r#X8C++ Adagrad optimizer doesn't initialize parameter stater$X*pytorch/pytorch cpu official Docker imagesr%X0Get https://github.com/pytorch/benchmark workingr&X,Enable PostLocalSGDOptimizer on CUDA tensorsr'X:Investigate possibilities of automation for build pipeliner(XJPerformance issue on Windows with a "benchmark" comparing to Linux and WLSr)XINTERNAL ASSERT FAILED !(has_different_input_dtypes && !config.promote_inputs_to_common_dtype_ && (has_undefined_outputs || config.enforce_safe_casting_to_output_ || config.cast_common_dtype_to_outputs_))r*X/`libtorch_cpu.so` is exposing some LLVM symbolsr+X)Add tests for ProcessGroup cpp extensionsr,X@torchdynamo.export doesn't work with data-dependent control flowr-XQeca_botnext26ts_256 fails with TORCHDYNAMO_DYNAMIC_ SHAPES=1: sympy infinite loopr.XCMinifier should try forward only, and if it fails set fwd_only=Truer/XMMinifier should save config variables so you don't have to replicate env varsr0X'ninja: build stopped: subcommand failedr1XMsebotnet33ts_256 fails with TORCHDYNAMO_DYNAMIC_SHAPES=1: sympy infinite loopr2X-#error "Expected GLOO_USE_CUDA to be defined"r3X[Crash on backwards step when using `batch_first=True` for LSTMs on MPS (1.14 nightly build)r4XHDynamic shapes exhaustive tests should fail (not xfail) if data mismatchr5XPFunctionalization does something wrong with pad backward when it uses as_stridedr6XEDCE produced obviously wrong graph for pad, but test did not catch itr7XYTesting insufficient to catch incorrect dispatch key for bernoulli.p re functionalizationr8Xdiagonal of Jacobian matrixr9X7The behavior of cast `NaN` is different on cpu and cudar:XHImprove `c10d::ReduceOp` & `torch.distributed.distributed_c10d.ReduceOp`r;X3`bmm` will return wrong result on cpu with in-placer<XP[onnx] export repeat_interleave TypeError: z_(): incompatible function argumentsr=X_DISABLED test_numpy_ref_mps_nn_functional_conv_transpose1d_mps_float32 (__main__.TestCommonMPS)r>X-RAM leak when copying tensor from cpu to cudar?Xinvalid_arguments.cpp is bustedr@X?Loading model trained on MPS cannot be opened on non MPS systemrAXHSynchronize domain builds to be executed after core build have completedrBXSbuilt from source windows static library with multiple "unresolved external symbol"rCXMGloo errors when process's batch only indexes padding_idx of sparse embeddingrDXMissing docstring for resize_asrEXATorchDynamo fails to trace the graph when custom op is being usedrFXF[JIT] Inconsistent handling of tracing dict output leads to assertion rGXHCategorical fails simplex validation after its own normalisation on CUDArHX[BUG] moco fails without suppress errors: RuntimeError: Tensors must be CUDA and dense rIXPlaceholder tensor is emptyrJXTSome operations do not keep `channels_last` memory format which yields accuracy droprKX4pytorch could not build from source with cudnn 8.0.5rLXmSemantics of sparse operations clarification - Sparsity of the gradient with respect to a sparse tensor inputrMX:ipykernel crash importing torch after scipy in .ipynb filerNX[Bug]: AssertionError: ABCMetarOX6index_select() applied in sparse tensor can't backproprPXZ[be] Change the structure of BackendConfig so that we don't need to write helper functionsrQX4`lower_cholesky` constraint incorrectly fails on MPSrRXK[Bug]: TorchInductor Input As_Strided Calls Dont Compose With Offset InputsrSX/Quantized Inference on GPU summary of resourcesrTXF`chunk` a 0-dim tensor will crash in JIT script w/o profiling executorrUXKInstalling PyTorch with BUILD_SPLIT_CUDA=ON and CUDNN fails on linker errorrVXDocument dist.new_subgroupsrWX5Better type annotations for `torch.Tensor` subclassesrXX[Bug]: OPTForCausalLM failing with TORCHDYNAMO_DYNAMIC_SHAPES=1: UNPACK_SEQUENCE AssertionError: assert len(seq.items) == inst.argvalrYX[Bug]: speech_transformer failing with TORCHDYNAMO_DYNAMIC_SHAPES=1: RuntimeError: expand(CUDABoolType{[10, 1, 204, 320]}, size=[-1, 204, -1]): the number of sizes provided (3) must be greater or equal to the number of dimensions in the tensor (4)rZX*Implementation of CG, and BICGSTAB methodsr[X0test_ao_sparsity fails when build without FBGEMMr\X%Triangular solver for sparse matricesr]X4[Inductor] Task Tracker for CPU Backend Optimizationr^XSpeed of torch.istftr_X:RuntimeError: Tensors of type TensorImpl do not have numelr`X9buffer is not large enough when running pytorch on M1 mpsraX>OpenCL 3.0 support: support every GPU on earth through rusticlrbX.Tracker for manually running pytorch/examples rcXGLSTM and RNN fail dynamo lowering in eval mode due to FakeTensor issuesrdXO[Bug]: the custom op cannot be included in the FX graph captured by torchdynamoreXGBetter error message when attempting to `torch.save` an optimized modelrfXD[Quant] There is no default_qconfig_mapping for dynamic quantizationrgX:'str' object has no attribute '__module__' in jit is_finalrhXKMissing string parsing for some parameter types in python arg parsing logicriX<Enable TorchInductor to support more vectorization operatorsrjX)[Bug]: GoogleFnet failed to load with amprkXRtorch.save throws ValueError: ctypes objects containing pointers cannot be pickledrlX.Use new input clearing mechanism for aot_eagerrmXFdynamo/aot fails when run with autograd.detect_anomaly context managerrnX-register_package has no further documentationroXPThe installation commands given on the pytorch website will not install properlyrpX4nvprims.div doesn't work with FakeTensor cpu scalarsrqXDynamo+FSDP overall triagerrX`custom_jvp` and `custom_vjp`rsXCReproducible "CUDA error: an illegal memory access was encountered"rtX'Missing `docker` directory in `tools/` ruXkThe autogenerated out variants via `autogen:` do not check that the dtype of the `out` kwarg via `canCast`.rvX+Unstable results in sin/arcsin/arccos callsrwX8torch.linalg.cond gives inconsistent results on CPU/CUDArxX3New APIs for cuda graph inspection and manipulationryXODDPOptimizer+inductor OOMs with hf_GPT2_large and timm_vision_transformer_largerzXtorch/csrc/utils/python_arg_parser.h:424:94: error: format ‘%ld’ expects argument of type ‘long int’, but argument 7 has type ‘int’r{X8DISABLED test_expanded_reduction_cpu (__main__.CpuTests)r|XMUnrecognized data format when using release libtorch libraries in debug buildr}X?torch.clamp does not clamp out of -0 from 0 when ran on the CPUr~X4[MPS] sum on a size=1 dim is ~5x slower than squeezerX(Bug in Histogram Observer ImplementationrX1MPS memory usage significantly higher than on CPUrXhFailing periodic test: test_comprehensive_masked_cumprod_cuda_float16 (__main__.TestInductorOpInfoCUDA) rXvFailing periodic tests: test_dense_mask_index_cpu (__main__.CpuTests) & est_expanded_reduction_cpu (__main__.CpuTests)rX3gradcheck failure with sparse matrix multiplicationrXWcppextension host compiler check ignores executable symbolic link in CUDA bin directoryrX!Nandense layer for missing valuesrXUDISABLED test_variant_consistency_jit_linalg_lu_cuda_complex64 (__main__.TestJitCUDA)rXmore cudagraphs testsrX*Pipe conveys inconsistent value in GPU envrX>Segmentation fault: 11 when running "import torch" on Mac OS XrX(Saving and loading from physical storagerXKImprove Readability of error(s) when provided unexpected keyword arguments.rX>Rewrite `narrow_copy_dense_cpu_out` using `copy_` and `narrow`rXEMultiprocessing DataLoader pickles multiprocessing.Queues incorrectlyrXkError: unknown architecture `armv7-a;' and Error: selected processor does not support `command' in ARM moderX-Drop deprecated behavior from NumPy-style `T`rXDUpgrade to a newer llvm-openmp version to avoid `/dev/shm` pollutionrX"PyTorch RPC crashed when using IB rXLDISABLED test_vmapvjpvjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA)rX1[opbench] NameError: name 'tensor' is not definedrX/Importing torch 1.12.0 breaks subprocess modulerX"torch.cat on empty tensor is bogusrXG[FSDP] Investigate `torch.cuda.current_stream()` usage in post-backwardrXQView-based advanced indexing (Integer array/LongTensor indexing) of nested_tensorrX"Broadcasting add for nested_tensorrX_DISABLED test_variant_consistency_jit_linalg_lu_factor_ex_cuda_complex64 (__main__.TestJitCUDA)rXcompile torch from sourcerX'TorchInductor CPU Performance DashboardrXR`torch.distributed.all_reduce` allocates excess GPU memory when using NCCL backendrX;.view(dtype) on a quantized tensor throws SegmentationFaultrX@Distributed collective ops fail in `inference_mode` for CPU-onlyrX7Could not run select_backward [vmap] [dlrm] [functorch]rXForward hooks for ScriptModulesrXGJIT model returns different value on cpu with uniform-initialized inputrX/Expanding the parameters of `torch.svd_lowrank`rX6[MPS] Add support for aten::erfinv.out for MPS backendrXDJIT model will have a different jacobian after the first computationrXDTF32 conv_transpose2d with groups has bad precision compared to fp32rXYWe don't have an op for vulkan_prepack::conv2d_clamp_prepack but it isn't a special case.rX,Poisson sampling on GPU fails for high ratesrXLDISABLED test_vmapjvpall_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA)rXLDISABLED test_vmapjvpvjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA)rX=Autograd doc does not mention torch.autograd.set_grad_enabledrXMNVFuser `FusionRootMappingMultipleBroadcast_CUDA` raises exception on sm_80+ rXtNVFuser `FusionComputeAtMultiBCast_CUDA` and `FusionDetectSelfMappedDomains_CUDA` does not raise exception on sm_80+rX*DISABLED test_attn_cuda (__main__.TestMin)rXEPerformance tests mnist_hogwild-cpu_memory CPU memory increase by 30%rX4Feature request: Deterministic test input generationrX[ONNX] AssertionError: A mismatch between the number of arguments (5) and their descriptors (4) was found at symbolic function 'scatter'rX(Documentation and typing hints for RProprX="upsample_nearest2d_out_frame" not implemented for 'BFloat16'rX4Pytorch built for Jetson errors if CUDA is not foundrXA[TorchInductor] Add support for Pascal GPUs (P100, GTX 1080, etc)rX?Adding a linear layer leads to failure of `optimize_for_mobile`rX]libtorch throws `required keyword attribute 'profiled_view_size' has the wrong type` on LinuxrXlibtorch make failed rXg[NvFuser] INTERNAL ASSERT FAIL "ScalarType should be static for Tensors in fusion for amp optimization"rX9RFC(from users): nn.Module behavior with in-place changesrX4[ONNX] CSE pass in export pollutes Scope informationrXQMove functorch tests from functorch/test/* to test/*; delete functorch CI configsrX[JIT returns different values for a model on cuda and returns a strange error message on cpurXJDecomposition table is ignored with use_functionalize=True in AOT AutogradrX3Nonoptimal trace of silu_backward with AOT AutogradrXFNVFuser batch norm with prims: internal assert failure from test suiterX1`squeeze_` fails with JIT but succeeds without itrX4JIT returns different values for `cos + frac` on cpurX4`CTCLoss` returns a different value with JIT on cudarXDJIT model with `relu+div+sgn` will crash when computing the gradientrXCJIT model with mean will crash when computing the gradients on cudarX7Easy way to "freeze" BatchNorm running_mean/running_varrX:Instructions for Selective Build for Mobile Linux PlatformrXj[functorch] colab links on functorch 0.2.0 website should be linked to a permalinked version of the colabsrXCData conversion ops ignore `memory_format=torch.contiguous_format` rX;[NvFuser] would change the output for some inaccurate dtyperXN`topk` will return the wrong value and could read out-of-bound value after jitrXD`max_unpool` and `max_pool` will trigger INTERNAL ASSERT FAIL in JITrXV`MultiLabelMarginLoss` will return incorrect values in JIT after the first run on cudarXAbout autocastrX+Segmentation fault (core dumped) in RTX3090rX(Compile failed at allreduce without gloorXDcuda.list_gpu_processes() uses the 'wrong' device order (PCI_BUS_ID)rX%Test for multiple instances inferencerX([functorch] [vmap] [SymInt][fake tensor]rX-Running JIT trace for many times leads to OOMrX&Conv2d will crash by using `jit.trace`rXb[NvFuser] JIT model with `mul+atan+sgn` will access illegal memory on cuda when computing gradientrXSupport cpp wrapper coderXl [Distributed: RPC] Sending `nn.Parameter` as RPC argument automatically detaches from the computation graphrXTooling Issue TrackingrX[ONNX] Memory leakrX8[DDPOptimizer] Compiled subgraphs sometimes return listsrXCInductor doesn't fuse outer dimension softmax into a single kernel.rXM[ONNX] Create an input adapter for suppling torch module input to onnxruntimerX<Automatic broadcasting for batch addition for sparse tensorsrXG`mem_get_info` reserves memory and can not be destroyed / deallocated. rX=onnx.export make size operations return Tensor instead of intrX*FSDP support to load DDP optim checkpointsrXItorch.tensor obj automatically moved to shared memory upon Process launchrXGWrong results with torch.linalg.inv on batched matrices when using cudarX<`SyncBatchNorm` doesn't work with subclass of `torch.Tensor`rXX(JIT) x:Optional[T] cannot not expect content type after `if x is None or x.shape[0]==1`rX'End-to-End AMP training with GradScalerrX'torch.cuda.empty_cache() is not workingrX#Improve FX naming for getitem callsrX.Dedicated function for shallow_copy_and_detachrXJStack trace preservation should work on plain use of make_fx / AOTAutogradrX2DISABLED test_rmsprop (optim.test_optim.TestOptim)rX [minifier] Accuracy minificationrXunctorch] [aot_autograd] rX>Use opinfo segfaulting list to protect inductor run internallyrXQOpInfo Tests To Validate that All Operators Are Being Tested With Strided TensorsrXp`conv_transpose` is not similar to `nn.grad.conv_input` when `output_padding` is passed with non-default values.rXAJetson JIT: Memory Leak on inference after optimize_for_inferencerX7Add complex support for SparseAdam and LBFGS optimizersrX)Add `maximize` support to LBFGS optimizerrXF`torch.special.round` doesn't support the same dtypes as `torch.round`rXGFeature request: Tests for `int` should be tests for `numbers.Integral`rX AOT Autograd Device PartitioningrXCJIT `lgamma` will return `inf` only with dual input in forward moderXP`torch.multinomial` on MPS crashes with `Error: total bytes of NDArray > 2**32'`rX2JIT miss the argument `as_tuple` for API `nonzero`rXPTransformerEncoder/TransformerDecoder has same initial parameters for all layersrXAUTOGRAD is not working on IOSrX7Autocast with BF16 on CPU slows down model more than 2XrX6TORCH_WARN is executed just once per set of parametersrXONNX export of any TorchScript submodule (scripted or traced) fails with "Modules that are called during a trace must be registered as submodules of the thing being traced" rXSupport guard on thread numberrXC[Inductor] Support float32 accumulation type for float16 sum on CPUrX5[Feature] Dispatching PyTorch Distributed CollectivesrX*How to perform unstructured interpolation rXpath in WORKSPACErXUfmt/src/os.cc: error: unknown type name 'error_code'; did you mean 'std::error_code'?rXxDynamo shouldn't name getitem variables getitem; instead it should derive the name from the variable that was getitem'edrXLWhen you call tensor.size(), dynamo returns a tuple, instead of a torch.SizerX0torch.nn.functional.one_hot only works for int64rXIMPSNDArray.mm:782: failed assertion; bufer is not large enough Mac M1 MPSrXIDebuggability++: Share instructions for building exotic CI configurationsr X@[TorchDispatch] Scalar Only Inputs Gets Matched To Tensor Schemar X>torch.jit.trace throwing Invalid name for qualified name eror r X?TransformerEncoder src_key_padding_mask does not work in eval()r X@JIT fails to trace binary cross entropy with a strange error msgr X1`cdist` should succeed when `p` is integer in JITrX,When will the torch.sparse module be usable?rX\JIT return a tensor with different datatype from the tensor w/o gradient and normal functionrX`F.affine_grid` crashes on MPSrXA[Activation Checkpointing] Investigate pin_memory for CPU offloadrXAFigure out the future of Metal backend given the existence of MPSrX4torch.remainder and torch.fmod produce wrong resultsrXpartial view/reshapingrXaSignificantly worse MPS performance between torch 1.13.0.dev20220922 and torch 1.13.0.dev20220930rX?Functorch memory_efficient_fusion gives wrong output batch sizerX:Quantized version of Sigmoid doesn't have _get_name methodrXNDiscrepancy in output shape for batch_norm inference mode between CUDA and CPUrXs[Distributed] Loading distributed checkpoint with FSDP fails with varying key errors (pos.embedding, shared.weight)rX'CUDA OOM issue when running tests in CIrXSetup ssh sometimes failrXSteam Deck Core DumprX7TorchScript does not recognize mix-in types with `Enum`rX<High occupation on GPU 0 when converting Tensor to multi GPUrXCollect Operator Coverager XhDISABLED test_aot_autograd_exhaustive_as_strided_scatter_cpu_float32 (__main__.TestEagerFusionOpInfoCPU)r!X?JIT model could return 'NaN' gradient after the first executionr"XG`torch.mm` produces wrong result on cpu when using in-place computationr#XmPrint a warning when user specifies a qconfig for some node and the qconfig is not supported by BackendConfigr$XlSetting the cuda device when using start_processes in Jupyter on Ampere leads to CUDA reinitialization errorr%X6[primTorch] Need to update data-dependent check policyr&X7[FSDP] `use_orig_params=True` Follow-Ups & Known Issuesr'X.string interning for dispatcher operator namesr(X,TorchScript error for `Enum` inside a moduler)X\`vector_norm` will trigger "Tracing failed sanity checks" for JIT when ord is boolean tensorr*X3JIT fails to trace `sparse.mm` with a strange errorr+XVTorchScript causes range_check error after a few iterations of forward-backward passesr,X4nn.CrossEntropyLoss overflow with FP16 and minibatchr-XHTimed out receiving the shared seed from the distribtued store on Rank 2r.XQConda Pytorch (Pytorch channel) in WSL2 Ubuntu can't find libcudnn shared objectsr/X&Replace same with TestCase assertEqualr0Xretro inductor OOMr1Ximagen inductor errorsr2X*Inductor stable baselines assertion errorsr3XF[ONNX] Conversion failed when using dict as input to a scripted moduler4X\Minifier should not produce repro with backward call if it is not necessary to trigger errorr5XComposer inductor errorsr6XCMinifier dumps checkpoints which don't actually reproduce the errorr7XG[Quant] Remove or clarify the meaning of Nones in QConfig/BackendConfigr8XRuntimeError: [enforce fail at CPUAllocator.cpp:68] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4860032 bytes. Error code 12 (Cannot allocate memory)r9XfAdd vector-Jacobian-products for a subset of nvFuser-supported prims; add backward support for nvprimsr:X$AMP consumes 30x gpu memory with bmmr;XHTorchInductor CUDA memory leak / memory corruption debugging master taskr<Xsnn.Embedding weights are not synced across processes with DistributedDataParallel when other parameters are presentr=XR[ONNX] torch/onnx is using rank to differentiate between ScalarType and TensorTyper>X/[functorch] CUDA Graph failure with AOTAutogradr?X/[functorch] conv.{1, 2, 3}d should raise errorsr@XCUDA allocator feature requestsrAXgCould not run 'aten::native_batch_norm' with arguments from the 'SparseCUDA' backend. using batch_normrBX<How to install pytorch with cuda 11.7 in anaconda envirment?rCX'Gloo DDP SocketTimeout error on WindowsrDXzBuild from source failed with error of different gpu architecture (compiler shows sm_30-related error but I use sm_86 GPU)rEX][MPS?] .to(memory_format=contiguous_format) behaves incorrectly; differently to .contiguous()rFX<[Distributed: RPC] Failed to initialize RPC with >18 workersrGXACreating NumPy array with `dtype=object` of PyTorch tensors failsrHX7Multiple GPUs get "errno: 98 - Address already in use"rIX`Solve default argument induced include cycles by not using defaults / moving the defaults to inlrJXB`linalg.norm` cannot compute the grad in forward mode after scriptrKX5`as_tensor` will return a different dtype with scriptrLX?FSDP's FlattenParamsWrapper breaks dynamo's faketensor wrappingrMX:`jit` could make some undifferentiable APIs differentiablerNX5`mvlgamma_` will fail when compiling with trace `jit`rOXS`detach_` behaves differently when computing the gradients in forward mode w/ `jit`rPXQtorch.Tensor.transpose().contiguous() on dimension of size 1 gives wrong stride rQX&NvFuser single mode changes the outputrRX.Iterative Global Pruning Cause GPU Memory LeakrSXO[functorch] transforms like jacrev, jacfwd, grad, etc don't work with BatchNormrTXGImplement `rand_like` ref and implement nvfuser_impl for `uniform` primrUXGThe reload `MultiLabelMarginLoss` will have different gradients on cudarVX;Measure impact of JIT decompositions, reconsider the designrWXdThe reload model has different (and strange) forward computation from original model with `LSTMCell`rXX2Execute smoke test for Better Transformer feature rYX2Unexpected assertionError when export tensor.numelrZX9[AO] In sparisty schedulers, rename `last_epoch` to stepsr[XO`max_pool2d_with_indices(self, ...)` shouldn't need to save `self` for backwardr\X<Issue with converting Comet model to ONNX. Split-node error.r]X4Can we rewrite numpy operators to pytorch operators?r^XSCannot index into a tensor using indices from another device - regression from 1.12r_X@`aminmax` will trigger INTERNAL ASSERT if input is empty on cudar`X4Prim Output Spec is Not Always Consistent With EagerraX6Feature Request: Deterministic Algorithm for MaxPool3drbXLtorch.nn.utils.prune.remove reorders the parameters of a module unexpectedlyrcX@please report a bug to PyTorch. Expected Object but got PyObjectrdX4Please put back missing rocm builds of Torch Vision.reX:very strange speed of torch.bmm with specific tensor shaperfXL[FSDP] MixedPrecision, CPUOffload, BackwardPrefetch etc should be documentedrgXdCI fails for test_compare_cpu_nn_functional_embedding_cuda_float32 which is not reproducible locallyrhX-Inconsistency between geometric distributionsriX0More windows for filtering and spectral analysisrjXPoint community docs to masterrkX6JIT fuser issues with {ceil,floor,round,trunc}(int8) rlX-functorch aten::scatter_add_ not implementedrmX(Crash in `torch.package.PackageExporter`rnX?AOT Autograd traces have instability in defining the same GraphroX3Remove `TypedStorage` and use only `UntypedStorage`rpX0torchrun substitutes host names for IP addressesrqXJHave NVIDIA driver and other related dependencies as part of the Linux AMIrrX1nvFuser support for {ceil,floor,round,trunc}(int)rsXAFX Graph Mode Quantization - Generate static quantization networkrtX/Add `persistent` option to `nn.Module.buffers`.ruXFc10d all_gather aborts with Signal 8 (SIGFPE) when tensor.numel() == 0rvX<[asan] ctc_loss fails test_make_fx_fake_exhaustive with ASANrwXB[MPS] load checkpoints gives zero weights when map_location is mpsrxXsTracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. ryXBtopk returns different results with the same input in cuda and cpurzX'Segmentation fault in native_batch_normr{X8Floating point exception in gather gradient computation.r|XSSegmentation fault in mkldnn_reorder_conv2d_weight and mkldnn_reorder_conv3d_weightr}X OSError libstdc++.so.6 at importr~X?performance between manually created graph and CUDAGraph.replayrX~pytorch core test failure: RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.rXNestedTensor 2.0 issue trackingrXRPytorch on iOS (iPhone X & XR) throwing can't allocate memory exception. Ref Logs:rXtorch::quantile performance?rXz[ONNX] Using values from a different tensor to index a tensor returns a tensor with incorrect shape in exported ONNX modelrXWPT Dispatcher confusing error message "There were no tensor arguments to this function"rX.make_traced() doesn't respect setting the seedrXNeed operator fallback statsrXD[ONNX][bug] `nn.Transformer` contains unsupported tensor scalar typerXRNeed easier way to tell which step of the optimized path fails (dynamo + inductor)rXd[ONNX] Produce error message for incorrect number of dummy inputs instead of Internal assert failurerX!Minifier improvements/consistencyrX<Python dispatch for PyOps needs to respect tensor subclassesrXAinstall libtorch cxx11 ABI as default in PyTorch pip installationrX<[ddp] must set `static_graph=False` when running with dynamorXzUpdate `use_deterministic_algorithms` documentation and tests to include `nn.functional` counterparts for all `nn` modulesrXLMemoizing AOT Autograd Input Conversion Breaks Training with Tied ParametersrXFreentrant torch.utils.checkpoint does not work with NamedTuple outputsrX@[NNC] loop vectorization fails, `Ramp` and `Broadcast` undefinedrXEprimTorch/nvfuser: have a way to check that refs are added to __all__rX:libtorch create a tensor is very slow, who can tell me whyrX&Segmentation fault in `torch.jit.wait`rXMSelectively sync internal Meta discussions / posts to dev-discuss.pytorch.orgrX<Add an opaque epilogue in AOTAutograd for aliasing/mutationsrXPCustom autograd functions are not inlined when export mode is ONNX_ATEN_FALLBACKrX%[CheckpointWrapper] Revamp API designrX=Cuda tensor is zero when passed through multiprocessing queuerX1Segmentation fault in `torch.futures.collect_all`rX"Add unit tests for test decoratorsrXYtest_warp_softmax_64bit_indexing_cuda_float16 takes ~147GB of CPU memory and is very slowrX8DISABLED test_random_seed (__main__.TestDataLoaderUtils)rXCCPU and MPS floating point math is different (in a significant way)rX1[FSDP] Test optimizer state dict with CPU offloadrXPRuntimeError: input_shape.size() > 0 || reshape.size() > 0INTERNAL ASSERT FAILEDrXSeparate doc and binaries buildrX2`is_pinned()` support in PrimTorch and FakeTensor.rX9Functorch functionalization causes increased memory usagerXQRe-Running PR Sanity Check after Adding `skip-pr-sanity-checks` Label Still FailsrXetorch.utils.checkpoint (with use_reentrant=False) doesn't work with all PyTorch features that set TLSrX,View consistency for PrimTorch+nvFuser testsrX@Feature Request: deterministic adaptive_avg_pool2d_backward_cudarX614k github models on PyTorch 2.0 pass rates dashboard rXONNX exporter errorrX7Support different NSE in batches of CSR and CSC tensorsrXQTypeError: finfo(): argument 'type' (position 1) must be torch.dtype, not HFProxyrXAProfiler Hangs on Non-Blocking H2D Transfer in Non-Default StreamrX;Batch multiplication for torch.sparse matrix multiplicationrXNINTERNAL ASSERT FAILED for _jit_pass_vulkan_optimize_for_mobile (Google Colab)rXNMPS: allow selecting specific MTLDevice by registryID via environment variablerXcompiling failed from sourcerXmacOS Pyinstaller: libc++abi: terminating with uncaught exception of type c10::Error: Type c10::intrusive_ptr> could not be converted to any of the known typesrX?Allow passing dict (as opposed to OrderedDict) to nn.SequentialrXYUse graph partitioner to remove ops that can be captured with cudagraphs in TorchInductorrXPyTorch-DirectML RFCrX6Test aten decompositions match their alias informationrX[ONNX] Speed up unit testsrX)Explore TBB for TorchInductor C++ backendrX2Add documentation about backward graph gc behaviorrX(About the different ways to print modelsrX&Set dtype if tensor converted to numpyrXBNotImplementedError: The operator aten::native_group_norm_backwardrX<Dynamo eager with sparse tensors gives wrong numeric resultsrX/Error when trying to export MONAI model to ONNXrX;test_public_bindings is not robust to various build optionsrXk[TensorExpr] applying `rfactor` for a Mul Reducer with init value different than 1 results in wrong resultsrX8JIT will affect the gradient computation of forward moderXBAutograd will take `init` module API into account when using `jit`rX3[ONNX] Track non-exportable pattern as diagnostics.rXvSupport FP16 with torch._fake_quantize_learnable_per_channel_affine & torch._fake_quantize_learnable_per_tensor_affinerXJJIT script calculation/dtype inconsistent depending on operator expressionrXUtorch.nn.functional.interpolate fails on some degenerate shapes, but passes on othersrXBINTERNAL ASSERT when the type of argument is not considered in JITrX:Beta distribution behaves incorrectly for small parametersrXtorch.hub.load local modelrXAAutogenerated out functions are missing at::cpu:: and co bindingsrX*Serialize the warmed up torchscript modulerXqCapture scalar outputs / dynamically sized outputs by default, partition graphs for backends that can't handle itrX.Accept SymInts and SymFloats For Scalar InputsrX+Uneven and/or Dynamically sized collectivesrX4torch.jit.script IndentationError: unexpected indentrXmodule: multiprocessing SimpleQueue put cannot bigger 716 in windows.And it is not has any info.The program is blocked and does not move.rX8Tensor slice copy across multiple devices fails silentlyrXWTensor Subclass that doesn't require grad may wrap a Tensor subclass that requires gradrXE[optim] asgd : handling of complex params as real params (NaN vs inf)rX&Pytorch does not recognize GPU in WSL2rX%Add nvfuser support for prims.copy_torXrlist of tensors can't be converted to a torch tensor while list of lists gets easily converted to a pytorch tensorrX6OpInfo tests should compare gpu to cpu implementationsrX*Minimal example for torch.optim.SparseAdamrX``tensordot` not working for dtype int32 and lower when there is only 1 element in the given axisrX4test_prims.py:test_nvfuser_no_args_cuda, memory leakrXKnn.Softmax should not allow default/implicit/unset dim constructor argumentrXdIssue with MPS ops lead to make_grid broken with mps device Tensors, whole grid is the 'first' imagerX,MPS backend appears to be limited to 32 bitsrX$Torch.FX work with autograd.FunctionrXa[NVFuser] RuntimeError: ref_id_it != replayed_concrete_ids_.vector().end() INTERNAL ASSERT FAILEDrXMfunctionalize: Does not compose cleanly with torch.jit.script/torch.jit.tracerXBEmbedding scale_grad_by_freq should probably shrink by sqrt(count)rXfFor PyTorch Nightly, failure when changing MPS device to CPU after PYTORCH_ENABLE_MPS_FALLBACK occurs.rX0A little improvement to torch.nn.ReflectionPad2drX6Install LibTorch by Conan or other C++ package managerrX=[c10d] Support a public API to retrieve default process grouprX,Strange cuda illegal memory allocation errorrX8Set up tests to run periodically and surface them on HUDrXCDeepcopy of FX graph fails with nested make_fx and constant tensorsrX#several questions about pytorch DDPrX'Odd type-casting behaviour in prims.divrX8Installation prefix is not passed to CMake appropriatelyrX3explain() has confusing explanation of graph breaksrX.torch.Size should convert all elements to intsrX:RuntimeError: "slow_conv2d_cpu" not implemented for 'Half'rXLack of newly raised optimizersrX"fix ATen tests that do not compilerXFloordiv is deprecated.rX=torch 1.12.1 cuda 10.2 runs slower than torch 1.8.2 cuda 10.2rX,Should enable skipped tests for `to` OpInfo rX9[Profiler] Snapshot CudaCachingAllocator on profile beginrX![Profiler] Generic Tensor summaryrX:Torch.fx tracing bug with dictionary.update calls on inputrX.DecompositionInterpreter creates invalid graphrXEUnable to run a single convolutional layer in different CUDA-contextsrX/op for aten::bitwise_and during torch.jit.tracerX;Fix convert path for fixed qparam ops (sigmoid and softmax)rX@torch.Tensor.to.dtype_layout overload is not available in PythonrX>relu-gru mse is 0.022 much greater than 0.003 with half dtype.rXCwould you like upload to the cpp libtorch to vcpkg package repo?rX6Support dict inputs and outputs when exporting to ONNXrX*Ensure ops account for offsets and stridesrXPRandomness should be consistent across devices with use_deterministic_algorithmsrX8Gradient value calculation error in MultiLabelMarginLossrXiPytorch gets small bias on the result of different types of divisors while doing floating point division.rX1Attach execution time to each node in an fx tracerX?CUDA 11.6 linux-bionic-cuda11.6-py3-gcc7-slow-gradcheck failurerXRuntimeError: outputs_[i]->uses().empty() INTERNAL ASSERT FAILED at /pytorch/torch/csrc/jit/ir.cpp:1027, please report a bug to PyTorch. (eraseOutput at /pytorch/torch/csrc/jit/ir.cpp:1027)rX)Sparse jagged tensor support for inductorrX[[jit] WithInsertPoint can't get back to the prev_ node if the prev_ node has been destroyedrXhSession of Google Colab crashes when `torch.utils::SummaryWriter` is called after importing `torchaudio`rX9Support setting strides on quantized weights of Embeddingr X1FSDP Forward order differs from that of first runr X<`linux-bionic-cuda10.2-py3.9-gcc7` multigpu test are brokenr XI[nvfuser] view size is not compatible with input tensor's size and strider X?[Quant] Reference get_default_qconfig_mapping in docs/tutorialsr X<Please include virtual/physical batch sizes in the tutorialsrX=MPS convolution is sometimes returning NaNs for valid inputs.rX>[jit] ignored method calling static method results in an errorrXBMove self.subtest calls in FSDP test suite to run_subtests utilityrX(Better error message for qlinear_prepackrX9Expose API for Registering Post-Gradient-Computation HookrX@scripted fasterRCNN model cannot be loaded with libtorch c++ APIrX8 Index out of bounds Error with PerChannel Quantization rXmodel.load_state_dict won't work in Child process if a sufficiently large tensor was padded in the Parent (even if empty padded)rXKI cannot install pytorch by Bad CRC-32 for file 'torch/lib/libtorch_cpu.so'rX:TorchScript unsupport tuple unpacking as function inputs.rXCOREMLTOOLs/NNPACK Python IssuerX$hipErrorNoBinaryForGpu, but reversedrX:[MPS] MPSNDArray error: product of dimension sizes > 2**31rX1Large number of WONT CONVERTs on detectron2 modelrX]fill_ OpInfo code not used, also, doesn't test the case where the second argument is a TensorrX8[Nested Tensor] Enable Nestedtensor to work with OpInfosrXHLinux cuda-11.x binary build jobs intermittently take more than 4 hoursrX/General NestedTensor op coverage tracking issuer X9PyTorch EC2 runners can not be used with standard actionsr!XGScatter min/max reduce operation that returns the corresponding indicesr"XIUndefined reference in libtorch_cpu.so `...std::__cxx11::basic_string...`r#X,pytorch 1.12.1 doesn't build with ffmpeg 5.0r$XOPython3 Depletes 2021 M1 Mac Memory Running Training Ops For Model's M, L and Xr%X1[FSDP] Make sharded / unsharded check more robustr&XBAre PyTorch Android nightly builds getting automatically publishedr'X6empty_quantized should probably be new_empty_quantizedr(X3Add torch nightly builds pipeline for aarch64 linuxr)X)Hitting rate limits for pytorchbot token r*XHprimTorch: support refs and decompositions when ATen and Python disagreer+XJ ModuleNotFoundError: No module named 'torch.ao.quantization.experimental'r,X/Support primtorch view ops in functionalizationr-X]RAM not free when deleting a model in CPU? worse after inference, is there some cache hidden?r.XUTracking nested tensor functions with backward kernels registered in derivatives.yamlr/X-Grad strides do not match bucket view stridesr0XIBug in batch names with matmul (result tensor has names=('i', 'i', 'k')).r1X,pytorch 1.12.1 Adam Optimizer Malfunction!!!r2X+Improve FSDP error msg on wrong attr accessr3X>bfloat16 matmul gives incorrect result on CPU (without mkldnn)r4XEPytorch/Nova CI should monitor service outages for major dependenciesr5X+torch fx cannot trace assert for some casesr6X5test_lazy spuriously fails if LAPACK is not installedr7XERuntimeError: Interrupted system call when doing distributed trainingr8X@Explore TorchInductor optimization pass to reorder kernel bodiesr9XRtorch.linalg.eigh crashe for matrices of size 2895×2895 or larger on eigen and M1r:X2[feature request] Add new device type works on CPUr;X(torch.var_mean is slower than layer normr<XError on installationr=XC[Nested Tensor] Move nested tensor specific ops to nested namespacer>X!Inductor Error: aten.fill_.Tensorr?X-[Nested Tensor] view + inplace for autograd. r@XHuggingFace Slow OperatorsrAXTimm Model Slow OperatorsrBXD[TorchTidy] Check if `set_to_none` would change optimizer semantics.rCXMissing header filerDXBtabulate.tabulate causes a lot of memory to be allocated in yolov3rEX+[Nested Tensor] Update TestCase.AssertEqualrFXCProfiler reports different # of Calls depending on group_by_stack_nrGX(BCELoss results in autocast CUDA warningrHXInvfuser + prim stack generated illegal PTX code on hardware with sm <= 70rIX<How to export a simple model using List.__contains__ to ONNXrJX6Build from source failed on MacOS 10.6 with CUDA 10.1 rKX[Bug] Circular Import rLX3Inconsistency between index_select and __get_item__rMX"distributed tests take a long timerNXbotorch dynamo errorsrOX%Huggingface Transformers Trainer TestrPXSquantization: unexpected casting of tensor min and max to int in histogram observerrQX[inductor] Lower aten.cumsumrRX[Discussion] Add custom devicerSXD[feature request] PyTorch vmap for efficient Evolutionary StrategiesrTX4Unhelpful error message from torch.linalg.ldl_factorrUXftest_profiler_experimental_tree_cuda_detailed is too unstable, and as its CUDA only difficult to regenrVXaargmax/argmin returns the last index instead of the first when there are equally max/min elementsrWX(Speedup for adding images to tensorboardrXXPSegfault when profiling with_stack=True on model with jit.optimize_for_inferencerYXSProfiler can only print first 5 entries in stack traces because of hard-coded limitrZX1Silent promotion of bool to int in the dispatcherr[X\Conv1d: NNPACK SpatialConvolution_updateOutput failed when batchsize or padding is too larger\X libtorch malloc cause coredump r]X8KL-divergence of two Generalized Dirichlet distributionsr^X+OpenJDK libtorch_cpu.so stack guard warningr_XHPyTorch test suite regression test_module_backward_global_hook_writeabler`X0I have the same issue as @samgelman on my MacOS.raXaAdd a new argument `check_inf=True` (by default) or check_pos_inf / check_neg_inf to anomaly moderbX/Adding Levenberg-marquardt optimizer in PyTorchrcXequantize_per_tensor/quantize_per_channel operators should honor the quant_min/quant_max from observerrdX(Cdist backward dependent on compute_modereX Build and Run QNNPACK on X86rfX@[Installation] conda installation hangs on "Solving environment"rgX'`torch.pinverse` produces wrong output!rhXUCalling torch.linalg.cholesky on a CPU tensor requires compiling PyTorch with LAPACK.riX&`Frozen` module for transfer learning.rjXEMove TorchInductor Triton autotuner to compile time, use common cacherkX,Find way to add comments to merge_rules jsonrlX%[ONNX] Convert GFPGANv1.3.pth to onnxrmX6Test public bindings in CI gives weird output on errorrnXiHow to turn off determinism just for specific operations, e.g. upsampling through bilinear interpolation?roX{zero-numel tensor has "RuntimeError: strides[cur - 1] == sizes[cur] * strides[cur] INTERNAL ASSERT FAILED" in multi-thread.rpXPyTorch profiler is spammyrqX[`test_profiler_experimental_tree_cuda_detailed` fails with mismatches in the profile outputrrXE[caffee2] Windows build / 'metanet_pb2' (a circular import) AnacondarsX%Complex-Valued Gaussian distributionsrtX7KeyError `shape,stack,cos` on pennylane quantum circuitruXZReplace decompositions in torchinductor/decompositions with refs/decomps in pytorch properrvX3DDP + FSDP: Investigate behavior for nn.Module APIsrwX)checkpoint function is not jit compatiblerxX& Torch1.10.2 is slower than torch1.9.1ryX"dataparallel function doesn't workrzX)torch.Tag doesn't have accurate mypy infor{XILong test time for PyTorch test_fx::TestVisionTracing with dynamo enabledr|X#TorchVision testing in CI + test_fxr}X2Improvements to ProcessGroupGloo monitored_barrierr~XHaddcdiv_ (in and out of place) not implemented for torch.float16 and cpurXMbmm operator in bfloat16 has low TFLOPS for some tensor shapes with CUDA 11.6rX;cannot import name 'ProcessGroup' from 'torch.distributed' rX8Emulating FP64 and increased precisions on Apple siliconrX!PyYAML not listed as a dependencyrX,Using Pytorch and Mapbox in the same projectrX^During DDP training timm densenet121, mobilenetv2(v3) models do not save state_dict correctly.rXHtorch.nn.Upsample's error message is inconsistent with the documentationrXERPC: wait method of Future object return 0 sometimes in rpc frameworkrX4torch.nn.TripletMarginLoss margin can be less than 0re(X=The type of parameter 'p' in torch.nn.TripletMarginLoss wrongrX]torch.nn.ReplicationPad{1|2}d supports more input dimension than are written on documentationrX Enable pyrerX)torch.nn.PixelShuffle error message wrongrX-torch.nn.MaxUnpool2d get negative size tensorrXUtorch.nn.InstanceNorm{1|2|3}d doesn't verify the value type of parameter num_featuresrXEtorchgen.model.FunctionSchema.parse fails with following ops' schema rXX[VDD] unique_by_key from _embedding_bag_dense_backward isn't blocklisted by CUDA graphsrXAEnable freezing parts of the model in Fully Sharded Data ParallelrX4Check support of FSDP + set_materialize_grads(False)rXNmodule 'torch.distributed' has no attribute 'pipeline' - macOS, PyTorch 1.12.1rX5torch.nn.GRU runs long time, when num_layers is largerXRtorch.nn.functional.softplus / torch.nn.Softplus parameter beta can be set to zerorXdeepcopy of LazyLinear failsrXEtorch.nn.functional.log_softmax parameter '_stacklevel' undocumentedrXOptimize for mobile metal modelrXBExpand Learning rate scheduling to any optimization hyperparameterrX Fail to install torch for sourcerX*torch.nn.Hardtanh allows min_val > max_valrXWWhen padding is big int, torch.nn.functional.fold runs too long and can't return resultrX8Make FSDP easier to debug when erroring in backward passrX%bf16 strided tensor wrong calculationrXECannot call CUDAGeneratorImpl::current_seed during CUDA graph capturerX[MPS] Bug on training CNN+LSTMrXABug in building pytorch deploy from source in macos USE_DEPLOY=1 rXhtorch.nn.functional.avg_pool{1|2|3}d error message does not match what is described in the documentationrXOne dlpack to rule them allrX1[FSDP] `test_summon_single_param()` is misleadingrX0FSDP crash if no parameters are used in fwd passrXVDirect use of torchdynamo.optimizations.analysis fails if you pass in None as an inputrX8Redirect the old metrics.pytorch.org url to the new pagerX:[CI] Create periodic fuzzy testing for PyTorch build flagsrXD[CI] Split up periodic.yml into forward-fixable.yml and periodic.ymlrX7DPP training incompatibility with checkpoint and detachrX make_fx + aot_autograd segfaultsrX?Updating the LTS version of the torch (1.8.2 -> 1.10.2\1.11.2?)rXCtorch.empty_strided argument 'size'and 'stride' documentation wrongrX*FSDP init can crash with shared parametersrXI[JIT] Scripting modules fails for modules that contain nested NamedTuplesrX%Support for CSR Tensor with NN layersrX9New PR template suggests a pattern that does not close PRrX/'Wav2Vec2ForCTC' object has no attribute 'conv'rX1TestCommon.test_dtypes error message is confusingrX&Incorrect tensor conversion to m1 MPS.rX&Implement refs.var as a real referencerX6torch.bitwise_xor argument 'other' documentation wrongrXMtorch.profiler's FLOPs measure only counts operations involving '+' and '*' .rXKtorchinductor fallback cannot deal op that returns tuple of list of tensorsrX5Slice operation on "ragged" dimension in NestedTensorrXeAdding a warning of non-compatibility with forward hooks for the fast path of TransformerEncoderLayerrX?DISABLED test_tensorboard_trace_handler (__main__.TestProfiler)rX-functorch slow tests not being run in slow CIrX;linalg and lu tests fail when run in parallel on linux cudarXACUDA graph capturing fails for nn.Embedding and large batch sizesrXR`torch.tensor` and `torch.as_tensor` keyword argument `device` documentation wrongrX.Unknown builtin op: torchvision::deform_conv2drXGGPU arch 8.6 is not covered by the `TORCH_CUDA_ARCH_LIST = All` option rX5Tensor operation hangs when used with multiprocessingrX4Error building Pytorch 13.1 from Source on OS X 12.5rX-getDLContext in DLConvertor.h cannot be foundrXQfunctionalize and make_fx are not composable resulting in segfault and cuda errorrXV[ROCm] build instruction is haphazard missing information unclear, build does not workrX(Profiling results on CPU is not reliablerXA[LibTorch] the C++ api needs detailed error reports like pytorch rX>UnaryUfuncInfo Sample Generation Ignores sample_kwarg functionrX-Subclass of Tensor doesn't support __format__rX*Fill in a bool Tensor not supported in jitrX9torch.Tensor.bag() should automatically implement baggingrX`Met bugs ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0rX;Refactor how errors decide whether to append C++ stacktracerXiDecompositionInterpreter creates invalid graphs for FX graph modules created with torch.fx.symbolic_tracerXUtorchdynamo backend failure suppression is insufficient when backend fails at runtimerXYAutomating release process - Binary validation, Automatically generating get started pagerX'cur_dim == dimINTERNAL ASSERT FAILED atrXLtensor.unfold don't check the parameter size value, that maybe less than 0. rX=Tensorboard py-profiler shows no device info in Operator viewrX"build fail when using lto with gccrX*Move nested-tensor tutorial from prototyperX=SequentialLR does not work correctly with multiple ConstantLRrXDRReLU doc doesn't specify the eval mode behaving just like LeakyReLUrXJunittest.subTest and way to selectively mark subTests as expected failuresrX)Schema information for torch.* operationsrX.in-place variants should get their own OpInfosrXD[Torchscript] torch.min returns wrong gradient when inputs are equalrXK[Torchscript] some activations backward are not fused when used with linearrX)PyTorch crashes when running with OpenACCrXFakeTensor Support For PicklingrX1contiguous() not work for rank 1 length 1 tensor.rXNDeep copy models with `create_feature_extractor` produces different parametersrXFDataLoader parameter pin_memory_device should accept torch.device typerX6RFC: Add flag for RNN decomposition to all RNN modulesrXPyTorch for quantum mechanicsrXE`torch.cat` can break `torch.jit.ScriptModule` when in inference moderX'make_fx is broken for all tracing modesrXLibtorch C++ torch::stack errorrX5Incorrect CPU implementation of CTCLoss backward steprX[Is there Doc that explains how to call an extension op in another extension implementation?rXUse NestedTensor in RNN modelsrX5[ONNX] Memory leak when exporting a jit model to onnxrX)Split up `common_methods_invocations.py`?rX"Symbolic tensors are not printablerX0Complex addition result in NaN when it shouldn'trX@Implement torch.clamp() on sparse tensors with SparseCPU backendrXICloning conjugate tensor in torch_dispatch context produces non equality.rX(Guide for diagnosing excess graph breaksrXNDoes torch.utils.checkpoint compatible with torch.cuda.make_graphed_callables?rX"SyncBatchNorm does not work on CPUrX>add support for bitwise operations with floating point numbersrX"Quantization issue in transformersrXVMinor inconsistency in description of `attn_output_weights` in MultiheadAttention docsrX)The torch::deploy document is not updatedrXA[JIT] _unsafe_view returns alias when size(input) = size argumentrX?Bilinear interpolation with antialiasing is slow in performancerXAProblems in built-from-source pytorch with USE_DEPLOY=1 in UbunturXmasked_scatter_ is very lackingrXufmt and flake8 lints racerX1Offer a way to really force merges via pytorchbotrX0[JIT] SchemaInfo warning appears out in the wildrXMtest_make_fx_symbolic_exhaustive should pass dynamic ints for shape argumentsrXAdd more Vulkan operationsrXiA/libc: Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 9792 (Background), pid 9674 (ample.testtorch)rXEtorch.einsum gets wrong results randomly when training with multi-gpurX3when distribute training load pretrain model errorrXORace condition between torch.tensor's view and /= (/= returns incorrect result)rXhpytorch's checkpoint_wrapper does not save memory while fairscale's checkpoint_wrapper saves huge memoryrX.`torch.matrix_exp` doesn't handle NaN properlyrX=DEBUG=1 env var doesn't actually set DEBUG preprocessor macrorXl[Reproducibility] Make tests say when unusual environment variables are set that change behavior of the testrXIlogspace inconsistently casts inputs to int before performing computationr X,primtorch refs should be composite compliantr XKlogspace and linspace off by one on cuda for integer dtypes for some inputsr XK[Profiler] Allow profiler to gracefully fail without interrupting workflow.r XK[Profiler] Allow profiler to gracefully fail without interrupting workflow.r X<Reordering test in PyTorch test suite induces dynamo failurerX{[feature request] DataLoader to accept num_threads argument to auto-set number of threads for OpenMP / intra-op parallelismrX+OOM during backward() leads to memory leaksrX*backward not available for index and mask rXKiOS TestApp from mobile performance recipes tutorial doesn't build on macOSrXVRuntimeError: "reflection_pad2d" not implemented for 'Half' in autocast enabled regionrX;[FSDP] Error when wrapping FSDP inside `checkpoint_wrapper`rXLmodel.to(device) takes time forever on A40-8Q, NVIDIA. cuda11.1, torch1.9.1.rXCProvide error handling for ops that don't yet support Dynamic ShaperXWDataLoader: `pin_memory` should respect object attributes before object collection typerX1`torch.sum` promotes integral tensors to `int64`.rX:[Checkpoint] Support multiple unpack in saved tensor hooksrX2DistributedDataParallel hangs when not using GPU 0rX;set_grad_enabled not respected when running on a web serverrX.Stop manually binding sparse factory functionsrX3Re-enable DynamicQuantModule in iOS simulator testsrXWExternal libraries cannot have a requirements.txt that needs to install a cpp_extensionrX#Move functorch tests to under test/rX7UserWarning: operator() sees varying value in profilingr XF[feature request] Discover actually loaded shared libraries at runtimer!X1torch.concat type hints fail for keyword argumentr"XjWhen using libtorch v1.10.2, calling at::slow_conv_dilated3d directly returns wrong results on cpu backendr#XRuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms r$X.linear.matrix_power is not composite compliantr%X/Untangle TorchScript prim ops in aten namespacer&X9Could be clearer that Cross Entropy takes logits as inputr'XRUsing DDP with num_workers > 0 hangs before entering the first training epoch loopr(X+Autocast documentation examples would breakr)XACUDACachingAllocator should be cuda memory merge/compact friendlyr*Xcant build with USE_VULKAN=1r+X3[FSDP] deepcopy FSDP model for EMA results in errorr,XBupsample_bilinear2d() received an invalid combination of argumentsr-X8optimize_for_mobile vulkan_prepack::conv2d_clamp_prepackr.X:Documentation for torch.cuda.Event(blocking=True) is wrongr/XuInconsistent implementation of quant_utils:: ChooseQuantizationParams compared with fbgemm:: ChooseQuantizationParamsr0Xq[Misleading] The doc started using Tensorflow terminology in the document to explain how to use the Pytorch code.r1X0[PyTorch/XLA] Improve the XLA PR landing processr2X+Issues with custom types defining `__new__`r3X:linspace cpu and sometimes cuda is wrong on integral typesr4X(Unify c10::Event and at::cuda::CUDAEventr5XQnn.InstanceNorm and nn.GroupNorm are affected by padding, so they need to maskingr6X,backwards compatibility ALLOWLIST is misusedr7X8test_sparse_matmul_cpu_complex128 fails on my local copyr8X;test_sparse_spdiags_cpu_bool fails on my local working copyr9X(Tensor.backward type hints clarificationr:X0Overloading multiple signatures for a single refr;X-Investigate adding shell linter/checker to CIr<X3Investigate adding Dockerfile linter hadolint to CIr=XInvestigate if it's okay to throw a RuntimeError instead of TypeError here : https://github.com/pytorch/pytorch/pull/79560/files#diff-415017bcad4fa6cd6d3dfe5f6ea1caffcd7122b46b8c1e4825f7d889efc80a62R1816r>X3Devirtualize sym_sizes, virtualize sym_sizes_customr?X$Add more autograd tests with symintsr@Ximplement sym_numelrAXFMake sure we always redispatch through a dispatcher for all SymInt opsrBX*Unknown builtin op: aten::broadcast_shapesrCX]Dependency header directory is not properly expanded in the utils.cpp_extention in ninja moderDXRRuntimeError: CUDA error: no kernel image is available for execution on the devicerEX4dtype mismatch when after using auto mixed precisionrFXJgrid_sample and mode='bilinear' induces errors at discrete pixel locationsrGXCompatibility with newest MKLrHX Enable jit error when using FSDPrIX9Workflows fail silently when the workflow file is invalidrJXjRename DispatchKey Dense/Sparse/etc to DenseFunctionality/SparseFunctionality, use original name for aliasrKXRTestTagsCPU.test_tags__refs_constant_pad_nd_cpu_float32 flaky with dynamo & pytestrLX3Modernize logging tensor in torch.testing._internalrMXBatchNorm for complex tensorrNX`DISABLED test_non_contiguous_tensors_nn_ConvTranspose1d_cuda_complex32 (__main__.TestModuleCUDA)rOXCSupport JaggedTensor/KeyedJaggedTensor from TorchRec in TorchDynamorPX=Inconsistent naming convention for end of enum in DispatchKeyrQX=PyTorch Embedding Op with max_norm is not working as expectedrRXDispatcher debug/logging moderSX2Failed to static link latest cuDNN while compilingrTX7Message exchange failure when perform alltoallv (cpus) rUX/Python operator registration API for subclassesrVXKFakeTensor consolidated strategy for in_kernel_invocation and dispatch keysrWX.Provide an option to disable CUDA_GCC_VERSIONSrXX+Export quantized shufflenet_v2_x0_5 to ONNXrYXARegister refs for CompositeImplicitAutograd ops as decompositionsrZXG[Tracker] AO migration of quantization from `torch.nn` to `torch.ao.nn`r[XM[packaging] Conda install missing python local version label (+cu123 or +cpu)r\XNoptimize_for_mobile has an issue with constant operations at the end of a loopr]X@RFC: auto-generated plain Tensor argument only sparse primitivesr^X+Idiom for PrimTorch refs for Tensor methodsr_Xr`sparse_coo.to_dense()` produces different results between CPU and CUDA backends for boolean non-coalesced inputs.r`XSWindows Debug binaries crash on forward: assert fail on IListRefIterator destructorraX)DISABLED test_profiler (test_jit.TestJit)rbXP[bug] the output shape from torch::mean and torch::var is different in libtorchrcXj[Distributed] test_dynamic_rpc_existing_rank_can_communicate_with_new_rank_cuda fails in caching allocatorrdXYPyTorch 1.12 cu113 Illegal Memory Access or Internal Error instead of Out of Memory casesreX|FakeTensorMode cannot handle non-fake tensor, but non-fake tensors can arise from non-interposable Tensor construction callsrfXCImprove interaction of PyTorch downstream libraries and torchdeployrgX[__getitem__ is returned as an OverloadPacket instead of an OpOverload in __torch_dispatch__rhX=[Profiler] Defer thread assignment for python startup events.riXOfloat' object is not callable when using scheduler.step() with MultiplicativeLRrjX+Support Swift Package Manager (SPM) for iOSrkX7Precision error from torch.distributed.send() to recv()rlX*Torch does not build with Lazy TS disabledrmXKLinking pytorch libraries causes sstream behavior to be overridden globallyrnX6[vulkan]compiling VulkanOpContext.cpp with some errorsroX@CapabilityBasedPartitioner treats non-compute ops inconsistentlyrpX2forward program terminated from __cxa_pure_virtualrqXHCapabilityBasedPartitioner doesn't support horizontal (vertical?) fusionrrXD[onnx] Add support for prim::DictConstruct in pytorch-ONNX converterrsXZ[onnx] support more combinations of args/kwargs as model inputs for pytorch-onnx converterrtX2jit gives surprising results with lists of objectsruX|[JIT] Request Constant Propagation to keep fake_quantize_per_tensor_affine and fake_quantize_per_channel_affine on the graphrvX<Missing corner case handling in ATen ctc_loss implementationrwX/torch.utils.checkpoint optimization opportunityrxX&torch.randint should accept high=2**63ryX?torch.stft does not normalize non-rectangular windows correctlyrzXO[FSDP] `test_mp_embedding_reduce()` fails with `transformer_auto_wrap_policy()`r{X<Add a check to detect mutation of the inputs during backwardr|X=torch.searchsorted error message and documentation is unclearr}X9num_worker and prefetch_factor in DataLoader do not scaler~X/Implement shape/size functions for nestedtensorrX.Global lambda function is not properly guardedrX"Attempted to resize a view tensor to a larger size. This is not allowed in the functionalization pass" reported on non view tensorrXDInvestigate ncclRedOpCreatePreMulSum operator for gradient reductionrX4quantization: QConfigMapping should be easy to printrXSegfault with fake tensorrX0[Prims+NvFuser] Issue with aten.where.ScalarSelfrX*JIT trace takes forever on a simple methodrX,Reductions on tensors larger than GPU memoryrXj`torch.overrides.get_testing_overrides` does not function as intended for native tensor methods/operationsrXMIncorrect results for mean or sum kernels on aarch64 when building with gcc-7rX'[Prims+NvFuser] Non-fusible ops TrackerrX4Files downloaded with torch.hub should respect umaskrXlRuntime error in Libtorch cpp project (Didn't find engine for operation quantized::conv2d_prepack NoQEngine)rX2Refactor linter adapters to avoid code duplicationrX@High GPU context memory on Torch 1.11.0 but none on Torch 1.10.0rX?[FSDP] Avoid explicit replace of activation checkpoint prefixesrXLibtorch cannot load TrochScript Module correctly, when a network contains conv2d(inchannels=64, outchannels=128, kernelsize=1) .rXKCapabilityBasedPartitioner does not work correctly with mutating operationsrX:Functionalization and fake tensors failure in torture testrXNtorch.fx.node.map_aggregate and torch.utils._pytree.tree_map do the same thingrX;DISABLED test_trace_dependencies (test_analyze.TestAnalyze)rX:torch._weight_norm with specified dim returns wrong outputrX-grad not preserved during copying or picklingrX8[Mac M1] `torch.mm` sometimes produces incorrect resultsrX*build libtorch with the same mkl as MatlabrX)move bazel files out of pytorch repo rootrX2SparseAdam performance issue during optimizer steprX"libprotobuf version compatibility rX.Docker updates cause subsequent builds to failrXKtorch.package can not be used to serialize `resnet18` from TorchVision-0.12rX(CI: Run cpu tests in parallel processes?rX4Resize/reshape of sparse compressed tensors - designrX?[discussion] Consolidation of audio-visual I/O in a new packagerX.[jit] Failed to load a saved scripted functionrX=RuntimeError: required keyword attribute 'value' is undefinedrXP[ONNX] Exporting the operator `::svd` to ONNX opset version 13 is not supported.rX![ONNX] Support opset 17 operatorsrX.[Releng] Improve the tutorials release processrX.three typing inconsistencies on Tensor methodsrX[[Prims+NVFuser] nvFuser running into "Tensors of type SparseTensorImpl do not have strides"rX+Nested tensor: Support Noncontiguous BufferrX8[Prims+NVFuser] aten.to.dtype refs impl causing segfaultrX3[ONNX] Tool to find mismatch in exported ONNX modelrX4[Prims+NVFuser] Aten2Aten decomp hurting performancerXRExpandedWeights sometimes fail silently and doesn't compute .grad_sample attributerX6ExpandedWeights can't handle modules with tied weightsrXKtorch.nn.functional.linear fails for multi-dimensional bias from torch 1.12rX[LTC] OOM on mnist examplerXBWrong example of sliced computation in doc page Numerical AccuracyrX$[jit] script backward wrong gradientrX4Position embedding aware global circular convolutionrXUInterpolation artifacts when using nn.interpolate, trilinear mode for 3D label imagesrX?[primTorch] `|` operator does not work with FakeTensor in _refsrXPmake_fx doesn't work with truly dynamic argument functions (e.g. fx.Interpreter)rX+slow test infra cannot handle nested suitesrX2C++ extensions inject a bunch of compilation flagsrX[BE] Refactor FSDP Unit TestsrX0SummaryWriter add_embedding issue with label_imgrXjit.freeze throws RuntimeError: stack_out && stack_out->size() == 1 INTERNAL ASSERT FAILED at "../torch/csrc/jit/passes/frozen_conv_folding.cpp":281rXCompatibility ListrXC[bug][nvfuser] Applying nvfuser to the model leads to runtime errorrX?[DDP] doesn't support multiple backwards when static_graph=TruerX$Can torchscript dump backward graph?rX4Inconsistent computation of gradient in MaxUnPoolingrX+Ne op does not behaves as expected with nanrXWhen running GPT trainning with megatron, the program quit due to torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workersrX/Add typing support to ModuleList and ModuleDictrXhThe result of doing a dot product between two vectors, using einsum, depends on another unrelated vectorrX torch.einsum results in segfaultrXX`torch.renorm` gives wrong gradient for 0-valued input when `p` is even and `maxnorm=0`.rX@`hardshrink` gives wrong gradient for 0 input when `lambd` is 0.rX`torch.inverse()` crash in cudarXRPC: Make RRefProxy callablerX!Anaconda is not a package managerrX3Let torch.utils.tensorboard support multiprocessingrXF`atan2` will gradcheck fail when `other` is a tensor with `int8` dtyperX?`det` will return wrong gradient for `1x1` matrix with 0 value.rXX[ONNX] RuntimeError: 0 INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/jit/ir/ir.cpp":518rXCapabilityBasedPartitioner requires is node supported to only return true for CALLABLE_NODE_OPS but no assertion for this invariant existsrX8Unable to use vmap atop torch.distribution functionalityrX*Add TorchDynamo as a submodule to Pytorch?rXjOutput for `aten::_native_multi_head_attention` appears inconsistent with entry in `native_functions.yaml`rX?[jit.script] jit.script give uncertain results using torch.halfrXApad_sequence and pack_sequence should support length zero tensorsrX.Overlapping Optimizer.step() with DDP backwardrXJRuntimeError: DataLoader worker (pid 22822) is killed by signal: Aborted. rX+Semi-reproducible random torch.baddbmm NaNsrX2`torch.ops.aten.find` inconsistent with `str.find`rX2-dimensional arangerX'`bmm_sparse_cuda` kernel for `bfloat16`rX Cannot run scripted BERT_PytorchrX?Nonliner conjugate gradient optimizer + Hager-Zhang line searchrXlNVFuser should extend caching to remove necessity for PrimTorch's executor to Provide Tensor Contiguity InforX!Allow parameterization of LayoutsrX.[Prims+NVFuser] Prims with missing NVFuser opsrXBDDP find_unused_parameters=True does not work for Sparse gradientsrX@[bug] libtorch bug in nn::MultiheadAttention and nn::TransformerrX<Negative values still produced by torch.nn.functional.kl_divrX3Revisit OpInfo samples for nn.functional.max_poolNdrXscatter_reduce choosed indicesrXBCMake Error: File /opt/pytorch/build_variables.bzl does not exist.rX'Torch fx print line number of each noderXGuard Failures in T5 ModelrX6[DDP] output_device argument appears completely unusedrX%[c10d] Async object-based collectivesrXHTracker: Slow gradcheck failures possibly indicating incorrect gradientsrX2Support for learnable p Values in LPPOOL like PoolrX>Modify _add_docstr to also set the correct module for the APIsrXF[BE] Update ProcessGroupWrapper tests to test other collective messagerX4Distributed Store `get` doesn't work well with `add`rX,DISABLED test_lobpcg (__main__.TestAutograd)rXBIllegal Memory Access from nonzero method when Tensor is Too LargerXb java.lang.ExceptionInInitializerError at org.pytorch.NativePeer.initHybrid(Native Method)rX:Add support for torch.nn.quantized.modules.FloatFunctionalrXoCosineAnnealingWarmRestarts with initial warm up and weight decay applied on consecutive cycles without warm uprXGAttributeError: 'LinearPackedParams' object has no attribute '_modules'rX8Need "valid" and "same" padding mode for convTranspose2drXSort tensors inplacerXkCudnn batch norm kernel (batchnorm_bwtr_nhwc_semiPersist) gets blocked by overlapping NCCL all_reduce callsrXB[complex] dropout and it's variants should support complex tensorsrXKWrite some torch.distributed.nn.* tests for the new dispatcher passable opsrXIChange c10d APIs in ProcessGroup to accept const std::vector&rX8test_conv_backend tests OOMing in 10.2 slow_gradcheck CIrX'[Prims+NVFuser] Supports 0-sized inputsrX-[Prims+NVFuser] Aten2Prim refs tracking itemsrX6Support tensor subclasses as `UninitializedParameter`srX%OpInfos for torch.ops.aten operationsrX7F.binary_cross_entropy_with_logits unexpected behaviourrXF`soft_margin_loss` gives wrong gradient when `target` with dtype uint8rX>`max_unpool` gives wrong gradient when `indices` has duplicaterX=[NVFuser] Investigate models without any fusion groups found rXK[NVFuser] Choose partitioner op list based on supported prim decompositionsrXD[NVFuser] Investigate modules with bad performance relative to eagerrXHTorch.fx: add reporting of the name of a module not found during tracingrX8Catch value errors if cell in match_nested_cell is emptyrXGEGLU activationrX"AMP step() enforce synchronizationrX[RFC] Module specific workflowsr X Elliptic Functions and Integralsr XC[primTorch] No _refs support for torch.Tensor.requires_grad.__get__r XOrthogonal Polynomialsr XGactivation checkpointing with non_reentrant implementation memory leaksr X>CPUProfilingAllocator greedy allocation plan generation failedrXH[feature request] Add support for a custom DatasetFetcher in DataLoader rX/Expose more MAGMA backends for solve_triangularrXfAllow a user provided "test name - test time" mapping file work with pytorch's test sharding mechanismrX:Provide error message when thread pool is exhausted in RPCrXComplex support in DDPrX1FakeTensor: Support torch.tensor([FakeTensor, 0])rXJpow CUDA tensor raised to CPU scalar tensor result can't backward properlyrXASupport `antialias` option on `torch.interpolate` for ONNX exportrXI`torch.special.gammainc` backward pass with respect to the first argumentrX:memory leaking when doing all_to_all_single communication rX=RPC init fails and crashes when world_size is greater than 18rXJ[ONNX] Input node deleted when converting a Conditional random field modelrX#static builds are broken by MKL_DNNrXXwhen forward use **kwargs,how to construct the example_ Inputs parameter in jit.trace?rX0Comprehensive documentation for Tensor indexing?rXEDeterministic `index_put` on CUDA fails when broadcasting is requiredrX#[CI] Do we run all cpp tests on CI?rX PrimTorch burns in static shapesr Xw[feature request] LazyTensor that provides/loads/computes its contents only upon request to be returned from torch.loadr!XCModify update-viable-strict GHA to use internal version of checkoutr"XWrite lint for isGreenr#XU`CosineAnnealingWarmRestarts` does not update parameters added with `add_param_group`r$XFtest_meta_vstack_cuda_int16 (__main__.TestMetaCUDA) Fails with DEBUG=1r%XIA/libc: Fatal signal 6 (SIGABRT), code -6 in tid 11742 (objectdetection) r&X?All {view}_scatter variants should support all (or most) dtypesr'Xd[bazel] [ci] `//:lazy_tests` Could not run 'aten::mul.Tensor' with arguments from the 'Lazy' backendr(Xf[bazel] [ci] `//:module_test` CUDA error: CUDA driver version is insufficient for CUDA runtime versionr)XLAutomatically calculate output_shape of sequential model (or any other fCNN)r*X'Multi-node training meets unknown errorr+XAutomatically use CUDAr,XS[ONNX] Replace test inheritance for `test/onnx/test_models.py` with parameterizing r-X:Parameter.__deepcopy__ doesn't preserve view relationshipsr.X:Improve clarity by making sharding a static nightly updater/Xandroid-tests is often flakyr0XN[FSDP] Test that module using mixed precision can be loaded into non-mp moduler1X:[JIT] failures with nested with blocks + loop continuationr2XCompliance with PEP-0523r3X?quantization: misleading backend config for linear_dynamic_fp16r4X>[FX] TypeError when tracing cat taking split's output as inputr5X<ONEDNN testing is not done properly in quantization codebaser6X/gradgradcheck fails for torch.native_layer_normr7XdFloat and double tensors randomly initialized with the same seed get different values for size >= 16r8X2Does Torch JIT Support Trace High-level Custom Op?r9XFtensorboard SummaryWriter.add_graph fails when model uses empty tuplesr:X+[FSDP] Progress of ParamExecOrderWrapPolicyr;X5Missing the time unit in duration time of DDP loggingr<XD[FSDP] Verify that FSDP-managed parameters are the same across ranksr=XOPyTorch Preview (Nightly) version number does not comply with Conda conventionsr>XSome unit tests are failingr?X`[LTC] Introduce a `MetricsReport` python binding and allow backend to add their report as stringr@X@__torch__dispatch does not return new output in inplace functionrAX?Unable to use a parameter with torch.sparse_coo layout with DDPrBX&test_ops.py extremely slow on cuda11.3rCXKDisplay a "reference" link for ops that points to primTorch implementationsrDXHDISABLED test_checkpoint_wrapper_parity (__main__.CheckpointWrapperTest)rEX7DISABLED test_caching_pinned_memory (__main__.TestCuda)rFX$Implement NestedTensor size functionrGX4[META] Sign up to discuss significantly modifying CIrHX)Add a new _broadcast_coalesced op for DDPrIXYEnsure the guards in distributed_c10d.py wrappers get executed in the replay of the graphrJX3Add autograd support for dispatch passable c10d opsrKX#Iteration # 1-offset in DDP loggingrLX%[BE][ZeRO] Enable multigpu unit testsrMXr[LTC] Make `torch::lazy::BackendImplInterface::ExecuteComputation` takes `ComputationPtr` instead of `Computation`rNX!Use c10d broadcast_object in ZerorOX)Guards for a linked list will be `O(n^2)`rPXOAPI for accessing SymIntNode mandates refcount bump even when it is unnecessaryrQX&Add doc formatting check to lintrunnerrRXConda enviromentrSX!SymInt equality tests are unsoundrTX3Init connect timeout when use torch.distributed.runrUX3caffe2_nvrtc is produced even when it won't be usedrVX(Incorrect image upscaling on MPS backendrWX+torch failure to open libcuda.so.1 on macOSrXX:TorchScript bidirectional lnlstm from example doesn't workrYXX[build] No documented way to install C++ binaries for pure-python development of pytorchrZX[bazel] build spams warningsr[X5Adam not optimally implemented: unnecessary torch.divr\X7[bazel] ability to run gpu tests on gpu machines in RBEr]XDPyTorch get positive log_prob of a multivariate normal distribution r^XYConda install from pytorch-nightly channel does not install the expected version on macOSr_X.Batches are being duplicated from go http callr`X*[ONNX] Internal assert error during exportraXE[NVFuser] hitting fallbacks on demucs (from torchbench + lazy tensor)rbX&`prepare_qat_fx` docstring doesn't runrcXDPyTorch gets stuck when using an NVLink/A6000 and more than two GPUsrdXAallowed_functions_module_string_ignorelist doesn't work very wellreX\testSerializationInterop in test/cpp/jit/torch_python_test.cpp has not run in over two yearsrfXBPyTorch leaks a macro definition called "CHECK" in the C++ versionrgX4[NVFuser] bad performance on pyhpc_isoneutral_mixingrhX*[BE] Generalize recursive wrapping utilityriX@[NVFuser] bad performance on mobilenet_v2 and mobilenet_v3_largerjX4[NVFuser] bad performance on pyhpc_equation_of_staterkXFscripted fft Convolutions are faster than nn.Conv1d with large kernelsrlX8[ONNX] Enable more operators to support data propagationrmXXout-of-place functional optimizers: functional optimizers may not be composite compliantrnX=[bug] Device dispatcher can choose CPU path for CUDA tensors.roX<[feature request] Support dataclass derivations of nn.ModulerpXJ[bug] fill_, masked_fill_ : fill ops allow lossy downcasting of fill valuerqX`Mismatch in clang toolchain lead to binary incompatibilities on M1 between torch and torchvisionrrXCTriangular solve fails on batches of matrices of size > (*, 524280)rsXj_make_elementwise_unary_reference and other function factories in torch._refs don't set __name__ correctlyrtXMDistributedDataParallel `static_graph=True` fails to handle unused parametersruX9PyTorch/XLA's DDP XLABackend is broken by upstream changervXKRedundant info are saved when using torch.save to save part of torch.tensorrwX@[AUTOGRAD] support implicit reductions with SymInts in autograd.rxXV[AUTOGRAD] codegen to use sym_sizes for ops w/ symint overloads in derivative formulasryXJtorchvision.models.mobilenetv3 can't save pre-trained model to custom dir?rzX@Hide or fuse TupleConstruct / TupleUnpack from tensorboard graphr{XW[ONNX] `.squeeze(1)` on the B X T (not B X 1 X T) tensor causes export error in maskingr|XAbout the source code of r}X-Error with Named Tensors and multiple threadsr~X/Improve PrimTorch testing for view consistency.rX-CI workflow creates too many tags in RSS feedrX7Multi-node, Multi-GPU set up tutorial for Slurm clusterrX'inspect.signature.bind is not supportedrX0[MetaIssue] Propagating SymInts through AutogradrXNMirror and implement `SymbolicIntNode` API for `SymInt` so we can trace in C++rXS[MetaIssue] Investigate if we should be reusing primtorch formulas for `is_dynamic`rXPDDP Freezes w/ No Output for PyTorch Geometric GNN Multi-GPU Node ClassificationrX&Add Autograd Support for Nested TensorrX9Add a test that shows that lazy_ir reuse breaks SizeNodesrXIImplement SymbolicIntNode interface for lazy (i.e. lazy::SymbolicIntNode)rXWDevirtualize `sym_sizes`. It still has to work for python tensor subclasses and LTC/XlarX:Building PyTorch from Source with BUILD_LAZY_TS_BACKEND_ONrXuWhen setting sizes and strides on a tensor subclass in `THPVariable_make_wrapper_subclass`, also make offset symbolicrXBRandom functions should infer device from user-specified GeneratorrX<Corner cases of ShardedTensor checkpoint when using TorchRecrX6RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERRORrXDMulti30k can't be downloaded the destination domain can't be reachedrXFtorchscript jit trace support custom op without specific csrc and .sorX&Doc on index of CPU Device seems wrongrX'Libtorch C++ mobile build linking errorrX.TorchInductor failing inference models trackerrXDataLoader leaking resources?rX8[forwardAD] torch.no_grad has no effect under forward_adrXCan we have Additive Attention?rXlibrary libshm.dylib is missingrX"Add type() support for mps backendrXLIf large enough tensor is being cloned, parallel dataloading hangs on M1 MacrX.Do we really need sampler for IterableDataset?rX'Strange tracing result with torchscriptrX6LambdaLR changes the learning rate in an undesired wayrX!TorchInductor missing ops trackerrXHtorch.fx deepcopy does not copy attributes added to GraphModule or NodesrX,[distributed_test.py] Improve `test_barrier`rXFAbnormal GPU memory usage when using CUDA tensors with multiprocessingrX_Cannot build master on AWS cluster: error: ‘__fatDeviceText’ was not declared in this scoperX5fatal_signal_asan_no_sig_test in current master hang.rX1Improving clarity in the docs of different lossesrX?Remove const from function return type if returning const valuerX0[Profiler] Capture more information about inputsrX0[RecordFunction] Hold a durable schema referencerX1MPS: Adding int64 tensor does not work on AMD GPUrXY[Modes] no_dispatch is not the same as DisableTorchFunction, causing differences in modesrXDAdd TORCH_SHOW_CPP_STACKTRACES when TORCH_DISTRIBUTED_DEBUG = detailrX$[ONNX] Re-design `torch.onnx.export`rXMac M1 Build Failure on DEBUG=1rX0Certain import order triggers segmentation faultrX.TorchScript inference get intermediate result?rX4Feature Request: Hessenberg and Schur decompositionsrX.Feature request: Integer system decompositionsrXStorch.jit.script segmentation fault (pytorch debayer module) 1.10, 1.11 and nightlyrX7Efficiency of unary operations on CPU for large tensorsrX+Deprecate hardtanh type promotion behavior.rXD[FSDP] Customizable gradient pre-divide for mixed precision trainingrXExtend tag testing for aliasesrX$Add `inplace_view` tag for `resize_`rXhGetting NotImplementedError when trying to implement E2E support for `prim::is_nested` Op in torch-mlir.rXrUnable to programmatically update models using references from model.named_modules()...requires additional parsingrX7Expose docs from the yaml for each torch.Tag in Python rX-Add a gallery of examples with sphinx-galleryrXBTest approximation and numerical stability of numerical operators rX#[primTorch] Sensible Error MessagesrXNew c10 constantsrXF[Better Engineering] Make OpInfo-based test failures easy to reproducerXAlBert quantizationrXF[ONNX] Scripted `reshape` incorrect if shape is dynamically calculatedrX$ValueError during `yaml.dump(dtype)`rX8BuildExtension does not choose correct CUDA installationrXCUnable to install Preview (Nightly) on M1 macOS: "Symbol not found"rXWAllow batch_norm_backward_elemt and batch_norm_gather_stats_with_counts handle 0 countsrXGtorch.distributed.init_process_group(backend="nccl") NCCL version errorrX&Debug job does not build in debug moderX#linalg.pinv_singular tests are slowrXIModule parameters/submodules can be shadowed by class attributes silentlyrX3[FSDP] Enhance sync_module_states for auto wrappingrX,torch.svd_lowrank fails for complex matricesrXLRFC: Improve the performance and usability of linear algebra on CUDA devicesrX@[JIT] autodiff implementation of rand_like function is outdated rX$LibTorch cannot be used without nvccrX%test_python_dispatch fails on DEBUG=1rX7Exponentiating floating number with cuda tensor is slowrXBclear input shape declaration on pytorch model inputs and outputsrXHParallel execution of multiple unrelated statements written sequentiallyrXV[1.9.1] [collect_env] collect_env does not collect actual runtime-loaded cudnn versionrX+New feature requested: vmap for torch.histcrXMtorch.fx: symbolic_trace: ones() received an invalid combination of argumentsrXRException in torch.jit.script doesn't indicate where in the code the problem lies.rXCtorch.lerp: discrepancy between CUDA and CPU (with extremal inputs)rX6is the issue resolved? windows not pytorch_jni in pathrXXRuntimeError: Event device type CUDA does not match blocking stream’s device type CPU rXF[onnx] RuntimeError: Attribute 'axes' is expected to have field 'ints'rXm`with torch.backends.cudnn.flags(deterministic=True)` doesn't give an exception for ctc_loss backward on CUDArX*Softmax, LogSoftmax are over parameterizedrXV`layer_norm` triggers INTERNAL ASSERT with input requiring grad + zero-size int tensorrXW`index_fill` will trigger INTERNAL ASSERT when float tensor requiring grad + int tensorrXLfx.Tracer with param_shapes_constant=True not working for RobertaForMaskedLMrXPermutation of Sparse TensorrX.lldbinit for lldb debugerrX)torch.angle differs from np.angle for -0.rX"Torchdynamo for Deepspeed and FSDPrX!Split up and reorganize RPC testsrXI`gradcheck` fails for `torch.distribution.transform` APIs in forward moderXaTRACK: integral + floating inputs to an op with floating requiring grad result in INTERNAL_ASSERTrX Memory allocation errors when attempting to initialize a large number of small feed-forward networks in RAM with shared memory despite having enough memory rXARequest for adding the possibility for training on sparse tensorsrXepytorch-android-lite use its own libfbjni.so, which is not compatible with any other version at all..rX4[CI] Detect when tests are no longer running from CIrX-Floating point exception in _conv_depthwise2drX Any plan to add Noam scheduling?rX#`max_unpool2d` is not deterministicrXY USE_NATIVE_ARCH flag causes nvcc build failure due to "'arch=native': expected a number"rX3Performance with MPS on AMD GPUs are worse than CPUrXhDISABLED test_complex_half_reference_testing_as_strided_scatter_cuda_complex32 (__main__.TestCommonCUDA)rX?nn.Sequential causes fx.replace_pattern to not find any match. rX5test_to (__main__.TestTorch) fails with multiple gpusrX0Allow specifying pickle module for torch.packagerX@[chalf] reference_testing: low quality test for fast growing opsrX+[Optimizer Overlap] Parameter group supportrX0[Optimizer Overlap] Proper checkpointing supportrX1[Optimizer Overlap] Custom optimizer registrationrX`pack_sequence` crashrX`ctc_loss` will backward crashrXA`baddmm` triggers INTERNAL ASSERT FAILED when input requires gradrXE`matmul, mm` triggers INTERNAL ASSERT FAILED when input requires gradrX5Enhancements to AliasDB to handle in-place operationsrX Segfault in _pad_packed_sequencerX)Segfault in _grid_sampler_2d_cpu_fallbackrX'Segfault in _embedding_bag_forward_onlyrX$Segfault in torch._C._nn.thnn_conv2drX)Segfault in torch._C._nn.reflection_pad2drXSegfault in max_unpool3drXSegfault in grid_sampler_3drXSegfault in bincountrXGDoesn't work when register hook to torch.nn.MultiheadAttention.out_projrX8[ONNX] Support tensors as scale and zero_point argumentsrX(RFC: Move functorch into pytorch/pytorchrXBtorch.multiprocessing.spawn raise PicklingError inside a decoratorrX2[primTorch] item prim can't return a bool properlyrX8[primTorch] Meta function for item creates a dummy valuer XPDISABLED test_init_from_local_shards (__main__.TestShardedTensorFromLocalShards)r X#Installation on Jetson target boardr XGamma and Related Functionsr X/nn.CosineSimilarity returns value larger than 1r X+Adam is 30% slower than SGD on Apple Metal.rXgPython memory allocator called without holding the GIL when running torchrun under Python debug versionrXXtoleranceOverride should override atol and rtol even when explicitly specified in a testrX3RFC: [primTorch] Stride-agnostic Operator SemanticsrX%DDP multi host with single GPU each. rXJ[FSDP] .modules() return original modules instead of FSDP prefixed modulesrX-FFT operators are not supported on MPS devicerXBError occurred , when compile source code setting BUILD_CAFFE2=ONrX2Three memory copies of every dataloader cpu tensorrX2Override sym_sizes to create LTC IR for SymIntNoderX8forward-mode support for "logically composite" operatorsrX9Inference Tensors should not be allowed to hold `grad_fn`rX`logaddexp2` fails to backwardrXBOperating on boolean torch tensor and numpy array casts to `unit8`rXKExporting the operator isinstance to ONNX opset version 13 is not supportedrXJNaN tensor values problem for GTX16xx users (no problem on other devices)rXB`topk` returns different results with the same input twice in cudarX?[failing test] test_foreach::test_binary_op_scalarlist_fastpathrX Fails to compile with GCC 12.1.0r X(Heap corruption in slow_conv_transpose3dr!X'Floating point exception in slow_conv3dr"X2Floating point exception in native_channel_shuffler#X+Floating point exception in channel_shuffler$X'Segmentation fault in _remove_batch_dimr%X@Make the appropriate backend `DimensionNode` visible to LTC corer&X2Throw warning if python optimise flags are enabledr'XcConv2D with large different number of input and output channels gives a CUDNN_STATUS_INTERNAL_ERRORr(X2ONNX export of CumSum produces different data typer)X.Legacy model format is not supported on mobiler*XhBUG: reference count leak when using `THPLayout_New` and `THPMemoryFormat_New` (static analyzer reports)r+X*Sporadic convolution error with dilation=0r,XETorchScript attempts to compile dead branch of torch.jit.is_scriptingr-XEcannot convert to channels last format for conv2d conv3d hybrid modelr.Xtorch.nn.Conv3D on MPS backendr/XI`addmv, mv` will trigger INTERNAL ASSERT FAILED when input requiring gradr0X7FSDP should work for model outputs that are dataclassesr1X4`Could not start gRPC server` flakiness in XLA testsr2XU`torch.utils.benchmark.examples.blas_compare` can not be parsed by Python-3.7 runtimer3X&General MPS op coverage tracking issuer4Xstrange behaviour in torch.divr5X%net_observer_reporter_print.h missingr6Xttorchrun leads to `ModuleNotFoundError: No module named 'tensorboard'`, but python -m torch.distributed.launch is okr7X"TimeSeriesDataset retrieve columnsr8XAdding Vulkan Support r9XKcomplex abs strides are wrong on empty tensors and tensors with 1 dimensionr:X'FSDP: enhanced shared parameter supportr;XKPrimTorch refs do not match argument naming with their PyTorch counterpartsr<X<Extend BC test to test for __torch_function__ overridabilityr=X&PrimTorch decomps for random functionsr>X0Werror=nonnull in dataloader.cpp (part of tests)r?X,PyTorch fails to build on gcc 12 due to gloor@X;How to handle __module__ attribute for Public API bindingsrAX*FSDP: test mixed precision with checkpointrBX?`stateless.functional_call` doesn't work with `nn.DataParallel`rCX3Investigate sharded gradscaler OOM on CPU workloadsrDX2Functional Jacobian does not work with TorchdiffeqrEXElintrunner doesn't give good error message suggesting lintrunner initrFX:Build check for AVX512 fails with AMD CPU and march=nativerGXModernize LoggingTensorModerHXCFailed to run on iOS - Couldn't find an operator for `aten::conv1d`rIXbatch Kronecker product rJXFsoftmarginloss should use `log1p` and has an incorrect out= behaviour.rKX5CUDA: Illegal memory access in `torch.linalg.solve()`rLXKDDP window TCP bug [socket.cpp:558] [c10d] The client socket has failed to rMX<Inplace Bool API + `sum` will trigger INTERNAL ASSERT FAILEDrNXK`max_pool1d` can succeed when padding is negative for tensor requiring gradrOX+Standalone unittests for checkpoint_wrapperrPX<conda CPU installation for LTS fails with UnsatisfiableErrorrQX2More clarity in doc for `torch.cuda.Event.record`?rRX5FSDP: Mixed precision should not cast ignored buffersrSX9Suboptimal error message - nn.Linear with double argumentrTXEProcess hangs after calling conv2d() in pytorch 1.11.0 with CUDA 11.3rUX%Allow force building with/without AVXrVXDtorch.onnx.export does not track Tensor.data.size() for dynamic axesrWXELarge numerical inconsistency for `torch.einsum` on RTX30 series GPU.rXXmicrobenchmark-style testsrYX%[distributed] c10d crashing on assertrZXoutputs_[i]->uses().empty()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/jit/ir/ir.cpp":1314, please report a bug to PyTorch. r[XQDISABLED test_ddp_profiling_autograd_profiler (__main__.TestDistBackendWithSpawn)r\X7Disable issue doesn't disable multiple dtypes correctlyr]XEwrong overload resolved for `torch.mul(x, 4)` in `__torch_dispatch__`r^XIDISABLED test_DistributedDataParallel (__main__.TestDistBackendWithSpawn)r_XDnon-rentrant checkpointing uses same memory as non-checkpointed coder`XCSubclasses with unwrapping `__torch_dispatch__` impls as parametersraXDCrossEntropyLoss computes SoftMax always across the second dimensionrbXlintrunner not workingrcXMThe codegen unconditionaly generate code even when it is not going to be usedrdX#libtorch1.8 torch::sigmoid is wrongreX0`tensordot` does check the dtype of empty tensorrfXQWrite decomposition conditionals in a way that leads to simpler shape expressionsrgXE`torch.scatter_add` will succeed when the `index` is a complex tensorrhXTfast `gradcheck` fails when outputs that do not require grad precede outputs that doriX:torch.ops.aten.ceil(1.5) returns Tensor rather than scalarrjX\[primTorch] Reduction references don't return views consistent with their original operatorsrkXG[RFC] Allow device override during Tensor unpickling without torch.loadrlX'Don't populate f_locals to check guardsrmX?Undefined symbol error when compiling and loading C++ extensionrnX+Improve the overall design of MPSGraphCacheroX-Allow users to express fused matmul/bias/relurpX9Move the MPSGuardImpl to inherit from NoOpDeviceGuardImplrqXVnn.functional.pad accepts bool values but raises internal assert when converted to JITrrXtorch.cholesky has been deprecated in favour of torch.linalg.cholesky. However, torch.cholesky_inverse remains as is. It should also be moved to torch.linalgrsX#Automate cleanup of header includesrtX%SymInt shouldn't be in dynamic_type.hruXA somewhat cryptic error message (for newcomers) - "Cannot re-initialize CUDA in forked subprocess" - report and suggestion for a possible solutionrvX"FSDP: ability to ignore parametersrwXDistributed Weighted Sampler.rxX*Add Tensor compare support for MPS backendryX#[FSDP] `ignored_modules` follow-upsrzX4torch.randperm uses too much cpu, but not efficient.r{X0__name__ on OpOverload should not contain periodr|XRbroadcast_object_list with GPU tensors can lead to deadlock on PyTorch CI machinesr}XMUnable to continue adding modules to `nn.Sequential` after using `del` methodr~X7Incorrect documentation in ``gumble_softmax`` function.rX2Building from source results in broken __version__rX.ENORMOUS OVERHEAD from mp.get_context('spawn')rXAPeak GPU-memory usage extremely huge when sorting with torch.sortrX?Stop calling sizes/numel/dim/is_contiguous on undefined tensorsrXDtorch.stack test_conj_view and test_neg_view are failing after 77043rX`Kill use of TensorImpl::ShareExternalPointer in torch/csrc/jit/tensorexpr/external_functions.cpprX$Where is fx2trt fx to tensorrt tool?rXDisplay EC2 informationrXI[bug] `NATIVE` and `OMP` `parallel_for` implementations are inconsistent.rXODISABLED test_comprehensive_linalg_ldl_factor_ex_cuda (__main__.TestDecompCUDA)rXNWhen using Rsqrt, the output of the 1/x process is very likely to have nan/infrXOWhen using Lambda, the output of the 1/x process is very likely to have nan/infrXEGlobalAvgPool2d causes the inconsistency of output between frameworksrXCGaussianNoise causes the inconsistency of output between frameworksrX?ReduceSum causes the inconsistency of output between frameworksrXBprimTorch references don't handle scalar x scalar inputs correctlyrX>Private API for accessing all "internal" attributes on TensorsrX8NVFuser opinfos - check for CudaFusionGroup in the graphrXGCan't pickle model torch._C._distributed_c10d.ProcessGroupNCCL' object rXb[RFC] Upstream current implementation of ssd_offload from fairscale FSDP to Torch Distributed FSDPrX#Avoid Self-loops on Module CreationrXAPytorch return TCPStore( RuntimeError: Connection reset by peer) rXXtorch.nn.functional.linear sometimes incorrectly accepts arguments of the different typerX6Unwanted behavior with some in-place operations on CPUrXmMultiprocessing DataLoader hangs on exception inside iterator when using a simple queue and a producer threadrX_EmbeddingBag: Does CUDA calculate error in EmbeddingBag forward when include_last_offset=True ?rX?RecursionError when running torch.jit.script inside JitTestCaserXGPrimTorch binary refs do not handle CUDA + CPU scalar tensors correctlyrX<Object-base collectives create tensors at unexpected devicesrX*Feature requests for optimizer overlappingrX\Inconsistent results between Pow and Float Pow with their numpy references for complex typesrX:Unify torch.ops argument parsing code with PythonArgParserrXWindows CUDA TTS tracking taskrX-TYPEIGNORE lint run locally disagrees with CIrXsThere is a bug with latest stable torch version and the following Nightly versions related to `optimize_for_mobile`rXEtorch.Tensor.__rdiv__ long x scalar float type promotion is incorrectrXWtorch.add bool x bool allows integer alpha, inconsistent with other dtype type checkingrX@`gradcheck` for `torch.solve` may trigger INTERNAL ASSERT FAILEDrXb`cumprod, prod` will backward fail if `dtype` argument is different than the dtype of input tensorrXY`addr, baddmm, dist, l1_loss` will backward fail when input tensors have different dtypesrX#`gradcheck` fails for `torch.trace`rX0`gradcheck` should support the comparison of NaNrX+Strange warning from `matmul(..., out=...)`rX`torch.addmv` backward failsrXV[JIT] Infinite RecursionError with self-referential models (also affects `__repr__`)!!rX(Support Positional-only Arguments in JITrX0[typing] distribution.lazy_property is not typedrXHhanging process with init_process_group(backend='mpi') cannot be killed rXQuantization in LibtorchrX<Default repr of __get__ methods in __torch_function__ is badrX(Bug in dataloader iterator found by mypyrX/[checkpoint] Stable file format for checkpointsrX<[checkpoint] Handle overlapping storage during save and loadrX'Supporting torch.tensor.apply_ over GPUrX4__torch_function__ callers should always pass kwargsrX:test_python_reference_meta_functions takes too long to runrXAdding Polyloss to `torch`rX5[JIT] magic methods do not work after reloading modelrX2Torchscript model Runtime Error after quantizationrX`reshape` for distributions.rX7Torch `x += y.bmm(z)` is faster than `x.baddbmm_(y, z)`rX.Performance bad on ARM AArch64 for PyTorch C++rX`[torch::deploy] Remove `manager_` from the constructor and deconstructor of `InterpreterSession`rX:[torch::deploy] move create_movable to interpreter_managerrX6[torch::deploy] remove reliance on manager_ for unloadrX2[torch::deploy] Remove manager_ from AquireSessionrX2C++ CUDA assign existing memory to forward method.rXARemove all docstrings when python is running in optimization moderX6Clarify dependency on NumPy (related to maskedtensor?)rX:[feature request] no-param sort to exploit parallelizationrXW`torch.sort` does not exploit parallelization when invoked without the `dim` parameter.rXAdd `ldl_unpack` functionalityrX0`torch.nn.HuberLoss` backwards unexpectedly failrX4`torch.smm` backward fail with strange error messagerX%Pytorch can't process special unicoderX.Design API for accessing sparse tensor indicesrX0WeightNorm: Add reset_parameters Linear overriderX4RestrictPtrTraits in CUDA potentially has no effect.rX7`Tensor.logit`'s signature in doc misses `eps` argumentrXpylint segfaultrXQImprove error message for `unfold` when generating tensor with negative dimensionrXMeanVarianceNormalizationrXWOpInfo incorrectly advertises lu_solve support on CUDA even when compiled without magmarX/OpInfo CUDA bfloat16 support detection is buggyrXwAttributeError: '_thread._local' object has no attribute 'rel_tol' (cannot use TestCase.assertEqual from other threads)rXsupport FSDP with AMPrXG`torch.linalg.cond` has different results for tensor requiring autogradrXRandom Generator for DropoutrXM[JIT][Autocast] Batchnorm folding pass during freezing doesn't preserve typesrX?torch.unique() nondeterministic behavior on nan inputs (on GPU)rXE[jiterator] perf regression when jiterating few ops for complex dtyperX^Pip packaging and publishing improvements in pytorch wheels for better integration with poetryrX>[numpy] Missing Tensor-Scalar support for multiple binary ops rXBetter handling for execrXSegfault in ~PyFunctionPreHookrXKExpose the compute function that supports the output stride at the frontendrX1Add ability to add custom suffixes to tensor reprrX5Discrepancy in einsum when done in batch vs non-batchrXENon target rank receives result of 'reduce' op when backend is 'gloo'rXG`torch.clamp` does not distribute gradients as element-wise`min/max` dorXBFix layout of masked output when all sparse dimensions are reducedrX$Clean up PyTorch's private operatorsrX@Make Torch FX function `_torchscript_type_to_python_type` publicrX+`Tensor.register_hook()` Source Link BrokenrX-[JIT] make IRAttributeError extend jit::ErrorrX1[NVFuser] Automated generation of microbenchmarksrX=`F.interpolate` uses incorrect size when `align_corners=True`rXC[Tracer] RuntimeError: _Map_base::at when tracing fake quantizationrX<Expand pow and float_pow sampling function for more coveragerXPyre type checking fails rX1Add post-AccumulateGrad hook as a nice public APIrX=[checkpoint] Extension hooks to support logging and telemetryrX,Enhance _verify_param_shape_across_processesrX7[checkpoint] Switch away from pickle-base serializationrXU[NHWC] Extend the lowering function to support explicitly defining the output stridesrXH[NHWC] Propagate the channels-last memory layout within the fusion grouprXT[NHWC] Optimize channels-last contiguous tensor vectorization by dimension collapse.rXNRemove _log_softmax/_softmax in favor of log_softmax and softmax respectively.rXGTorchFunction handling and overload resolution very slow in `torch.ops`rX5[checkpoint] SPMD distributed checkpoint coordinationrX2Error in DistributedDataParallel with 'CPU' devicerXU[checkpoint] Make prepare_sharded_tensor_read and prepare_sharded_tensor_write publicrXRSuggestion to throw a UserWarning when a user forgot .eval() mode during inferencerXU[ONNX] Intermediate values are encoded when exporting operators with custom namespacerX>onnx export fails when using torchvision.transforms.CenterCroprXZWhether 'targetSize' in inferExpandGeometryImpl needs to be checked when it is less than 0rX NVFuser failing extremal opinfosrXL`index_select` allows negative `index` for sparse but not for strided `self`rX2[ONNX] Use topk to export max(dim,keepdim) to onnxrX?torch.bucketize doc typo on the left boundary when 'right=True'rX8[checkpoint] Avoid loading whole tensor when resharding rXM[checkpoint] Add extension points to avoid the default serialization behaviorrX>[checkpoint] Support models with different cross-rank metadatarX1[checkpoint] Use fsspec to support object storagerXBessel and Related FunctionsrX%[Checkpoint] Add module documentationrXEMore understandable name column of the table of the profiling result.rX9Quantization fails when padding parameter given as stringr X2Unexpected _LinAlgError appeared only on my devicer XETorchScript: jit.script fails when using 'tolist' with int32 tensors.r Xdtorch._remove_batch_dim is interceptable by __torch_function__ / batch tensors don't print correctlyr X-Multi-GPU distributed training reports errorsr X9torch.elastic fails to shutdown despite crashed processesrXb`torch.cuda.amp.GradScaler` may skip parameter synchronization required by post localSGD optimizerrX>[ONNX] About custom operator convert PreciseRoIPooling to ONNXrX5Initial integration of ZenDNN as backend into PyTorchrX.Observing a strange behavior - Row parallelismrXRuntimeError: bucket_count == per_bucket_sizes.size()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1646755853042/work/torch/csrc/distributed/c10d/reducer.cpp":980, please report a bug to PyTorch. rXUpdate NCCL to 2.12rXT[Feature request] Exclusive prefix sum, `torch.cumsum(input, dim=0, exclusive=True)`rXCfx: cannot find module when using apex.amprX=Ensure custom Function are correct in double backward settingrX.Allow `torch.fx` tracing on TorchScript modelsrXLIndexing assignment can have no effect on CUDA with deterministic algorithmsrX3Many dispatch keys do not print to string correctlyrX at::real and at::imag as methodsrX*test_wishart_log_prob fails locally for merXWReplace `RuntimeError` by custom exception for unsupported ONNX operators during exportrX>Coverage test is only checking packages and not all submodulesrX8test_license_for_wheel always fails on my local dev copyrX,run_test.py option to write out failed testsr X%Deprecation warning from SequentialLRr!XRuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\caffe2\serialize\inline_container.cc:300]r"XQAllow any operation that takes a Storage to also take a contiguous Tensor insteadr#X;Failed to build on Ubuntu 18.04 due to bad MPI linker flagsr$X*Some test failed when running in parallel.r%X-Eliminate uses of deprecated `FindCUDA.cmake`r&X-HIPFFT_EXEC_FAILED when using AMD GPU run FFTr'X:NVFuser microbenchmark classifier - hash on memory formatsr(X<`init_process_group` hanging on HPC multi-node system w GPU r)X!NNC failing opinfo accuracy testsr*XLRuntimeError: bucket_count == per_bucket_sizes.size() INTERNAL ASSERT FAILEDr+XdSSL certificate error: urlopen error [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1091)>r,X[RFC] NPU device for PyTorchr-X-__torch_function__ and generator input hazardr.X2Computer using CPU instead of GPU nvidia with CUDAr/X"Dirichlet with small concentrationr0XLMobile assets upload could break third party mirrors due to binary data sizer1X4A bug in instructions for building PyTorch with ASANr2XLJit torchscript for prediction is missing 'forward' when using forward hooksr3XRNumerical instability: matrix multiplication got different results on cpu and gpu r4X>The prediction results of different equipment are inconsistentr5XDtest_jit.py TestWarn.test_warn and friends doesn't work under pytestr6Xjtorch.nn.LayerNorm is very slow on GPU (much slower than a custom LayerNorm version in the ConvNext model)r7X'backcompat tests in test_nn.py are slowr8XIBuild a default NVFuser comparison callback, e.g. for use with torchbenchr9X$gql_mocks.json has really long linesr:XrDISABLED test_zero_model_parallel_parameters_as_bucket_view_True (__main__.TestZeroRedundancyOptimizerDistributed)r;X8API to determine if a torch.return_type is a "structseq"r<XAdd build support for GCC 11.2r=Xjit/_trace.py", line 71, in _unique_state_dict filtered_dict[k] = v.detach() AttributeError: 'torch.dtype' object has no attribute 'detach'r>Xl[JIT] [Autocast] JIT Autocast Pass operations' list should be extendable and consistent with imperative pathr?X:Potential memory leak in Adam optimizer in AMD chips (CPU)r@X9FSDP remove the requirement of all trainable parameters rAXAdd nesting of nested TensorrBX-AllGather with backward support async_op=TruerCXEtorch.jit.trace error when custom autograd function used in the modelrDX.Disable TracerWarnings on NVFuser opinfo testsrEX1autogen-58 microbenchmark fails on NNC gpu fusionrFX<aten::_softmax.out doesn't work with non-contiguous Tensors rGXkinteraction with psychopy during imports, script exits with: free(): invalid pointer. Aborted (core dumped)rHXi'python setup.py build' failed but succeed using 'pip install -v .' which calls 'python setup.py build'.rIX"[FSDP] Verify buffer checkpointingrJX.Add batching rules for `{view}_copy` operatorsrKX3Move _SKIP_PYTHON_BINDINGS to native_functions.yamlrLXKtorch.jit.script'd function very slow on first invocation on latest nightlyrMX'add -D_GLIBCXX_ASSERTIONS in debug moderNX2INTERNAL ASSERT FAILED at "vulkan_rewrite.cpp":272rOX8LayerNorm and GroupNorm with num_groups=1 not equivalentrPXAFix workaround `__module__` used to appease public binding checksrQXDifferent result with JITrRXh`torch.jit.script` Script functions do return `requires_grad = False` if `torch.no_grad()` has been usedrSXLExpected quantizer->qscheme() == kPerTensorAffine to be true, but got false.rTX[`torch.matmul` produces wrong results on A4000 for matrices (n*m) with large m and small n rUX8Handle noncontiguous inputs in distributed backend layerrVX3[Autograd] Queued Callback Does Not Propagate ErrorrWX@Depthwise Conv1d performance (a naive CUDA kernel is 10x faster)rXXJLarge numerical error when applying nn.Linear in RTX A6000 with cuda>=11.1rYXtorch.device missing doctringrZXC`torch.sum, prod, cumsum, cumprod, sparse.sum` INTERNAL ASSERT FAILr[XcWarning originating in C10 backend does not get translated to Python warning if run from subprocessr\X<Support batch indexing with sparse tensors with torch.sparser]X,Let's host NVIDIA dependencies in our own S3r^X&Einsum should have an `out=` parameterr_XIAddressing skips in OpInfo nn.functional.binary_cross_entropy_with_logitsr`XMTensorboard Issue with visualizing the connections of encoder-decoder networkraX#Implement histc for bfloat16 on CPUrbX#Off main thread symbolic evaluationrcX>multiprocessing and torch.tensor, Cannot allocate memory errorrdX-Misleading documentation for cholesky_inversereX.1.11.0 distribution train different with 1.8.1rfX+`jit(Function)` results in double executionrgX8jit fails when trying to assign values to model via hookrhX<Op segfaults with ForwardAD and Subclassed Tensor as TangentriX<Navi 21 GPU hang when passing wrong input to embedding layerrjXJ[ONNX] Enable stacktrace print for TORCH_INTERNAL_ASSERT errors in export.rkXI[ONNX] Support unit tests in scripting that we already support in tracingrlXkthvalue 20x slower than sort rmXAdd ZeroTensor support for `mm`rnX$Add `balance` flag to `random_split`roXiCannot use socks5h proxy because of urllib: `urlopen error Remote end closed connection without response`rpe.PKUo??PK=saved_corpus/byteorderFB9ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZlittlePK=PK8saved_corpus/versionFB4ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ3 PKўgUPK#-saved_corpus/.data/serialization_idFB)ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ0361162702521463308600000023439188243573PK$((PKUo??saved_corpus/data.pklPK=saved_corpus/byteorderPKўgUsaved_corpus/versionPK$((#saved_corpus/.data/serialization_idPK,-8PKRPK8