Note
Click here to download the full example code
Explicit horizontal fusion with foreach_map and torch.compile¶
Author: Michael Lazos
- Horizontal fusion is a key optimization in ML compilers. In eager,
this is typically expressed using the torch._foreach* ops which parallelizes operations across a list of tensors. However, supporting all possible permutations of arguments is quite difficult (e.g. mixtures of scalars and lists). Foreach_map allows conversion of any pointwise op in
torch
to a horiztonally fused foreach variant. In this tutorial, we will demonstrate how to implement the Adam optimizer withforeach_map
to generate a fully fused kernel.
Note
This recipe describes a prototype feature. Prototype features are typically at an early stage for feedback and testing and are subject to change.
Prerequisites¶
PyTorch v2.7.0 or later
Model Setup¶
For this example, we’ll use a simple sequence of linear layers. We instantiate an independent copy to compare the two optimizer implementations.
import torch # exit cleanly if we are on a device that doesn't support ``torch.compile`` if torch.cuda.get_device_capability() < (7, 0): print("Exiting because torch.compile is not supported on this device.") import sys sys.exit(0) # Create simple model model = torch.nn.Sequential( *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)] ) model_copy = torch.nn.Sequential( *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)] ) input = torch.rand(1024, device="cuda") # run forward pass output = model(input) output_copy = model_copy(input) # run backward to populate the grads for our optimizer below output.sum().backward() output_copy.sum().backward()
Helper functions for foreach_map implementation¶
In this section, we’ll begin our implementation of the Adam optimizer.
from torch._higher_order_ops.foreach_map import foreach_map # Helper function to extract optimizer states from a torch.optim.Adam instance def get_inputs(optim): steps = [] params = [] grads = [] exp_avgs = [] exp_avg_sqs = [] for group in optim.param_groups: for p in group["params"]: params.append(p) grads.append(p.grad) state = optim.state[p] exp_avgs.append(state["exp_avg"]) exp_avg_sqs.append(state["exp_avg_sq"]) steps.append(state["step"]) return steps, params, exp_avgs, exp_avg_sqs # Functions to update the different optimizer states def update_exp_avg_sq(exp_avg_sq, grad, beta2): return exp_avg_sq.mul(beta2).addcmul(grad, grad, value=1 - beta2) def update_param(param, step, exp_avg, exp_avg_sq, beta1, beta2, lr, eps): bias_correction1 = 1 - torch.pow(beta1, step) bias_correction2 = (1 - torch.pow(beta2, step)).sqrt() step_size = (lr / bias_correction1).neg() denom = (exp_avg_sq.sqrt() / (bias_correction2 * step_size)).add(eps / step_size) return torch.add(param, torch.div(exp_avg, denom)) # Our full Adam implementation def foreach_map_adam( steps, params, exp_avgs, exp_avg_sqs, weight_decay=0, beta1=0.9, beta2=0.999, lr=1e-3, eps=1e-8, ): with torch.no_grad(): grads = [param.grad for param in params] # update step updated_steps = foreach_map(lambda x: x + 1, steps) torch._foreach_copy_(steps, updated_steps) if weight_decay != 0: foreach_map(torch.add, (grads,), alpha=weight_decay) # Higher-order operators (HOPs) cannot have multiple outputs at the moment # need to call foreach_map once for each output exp_avgs_updated = foreach_map(torch.lerp, exp_avgs, grads, 1 - beta1) exp_avgs_sq_updated = foreach_map(update_exp_avg_sq, exp_avg_sqs, grads, beta2) params_updated = foreach_map( update_param, params, steps, exp_avgs_updated, exp_avgs_sq_updated, beta1, beta2, lr, eps, ) # Higher-order operators (HOPs) don't support input mutation today # so manually update the states in-place torch._foreach_copy_(exp_avgs, exp_avgs_updated) torch._foreach_copy_(exp_avg_sqs, exp_avgs_sq_updated) torch._foreach_copy_(params, params_updated) return
Setting up and running the compiled kernel¶
In this section, we’ll run our Adam optimizer and compare the results
Note
torch.compile
is only supported on CUDA devices that have a compute capability of 7.0 or higher.
opt_eager = torch.optim.Adam(model.parameters(), lr=torch.tensor(0.01)) opt_eager_copy = torch.optim.Adam(model_copy.parameters(), lr=torch.tensor(0.01)) # warm up the optimizer state dict opt_eager.step() opt_eager_copy.step() inputs = get_inputs(opt_eager_copy) compiled_adam = torch.compile(foreach_map_adam) # optionally view the output code torch._logging.set_logs(output_code=True) # Warmup runs to compile the function for _ in range(5): opt_eager.step() compiled_adam(*inputs) for eager_p, compile_p in zip(opt_eager.param_groups[0]["params"], opt_eager_copy.param_groups[0]["params"]): torch.allclose(eager_p, compile_p) # Benchmark performance # Let's define a helpful benchmarking function: import torch.utils.benchmark as benchmark def benchmark_torch_function_in_microseconds(f, *args, **kwargs): t0 = benchmark.Timer( stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f} ) return t0.blocked_autorange().mean * 1e6 eager_runtime = benchmark_torch_function_in_microseconds(opt_eager.step) compiled_runtime = benchmark_torch_function_in_microseconds(lambda: compiled_adam(*inputs)) assert eager_runtime > compiled_runtime print(f"eager runtime: {eager_runtime}us") print(f"compiled runtime: {compiled_runtime}us")
V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] Output code: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] # AOT ID: ['0_inference'] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import torch V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import math V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import random V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import os V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import tempfile V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from math import inf, nan V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from cmath import nanj V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.utils import maybe_profile V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch import device, empty_strided V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton.language as tl V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] aten = torch.ops.aten V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] inductor_ops = torch.ops.inductor V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] _quantized = torch.ops._quantized V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] async_compile = AsyncCompile() V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] # kernel path: /tmp/torchinductor_ci-user/ej/cejr7t4zzqo7llcoxga7clgyc6gs3676lsm4dvilpfw64kudp2ns.py V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: [] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Source node to ATen node mapping: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', ''' V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] import triton.language as tl V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] @triton_heuristics.foreach( V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_warps=8, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]}, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '1E2C16421D4C3DBA4AD92BFC4278A3CB24C43DEDA6EE7FF9E3FBB1DBB80802DB', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False}, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] ) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] @triton.jit V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89): V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid = tl.program_id(0) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] XBLOCK: tl.constexpr = 1024 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_0 = tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] if pid < num_xblocks_0: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x0 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp5 = tl.load(in_ptr0 + (x0), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp6 = tl.load(in_ptr1 + (x0), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp11 = tl.load(in_ptr2 + (x0), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp18 = tl.load(in_ptr3 + (x0), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp20 = in_ptr4 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp0 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp1 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp2 = tmp0 >= tmp1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp3 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp4 = tl.where(tmp2, tmp3, tmp0) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp7 = tmp5 - tmp6 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp8 = tmp4 * tmp7 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp9 = tl.where(tmp2, tmp5, tmp6) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp10 = tmp8 + tmp9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp12 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp13 = tmp11 * tmp12 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp14 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp15 = tmp5 * tmp14 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp16 = tmp15 * tmp5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp17 = tmp13 + tmp16 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp19 = libdevice.sqrt(tmp17) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp21 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp22 = tmp20 + tmp21 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp23 = libdevice.pow(tmp12, tmp22) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp24 = tmp21 - tmp23 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp25 = libdevice.sqrt(tmp24) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp26 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp27 = libdevice.pow(tmp26, tmp22) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp28 = tmp21 - tmp27 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp29 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp30 = (tmp29 / tmp28) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp31 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp32 = tmp30 * tmp31 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp33 = -tmp32 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp34 = tmp25 * tmp33 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp35 = (tmp19 / tmp34) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp36 = (tmp29 / tmp33) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp37 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp38 = tmp36 * tmp37 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp39 = tmp35 + tmp38 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp40 = (tmp10 / tmp39) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp41 = tmp18 + tmp40 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr6 + (x0), tmp41, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr7 + (x0), tmp10, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr8 + (x0), tmp17, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_1: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x1 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp47 = tl.load(in_ptr5 + (x1), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp48 = tl.load(in_ptr6 + (x1), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp53 = tl.load(in_ptr7 + (x1), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp60 = tl.load(in_ptr8 + (x1), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp62 = in_ptr9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp42 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp43 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp44 = tmp42 >= tmp43 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp45 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp46 = tl.where(tmp44, tmp45, tmp42) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp49 = tmp47 - tmp48 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp50 = tmp46 * tmp49 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp51 = tl.where(tmp44, tmp47, tmp48) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp52 = tmp50 + tmp51 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp54 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp55 = tmp53 * tmp54 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp56 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp57 = tmp47 * tmp56 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp58 = tmp57 * tmp47 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp59 = tmp55 + tmp58 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp61 = libdevice.sqrt(tmp59) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp63 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp64 = tmp62 + tmp63 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp65 = libdevice.pow(tmp54, tmp64) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp66 = tmp63 - tmp65 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp67 = libdevice.sqrt(tmp66) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp68 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp69 = libdevice.pow(tmp68, tmp64) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp70 = tmp63 - tmp69 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp71 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp72 = (tmp71 / tmp70) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp73 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp74 = tmp72 * tmp73 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp75 = -tmp74 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp76 = tmp67 * tmp75 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp77 = (tmp61 / tmp76) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp78 = (tmp71 / tmp75) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp79 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp80 = tmp78 * tmp79 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp81 = tmp77 + tmp80 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp82 = (tmp52 / tmp81) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp83 = tmp60 + tmp82 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr15 + (x1), tmp83, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr16 + (x1), tmp52, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr17 + (x1), tmp59, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_2: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x2 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp89 = tl.load(in_ptr10 + (x2), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp90 = tl.load(in_ptr11 + (x2), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp95 = tl.load(in_ptr12 + (x2), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp102 = tl.load(in_ptr13 + (x2), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp104 = in_ptr14 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp84 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp85 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp86 = tmp84 >= tmp85 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp87 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp88 = tl.where(tmp86, tmp87, tmp84) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp91 = tmp89 - tmp90 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp92 = tmp88 * tmp91 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp93 = tl.where(tmp86, tmp89, tmp90) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp94 = tmp92 + tmp93 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp96 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp97 = tmp95 * tmp96 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp98 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp99 = tmp89 * tmp98 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp100 = tmp99 * tmp89 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp101 = tmp97 + tmp100 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp103 = libdevice.sqrt(tmp101) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp105 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp106 = tmp104 + tmp105 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp107 = libdevice.pow(tmp96, tmp106) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp108 = tmp105 - tmp107 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp109 = libdevice.sqrt(tmp108) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp110 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp111 = libdevice.pow(tmp110, tmp106) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp112 = tmp105 - tmp111 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp113 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp114 = (tmp113 / tmp112) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp115 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp116 = tmp114 * tmp115 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp117 = -tmp116 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp118 = tmp109 * tmp117 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp119 = (tmp103 / tmp118) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp120 = (tmp113 / tmp117) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp121 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp122 = tmp120 * tmp121 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp123 = tmp119 + tmp122 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp124 = (tmp94 / tmp123) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp125 = tmp102 + tmp124 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr24 + (x2), tmp125, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr25 + (x2), tmp94, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr26 + (x2), tmp101, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_3: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_2 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x3 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp131 = tl.load(in_ptr15 + (x3), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp132 = tl.load(in_ptr16 + (x3), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp137 = tl.load(in_ptr17 + (x3), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp144 = tl.load(in_ptr18 + (x3), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp146 = in_ptr19 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp126 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp127 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp128 = tmp126 >= tmp127 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp129 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp130 = tl.where(tmp128, tmp129, tmp126) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp133 = tmp131 - tmp132 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp134 = tmp130 * tmp133 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp135 = tl.where(tmp128, tmp131, tmp132) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp136 = tmp134 + tmp135 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp138 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp139 = tmp137 * tmp138 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp140 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp141 = tmp131 * tmp140 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp142 = tmp141 * tmp131 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp143 = tmp139 + tmp142 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp145 = libdevice.sqrt(tmp143) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp147 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp148 = tmp146 + tmp147 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp149 = libdevice.pow(tmp138, tmp148) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp150 = tmp147 - tmp149 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp151 = libdevice.sqrt(tmp150) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp152 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp153 = libdevice.pow(tmp152, tmp148) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp154 = tmp147 - tmp153 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp155 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp156 = (tmp155 / tmp154) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp157 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp158 = tmp156 * tmp157 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp159 = -tmp158 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp160 = tmp151 * tmp159 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp161 = (tmp145 / tmp160) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp162 = (tmp155 / tmp159) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp163 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp164 = tmp162 * tmp163 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp165 = tmp161 + tmp164 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp166 = (tmp136 / tmp165) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp167 = tmp144 + tmp166 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr33 + (x3), tmp167, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr34 + (x3), tmp136, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr35 + (x3), tmp143, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_4: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_3 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x4 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp173 = tl.load(in_ptr20 + (x4), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp174 = tl.load(in_ptr21 + (x4), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp179 = tl.load(in_ptr22 + (x4), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp186 = tl.load(in_ptr23 + (x4), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp188 = in_ptr24 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp168 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp169 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp170 = tmp168 >= tmp169 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp171 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp172 = tl.where(tmp170, tmp171, tmp168) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp175 = tmp173 - tmp174 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp176 = tmp172 * tmp175 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp177 = tl.where(tmp170, tmp173, tmp174) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp178 = tmp176 + tmp177 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp180 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp181 = tmp179 * tmp180 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp182 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp183 = tmp173 * tmp182 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp184 = tmp183 * tmp173 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp185 = tmp181 + tmp184 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp187 = libdevice.sqrt(tmp185) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp189 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp190 = tmp188 + tmp189 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp191 = libdevice.pow(tmp180, tmp190) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp192 = tmp189 - tmp191 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp193 = libdevice.sqrt(tmp192) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp194 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp195 = libdevice.pow(tmp194, tmp190) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp196 = tmp189 - tmp195 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp197 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp198 = (tmp197 / tmp196) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp199 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp200 = tmp198 * tmp199 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp201 = -tmp200 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp202 = tmp193 * tmp201 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp203 = (tmp187 / tmp202) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp204 = (tmp197 / tmp201) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp205 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp206 = tmp204 * tmp205 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp207 = tmp203 + tmp206 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp208 = (tmp178 / tmp207) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp209 = tmp186 + tmp208 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr42 + (x4), tmp209, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr43 + (x4), tmp178, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr44 + (x4), tmp185, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_5: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_4 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x5 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp215 = tl.load(in_ptr25 + (x5), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp216 = tl.load(in_ptr26 + (x5), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp221 = tl.load(in_ptr27 + (x5), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp228 = tl.load(in_ptr28 + (x5), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp230 = in_ptr29 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp210 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp211 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp212 = tmp210 >= tmp211 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp213 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp214 = tl.where(tmp212, tmp213, tmp210) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp217 = tmp215 - tmp216 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp218 = tmp214 * tmp217 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp219 = tl.where(tmp212, tmp215, tmp216) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp220 = tmp218 + tmp219 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp222 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp223 = tmp221 * tmp222 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp224 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp225 = tmp215 * tmp224 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp226 = tmp225 * tmp215 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp227 = tmp223 + tmp226 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp229 = libdevice.sqrt(tmp227) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp231 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp232 = tmp230 + tmp231 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp233 = libdevice.pow(tmp222, tmp232) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp234 = tmp231 - tmp233 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp235 = libdevice.sqrt(tmp234) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp236 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp237 = libdevice.pow(tmp236, tmp232) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp238 = tmp231 - tmp237 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp239 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp240 = (tmp239 / tmp238) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp241 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp242 = tmp240 * tmp241 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp243 = -tmp242 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp244 = tmp235 * tmp243 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp245 = (tmp229 / tmp244) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp246 = (tmp239 / tmp243) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp247 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp248 = tmp246 * tmp247 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp249 = tmp245 + tmp248 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp250 = (tmp220 / tmp249) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp251 = tmp228 + tmp250 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr51 + (x5), tmp251, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr52 + (x5), tmp220, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr53 + (x5), tmp227, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_6: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x6 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp257 = tl.load(in_ptr30 + (x6), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp258 = tl.load(in_ptr31 + (x6), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp263 = tl.load(in_ptr32 + (x6), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp270 = tl.load(in_ptr33 + (x6), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp272 = in_ptr34 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp252 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp253 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp254 = tmp252 >= tmp253 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp255 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp256 = tl.where(tmp254, tmp255, tmp252) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp259 = tmp257 - tmp258 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp260 = tmp256 * tmp259 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp261 = tl.where(tmp254, tmp257, tmp258) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp262 = tmp260 + tmp261 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp264 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp265 = tmp263 * tmp264 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp266 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp267 = tmp257 * tmp266 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp268 = tmp267 * tmp257 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp269 = tmp265 + tmp268 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp271 = libdevice.sqrt(tmp269) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp273 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp274 = tmp272 + tmp273 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp275 = libdevice.pow(tmp264, tmp274) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp276 = tmp273 - tmp275 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp277 = libdevice.sqrt(tmp276) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp278 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp279 = libdevice.pow(tmp278, tmp274) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp280 = tmp273 - tmp279 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp281 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp282 = (tmp281 / tmp280) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp283 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp284 = tmp282 * tmp283 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp285 = -tmp284 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp286 = tmp277 * tmp285 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp287 = (tmp271 / tmp286) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp288 = (tmp281 / tmp285) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp289 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp290 = tmp288 * tmp289 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp291 = tmp287 + tmp290 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp292 = (tmp262 / tmp291) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp293 = tmp270 + tmp292 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr60 + (x6), tmp293, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr61 + (x6), tmp262, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr62 + (x6), tmp269, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_7: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_6 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x7 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp299 = tl.load(in_ptr35 + (x7), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp300 = tl.load(in_ptr36 + (x7), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp305 = tl.load(in_ptr37 + (x7), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp312 = tl.load(in_ptr38 + (x7), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp314 = in_ptr39 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp294 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp295 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp296 = tmp294 >= tmp295 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp297 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp298 = tl.where(tmp296, tmp297, tmp294) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp301 = tmp299 - tmp300 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp302 = tmp298 * tmp301 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp303 = tl.where(tmp296, tmp299, tmp300) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp304 = tmp302 + tmp303 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp306 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp307 = tmp305 * tmp306 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp308 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp309 = tmp299 * tmp308 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp310 = tmp309 * tmp299 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp311 = tmp307 + tmp310 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp313 = libdevice.sqrt(tmp311) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp315 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp316 = tmp314 + tmp315 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp317 = libdevice.pow(tmp306, tmp316) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp318 = tmp315 - tmp317 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp319 = libdevice.sqrt(tmp318) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp320 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp321 = libdevice.pow(tmp320, tmp316) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp322 = tmp315 - tmp321 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp323 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp324 = (tmp323 / tmp322) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp325 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp326 = tmp324 * tmp325 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp327 = -tmp326 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp328 = tmp319 * tmp327 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp329 = (tmp313 / tmp328) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp330 = (tmp323 / tmp327) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp331 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp332 = tmp330 * tmp331 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp333 = tmp329 + tmp332 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp334 = (tmp304 / tmp333) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp335 = tmp312 + tmp334 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr69 + (x7), tmp335, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr70 + (x7), tmp304, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr71 + (x7), tmp311, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_8: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_7 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x8 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp341 = tl.load(in_ptr40 + (x8), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp342 = tl.load(in_ptr41 + (x8), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp347 = tl.load(in_ptr42 + (x8), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp354 = tl.load(in_ptr43 + (x8), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp356 = in_ptr44 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp336 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp337 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp338 = tmp336 >= tmp337 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp339 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp340 = tl.where(tmp338, tmp339, tmp336) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp343 = tmp341 - tmp342 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp344 = tmp340 * tmp343 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp345 = tl.where(tmp338, tmp341, tmp342) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp346 = tmp344 + tmp345 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp348 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp349 = tmp347 * tmp348 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp350 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp351 = tmp341 * tmp350 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp352 = tmp351 * tmp341 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp353 = tmp349 + tmp352 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp355 = libdevice.sqrt(tmp353) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp357 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp358 = tmp356 + tmp357 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp359 = libdevice.pow(tmp348, tmp358) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp360 = tmp357 - tmp359 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp361 = libdevice.sqrt(tmp360) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp362 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp363 = libdevice.pow(tmp362, tmp358) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp364 = tmp357 - tmp363 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp365 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp366 = (tmp365 / tmp364) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp367 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp368 = tmp366 * tmp367 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp369 = -tmp368 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp370 = tmp361 * tmp369 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp371 = (tmp355 / tmp370) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp372 = (tmp365 / tmp369) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp373 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp374 = tmp372 * tmp373 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp375 = tmp371 + tmp374 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp376 = (tmp346 / tmp375) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp377 = tmp354 + tmp376 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr78 + (x8), tmp377, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr79 + (x8), tmp346, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr80 + (x8), tmp353, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] elif pid < num_xblocks_9: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pid_offset = pid - num_xblocks_8 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xnumel = 1048576 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] r0_numel = 1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] x9 = xindex V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp383 = tl.load(in_ptr45 + (x9), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp384 = tl.load(in_ptr46 + (x9), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp389 = tl.load(in_ptr47 + (x9), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp396 = tl.load(in_ptr48 + (x9), None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp398 = in_ptr49 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp378 = 0.09999999999999998 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp379 = 0.5 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp380 = tmp378 >= tmp379 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp381 = -0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp382 = tl.where(tmp380, tmp381, tmp378) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp385 = tmp383 - tmp384 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp386 = tmp382 * tmp385 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp387 = tl.where(tmp380, tmp383, tmp384) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp388 = tmp386 + tmp387 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp390 = 0.999 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp391 = tmp389 * tmp390 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp392 = 0.0010000000000000009 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp393 = tmp383 * tmp392 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp394 = tmp393 * tmp383 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp395 = tmp391 + tmp394 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp397 = libdevice.sqrt(tmp395) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp399 = 1.0 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp400 = tmp398 + tmp399 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp401 = libdevice.pow(tmp390, tmp400) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp402 = tmp399 - tmp401 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp403 = libdevice.sqrt(tmp402) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp404 = 0.9 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp405 = libdevice.pow(tmp404, tmp400) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp406 = tmp399 - tmp405 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp407 = tl.full([1], 1, tl.int32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp408 = (tmp407 / tmp406) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp409 = 0.001 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp410 = tmp408 * tmp409 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp411 = -tmp410 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp412 = tmp403 * tmp411 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp413 = (tmp397 / tmp412) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp414 = (tmp407 / tmp411) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp415 = 1e-08 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp416 = tmp414 * tmp415 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp417 = tmp413 + tmp416 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp418 = (tmp388 / tmp417) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tmp419 = tmp396 + tmp418 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr87 + (x9), tmp419, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr88 + (x9), tmp388, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] tl.store(out_ptr89 + (x9), tmp395, None) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] else: V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] pass V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] ''', device_str='cuda') V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], ''' V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] #include "/tmp/torchinductor_ci-user/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] extern "C" void kernel(const float* in_ptr0, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr1, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr2, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr3, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr4, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr5, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr6, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr7, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr8, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] const float* in_ptr9, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr1, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr3, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr5, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr7, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr9, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr11, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr13, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr15, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr17, V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] float* out_ptr19) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr0[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr1[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr1[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr3[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr2[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr5[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr3[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr7[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr4[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr9[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr5[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr11[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr6[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr13[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr7[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr15[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr8[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr17[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] { V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp0 = in_ptr9[static_cast<int64_t>(0L)]; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] out_ptr19[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] } V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] ''') V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] async_compile.wait(globals()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del async_compile V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] def call(args): V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] args.clear() V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg0_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg1_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg2_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg3_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg4_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg5_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg6_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg7_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg8_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg9_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg10_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg11_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg12_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg13_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg14_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg15_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg16_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg17_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg18_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg19_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg20_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg21_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg22_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg23_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg24_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg25_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg26_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg27_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg28_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg29_1, (), ()) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg30_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg31_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg32_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg33_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg34_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg35_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg36_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg37_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg38_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg39_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg40_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg41_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg42_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg43_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg44_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg45_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg46_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg47_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg48_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] assert_size_stride(arg49_1, (1024, 1024), (1024, 1)) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] with torch.cuda._DeviceGuard(0): V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] torch.cuda.set_device(0) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] # Unsorted Source Nodes: [], Original ATen: [] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] stream0 = get_raw_stream(0) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] triton_for_fused_0.run(arg1_1, arg30_1, arg40_1, arg0_1, arg20_1.item(), arg3_1, arg31_1, arg41_1, arg2_1, arg21_1.item(), arg5_1, arg32_1, arg42_1, arg4_1, arg22_1.item(), arg7_1, arg33_1, arg43_1, arg6_1, arg23_1.item(), arg9_1, arg34_1, arg44_1, arg8_1, arg24_1.item(), arg11_1, arg35_1, arg45_1, arg10_1, arg25_1.item(), arg13_1, arg36_1, arg46_1, arg12_1, arg26_1.item(), arg15_1, arg37_1, arg47_1, arg14_1, arg27_1.item(), arg17_1, arg38_1, arg48_1, arg16_1, arg28_1.item(), arg19_1, arg39_1, arg49_1, arg18_1, arg29_1.item(), arg0_1, arg30_1, arg40_1, arg2_1, arg31_1, arg41_1, arg4_1, arg32_1, arg42_1, arg6_1, arg33_1, arg43_1, arg8_1, arg34_1, arg44_1, arg10_1, arg35_1, arg45_1, arg12_1, arg36_1, arg46_1, arg14_1, arg37_1, arg47_1, arg16_1, arg38_1, arg48_1, arg18_1, arg39_1, arg49_1, stream=stream0) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg0_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg10_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg11_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg12_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg13_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg14_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg15_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg16_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg17_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg18_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg19_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg1_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg2_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg30_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg31_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg32_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg33_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg34_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg35_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg36_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg37_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg38_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg39_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg3_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg40_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg41_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg42_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg43_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg44_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg45_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg46_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg47_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg48_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg49_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg4_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg5_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg6_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg7_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg8_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg9_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] cpp_fused__foreach_copy_1(arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg20_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg21_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg22_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg23_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg24_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg25_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg26_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg27_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg28_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] del arg29_1 V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] return () V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10): V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._dynamo.testing import rand_strided V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.utils import print_performance V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg10_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg11_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg12_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg13_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg14_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg15_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg16_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg17_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg18_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg19_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg20_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg21_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg22_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg23_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg24_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg25_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg26_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg27_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg28_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg29_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1]) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] return print_performance(fn, times=times, repeat=repeat) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] if __name__ == "__main__": V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] compiled_module_main('None', benchmark_compiled_module) V0707 22:50:04.702000 29476 torch/_inductor/graph.py:2104] [0/0] [__output_code] V0707 22:50:04.751000 29476 torch/_inductor/graph.py:2115] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/bx/cbxwuspm7iljtlkypwgm5a6rrandaew4wqmdmng4lzas4ogomxpw.py I0707 22:50:06.299000 29476 torch/_inductor/graph.py:2149] [0/0] [__output_code] Output code written to: /tmp/torchinductor_ci-user/bx/cbxwuspm7iljtlkypwgm5a6rrandaew4wqmdmng4lzas4ogomxpw.py V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] Output code: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] # AOT ID: ['1_inference'] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from ctypes import c_void_p, c_long, c_int V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import torch V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import math V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import random V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import os V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import tempfile V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from math import inf, nan V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from cmath import nanj V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.hooks import run_intermediate_hooks V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.utils import maybe_profile V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.codegen.memory_planning import _align as align V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch import device, empty_strided V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.async_compile import AsyncCompile V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.select_algorithm import extern_kernels V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton.language as tl V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.triton_heuristics import start_graph, end_graph V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._C import _cuda_getCurrentRawStream as get_raw_stream V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] aten = torch.ops.aten V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] inductor_ops = torch.ops.inductor V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] _quantized = torch.ops._quantized V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] async_compile = AsyncCompile() V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] # kernel path: /tmp/torchinductor_ci-user/ej/cejr7t4zzqo7llcoxga7clgyc6gs3676lsm4dvilpfw64kudp2ns.py V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: [] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Source node to ATen node mapping: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_for_fused_0 = async_compile.triton('triton_for_fused_0', ''' V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] import triton.language as tl V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime import triton_helpers, triton_heuristics V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.triton_helpers import libdevice, math as tl_math V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.runtime.hints import AutotuneHint, ReductionHint, TileHint, DeviceProperties V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] @triton_heuristics.foreach( V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_warps=8, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_meta={'signature': {'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'in_ptr2': '*fp32', 'in_ptr3': '*fp32', 'in_ptr4': 'fp32', 'in_ptr5': '*fp32', 'in_ptr6': '*fp32', 'in_ptr7': '*fp32', 'in_ptr8': '*fp32', 'in_ptr9': 'fp32', 'in_ptr10': '*fp32', 'in_ptr11': '*fp32', 'in_ptr12': '*fp32', 'in_ptr13': '*fp32', 'in_ptr14': 'fp32', 'in_ptr15': '*fp32', 'in_ptr16': '*fp32', 'in_ptr17': '*fp32', 'in_ptr18': '*fp32', 'in_ptr19': 'fp32', 'in_ptr20': '*fp32', 'in_ptr21': '*fp32', 'in_ptr22': '*fp32', 'in_ptr23': '*fp32', 'in_ptr24': 'fp32', 'in_ptr25': '*fp32', 'in_ptr26': '*fp32', 'in_ptr27': '*fp32', 'in_ptr28': '*fp32', 'in_ptr29': 'fp32', 'in_ptr30': '*fp32', 'in_ptr31': '*fp32', 'in_ptr32': '*fp32', 'in_ptr33': '*fp32', 'in_ptr34': 'fp32', 'in_ptr35': '*fp32', 'in_ptr36': '*fp32', 'in_ptr37': '*fp32', 'in_ptr38': '*fp32', 'in_ptr39': 'fp32', 'in_ptr40': '*fp32', 'in_ptr41': '*fp32', 'in_ptr42': '*fp32', 'in_ptr43': '*fp32', 'in_ptr44': 'fp32', 'in_ptr45': '*fp32', 'in_ptr46': '*fp32', 'in_ptr47': '*fp32', 'in_ptr48': '*fp32', 'in_ptr49': 'fp32', 'out_ptr6': '*fp32', 'out_ptr7': '*fp32', 'out_ptr8': '*fp32', 'out_ptr15': '*fp32', 'out_ptr16': '*fp32', 'out_ptr17': '*fp32', 'out_ptr24': '*fp32', 'out_ptr25': '*fp32', 'out_ptr26': '*fp32', 'out_ptr33': '*fp32', 'out_ptr34': '*fp32', 'out_ptr35': '*fp32', 'out_ptr42': '*fp32', 'out_ptr43': '*fp32', 'out_ptr44': '*fp32', 'out_ptr51': '*fp32', 'out_ptr52': '*fp32', 'out_ptr53': '*fp32', 'out_ptr60': '*fp32', 'out_ptr61': '*fp32', 'out_ptr62': '*fp32', 'out_ptr69': '*fp32', 'out_ptr70': '*fp32', 'out_ptr71': '*fp32', 'out_ptr78': '*fp32', 'out_ptr79': '*fp32', 'out_ptr80': '*fp32', 'out_ptr87': '*fp32', 'out_ptr88': '*fp32', 'out_ptr89': '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, multi_processor_count=80, cc=86, major=8, regs_per_multiprocessor=65536, max_threads_per_multi_processor=1536, warp_size=32), 'constants': {}, 'configs': [{(0,): [['tt.divisibility', 16]], (1,): [['tt.divisibility', 16]], (2,): [['tt.divisibility', 16]], (3,): [['tt.divisibility', 16]], (5,): [['tt.divisibility', 16]], (6,): [['tt.divisibility', 16]], (7,): [['tt.divisibility', 16]], (8,): [['tt.divisibility', 16]], (10,): [['tt.divisibility', 16]], (11,): [['tt.divisibility', 16]], (12,): [['tt.divisibility', 16]], (13,): [['tt.divisibility', 16]], (15,): [['tt.divisibility', 16]], (16,): [['tt.divisibility', 16]], (17,): [['tt.divisibility', 16]], (18,): [['tt.divisibility', 16]], (20,): [['tt.divisibility', 16]], (21,): [['tt.divisibility', 16]], (22,): [['tt.divisibility', 16]], (23,): [['tt.divisibility', 16]], (25,): [['tt.divisibility', 16]], (26,): [['tt.divisibility', 16]], (27,): [['tt.divisibility', 16]], (28,): [['tt.divisibility', 16]], (30,): [['tt.divisibility', 16]], (31,): [['tt.divisibility', 16]], (32,): [['tt.divisibility', 16]], (33,): [['tt.divisibility', 16]], (35,): [['tt.divisibility', 16]], (36,): [['tt.divisibility', 16]], (37,): [['tt.divisibility', 16]], (38,): [['tt.divisibility', 16]], (40,): [['tt.divisibility', 16]], (41,): [['tt.divisibility', 16]], (42,): [['tt.divisibility', 16]], (43,): [['tt.divisibility', 16]], (45,): [['tt.divisibility', 16]], (46,): [['tt.divisibility', 16]], (47,): [['tt.divisibility', 16]], (48,): [['tt.divisibility', 16]], (50,): [['tt.divisibility', 16]], (51,): [['tt.divisibility', 16]], (52,): [['tt.divisibility', 16]], (53,): [['tt.divisibility', 16]], (54,): [['tt.divisibility', 16]], (55,): [['tt.divisibility', 16]], (56,): [['tt.divisibility', 16]], (57,): [['tt.divisibility', 16]], (58,): [['tt.divisibility', 16]], (59,): [['tt.divisibility', 16]], (60,): [['tt.divisibility', 16]], (61,): [['tt.divisibility', 16]], (62,): [['tt.divisibility', 16]], (63,): [['tt.divisibility', 16]], (64,): [['tt.divisibility', 16]], (65,): [['tt.divisibility', 16]], (66,): [['tt.divisibility', 16]], (67,): [['tt.divisibility', 16]], (68,): [['tt.divisibility', 16]], (69,): [['tt.divisibility', 16]], (70,): [['tt.divisibility', 16]], (71,): [['tt.divisibility', 16]], (72,): [['tt.divisibility', 16]], (73,): [['tt.divisibility', 16]], (74,): [['tt.divisibility', 16]], (75,): [['tt.divisibility', 16]], (76,): [['tt.divisibility', 16]], (77,): [['tt.divisibility', 16]], (78,): [['tt.divisibility', 16]], (79,): [['tt.divisibility', 16]]}]}, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] inductor_meta={'grid_type': 'SequentialComboKernelGrid', 'combo_grid_meta': {'num_kernels': 10, 'min_blocks': 0, 'default_config': {'XBLOCK': 1024}, 'no_x_dim_0': False, 'xnumel_0': 1048576, 'no_x_dim_1': False, 'xnumel_1': 1048576, 'no_x_dim_2': False, 'xnumel_2': 1048576, 'no_x_dim_3': False, 'xnumel_3': 1048576, 'no_x_dim_4': False, 'xnumel_4': 1048576, 'no_x_dim_5': False, 'xnumel_5': 1048576, 'no_x_dim_6': False, 'xnumel_6': 1048576, 'no_x_dim_7': False, 'xnumel_7': 1048576, 'no_x_dim_8': False, 'xnumel_8': 1048576, 'no_x_dim_9': False, 'xnumel_9': 1048576}, 'kernel_name': 'triton_for_fused_0', 'mutated_arg_names': ['in_ptr1', 'in_ptr11', 'in_ptr12', 'in_ptr13', 'in_ptr16', 'in_ptr17', 'in_ptr18', 'in_ptr2', 'in_ptr21', 'in_ptr22', 'in_ptr23', 'in_ptr26', 'in_ptr27', 'in_ptr28', 'in_ptr3', 'in_ptr31', 'in_ptr32', 'in_ptr33', 'in_ptr36', 'in_ptr37', 'in_ptr38', 'in_ptr41', 'in_ptr42', 'in_ptr43', 'in_ptr46', 'in_ptr47', 'in_ptr48', 'in_ptr6', 'in_ptr7', 'in_ptr8', 'out_ptr15', 'out_ptr16', 'out_ptr17', 'out_ptr24', 'out_ptr25', 'out_ptr26', 'out_ptr33', 'out_ptr34', 'out_ptr35', 'out_ptr42', 'out_ptr43', 'out_ptr44', 'out_ptr51', 'out_ptr52', 'out_ptr53', 'out_ptr6', 'out_ptr60', 'out_ptr61', 'out_ptr62', 'out_ptr69', 'out_ptr7', 'out_ptr70', 'out_ptr71', 'out_ptr78', 'out_ptr79', 'out_ptr8', 'out_ptr80', 'out_ptr87', 'out_ptr88', 'out_ptr89'], 'backend_hash': '1E2C16421D4C3DBA4AD92BFC4278A3CB24C43DEDA6EE7FF9E3FBB1DBB80802DB', 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': None, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False}, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] ) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] @triton.jit V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] def triton_for_fused_0(in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, in_ptr8, in_ptr9, in_ptr10, in_ptr11, in_ptr12, in_ptr13, in_ptr14, in_ptr15, in_ptr16, in_ptr17, in_ptr18, in_ptr19, in_ptr20, in_ptr21, in_ptr22, in_ptr23, in_ptr24, in_ptr25, in_ptr26, in_ptr27, in_ptr28, in_ptr29, in_ptr30, in_ptr31, in_ptr32, in_ptr33, in_ptr34, in_ptr35, in_ptr36, in_ptr37, in_ptr38, in_ptr39, in_ptr40, in_ptr41, in_ptr42, in_ptr43, in_ptr44, in_ptr45, in_ptr46, in_ptr47, in_ptr48, in_ptr49, out_ptr6, out_ptr7, out_ptr8, out_ptr15, out_ptr16, out_ptr17, out_ptr24, out_ptr25, out_ptr26, out_ptr33, out_ptr34, out_ptr35, out_ptr42, out_ptr43, out_ptr44, out_ptr51, out_ptr52, out_ptr53, out_ptr60, out_ptr61, out_ptr62, out_ptr69, out_ptr70, out_ptr71, out_ptr78, out_ptr79, out_ptr80, out_ptr87, out_ptr88, out_ptr89): V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid = tl.program_id(0) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] XBLOCK: tl.constexpr = 1024 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_0 = tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_1 = num_xblocks_0 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_2 = num_xblocks_1 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_3 = num_xblocks_2 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_4 = num_xblocks_3 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_5 = num_xblocks_4 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_6 = num_xblocks_5 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_7 = num_xblocks_6 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_8 = num_xblocks_7 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] num_xblocks_9 = num_xblocks_8 + tl.cdiv(1048576, XBLOCK) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] if pid < num_xblocks_0: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x0 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp5 = tl.load(in_ptr0 + (x0), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp6 = tl.load(in_ptr1 + (x0), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp11 = tl.load(in_ptr2 + (x0), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp18 = tl.load(in_ptr3 + (x0), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp20 = in_ptr4 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp0 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp1 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp2 = tmp0 >= tmp1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp3 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp4 = tl.where(tmp2, tmp3, tmp0) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp7 = tmp5 - tmp6 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp8 = tmp4 * tmp7 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp9 = tl.where(tmp2, tmp5, tmp6) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp10 = tmp8 + tmp9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp12 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp13 = tmp11 * tmp12 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp14 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp15 = tmp5 * tmp14 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp16 = tmp15 * tmp5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp17 = tmp13 + tmp16 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp19 = libdevice.sqrt(tmp17) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp21 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp22 = tmp20 + tmp21 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp23 = libdevice.pow(tmp12, tmp22) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp24 = tmp21 - tmp23 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp25 = libdevice.sqrt(tmp24) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp26 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp27 = libdevice.pow(tmp26, tmp22) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp28 = tmp21 - tmp27 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp29 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp30 = (tmp29 / tmp28) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp31 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp32 = tmp30 * tmp31 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp33 = -tmp32 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp34 = tmp25 * tmp33 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp35 = (tmp19 / tmp34) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp36 = (tmp29 / tmp33) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp37 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp38 = tmp36 * tmp37 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp39 = tmp35 + tmp38 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp40 = (tmp10 / tmp39) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp41 = tmp18 + tmp40 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr6 + (x0), tmp41, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr7 + (x0), tmp10, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr8 + (x0), tmp17, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_1: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x1 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp47 = tl.load(in_ptr5 + (x1), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp48 = tl.load(in_ptr6 + (x1), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp53 = tl.load(in_ptr7 + (x1), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp60 = tl.load(in_ptr8 + (x1), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp62 = in_ptr9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp42 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp43 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp44 = tmp42 >= tmp43 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp45 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp46 = tl.where(tmp44, tmp45, tmp42) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp49 = tmp47 - tmp48 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp50 = tmp46 * tmp49 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp51 = tl.where(tmp44, tmp47, tmp48) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp52 = tmp50 + tmp51 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp54 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp55 = tmp53 * tmp54 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp56 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp57 = tmp47 * tmp56 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp58 = tmp57 * tmp47 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp59 = tmp55 + tmp58 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp61 = libdevice.sqrt(tmp59) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp63 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp64 = tmp62 + tmp63 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp65 = libdevice.pow(tmp54, tmp64) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp66 = tmp63 - tmp65 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp67 = libdevice.sqrt(tmp66) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp68 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp69 = libdevice.pow(tmp68, tmp64) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp70 = tmp63 - tmp69 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp71 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp72 = (tmp71 / tmp70) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp73 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp74 = tmp72 * tmp73 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp75 = -tmp74 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp76 = tmp67 * tmp75 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp77 = (tmp61 / tmp76) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp78 = (tmp71 / tmp75) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp79 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp80 = tmp78 * tmp79 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp81 = tmp77 + tmp80 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp82 = (tmp52 / tmp81) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp83 = tmp60 + tmp82 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr15 + (x1), tmp83, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr16 + (x1), tmp52, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr17 + (x1), tmp59, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_2: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x2 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp89 = tl.load(in_ptr10 + (x2), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp90 = tl.load(in_ptr11 + (x2), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp95 = tl.load(in_ptr12 + (x2), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp102 = tl.load(in_ptr13 + (x2), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp104 = in_ptr14 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp84 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp85 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp86 = tmp84 >= tmp85 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp87 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp88 = tl.where(tmp86, tmp87, tmp84) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp91 = tmp89 - tmp90 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp92 = tmp88 * tmp91 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp93 = tl.where(tmp86, tmp89, tmp90) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp94 = tmp92 + tmp93 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp96 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp97 = tmp95 * tmp96 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp98 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp99 = tmp89 * tmp98 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp100 = tmp99 * tmp89 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp101 = tmp97 + tmp100 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp103 = libdevice.sqrt(tmp101) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp105 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp106 = tmp104 + tmp105 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp107 = libdevice.pow(tmp96, tmp106) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp108 = tmp105 - tmp107 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp109 = libdevice.sqrt(tmp108) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp110 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp111 = libdevice.pow(tmp110, tmp106) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp112 = tmp105 - tmp111 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp113 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp114 = (tmp113 / tmp112) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp115 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp116 = tmp114 * tmp115 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp117 = -tmp116 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp118 = tmp109 * tmp117 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp119 = (tmp103 / tmp118) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp120 = (tmp113 / tmp117) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp121 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp122 = tmp120 * tmp121 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp123 = tmp119 + tmp122 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp124 = (tmp94 / tmp123) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp125 = tmp102 + tmp124 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr24 + (x2), tmp125, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr25 + (x2), tmp94, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr26 + (x2), tmp101, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_3: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_2 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x3 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp131 = tl.load(in_ptr15 + (x3), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp132 = tl.load(in_ptr16 + (x3), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp137 = tl.load(in_ptr17 + (x3), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp144 = tl.load(in_ptr18 + (x3), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp146 = in_ptr19 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp126 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp127 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp128 = tmp126 >= tmp127 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp129 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp130 = tl.where(tmp128, tmp129, tmp126) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp133 = tmp131 - tmp132 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp134 = tmp130 * tmp133 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp135 = tl.where(tmp128, tmp131, tmp132) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp136 = tmp134 + tmp135 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp138 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp139 = tmp137 * tmp138 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp140 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp141 = tmp131 * tmp140 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp142 = tmp141 * tmp131 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp143 = tmp139 + tmp142 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp145 = libdevice.sqrt(tmp143) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp147 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp148 = tmp146 + tmp147 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp149 = libdevice.pow(tmp138, tmp148) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp150 = tmp147 - tmp149 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp151 = libdevice.sqrt(tmp150) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp152 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp153 = libdevice.pow(tmp152, tmp148) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp154 = tmp147 - tmp153 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp155 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp156 = (tmp155 / tmp154) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp157 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp158 = tmp156 * tmp157 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp159 = -tmp158 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp160 = tmp151 * tmp159 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp161 = (tmp145 / tmp160) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp162 = (tmp155 / tmp159) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp163 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp164 = tmp162 * tmp163 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp165 = tmp161 + tmp164 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp166 = (tmp136 / tmp165) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp167 = tmp144 + tmp166 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr33 + (x3), tmp167, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr34 + (x3), tmp136, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr35 + (x3), tmp143, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_4: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_3 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x4 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp173 = tl.load(in_ptr20 + (x4), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp174 = tl.load(in_ptr21 + (x4), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp179 = tl.load(in_ptr22 + (x4), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp186 = tl.load(in_ptr23 + (x4), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp188 = in_ptr24 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp168 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp169 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp170 = tmp168 >= tmp169 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp171 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp172 = tl.where(tmp170, tmp171, tmp168) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp175 = tmp173 - tmp174 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp176 = tmp172 * tmp175 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp177 = tl.where(tmp170, tmp173, tmp174) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp178 = tmp176 + tmp177 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp180 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp181 = tmp179 * tmp180 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp182 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp183 = tmp173 * tmp182 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp184 = tmp183 * tmp173 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp185 = tmp181 + tmp184 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp187 = libdevice.sqrt(tmp185) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp189 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp190 = tmp188 + tmp189 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp191 = libdevice.pow(tmp180, tmp190) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp192 = tmp189 - tmp191 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp193 = libdevice.sqrt(tmp192) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp194 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp195 = libdevice.pow(tmp194, tmp190) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp196 = tmp189 - tmp195 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp197 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp198 = (tmp197 / tmp196) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp199 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp200 = tmp198 * tmp199 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp201 = -tmp200 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp202 = tmp193 * tmp201 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp203 = (tmp187 / tmp202) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp204 = (tmp197 / tmp201) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp205 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp206 = tmp204 * tmp205 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp207 = tmp203 + tmp206 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp208 = (tmp178 / tmp207) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp209 = tmp186 + tmp208 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr42 + (x4), tmp209, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr43 + (x4), tmp178, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr44 + (x4), tmp185, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_5: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_4 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x5 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp215 = tl.load(in_ptr25 + (x5), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp216 = tl.load(in_ptr26 + (x5), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp221 = tl.load(in_ptr27 + (x5), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp228 = tl.load(in_ptr28 + (x5), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp230 = in_ptr29 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp210 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp211 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp212 = tmp210 >= tmp211 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp213 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp214 = tl.where(tmp212, tmp213, tmp210) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp217 = tmp215 - tmp216 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp218 = tmp214 * tmp217 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp219 = tl.where(tmp212, tmp215, tmp216) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp220 = tmp218 + tmp219 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp222 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp223 = tmp221 * tmp222 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp224 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp225 = tmp215 * tmp224 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp226 = tmp225 * tmp215 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp227 = tmp223 + tmp226 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp229 = libdevice.sqrt(tmp227) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp231 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp232 = tmp230 + tmp231 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp233 = libdevice.pow(tmp222, tmp232) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp234 = tmp231 - tmp233 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp235 = libdevice.sqrt(tmp234) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp236 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp237 = libdevice.pow(tmp236, tmp232) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp238 = tmp231 - tmp237 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp239 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp240 = (tmp239 / tmp238) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp241 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp242 = tmp240 * tmp241 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp243 = -tmp242 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp244 = tmp235 * tmp243 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp245 = (tmp229 / tmp244) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp246 = (tmp239 / tmp243) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp247 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp248 = tmp246 * tmp247 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp249 = tmp245 + tmp248 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp250 = (tmp220 / tmp249) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp251 = tmp228 + tmp250 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr51 + (x5), tmp251, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr52 + (x5), tmp220, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr53 + (x5), tmp227, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_6: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x6 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp257 = tl.load(in_ptr30 + (x6), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp258 = tl.load(in_ptr31 + (x6), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp263 = tl.load(in_ptr32 + (x6), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp270 = tl.load(in_ptr33 + (x6), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp272 = in_ptr34 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp252 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp253 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp254 = tmp252 >= tmp253 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp255 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp256 = tl.where(tmp254, tmp255, tmp252) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp259 = tmp257 - tmp258 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp260 = tmp256 * tmp259 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp261 = tl.where(tmp254, tmp257, tmp258) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp262 = tmp260 + tmp261 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp264 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp265 = tmp263 * tmp264 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp266 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp267 = tmp257 * tmp266 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp268 = tmp267 * tmp257 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp269 = tmp265 + tmp268 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp271 = libdevice.sqrt(tmp269) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp273 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp274 = tmp272 + tmp273 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp275 = libdevice.pow(tmp264, tmp274) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp276 = tmp273 - tmp275 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp277 = libdevice.sqrt(tmp276) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp278 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp279 = libdevice.pow(tmp278, tmp274) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp280 = tmp273 - tmp279 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp281 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp282 = (tmp281 / tmp280) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp283 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp284 = tmp282 * tmp283 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp285 = -tmp284 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp286 = tmp277 * tmp285 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp287 = (tmp271 / tmp286) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp288 = (tmp281 / tmp285) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp289 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp290 = tmp288 * tmp289 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp291 = tmp287 + tmp290 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp292 = (tmp262 / tmp291) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp293 = tmp270 + tmp292 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr60 + (x6), tmp293, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr61 + (x6), tmp262, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr62 + (x6), tmp269, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_7: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_6 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x7 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp299 = tl.load(in_ptr35 + (x7), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp300 = tl.load(in_ptr36 + (x7), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp305 = tl.load(in_ptr37 + (x7), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp312 = tl.load(in_ptr38 + (x7), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp314 = in_ptr39 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp294 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp295 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp296 = tmp294 >= tmp295 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp297 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp298 = tl.where(tmp296, tmp297, tmp294) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp301 = tmp299 - tmp300 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp302 = tmp298 * tmp301 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp303 = tl.where(tmp296, tmp299, tmp300) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp304 = tmp302 + tmp303 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp306 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp307 = tmp305 * tmp306 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp308 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp309 = tmp299 * tmp308 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp310 = tmp309 * tmp299 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp311 = tmp307 + tmp310 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp313 = libdevice.sqrt(tmp311) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp315 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp316 = tmp314 + tmp315 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp317 = libdevice.pow(tmp306, tmp316) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp318 = tmp315 - tmp317 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp319 = libdevice.sqrt(tmp318) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp320 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp321 = libdevice.pow(tmp320, tmp316) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp322 = tmp315 - tmp321 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp323 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp324 = (tmp323 / tmp322) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp325 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp326 = tmp324 * tmp325 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp327 = -tmp326 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp328 = tmp319 * tmp327 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp329 = (tmp313 / tmp328) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp330 = (tmp323 / tmp327) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp331 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp332 = tmp330 * tmp331 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp333 = tmp329 + tmp332 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp334 = (tmp304 / tmp333) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp335 = tmp312 + tmp334 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr69 + (x7), tmp335, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr70 + (x7), tmp304, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr71 + (x7), tmp311, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_8: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_7 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x8 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp341 = tl.load(in_ptr40 + (x8), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp342 = tl.load(in_ptr41 + (x8), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp347 = tl.load(in_ptr42 + (x8), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp354 = tl.load(in_ptr43 + (x8), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp356 = in_ptr44 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp336 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp337 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp338 = tmp336 >= tmp337 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp339 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp340 = tl.where(tmp338, tmp339, tmp336) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp343 = tmp341 - tmp342 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp344 = tmp340 * tmp343 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp345 = tl.where(tmp338, tmp341, tmp342) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp346 = tmp344 + tmp345 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp348 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp349 = tmp347 * tmp348 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp350 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp351 = tmp341 * tmp350 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp352 = tmp351 * tmp341 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp353 = tmp349 + tmp352 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp355 = libdevice.sqrt(tmp353) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp357 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp358 = tmp356 + tmp357 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp359 = libdevice.pow(tmp348, tmp358) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp360 = tmp357 - tmp359 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp361 = libdevice.sqrt(tmp360) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp362 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp363 = libdevice.pow(tmp362, tmp358) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp364 = tmp357 - tmp363 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp365 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp366 = (tmp365 / tmp364) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp367 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp368 = tmp366 * tmp367 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp369 = -tmp368 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp370 = tmp361 * tmp369 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp371 = (tmp355 / tmp370) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp372 = (tmp365 / tmp369) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp373 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp374 = tmp372 * tmp373 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp375 = tmp371 + tmp374 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp376 = (tmp346 / tmp375) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp377 = tmp354 + tmp376 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr78 + (x8), tmp377, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr79 + (x8), tmp346, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr80 + (x8), tmp353, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] elif pid < num_xblocks_9: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pid_offset = pid - num_xblocks_8 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xnumel = 1048576 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] r0_numel = 1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xoffset = pid_offset * XBLOCK V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xindex = xoffset + tl.arange(0, XBLOCK)[:] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] xmask = tl.full([XBLOCK], True, tl.int1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] x9 = xindex V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp383 = tl.load(in_ptr45 + (x9), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp384 = tl.load(in_ptr46 + (x9), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp389 = tl.load(in_ptr47 + (x9), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp396 = tl.load(in_ptr48 + (x9), None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp398 = in_ptr49 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp378 = 0.09999999999999998 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp379 = 0.5 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp380 = tmp378 >= tmp379 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp381 = -0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp382 = tl.where(tmp380, tmp381, tmp378) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp385 = tmp383 - tmp384 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp386 = tmp382 * tmp385 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp387 = tl.where(tmp380, tmp383, tmp384) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp388 = tmp386 + tmp387 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp390 = 0.999 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp391 = tmp389 * tmp390 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp392 = 0.0010000000000000009 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp393 = tmp383 * tmp392 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp394 = tmp393 * tmp383 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp395 = tmp391 + tmp394 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp397 = libdevice.sqrt(tmp395) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp399 = 1.0 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp400 = tmp398 + tmp399 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp401 = libdevice.pow(tmp390, tmp400) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp402 = tmp399 - tmp401 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp403 = libdevice.sqrt(tmp402) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp404 = 0.9 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp405 = libdevice.pow(tmp404, tmp400) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp406 = tmp399 - tmp405 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp407 = tl.full([1], 1, tl.int32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp408 = (tmp407 / tmp406) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp409 = 0.001 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp410 = tmp408 * tmp409 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp411 = -tmp410 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp412 = tmp403 * tmp411 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp413 = (tmp397 / tmp412) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp414 = (tmp407 / tmp411) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp415 = 1e-08 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp416 = tmp414 * tmp415 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp417 = tmp413 + tmp416 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp418 = (tmp388 / tmp417) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tmp419 = tmp396 + tmp418 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr87 + (x9), tmp419, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr88 + (x9), tmp388, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] tl.store(out_ptr89 + (x9), tmp395, None) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] else: V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] pass V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] ''', device_str='cuda') V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] cpp_fused__foreach_copy_1 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*', 'float*'], ''' V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] #include "/tmp/torchinductor_ci-user/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h" V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] extern "C" void kernel(const float* in_ptr0, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr1, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr2, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr3, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr4, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr5, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr6, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr7, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr8, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] const float* in_ptr9, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr1, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr3, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr5, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr7, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr9, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr11, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr13, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr15, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr17, V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] float* out_ptr19) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr0[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr1[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr1[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr3[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr2[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr5[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr3[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr7[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr4[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr9[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr5[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr11[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr6[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr13[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr7[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr15[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr8[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr17[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] { V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp0 = in_ptr9[static_cast<int64_t>(0L)]; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp1 = static_cast<float>(1.0); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] auto tmp2 = decltype(tmp0)(tmp0 + tmp1); V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] out_ptr19[static_cast<int64_t>(0L)] = tmp2; V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] } V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] ''') V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] async_compile.wait(globals()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del async_compile V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] def call(args): V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1 = args V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] args.clear() V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg0_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg1_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg2_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg3_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg4_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg5_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg6_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg7_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg8_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg9_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg10_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg11_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg12_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg13_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg14_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg15_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg16_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg17_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg18_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg19_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg20_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg21_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg22_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg23_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg24_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg25_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg26_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg27_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg28_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg29_1, (), ()) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg30_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg31_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg32_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg33_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg34_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg35_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg36_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg37_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg38_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg39_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg40_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg41_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg42_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg43_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg44_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg45_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg46_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg47_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg48_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] assert_size_stride(arg49_1, (1024, 1024), (1024, 1)) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] with torch.cuda._DeviceGuard(0): V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] torch.cuda.set_device(0) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] # Unsorted Source Nodes: [], Original ATen: [] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] stream0 = get_raw_stream(0) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] triton_for_fused_0.run(arg1_1, arg30_1, arg40_1, arg0_1, arg20_1.item(), arg3_1, arg31_1, arg41_1, arg2_1, arg21_1.item(), arg5_1, arg32_1, arg42_1, arg4_1, arg22_1.item(), arg7_1, arg33_1, arg43_1, arg6_1, arg23_1.item(), arg9_1, arg34_1, arg44_1, arg8_1, arg24_1.item(), arg11_1, arg35_1, arg45_1, arg10_1, arg25_1.item(), arg13_1, arg36_1, arg46_1, arg12_1, arg26_1.item(), arg15_1, arg37_1, arg47_1, arg14_1, arg27_1.item(), arg17_1, arg38_1, arg48_1, arg16_1, arg28_1.item(), arg19_1, arg39_1, arg49_1, arg18_1, arg29_1.item(), arg0_1, arg30_1, arg40_1, arg2_1, arg31_1, arg41_1, arg4_1, arg32_1, arg42_1, arg6_1, arg33_1, arg43_1, arg8_1, arg34_1, arg44_1, arg10_1, arg35_1, arg45_1, arg12_1, arg36_1, arg46_1, arg14_1, arg37_1, arg47_1, arg16_1, arg38_1, arg48_1, arg18_1, arg39_1, arg49_1, stream=stream0) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg0_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg10_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg11_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg12_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg13_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg14_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg15_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg16_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg17_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg18_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg19_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg1_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg2_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg30_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg31_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg32_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg33_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg34_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg35_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg36_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg37_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg38_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg39_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg3_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg40_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg41_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg42_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg43_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg44_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg45_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg46_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg47_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg48_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg49_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg4_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg5_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg6_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg7_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg8_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg9_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] cpp_fused__foreach_copy_1(arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg20_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg21_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg22_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg23_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg24_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg25_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg26_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg27_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg28_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] del arg29_1 V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] return () V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] def benchmark_compiled_module(times=10, repeat=10): V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._dynamo.testing import rand_strided V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.utils import print_performance V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg0_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg1_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg2_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg3_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg4_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg5_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg6_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg7_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg8_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg9_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg10_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg11_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg12_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg13_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg14_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg15_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg16_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg17_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg18_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg19_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg20_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg21_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg22_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg23_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg24_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg25_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg26_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg27_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg28_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg29_1 = rand_strided((), (), device='cpu', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg30_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg31_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg32_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg33_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg34_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg35_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg36_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg37_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg38_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg39_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg40_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg41_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg42_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg43_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg44_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg45_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg46_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg47_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg48_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] arg49_1 = rand_strided((1024, 1024), (1024, 1), device='cuda:0', dtype=torch.float32) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1, arg11_1, arg12_1, arg13_1, arg14_1, arg15_1, arg16_1, arg17_1, arg18_1, arg19_1, arg20_1, arg21_1, arg22_1, arg23_1, arg24_1, arg25_1, arg26_1, arg27_1, arg28_1, arg29_1, arg30_1, arg31_1, arg32_1, arg33_1, arg34_1, arg35_1, arg36_1, arg37_1, arg38_1, arg39_1, arg40_1, arg41_1, arg42_1, arg43_1, arg44_1, arg45_1, arg46_1, arg47_1, arg48_1, arg49_1]) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] return print_performance(fn, times=times, repeat=repeat) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] if __name__ == "__main__": V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] from torch._inductor.wrapper_benchmark import compiled_module_main V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] compiled_module_main('None', benchmark_compiled_module) V0707 22:50:09.128000 29476 torch/_inductor/graph.py:2104] [0/1] [__output_code] V0707 22:50:09.174000 29476 torch/_inductor/graph.py:2115] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/65/c655isihixkazmceuwbfqagiscwkui2zsppjfrucnr3s5l4gahqw.py I0707 22:50:09.218000 29476 torch/_inductor/graph.py:2149] [0/1] [__output_code] Output code written to: /tmp/torchinductor_ci-user/65/c655isihixkazmceuwbfqagiscwkui2zsppjfrucnr3s5l4gahqw.py eager runtime: 1212.3726350000652us compiled runtime: 755.7724566140678us
Conclusion¶
In this tutorial, we successfully implemented a custom fully-fused Adam optimizer using foreach_map. By leveraging the power of foreach_map and torch.compile, we were able to create an optimized version of the Adam optimizer that can be used in various machine learning applications. This tutorial provides a comprehensive guide on how to use foreach_map and torch.compile to optimize machine learning models, and serves as a valuable resource for developers looking to improve the performance of their models with horizontal fusion.
See also:
Compiled optimizer tutorial - an intro into the compiled optimizer.
Compiling the optimizer with PT2 - deeper technical details on the compiled optimizer.
Total running time of the script: ( 0 minutes 12.715 seconds)