Skip to content

Conversation

@zoooo0820
Copy link
Contributor

@zoooo0820 zoooo0820 commented Dec 28, 2023

PR types

Performance optimization

PR changes

Others

Description

Pcard-66985

speed up the basic setitem avoid using to_tensor to construct tensor to use set_value_with_tensor op. Now will directly use set_value op

@paddle-bot
Copy link

paddle-bot bot commented Dec 28, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@zoooo0820 zoooo0820 force-pushed the set_value_with_scalar branch from 51ccd89 to 44dbc33 Compare December 29, 2023 06:21
@zoooo0820 zoooo0820 force-pushed the set_value_with_scalar branch from 44dbc33 to f15e3e5 Compare December 29, 2023 08:28
Copy link
Contributor

@jeff41404 jeff41404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeff41404 jeff41404 merged commit 7c7446f into PaddlePaddle:develop Jan 2, 2024
@zoooo0820 zoooo0820 deleted the set_value_with_scalar branch January 2, 2024 07:28
Wanglongzhi2001 pushed a commit to Wanglongzhi2001/Paddle that referenced this pull request Jan 7, 2024
* set_value with scalar * fix ut
Xinyu302 added a commit to Xinyu302/Paddle that referenced this pull request Jan 15, 2024
* [DimExpr] DimExpr support hash (PaddlePaddle#60471) * open warning with `paddle.utils.deprecated` (PaddlePaddle#60458) * open_warning * update unittest * update * fix typos * fix warning in test runner * uncomment * cleanup todo * using VisibleDeprecationWarning * update comment * fix typo * fix indentation * fix * fix * fix indent level and test * update --------- Co-authored-by: SigureMo <sigure.qaq@gmail.com> * [AutoParallel] Auto Trans PP to VPP (PaddlePaddle#60467) * [AutoParallel] Auto Trans PP to VPP * add comment * 【PIR OpTest Fix No.23】 fix test_distribute_fpn_proposals_op (PaddlePaddle#60335) * fix * fix * fix test_lookup_table_v2_bf16_op (PaddlePaddle#60332) * Fix shape error in combined-indexing setitem (PaddlePaddle#60447) * add ut * fix shape error in combine-indexing * fix ut * [auto parallel] Add pp lazy init, bug fix for xavier (PaddlePaddle#60441) * [PIR] add slice_array_dense api (PaddlePaddle#60433) * fix * fix * Set value with scalar (PaddlePaddle#60452) * set_value with scalar * fix ut * [PIR]Support custom op in PIR (PaddlePaddle#59790) * support custom op in pir * fix compile bugs * fix bugs * delete code * fix windows bugs * fix windows bugs * add symbol to paddle lib * fix windows bugs * revert code * fix bugs * fix bugs * perfect code according comment * fix py3 * revert third party * fix bugs * fix bug * fix compile bugs * fix windows * [Prim][PIR] support roll, gather, scatter, scatter_nd_add op backward in pir prim (PaddlePaddle#60481) * prim gather op backward * prim scatter op backward * prim roll op backward * prim scatter_nd op backward * [PIR] delete dense_tensor mem_desc_ (PaddlePaddle#60024) * delete dense_tensor mem_desc_ * [PIR] Complement op defs (PaddlePaddle#60475) * complement translation of legacy matmul * Complement op mappings in translation for deformable_conv_v1. * [pir]Supporting constant_folding_pass for train (PaddlePaddle#60355) * [pir]Supporting constant_folding_pass for train * fix * Update constant_folding_pass.cc * [Dynamic Shape] Fuse shape ops into generate shape op pass (PaddlePaddle#60490) * add shape.generate_shape op * rename shape.generate_shape to cinn_op.generate_shape * refactor GenerateShapeOp::SymbolBinding * move GenerateShapeOp related helper functions into generate_shape_util.cc * minor fix * minor fix * backup * refine signature of ConvertDimExprToAttribute * minor fix for signature of ConvertDimExprToAttributes * remove SubstituteDimExpr from generate_shape_util.h * Fix compile error * Fix unittest compile error * Code format * Code format * Fix _hiden_size to _hidden_size (PaddlePaddle#60485) * [DimExpr] Add substitute DimExpr util (PaddlePaddle#60493) * add SubstituteDimExpr * Fix compile error * Code format * Polish DimExprUtilTest * Change namesapce * Fix unittest * Polish DimExprUtilTest * [xpu]add sine_pos fuse pass and sine_pos xpu kernel (PaddlePaddle#60025) * add split with variable in factors and rewrite vectorize,unroll,bind error handling mechanism (PaddlePaddle#60449) * [CodeStyle] Fix regression of Ruff in sot (PaddlePaddle#60483) * support cast op from FP32 to low precision (PaddlePaddle#60385) * test=document_fix (PaddlePaddle#60399) * [XPU] refine flash attention ut (PaddlePaddle#60474) * [XPU] refine flash attention ut * refine tolerance * [Inference] support collect shape in sub block (PaddlePaddle#60451) * support collect shape in sub block * udpate * udpate * fix process mesh incorrect set in converter (PaddlePaddle#60504) * 【CMake opt No.13】Remove CINN DEPS in test/cpp/pir/shape_dialect/CMakeLists.txt (PaddlePaddle#60517) * Update CMakeLists.txt * Apply suggestions from code review * Apply suggestions from code review * Update CMakeLists.txt * Update CMakeLists.txt * 【pir】 add tensorarray op createarrylike, add_n (PaddlePaddle#60460) * optimize backward * [PIR] add vjp interface for while op * [PIR] fix ci error. * modify while stopgradient * merge * modify while grad bug * modify while grad op * modify * increment vp * [PIR] add get_used_external_value interface for block. * while case * delete print * delete print * Update python/paddle/autograd/ir_backward.py * [PIR] add unit_test for get_used_external_value * modify while_loop * code_style * modofy ci bug * modify while api * modify ci * modify array * Update python/paddle/autograd/ir_backward.py * Update test/legacy_test/test_cond.py * update * modify array_write grad info * merge * add_n and createarraylike * conflict * modify exe bug * modify kernel choose --------- Co-authored-by: winter-wang <1030748926@qq.com> * Add align iter space tactic (PaddlePaddle#60498) Add align iter space tactic * [Dynamic Shape] Add helper function MakeGenerateShapeOpAttribute (PaddlePaddle#60512) * add helper function MakeGenerateShapeOpAttribute * fix complier complaint * Code format * [Prim][PIR] Set prim gflag for pure cpp (PaddlePaddle#60505) * inference support decomp * polish code * add decomp base define * add decomp base define2 * change decomp infer * fix symbol overload * fix test case * debug * debug * decomp add debug info * add cpp flag * revert * remove unused flag * [PIR] Refine and fix pir exe (PaddlePaddle#60443) * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * update 2023 security advisory, test=document_fix (PaddlePaddle#60527) * [Inference] refine common/*.h for inference lib (PaddlePaddle#60513) * 【complex op】No.19 add complex support for triangular_solve (PaddlePaddle#59529) * fix reshard dist_attr (PaddlePaddle#60535) * 【auto parallel】剔除切分推导相关的头文件对proto 的依赖 (PaddlePaddle#60543) * decouple proto * format * format * strcuct pre def * [PIR] Support Operation::Clone Interface (PaddlePaddle#60536) * [PIR] Support Operation::Clone Interface * modify into shared_ptr * [Dynamic Shape] Add FullyInsertBroadcastPass and Broadcast Op (PaddlePaddle#60511) * add ShapeBroadcastOp * add pass FullyInsertBroadcastPass * InferSymbolicShape of BroadcastShape Op * Delete unit test * Fix return error * Code format * Fix error message * Update paddle/cinn/hlir/dialect/operator/transforms/fully_insert_broadcast_pass.cc Co-authored-by: Bo Zhang <105368690+zhangbopd@users.noreply.github.com> --------- Co-authored-by: Bo Zhang <105368690+zhangbopd@users.noreply.github.com> * Fix OpTranslatorTest name (PaddlePaddle#60518) * fix name * fix name * fix name * fix name * [PIR] migrate DataFeeder into pir (PaddlePaddle#60434) * 【PIR API adaptor No.90,92】Migrate some ops into pir (PaddlePaddle#59801) * [DimExpr] Convert Broadcast to BroadcastTree (PaddlePaddle#60440) * backup BroadcastTree * add SubstituteDimExpr * add helper function ConstructBroadcastTree * Fix compile error * Code format * Polish DimExprUtilTest * Add cmake file * Change namesapce * Fix compile error * Fix unittest * reconstruct BroadcastTree * Polish DimExprUtilTest * Reconstruct BroadcastTree * Finish BroadcastBranch * Finish BroadcastBranch * Finish BroadcastBranch * Add Unittest * Remove unnecessary dim_expr_util * Add header file * [Dynamic Shape] Erase expand (PaddlePaddle#60525) * EraseExpandOp * minor fix * minor fix * Code format * [inference] Support wint4 groupwise with cutlass gemm (PaddlePaddle#60422) * support gemv-groupwise func && weightQuanter-groupwise && weightDeQuanter-groupwise * fix build bug * add unit_test && fix bug * delete useless code * fix ci build bug * fix ci && optimize * fix merge conflict * add op change info * fix weight_only_linear_pass * fix format * solve ci unit_test * init * support cutlass gemm with groupwise * add unit test * fix strange bug * delete random bug * fix sm70 build bug * try to fix ci build bug * fix bug * fix volta build bug * skip sm70 in groupwise mode * change cutlass branch * simplify extent of loop after fuse and add corresponding test case (PaddlePaddle#60538) * fix bug of put_along_axis (PaddlePaddle#60551) * remove clearPass to allow custom device use fusion under fp16 (PaddlePaddle#60541) * fix fleetutil get_online_pass_interval bug2; test=develop (PaddlePaddle#60544) * fix vs2017 limit (PaddlePaddle#60528) * 【Hackathon 5th No.20】为 Paddle 新增 Exponential 和 Gamma API (PaddlePaddle#57899) * add exponential * add gamma distribution * refine docs * add kl_divergence and test * resolve conflicts * resolve conflicts * fix bug * refine test * fix test timeout * refine code * add standard_gamma kernel * fix comments * fix tests * fix tests * fix comments * fix tests * fix gamma grad * fix yaml * fix bugs * fix tests * fix standard_gamma_grad * fix test * fix test * add cdf & icdf * add cdf & icdf * refine comments * fix * fix * fix head file * fix * fix cuda op * fix * fix * refine test * fix test * refine comments * fix comments * fix * fix * fix type check * fix docs * delete useless comments * [CINN] Add IntrinsicOps into ir_codes_collector (PaddlePaddle#60556) This PR fixed a bug of running Resnet PaddleClas. The bug is due to vectorize introduce an intrinsic GetAddr and we didn't collect the tensor of GetAddr in ir_node_collector, this would caused tensor alias won't create in cuda code. TODO: we may modify IntrinsicOp in the near future * 【auto parallel】custom op spmd rule register (PaddlePaddle#60509) * custom op spmd rule register * custom op spmd rule register * custom op spmd rule register * custom op spmd rule register * polish * 【AutoParallel】Add master grad in AMP-O2 of AutoParallel (PaddlePaddle#59987) * add master_grad in auto-parallel * reset third_party * fix coverage * support bf16 master_grad * fix bug in master_grad * change code according to review * change the way to find optimizer op * [Dy2St] Fix `NameloadJstTransformer` missing transform call kwargs (PaddlePaddle#60515) --------- Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com> * cinn(backends): generate infer shape kernel to infer shape of output tensor (PaddlePaddle#60519) 通过二维指针来返回后端infer shape的结果。生成的cinn ir如下。tensor_shape_args是一个二维指针。 infer_shape_set_value(0, 0, S1, tensor_shape_args) 表示将第0个output tensor的第0维设置为S1。 * fix tensor math method inplace converter (PaddlePaddle#60546) * [xpu]Add vis_decoder_attention_xpu_pass && modify qkv_attention_xpu_kernel (PaddlePaddle#60361) * [Prim][PIR] support abs, instance_norm op backward in prim pir (PaddlePaddle#60444) * abs op backward * add test case * update code * update code * update code * update code * update code * instance_norm op backward * add instance_norm_v2 test cast * custom op * [PIR] remove log simply name mechnism from phi to common. (PaddlePaddle#60507) * [InferSymbolicShape] Delete redundent value_id_to_shapeordata_ (PaddlePaddle#60554) * 【Hackathon 5th No.25】add gammaln api (PaddlePaddle#60553) * fix (PaddlePaddle#60570) * [CINN] Add tile tactic and bind cuda tactic (PaddlePaddle#60534) * [CINN] Add tile tactic * [CINN] Add bind cuda tactic * 【PIR OpTest Fix No.8】 fix test_shuffle_batch_op (PaddlePaddle#59631) * fix test_shuffle_batch_op * fix * 【PIR OpTest Fix No.14】 fix test_nce (PaddlePaddle#60255) * fix test_nce * fix test_nce * Update ops.yaml * fix * Update utils.cc * Update ops.yaml * 【PIR OpTest Fix No.19】 fix test_ftrl_op (PaddlePaddle#60329) * fix test_ftrl_op * fix * [auto parallel] Lazy init for MP. Add reshard infer shape. (PaddlePaddle#60563) * [PIR] Add unittest for Operation::Clone and Group::Clone (PaddlePaddle#60577) * [PIR] dce pass disable custom op (PaddlePaddle#60578) * [Inference] Fix bug of RunWithExternalStream API in new executor (PaddlePaddle#60122) * fix bug of RunWithExternalStream API in new executor * add test * fix bug of RunWithExternalStream API in new executor * reset flage in RunWithExternalStream * fix bug * add param swith_stream * fix bug * modify python api * fix bug * Resubmit PR-58859 (PaddlePaddle#60310) * allow multiple rng state in generator * Fix 60142; Fix some comments from sneaxiy * Overwrite copy constructors * add api * pre-commit * tensor_array slice in PIR (PaddlePaddle#60503) * use slice_array, now will meet error of destory opresult still in use * disable the pir test until the bug fixed * Set DistModel state_dict keys to structure_names (PaddlePaddle#60478) * exclude xpu * check structure name mapping * test pp * polish * support dynamic save static load * support dygraph save static load * polish * polish * use structured_name as key in DistModel state_dict * polish * polish * fix checkpoint path conflict * test get_rank_to_files * static save dynamic load test * fix sm75 build bug (PaddlePaddle#60583) * replace LOG(INFO) with VLOG(6) * Add CanProveDivisible for symbolic calculation (PaddlePaddle#60572) * add CanProveDivisible for symbolic calculation * delete extra cout for debug * fix according to some comments * [PIR][DynamicShape] make shape pass default and fix some bugs (PaddlePaddle#60548) att, make shape pass default and fix some bugs * Fix words (PaddlePaddle#60603) * 【auto parallel】custom op use spmd rule (PaddlePaddle#60571) * custom op use smpd rule * custom op use smpd rule * [auto parallel] add lazy init ut to llama (PaddlePaddle#60585) * 【pir】 modify array_write and array_read vjp , add a simple while with array_write (PaddlePaddle#60575) * optimize backward * [PIR] add vjp interface for while op * [PIR] fix ci error. * modify while stopgradient * merge * modify while grad bug * modify while grad op * modify * increment vp * [PIR] add get_used_external_value interface for block. * while case * delete print * delete print * Update python/paddle/autograd/ir_backward.py * [PIR] add unit_test for get_used_external_value * modify while_loop * code_style * modofy ci bug * modify while api * modify ci * modify array * Update python/paddle/autograd/ir_backward.py * Update test/legacy_test/test_cond.py * update * modify array_write grad info * merge * add_n and createarraylike * conflict * modify array_write vjp * modify array_write vjp * Update paddle/fluid/pybind/manual_static_op_function.h * modify array_write vjp * modify ci bug * modify * modify * Update test/legacy_test/test_while_loop_op.py * modify inplace array_read * Update test/legacy_test/test_while_op.py * Update test/ir/pir/test_while_api.py --------- Co-authored-by: winter-wang <1030748926@qq.com> * [Prim][PIR] add leaky_relu, sigmoid, instance_norm op forward prim (PaddlePaddle#60564) * hardswish op prim sink * hardswish op prim * add composite * add leaky_relu, sigmoid op forward prim * remove hardswish op forward * add instance_norm op forward prim * [CINN]Add bucket context (PaddlePaddle#60549) * [CINN] Add tile tactic * [CINN] Add bind cuda tactic * [CINN] Add bucket contexts * fix group output args bug * Add CUDNNv8 max pooling (PaddlePaddle#59413) * Add CUDNNv8 version of pool2d * Minor fix * Fix build failure * Remove dygraph API * Fix CI failure * Fix CI failure * Fix timeout * Fix timeout * Add comments * Minor fix * update lbfgs to avoid the randomness caused by paddle.dot() temporarily (PaddlePaddle#60591) * update lbfgs to avoid the randomness caused by paddle.dot() temporarily * add note * set_pir_tests_properties for some tests (PaddlePaddle#60401) * fix * Update CMakeLists.txt * Update pir_op_test_white_list * Update pir_op_test_white_list * Update pir_op_test_white_list * Add tests to whitelist (PaddlePaddle#60522) * fix * add * fix double grad without convert inplace (PaddlePaddle#60614) * fix fleetutil get_online_pass_interval bug3 (PaddlePaddle#60615) * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * [PIR][DynamicShape] Add an example for broadcast in dynamic shape infer (PaddlePaddle#60608) * Add an example for broadcast in dynamic shape infer * fix_convert_all_blocks (PaddlePaddle#60613) * fix_convert_all_blocks * [Paddle-TRT] support set_value dynamic shape (PaddlePaddle#60508) [Paddle-TRT] support set_value dynamic shape (PaddlePaddle#60508) * fix (PaddlePaddle#60625) * [PIR] Support Region Clone in Operation::Clone (PaddlePaddle#60590) * deg2rad test passed (PaddlePaddle#60619) * [PIR+CINN]Fix Pool2d Variant Attibute for kernel_size (PaddlePaddle#60623) * [PIR+CINN]Fix Pool2d Variant Attibute for kernel_size * fix padding_size * fix pooling_type * [SOT] move_gpu_pinned_to_gpu (PaddlePaddle#60395) * PIR API adaptor No.35、40】 Migrate paddle.nn.ChannelShuffle/ClipGradByNorm into pir (PaddlePaddle#60445) * fix some bugs * fix bugs * Update clip.py * Update test_channel_shuffle.py * Update test_clip_by_norm_op.py * Update test_clip_by_norm_op.py * add param name for dist_tensor parameter (PaddlePaddle#60574) * Fix (PaddlePaddle#60631) * [PIR] Reify InferSymbolicShapeInterface (PaddlePaddle#60438) * Reify InferSymbolicShapeInterface * [Dynamic Shape] Remove ShapeBroadcastOp redundant codes (PaddlePaddle#60609) * [Dy2St] fix `test_grad` in PIR mode (PaddlePaddle#60621) --------- Co-authored-by: xiaoguoguo626807 <100397923+xiaoguoguo626807@users.noreply.github.com> * reconstruct llama ci cases (PaddlePaddle#60637) * 【AutoParallel】Unify the fp16 and bf16 in auto-parallel (PaddlePaddle#60514) * unify the fp16 and bf16 * change white_list in AMP * add dtype support * fix bug in dtype * [Dynamic Shape] Add SplitGenerateShapeIntoShapeOpsPass (PaddlePaddle#60624) * [Dynamic Shape] Add SplitGenerateShapeIntoShapeOpsPass * Fix compile error * Fix compile error * update pdsa-2023-019, test=document_fix (PaddlePaddle#60646) * [SOT] sot export test files (PaddlePaddle#60547) * Improve the performence of put_along_axis (PaddlePaddle#60618) * fix bug of put_along_axis * improve performence of put_along_axis * [AutoParallel] Fit vpp for gradient_merge pass (PaddlePaddle#60560) * add dist attr * add op namescope * add test_semi_auto_parallel_hybrid_strategy (PaddlePaddle#60537) * [PIR]Open uts for AdaptiveAvgPool3D (PaddlePaddle#60636) * test (PaddlePaddle#60654) * [CINN] Add OptimizeReductionTactic (PaddlePaddle#60661) * [Paddle-Trt]update set_value cmakelist (PaddlePaddle#60664) [Paddle-Trt]update set_value cmakelist * [auto parallel] fix reshape infer shape (PaddlePaddle#60632) * [CINN+PIR]Clean Old GroupScheduler logic and switch into new_group_scheduler (PaddlePaddle#60642) * [CINN]Fix HasDynamicShape Bug while Type is NULL (PaddlePaddle#60658) * [PIR] pir onednn support legact istruction and lrn (PaddlePaddle#60502) * pir onednn support legact istruction and lrn * c_softmax_with_cross_entropy support bf16 for xpu (PaddlePaddle#60472) * enable custom device to use silu_fuse_pass (PaddlePaddle#60595) move SetUseCustomDevice to all platform * [XPU] add empty_like op and test, update XHPC to 20240105 (PaddlePaddle#60617) * [XPU] update XHPC date and refine FA ut (PaddlePaddle#60598) * [XPU] update XHPC date * update comments for ut * correct adamw bf16 unit test and the way to get data type (PaddlePaddle#60565) * Fix some PADDLE_THROW error type and change test cases (PaddlePaddle#60487) * fix error type * fix TypeError fix type fix fix fix fix * fix typo * as_complex as_real check_grad (PaddlePaddle#60666) * [Fix Bug] Fix Bugs of Two Pass (PaddlePaddle#60626) * [Fix Bug] Fix Bugs of Two Pass * Fix GenerateShapeOp bug * Modify unit test * Fix MakeGetterDimExpr4SymbolName * 【Hackathon 5th No.34】为 Paddle 新增 bitwise_right_shift / bitwise_right_shift_ / bitwise_left_shift / bitwise_left_shift_ API (PaddlePaddle#58092) * This PR enable offset of generator for custom device. (PaddlePaddle#60616) * [SOT] Convert dtype to `DataType` in PIR mode (PaddlePaddle#60627) * [PIR] Change output to block_arg from copy to a shared for the execution of while (PaddlePaddle#60607) * test * fix * fix * fix * 【auto parallel】custom op spmd infer add args check (PaddlePaddle#60633) * add bound check * add bound check * [PIR] Open PIR flag for test_ifelse (PaddlePaddle#60685) * open pir flag for test_ifelse * Update test_ifelse.py * Update test_ifelse.py * [CIN+PIR]Fix SplitOpPattern Bug in pd_to_cinn_pass (PaddlePaddle#60669) * [CIN+PIR]Fix SplitOpPattern Bug in pd_to_cinn_pass * fix index error * refine pir_all_path UT * fix bug * fix uncontiguous tensor resize bug (PaddlePaddle#60684) * fix uncontiguous tensor resize bug * [PIR]Support inplace custom op in pir (PaddlePaddle#60529) * support inplace in pir * fix inference ut * fix win bugs * fix win bug * fix * polish code * polish code * print log * print log * debug * fix win bugs * fix windows * fix (PaddlePaddle#60634) * [Docs] Update latest release version in README (PaddlePaddle#60691) * [CINN] Refine cmake for pass in cinn (PaddlePaddle#60683) * refine cmake for pass in cinn * add dependency in cmake * add dependency in cmake * [PIR]Open uts for PReLU (PaddlePaddle#60645) * [PIR]Open uts for ReLU6 (PaddlePaddle#60650) * [PIR]Open uts for RReLU (PaddlePaddle#60660) * [NPU] fix storage_properties type mismatch with OneDNN and NPU (PaddlePaddle#60566) * fix ttfnet_darknet53_1x_coco in pir mode (PaddlePaddle#60663) * [auto parallel] shard tensor stop gradient support (PaddlePaddle#60699) * [PIR][DynamicShape] Polish some codes (PaddlePaddle#60651) att, polish some codes * [PIR] fix onednn double reg (PaddlePaddle#60720) * fix onednn double reg * 【pir】modify add_n in while use blockarg instead of input value (PaddlePaddle#60668) * test * fix * fix * fix * modify add_n block_arg * modify increment return value * merge * modfiy whiel_op.py --------- Co-authored-by: zhangbo9674 <zhangbo54@baidu.com> * [PIR] Open test_case ut (PaddlePaddle#60721) * fix * fix * [PIR] rename data_layout (PaddlePaddle#60678) * rename data_layout * [xpu]: check op is null (PaddlePaddle#60656) * 【Hackathon 5th No.1】 为 Paddle 新增 copysign API (PaddlePaddle#57785) * add copysign op * fix codestyle * codestyle * fix test * fix std bug * merge init * merge init * merge init * add static cast * add std * static cast * static cast * copysignf * static cast to float input * float input * static cast to double input * fix * add inplace test * fix api * fix cast when grad * modify paddle.cast_ to cast_ * remove cast in python api * support fp16 && bf16 * set grad y to zero * fix en doc * support number input * add hostdevice * refactor kernel * fix nan when backward * add broadcast unit test * modify .cu * Update __init__.py * Update __init__.py * for ci test * static float * codestyle * static double * fix broadcast, try coverage * Delete paddle/phi/kernels/funcs/broadcast_function.h * remove unused * Update math.py * Update math.py * fix en doc * add test for output dtype, integer unsupported for now * update * update * fix * fix * add cast for input * fix * add pir test * fix doc * fix doc * fix doc * detail doc * adjust for MSVC * fix * Update python/paddle/tensor/math.py Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com> * Update python/paddle/tensor/math.py Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com> * fix doc output dtype, fix Equation * codestyle * codestyle * Update math.py --------- Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com> * rms_norm_infer_spmd (PaddlePaddle#60709) * [PIR]Open more tests for bernoulli and celu (PaddlePaddle#60706) * bernoulli && celu * celu test_error * [PIR]Open uts for scatter_nd_add (PaddlePaddle#60698) * [PIR]Open uts for scatter_nd_add * Fix ut * [PIR]Open uts for sinh (PaddlePaddle#60714) * [PIR]Open uts for Softshrink and Softsign (PaddlePaddle#60716) * [PIR] polish the ir_mapping implimentation. (PaddlePaddle#60675) * [PIR] fix onednn layout transform yaml format (PaddlePaddle#60680) * fix onednn layout transform yaml format * 【CINN】Complete error handler mechanism of dynamic schedule (PaddlePaddle#60718) * complete error handler mechanism of dynamic schedule * fix some output info * fix windows C++17 bug (PaddlePaddle#60736) * [XPU] fc pass and delete pass nodes check (PaddlePaddle#60314) * fix_local_windows_compile (PaddlePaddle#60682) * [PIR] fix onednn dialect name (PaddlePaddle#60665) * fix onednn dialect name * 【pir】add tesnor to array kernel etc (PaddlePaddle#60703) * merge * modfiy kernel * modify net * modify print * Fix defition definition (PaddlePaddle#60679) * cholesky and cholesky_solve tests (PaddlePaddle#60726) * [PIR]Open uts for searchsorted (PaddlePaddle#60700) * [PIR]Open uts for selu (PaddlePaddle#60702) * [PIR]Open uts for selu * Fix ut * [PIR]Open uts for sequence_mask (PaddlePaddle#60704) * [PIR] adjust pir pass log printing (PaddlePaddle#60723) * adjust pir pass log printing * update * update * update * fix compile * Fix Throughtput Throughput (PaddlePaddle#60741) * please last md (PaddlePaddle#60749) * [CINN+PIR]Fix Fetch XShape Variable logic (PaddlePaddle#60722) * [PIR][DynamicShape] Remove redundant code for shapeAnalysis and shapedTypeInterface (PaddlePaddle#60744) att, remove redundant code for shapeAnalysis and shapedTypeInterface * 【PIR Dist Op Reg No.1】 reg push_sparse_v2 (PaddlePaddle#60473) * code reg push_sparse_v2 * [Dynamic Shape] Provide operator<< For BroadcastTree (PaddlePaddle#60730) * [PIR] change IR clone to const and support clone operation successors (PaddlePaddle#60752) * support ir clone const and support clone operation successors * refine ir_mapping * refine region clone * [CINN] Refine fully_insert_broadcast_pass (PaddlePaddle#60676) * refine fully_insert_broadcast_pass * fix complie bug * fix complie * fix conflict * [PIR] einsum's inner_cache and xshape set to optional (PaddlePaddle#60748) * einsum's inner_cache and xshape set to intermediate * Update paddle/fluid/pir/dialect/operator/ir/ops.yaml --------- Co-authored-by: kangguangli <kangguangli@hotmail.com> * reduce runtime of unit-tests in windows-trt (PaddlePaddle#60731) * modify trt test to deal with Timeout * windows * [Paddle-TRT] upgrade EnqueueV2 to EnqueueV3 (PaddlePaddle#59950) * 【Hackathon 5th No.110】为 Paddle 增强 sparse.matmul API (PaddlePaddle#59890) * Fix rank_relatvie rank_relative (PaddlePaddle#60770) * add graph_key to specific graph's varmap (PaddlePaddle#60567) * add graph_key to specific graph's varmap * fix inpalce case * fix inpalce case * 【Hackathon 5th No.38】为 Paddle 新增 FractionalMaxPool2d / FractionalMaxPool3d API -kernel (PaddlePaddle#59847) * [Init] add fractional max pool kernel and api * [Fix] pooling.cu seed offset * [Change] remove adaptive from fractional max pool * [Change] fractional max 2d gpu pooling.cu grad * [Change] fractional max 2d gpu pooling.cu grad with dim3 * [Change] use UnchangedInferMeta * [Change] test api with uint16 * [Change] wrap test disable_static * [Change] regiester float16/bfloat16 * [Change] remove bfloat16 from cpu kernrl * [Change] test dtypes in cpu and gpu * [Change] test_fractional_max_pool3d_2d/3d timeout to 30s * [Fix] resolve conflict * [Change] win32 cannot detect bfloat16 correctly * [Change] force set_device * [Add] test random_u is None * [Change] use kernel_size for overlapping mode * [Change] clean headers * [CodeStyle] pooling * [Change] rename op * [Change] rename func without index * [Prim][PIR] Recover pir bn (PaddlePaddle#60689) * reopen bn prim pir * fix atol * decomp support batch_norm_ * fix test case * fix bug * fix code * [PIR]fc_with_special_op_fuse_pass bug fix (PaddlePaddle#60751) * bug fix update * update * delete all debug message * add code deleted wrong at last commit * delete createAutoMixedPrecisionPass in analysis_predictor.cc --------- Co-authored-by: HongyuJia <jiahongyu@baidu.com> Co-authored-by: ooo oo <106524776+ooooo-create@users.noreply.github.com> Co-authored-by: SigureMo <sigure.qaq@gmail.com> Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com> Co-authored-by: xingmingyyj <135400902+xingmingyyj@users.noreply.github.com> Co-authored-by: JYChen <zoooo0820@qq.com> Co-authored-by: Yuang Liu <liuyuang@baidu.com> Co-authored-by: zhangbo9674 <82555433+zhangbo9674@users.noreply.github.com> Co-authored-by: YuanRisheng <yuanrisheng@baidu.com> Co-authored-by: kevin <chengyf112@gmail.com> Co-authored-by: wanghuancoder <wanghuan29@baidu.com> Co-authored-by: kangguangli <kangguangli@hotmail.com> Co-authored-by: zhangyuqin1998 <75946871+zhangyuqin1998@users.noreply.github.com> Co-authored-by: co63oc <co63oc@users.noreply.github.com> Co-authored-by: NeroLoh <745827440@qq.com> Co-authored-by: 傅剑寒 <Xs1580802568@gmail.com> Co-authored-by: lzydev <lizhiyu02@baidu.com> Co-authored-by: tianshuo78520a <707759223@qq.com> Co-authored-by: houj04 <35131887+houj04@users.noreply.github.com> Co-authored-by: Yuanle Liu <yuanlehome@163.com> Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com> Co-authored-by: 张春乔 <83450930+Liyulingyue@users.noreply.github.com> Co-authored-by: xiaoguoguo626807 <100397923+xiaoguoguo626807@users.noreply.github.com> Co-authored-by: winter-wang <1030748926@qq.com> Co-authored-by: BiynXu <62832681+BiynXu@users.noreply.github.com> Co-authored-by: cyber-pioneer <116002591+cyber-pioneer@users.noreply.github.com> Co-authored-by: Vigi Zhang <VigiZhang@users.noreply.github.com> Co-authored-by: zbt78 <1095497213@qq.com> Co-authored-by: liuzhenhai93 <liuzhenhai93@outlook.com> Co-authored-by: Aurelius84 <zhangliujie@baidu.com> Co-authored-by: Bo Zhang <105368690+zhangbopd@users.noreply.github.com> Co-authored-by: Lu Qi <61354321+MarioLulab@users.noreply.github.com> Co-authored-by: LoneRanger <836253168@qq.com> Co-authored-by: freeliuzc <lzc842650834@gmail.com> Co-authored-by: YibLiu <68105073+YibinLiu666@users.noreply.github.com> Co-authored-by: engineer1109 <jialiang.wang@xdxct.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> Co-authored-by: xuxinyi389 <104957571+xuxinyi389@users.noreply.github.com> Co-authored-by: MayYouBeProsperous <ljmhz@outlook.com> Co-authored-by: Huihuang Zheng <zhhsplendid@163.com> Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com> Co-authored-by: 6clc <chaoliu.lc@foxmail.com> Co-authored-by: Terry <38135104+TR666@users.noreply.github.com> Co-authored-by: winter-wang <78149749+winter-wang@users.noreply.github.com> Co-authored-by: Wang Xin <xinwang614@gmail.com> Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com> Co-authored-by: Frank Lin <eee4017@gmail.com> Co-authored-by: pangengzheng <117730991+pangengzheng@users.noreply.github.com> Co-authored-by: lanxianghit <47554610+lanxianghit@users.noreply.github.com> Co-authored-by: Tian Zheng <tizheng@nvidia.com> Co-authored-by: lijialin03 <124568209+lijialin03@users.noreply.github.com> Co-authored-by: Wangzheee <634486483@qq.com> Co-authored-by: zhink <33270771+zhink@users.noreply.github.com> Co-authored-by: huangjiyi <43315610+huangjiyi@users.noreply.github.com> Co-authored-by: Chen Zhiyang <1792266893@qq.com> Co-authored-by: feifei-111 <2364819892@qq.com> Co-authored-by: fsczz <57291768+fsczz@users.noreply.github.com> Co-authored-by: Haohongxiang <86215757+haohongxiang@users.noreply.github.com> Co-authored-by: Sonder <55493212+AndSonder@users.noreply.github.com> Co-authored-by: Liujie0926 <44688141+Liujie0926@users.noreply.github.com> Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com> Co-authored-by: risemeup1 <62429225+risemeup1@users.noreply.github.com> Co-authored-by: bukejiyu <52310069+bukejiyu@users.noreply.github.com> Co-authored-by: zhangyikun02 <48021248+zhangyk0314@users.noreply.github.com> Co-authored-by: Jianbang Yang <yangjianbang112@gmail.com> Co-authored-by: enzodechine <enzo9533@hotmail.com> Co-authored-by: Zhan Rongrui <46243324+zrr1999@users.noreply.github.com> Co-authored-by: coco <69197635+cocoshe@users.noreply.github.com> Co-authored-by: zhaohaixu <49297029+zhaohaixu@users.noreply.github.com> Co-authored-by: chen2016013 <111894720+chen2016013@users.noreply.github.com> Co-authored-by: zyfncg <zhangyunfei07@baidu.com> Co-authored-by: Qi Li <qili93@qq.com> Co-authored-by: zhangbo9674 <zhangbo54@baidu.com> Co-authored-by: Liuyinfeng <30849840+gitliuyf@users.noreply.github.com> Co-authored-by: zachary sun <70642955+sunzhongkai588@users.noreply.github.com> Co-authored-by: wendaxiao <113992173+wenxiaohahaha@users.noreply.github.com> Co-authored-by: cyberslack_lee <luhputu0815@gmail.com> Co-authored-by: lizexu123 <39205361+lizexu123@users.noreply.github.com> Co-authored-by: GGBond8488 <33050871+GGBond8488@users.noreply.github.com> Co-authored-by: megemini <megemini@outlook.com>
zoooo0820 added a commit to zoooo0820/Paddle that referenced this pull request Jan 18, 2024
* set_value with scalar * fix ut
XiaoguangHu01 pushed a commit that referenced this pull request Jan 19, 2024
* Fix set value grad (#59034) * first fix the UT * fix set value grad * polish code * add static mode backward test * always has input valuetensor * add dygraph test * Fix shape error in combined-indexing setitem (#60447) * add ut * fix shape error in combine-indexing * fix ut * Set value with scalar (#60452) * set_value with scalar * fix ut * remove test_pir * remove one test since 2.6 not support uint8-add
xiaoguoguo626807 pushed a commit that referenced this pull request Sep 30, 2024
* fix windows bug for common lib (#60308) * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * fix windows bug * Update inference_lib.cmake * [Dy2St] Disable `test_bert` on CPU (#60173) (#60324) Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com> * [Cherry-pick] fix weight quant kernel bug when n div 64 != 0 (#60184) * fix weight-only quant kernel error for n div 64 !=0 * code style fix * tile (#60261) * add chunk allocator posix_memalign return value check (#60208) (#60495) * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * fix chunk allocator posix_memalign return value check;test=develop * update 2023 security advisory, test=document_fix (#60532) * fix fleetutil get_online_pass_interval bug2; test=develop (#60545) * fix fused_rope diff (#60217) (#60593) * [cherry-pick]fix fleetutil get_online_pass_interval bug3 (#60620) * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * fix fleetutil get_online_pass_interval bug3; test=develop * [cherry-pick]update pdsa-2023-019 (#60649) * update 2023 security advisory, test=document_fix * update pdsa-2023-019, test=document_fix * [Dy2St][2.6] Disable `test_grad` on release/2.6 (#60662) * fix bug of ci (#59926) (#60785) * [Dy2St][2.6] Disable `test_transformer` on `release/2.6` and update README (#60786) * [Dy2St][2.6] Disable `test_transformer` on release/2.6 and update README * [Docs] Update latest release version in README (#60691) * restore order * [Dy2St][2.6] Increase `test_transformer` and `test_mobile_net` ut time (#60829) (#60875) * [Cherry-pick] fix set_value with scalar grad (#60930) * Fix set value grad (#59034) * first fix the UT * fix set value grad * polish code * add static mode backward test * always has input valuetensor * add dygraph test * Fix shape error in combined-indexing setitem (#60447) * add ut * fix shape error in combine-indexing * fix ut * Set value with scalar (#60452) * set_value with scalar * fix ut * remove test_pir * remove one test since 2.6 not support uint8-add * [cherry-pick] This PR enable offset of generator for custom device. (#60616) (#60772) * fix core dump when fallback gather_nd_grad and MemoryAllocateHost (#61067) * fix qat tests (#61211) (#61284) * [Security] fix draw security problem (#61161) (#61338) * fix draw security problem * fix _decompress security problem (#61294) (#61337) * Fix CVE-2024-0521 (#61032) (#61287) This uses shlex for safe command parsing to fix arbitrary code injection Co-authored-by: ndren <andreien@proton.me> * [Security] fix security problem for prune_by_memory_estimation (#61382) * OS Command Injection prune_by_memory_estimation fix * Fix StyleCode * [Security] fix security problem for run_cmd (#61285) (#61398) * fix security problem for run_cmd * [Security] fix download security problem (#61162) (#61388) * fix download security problem * check eval for security (#61389) * [cherry-pick] adapt c_embedding to phi namespace for custom devices (#60774) (#61045) Co-authored-by: Tian <121000916+SylarTiaNII@users.noreply.github.com> * [CherryPick] Fix issue 60092 (#61427) * fix issue 60092 * update * update * update * Fix unique (#60840) (#61044) * fix unique kernel, row to num_out * cinn(py-dsl): skip eval string in python-dsl (#61380) (#61586) * remove _wget (#61356) (#61569) * remove _wget * remove _wget * remove wget test * fix layer_norm decompose dtyte bugs, polish codes (#61631) * fix doc style (#61688) * merge (#61866) * [security] refine _get_program_cache_key (#61827) (#61896) * security, refine _get_program_cache_key * repeat_interleave support bf16 dtype (#61854) (#61899) * repeat_interleave support bf16 dtype * support bf16 on cpu * Support Fake GroupWise Quant (#61900) * fix launch when elastic run (#61847) (#61878) * [Paddle-TRT] fix solve (#61806) * [Cherry-Pick] Fix CacheKV Quant Bug (#61966) * fix cachekv quant problem * add unittest * Sychronized the paddle2.4 adaptation changes * clear third_part dependencies * change submodules to right commits * build pass with cpu only * build success with maca * build success with cutlass and fused kernels * build with flash_attn and mccl * build with test, fix some bugs * fix some bugs * fixed some compilation bugs * fix bug in previous commit * fix bug with split when col_size biger than 256 * add row_limit to show full kernel name * add env.sh Change-Id: I6fded2761a44af952a4599691e19a1976bd9b9d1 * add shape record Change-Id: I273f5a5e97e2a31c1c8987ee1c3ce44a6acd6738 * modify paddle version Change-Id: I97384323c38066e22562a6fe8f44b245cbd68f98 * wuzhao optimized the performance of elementwise kernel. Change-Id: I607bc990415ab5ff7fb3337f628b3ac765d3186c * fix split when dtype is fp16 Change-Id: Ia55d31d11e6fa214d555326a553eaee3e928e597 * fix bug in previous commit Change-Id: I0fa66120160374da5a774ef2c04f133a54517069 * adapt flash_attn new capi Change-Id: Ic669be18daee9cecbc8542a14e02cdc4b8d429ba * change eigen path Change-Id: I514c0028e16d19a3084656cc9aa0838a115fc75c * modify mcname -> replaced_name Change-Id: Idc520d2db200ed5aa32da9573b19483d81a0fe9e * fix some build bugs Change-Id: I50067dfa3fcaa019b5736f4426df6d4e5f64107d * add PADDLE_ENABLE_SAME_RAND_A100 Change-Id: I2d4ab6ed0b5fac3568562860b0ba1c4f8e346c61 done * remove redundant warning, add patch from 2.6.1 Change-Id: I958d5bebdc68eb42fe433c76a3737330e00a72aa * improve VectorizedBroadcastKernel (cherry picked from commit 19069b26c0bf05a80cc834162db072f6b8aa2536) Change-Id: Iaf5719d72ab52adbedc40d4788c52eb1ce4d517c Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * fix bugs (cherry picked from commit b007853a75dbd5de63028f4af82c15a5d3d81f7c) Change-Id: Iaec0418c384ad2c81c354ef09d81f3e9dfcf82f1 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * split ElementwiseDivGrad (cherry picked from commit eb6470406b7d440c135a3f7ff68fbed9494e9c1f) Change-Id: I60e8912be8f8d40ca83a54af1493adfa2962b2d6 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * in VectorizedElementwiseKernel, it can now use vecSize = 8 (cherry picked from commit a873000a6c3bc9e2540e178d460e74e15a3d4de5) Change-Id: Ia703b1e9e959558988fcd09182387da839d33922 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve ModulatedDeformableCol2imCoordGpuKernel:1.block size 512->64;2.FastDivMod;3.fix VL1;4.remove DmcnGetCoordinateWeight divergent branches. (cherry picked from commit 82c914bdd29f0eef87a52b229ff84bc456a1beeb) Change-Id: I60b1fa9a9c89ade25e6b057c38e08616a24fa5e3 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Optimize depthwise_conv2d_grad compute (InputGrad): 1.use shared memory to optimize data load from global memory; 2.different blocksize for different input shape 3.FastDivMod for input shape div, >> and & for stride div. (cherry picked from commit b34a5634d848f3799f5a8bcf884731dba72d3b20) Change-Id: I0d8f22f2a2b9d99dc9fbfc1fb69b7bed66010229 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve VectorizedBroadcastKernel with LoadType = 2(kMixed) (cherry picked from commit 728b9547f65e096b45f39f096783d2bb49e8556f) Change-Id: I282dd8284a7cde54061780a22b397133303f51e5 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * fix ElementwiseDivGrad (cherry picked from commit 5f99c31904e94fd073bdd1696c3431cccaa376cb) Change-Id: I3ae0d6c01eec124d12fa226a002b10d0c40f820c Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Revert "Optimize depthwise_conv2d_grad compute (InputGrad):" This reverts commit b34a5634d848f3799f5a8bcf884731dba72d3b20. (cherry picked from commit 398f5cde81e2131ff7014edfe1d7beaaf806adbb) Change-Id: I637685b91860a7dea6df6cbba0ff2cf31363e766 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve ElementwiseDivGrad and ElementwiseMulGrad (cherry picked from commit fe32db418d8f075e083f31dca7010398636a6e67) Change-Id: I4f7e0f2b5afd4e704ffcd7258def63afc43eea9c Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve FilterBBoxes (cherry picked from commit fe4655e86b92f5053fa886af49bf199307960a05) Change-Id: I35003420292359f8a41b19b7ca2cbaae17dc5b45 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve deformable_conv_grad op:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit a7cb0ed275a3488f79445ef31456ab6560e9de43) Change-Id: Ia89df4e5a26de64baae4152837d2ce3076c56df1 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve ModulatedDeformableIm2colGpuKernel:1.adaptive block size;2.FastDivMod;3.move ldg up. (cherry picked from commit 4fb857655d09f55783d9445b91a2d953ed14d0b8) Change-Id: I7df7f3af7b4615e5e96d33b439e5276be6ddb732 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve KeBNBackwardData:replace 1.0/sqrt with rsqrt (cherry picked from commit 333cba7aca1edf7a0e87623a0e55e230cd1e9451) Change-Id: Ic808d42003677ed543621eb22a797f0ab7751baa Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Improve KeBNBackwardData, FilterGradAddupGpuKernel kernels. Improve nonzero and masked_select (forward only) OP. (cherry picked from commit c907b40eb3f9ded6ee751e522c2a97a353ac93bd) Change-Id: I7f4845405e64e7599134a8c497f464ac04dead88 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Optimize depthwise_conv2d: 1. 256 Blocksize launch for small shape inputgrad; 2. FastDivMod in inputgrad and filtergrad; 3. shared memory to put output_grad_data in small shape. (cherry picked from commit f9f29bf7b8d929fb95eb1153a79d8a6b96d5b6d2) Change-Id: I1a3818201784031dbedc320286ea5f4802dbb6b1 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors. (cherry picked from commit 3bd200f262271a333b3947326442b86af7fb6da1) Change-Id: I57c94cc5e709be8926e1b21da14b653cb18eabc3 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Revert "Improve CheckFiniteAndUnscaleKernel by splitting the kernel into multiple tensors." This reverts commit 3bd200f262271a333b3947326442b86af7fb6da1. (cherry picked from commit 86ed8adaa8c20d3c824eecb0ee1e10d365bcea37) Change-Id: I5b8b7819fdf99255c65fe832d5d77f8e439bdecb Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve ScatterInitCUDAKernel and ScatterCUDAKernel (cherry picked from commit cddb01a83411c45f68363248291c0c4685e60b24) Change-Id: Ie106ff8d65c21a8545c40636f021b73f3ad84587 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * fix bugs and make the code easier to read (cherry picked from commit 07ea3acf347fda434959c8c9cc3533c0686d1836) Change-Id: Id7a727fd18fac4a662f8af1bf6c6b5ebc6233c9f Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Optimize FilterGard and InputGradSpL Use tmp to store ldg data in the loop so calculate and ldg time can fold each other. (cherry picked from commit 7ddab49d868cdb6deb7c3e17c5ef9bbdbab86c3e) Change-Id: I46399594d1d7f76b78b9860e483716fdae8fc7d6 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Improve CheckFiniteAndUnscaleKernel by putting address access to shared memory and making single thread do more tasks. (cherry picked from commit 631ffdda2847cda9562e591dc87b3f529a51a978) Change-Id: Ie9ffdd872ab06ff34d4daf3134d6744f5221e41e Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Optimize SwinTransformer 1.LayerNormBackward: remove if statement, now will always loop VPT times for ldg128 in compiler, bool flag to control if write action will be taken or not; 2.ContiguousCaseOneFunc: tmp saving division result for less division (cherry picked from commit 422d676507308d26f6107bed924424166aa350d3) Change-Id: I37aab7e2f97ae6b61c0f50ae4134f5eb1743d429 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Optimize LayerNormBackwardComputeGradInputWithSmallFeatureSize Set BlockDim.z to make blockSize always be 512, each block can handle several batches. Then all threads will loop 4 times for better performance. (cherry picked from commit 7550c90ca29758952fde13eeea74857ece41908b) Change-Id: If24de87a0af19ee07e29ac2e7e237800f0181148 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve KeMatrixTopK:1.fix private memory;2.modify max grid size;3.change it to 64 warp reduce. (cherry picked from commit a346af182b139dfc7737e5f6473dc394b21635d7) Change-Id: I6c8d8105fd77947c662e6d22a0d15d7bad076bde Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * Modify LayerNorm Optimization Might have lossdiff with old optimization without atomicAdd. (cherry picked from commit 80b0bcaa9a307c94dbeda658236fd75e104ccccc) Change-Id: I4a7c4ec2a0e885c2d581dcebc74464830dae7637 Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * improve roi_align op:1.adaptive block size;2.FastDivMod. (cherry picked from commit cc421d7861c359740de0d2870abcfde4354d8c71) Change-Id: I55c049e951f93782af1c374331f44b521ed75dfe Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> * add workaround for parameters dislocation when calling BatchedGEMM<float16>. Change-Id: I5788c73a9c45f65e60ed5a88d16a473bbb888927 * fix McFlashAttn string Change-Id: I8b34f02958ddccb3467f639daaac8044022f3d34 * [C500-27046] fix wb issue Change-Id: I77730da567903f43ef7a9992925b90ed4ba179c7 * Support compiling external ops Change-Id: I1b7eb58e7959daff8660ce7889ba390cdfae0c1a * support flash attn varlen api and support arm build Change-Id: I94d422c969bdb83ad74262e03efe38ca85ffa673 * Add a copyright notice Change-Id: I8ece364d926596a40f42d973190525d9b8224d99 * Modify some third-party dependency addresses to public network addresses --------- Signed-off-by: m00891 <Zequn.Yang@metax-tech.com> Co-authored-by: risemeup1 <62429225+risemeup1@users.noreply.github.com> Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com> Co-authored-by: gouzil <66515297+gouzil@users.noreply.github.com> Co-authored-by: Wang Bojun <105858416+wwbitejotunn@users.noreply.github.com> Co-authored-by: lizexu123 <39205361+lizexu123@users.noreply.github.com> Co-authored-by: danleifeng <52735331+danleifeng@users.noreply.github.com> Co-authored-by: Vigi Zhang <VigiZhang@users.noreply.github.com> Co-authored-by: tianhaodongbd <137985359+tianhaodongbd@users.noreply.github.com> Co-authored-by: zyfncg <zhangyunfei07@baidu.com> Co-authored-by: JYChen <zoooo0820@qq.com> Co-authored-by: zhaohaixu <49297029+zhaohaixu@users.noreply.github.com> Co-authored-by: Spelling <33216444+raining-dark@users.noreply.github.com> Co-authored-by: zhouzj <41366441+zzjjay@users.noreply.github.com> Co-authored-by: wanghuancoder <wanghuan29@baidu.com> Co-authored-by: ndren <andreien@proton.me> Co-authored-by: Nguyen Cong Vinh <80946737+vn-ncvinh@users.noreply.github.com> Co-authored-by: Ruibin Cheung <beinggod@foxmail.com> Co-authored-by: Tian <121000916+SylarTiaNII@users.noreply.github.com> Co-authored-by: Yuanle Liu <yuanlehome@163.com> Co-authored-by: zhuyipin <yipinzhu@outlook.com> Co-authored-by: 6clc <chaoliu.lc@foxmail.com> Co-authored-by: Wenyu <wenyu.lyu@gmail.com> Co-authored-by: Xianduo Li <30922914+lxd-cumt@users.noreply.github.com> Co-authored-by: Wang Xin <xinwang614@gmail.com> Co-authored-by: Chang Xu <molixu7@gmail.com> Co-authored-by: wentao yu <yuwentao126@126.com> Co-authored-by: zhink <33270771+zhink@users.noreply.github.com> Co-authored-by: handiz <35895648+ZhangHandi@users.noreply.github.com> Co-authored-by: zhimin Pan <zhimin.pan@metax-tech.com> Co-authored-by: m00891 <Zequn.Yang@metax-tech.com> Co-authored-by: shuliu <shupeng.liu@metax-tech.com> Co-authored-by: Yanxin Zhou <yanxin.zhou@metax-tech.com> Co-authored-by: Zhao Wu <zhao.wu@metax-tech.com> Co-authored-by: m00932 <xiangrong.yi@metax-tech.com> Co-authored-by: Fangzhou Feng <fangzhou.feng@metax-tech.com> Co-authored-by: junwang <jun.wang@metax-tech.com> Co-authored-by: m01097 <qimeng.du@metax-tech.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants