Software Network Data Plane – Satisfying the Need for Speed FastData.io – VPP and CSIT Projects 16th of April 2018, Out of the Box Net Dev Meetup, London Maciek Konstantynowicz, lf-id/iirc: mackonstan, mkonstan@cisco.com CSIT-CPL - Continuous System Integration and Testing, a.k.a. “Continuous Performance Lab” https://wiki.fd.io/view/CSIT VPP – Vector Packet Processing https://wiki.fd.io/view/VPP
Multiparty: Broad Membership FD.io Foundation 2 Service Providers Network Vendors Chip Vendors Integrators
Multiparty: Broad Contribution FD.io Foundation 3 Universitat Politècnica de Catalunya (UPC) Yandex Qiniu
Topics • What is FD.io • The “Magic” of Vectors • SW Data Plane Benchmarking • Deployment Applicability • Addressing the Continuity Problem • Some Results, Reports and Analysis
5 Breaking the Barrier of Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! EFFICIENCY PERFORMANCE SOFTWARE DEFINED NETWORKING CLOUD NETWORK SERVICES LINUX FOUNDATION A Universal Terabit Network Platform For Cloud-native Network Services Superior Performance Most Efficient on the Planet Flexible and Extensible Open Source Cloud Native
FD.io VPP – Vector Packet Processor Compute Optimized SW Network Platform Packet Processing Software Platform • High performance • Linux user space • Runs on compute CPUs: - And “knows” how to run them well ! 6 Packet Processing Dataplane Management Agent Network IO Bare-metal / VM / Container
FD.io VPP – The “Magic” of Vectors Compute Optimized SW Network Platform 1 Packet processing is decomposed into a directed graph of nodes … Packet 0 Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Packet 7 Packet 8 Packet 9 Packet 10 … packets move through graph nodes in vector …2 Microprocessor … graph nodes are optimized to fit inside the instruction cache … … packets are pre-fetched into the data cache. Instruction Cache3 Data Cache4 3 4 vhost-user- input af-packet- input dpdk-input ip4-lookup- mulitcast ip4-lookup* ethernet- input mpls-input lldp-input arp-inputcdp-input ...-no- checksum ip6-inputl2-input ip4-input ip4-load- balance mpls-policy- encap ip4-rewrite- transit ip4- midchain interface- output * Each graph node implements a “micro-NF”, a “micro-NetworkFunction” processing packets.
FD.io Benefits from Intel® Xeon® Processor Developments Increased Processor I/O Improves Packet Forwarding Rates YESTERDAY Intel® Xeon® E5-2699v4 22 Cores, 2.2 GHz, 55MB Cache Network I/O: 160 Gbps Core ALU: 4-wide parallel µops Memory: 4-channels 2400 MHz Max power: 145W (TDP) 1 2 3 4 Socket 0 Broadwell Server CPU Socket 1 Broadwell Server CPU 2 DDR4 QPI QPI 4 2 DDR4 DDR4 PCIe PCIe PCIe x8 50GE x16 100GE x16 100GE 3 1 4 PCIe PCIe x8 50GE x16 100GE Ethernet 1 3 DDR4 DDR4 DDR4 DDR4 DDR4 SATA B I O S PCH Intel® Xeon® Platinum 8168 24 Cores, 2.7 GHz, 33MB Cache TODAY Network I/O: 280 Gbps Core ALU: 5-wide parallel µops Memory: 6-channels 2666 MHz Max power: 205W (TDP) 1 2 3 4 Socket 0 Skylake Server CPU Socket 1 Skylake Server CPU UPI UPI DDR4 DDR4 DDR4 PCIe PCIe PCIe PCIe PCIe PCIe x8 50GE x16 100GE x8 50GE x16 100GE x16 100GE SATA B I O S 2 4 2 1 3 1 4 3 x8 50GE DDR4 PCIe x8 40GE Lewisburg PCH DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 0 200 400 600 800 1000 1200 160 280 320 560 640 Server [1 Socket] Server [2 Sockets] Server 2x [2 Sockets] +75% +75% PCle Packet Forwarding Rate [Gbps] Intel® Xeon® v3, v4 Processors Intel® Xeon® Platinum 8180 Processors 1,120* Gbps +75% * On compute platforms with all PCIe lanes from the Processors routed to PCIe slots. Breaking the Barrier of Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! FD.io Takes Full Advantage of Faster Intel® Xeon® Scalable Processors No Code Change Required https://goo.gl/UtbaHy
2CPU Network I/O 490 Gbps Crypto I/O 100 Gbps 2CPU Network I/O 490 Gbps Crypto I/O 100 Gbps Socket 0 Skylake Server CPU Socket 1 Skylake Server CPU UPI UPI DDR4 DDR4 DDR4 PCIe PCIe PCIe PCIe PCIe PCIe x8 50GE x16 100GE x8 50GE x16 100GE x16 100GE SATA B I O S 2 4 2 1 3 1 4 3 x8 50GE DDR4 PCIe x8 40GE Lewisburg PCH DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 FD.io VPP – The “Magic” Behind the Equation FD.io Takes Full Advantage of Faster Intel® Xeon® Scalable Processors No Code Change Required FD.io Data Plane Efficiency Metrics: { + } higher is better { - } lower is better YESTERDAY TODAY Intel® Xeon® E5-2699v4 Intel® Xeon® Platinum 8168 Improvement { + } 4 Socket forwarding rate [Gbps] 560 Gbps 948 Gbps* +69 % { - } Cycles / Packet 180 158 -12 % { + } Instructions / Cycle (HW max.) 2.8 ( 4 ) 3.28 ( 5 ) +17 % { - } Instructions / Packet 499 497 ~0 % Socket 0 Skylake Server CPU Socket 1 Skylake Server CPU UPI UPI DDR4 DDR4 DDR4 PCIe PCIe PCIe PCIe PCIe PCIe x8 50GE x16 100GE x8 50GE x16 100GE x16 100GE SATA B I O S 2 4 2 1 3 1 4 3 x8 50GE DDR4 PCIe x8 40GE Lewisburg PCH DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 Per processor: 24 cores 48 threads 2.7GHz On-board LBG-NS 100G QAT Crypto Machine with Intel® Xeon® Platinum 8168 * Measured 4 Socket forwarding rate is limited by PCIe I/O slot layout on tested compute machines; nominal forwarding rate for tested FD.io VPP configuration is 280 Gbps per Platinum Processor. Not all cores are used. 9 Breaking the Barrier of Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! https://goo.gl/UtbaHy
tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 [𝑏𝑝𝑠] = tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑝𝑝𝑠 ∗ 𝑝𝑎𝑐𝑘𝑒𝑡_𝑠𝑖𝑧𝑒[𝑝𝑝𝑠] DP Benchmarking Metrics – External and Internal Compute CPP from PPS or vice versa.. 𝑝𝑟𝑜𝑔𝑟𝑎𝑚_𝑢𝑛𝑖𝑡_𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐] = #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑝𝑟𝑜𝑔𝑟𝑎𝑚_𝑢𝑛𝑖𝑡 ∗ #𝑐𝑦𝑐𝑙𝑒𝑠 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 ∗ 𝑐𝑦𝑐𝑙𝑒_𝑡𝑖𝑚𝑒 𝑝𝑎𝑐𝑘𝑒𝑡_𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐] = #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑝𝑎𝑐𝑘𝑒𝑡 ∗ #𝑐𝑦𝑐𝑙𝑒𝑠 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 ∗ 𝑐𝑦𝑐𝑙𝑒_𝑡𝑖𝑚𝑒 #cycles_per_packet = #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑝𝑎𝑐𝑘𝑒𝑡 ∗ #𝑐𝑦𝑐𝑙𝑒𝑠 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡[𝑝𝑝𝑠] = 1 ]𝑝𝑎𝑐𝑘𝑒𝑡_𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐 = ]𝐶𝑃𝑈_𝑓𝑟𝑒𝑞[𝐻𝑧 #𝑐𝑦𝑐𝑙𝑒𝑠_𝑝𝑒𝑟_𝑝𝑎𝑐𝑘𝑒𝑡 Treat software network Data Plane as one would any program, .. with the instructions per packet being the program unit, .. and arrive to the main data plane benchmarking metrics. CPP PPS BPS CPI or 1/IPCIPP External Metrics Internal Metrics
Main architecture resources used for packet- centric operations: 1. Packet processing operations – How many CPU core cycles are required to process a packet? 2. Memory bandwidth – How many memory-read and -write accesses are made per packet? 3. I/O bandwidth – How many bytes are transferred over PCIe link per packet? 4. Inter-socket transactions – How many bytes are accessed from the other socket or other core in the same socket per packet? Socket 0 Broadwell Server CPU Socket 1 Broadwell Server CPU 1 DDR4 QPI QPI 4 1 DDR4 DDR4 PCIe PCIe PCIe x8 50GE x16 100GE x16 100GE 2 3 PCIe PCIe x8 50GE x16 100GE Ethernet 3 DDR4 DDR4 DDR4 DDR4 DDR4 SATA B I O S PCH Metrics – Mapping Them to Resources.. In-depth introduction on SW data plane performance benchmarking: https://fd.io/resources/performance_analysis_sw_data_planes.pdf
Applicability – SW Network Services within a Node • Start simple • Benchmark the NIC to NIC packet path • Use the right test and telemetry tools and approaches • Analyze all key metrics: PPS, CPP, IPC, IPP and more • Find performance ceilings of Network Function data plane • Then apply the same methodology to other packet paths and services ✓ …
• FD.io VPP works today • Great external and internal performance metrics • The world keeps moving on • New functions and features are being added continuously • New generations of hardware are showing up periodically • So, how do you keep the world happy, i.e. : • maintain the best-in-class performance? • prevent rogue patches going in? • qualify further optimizations of existing code? • quantify HW accelerators, processors, device setting changes? The Continuity Problem
• CSIT-CPL goals and aspirations • FD.io VPP benchmarking • VPP functionality per specifications (RFCs1) • VPP performance and efficiency (PPS2, CPP3) • Network data plane - throughput Non-Drop Rate, bandwidth, PPS, packet delay • Network Control Plane, Management Plane Interactions (memory leaks!) • Performance baseline references for HW + SW stack (PPS, CPP) • Range of deterministic operation for HW + SW stack (NDR, PDR4) • Provide testing platform and tools to FD.io VPP dev and usr community • Automated functional and performance tests • Automated telemetry feedback with conformance, performance and efficiency metrics • Help to drive good practice and engineering discipline into FD.io dev community • Drive innovative optimizations into the source code – verify they work • Enable innovative functional, performance and efficiency additions & extensions • Prevent unnecessary code “harm” Addressing the Continuity Problem with FD.io CSIT-CPL Continuous Performance Lab Legend: 1 RFC – Request For Comments – IETF Specs basically 2 PPS – Packets Per Second 3 CPP – Cycles Per Packet (metric of packet processing efficiency) 4 NDR, PDR – Non-Drop Rate, Partial Drop Rate
• Continuous Testing and Reporting • Functional – Pass/Fail • Device Drivers – Pass/Fail • Performance Benchmarking – Throughput and Latency • no Pass/Fail, but a Spectrum of Data that needs to be analyzed and classfied further • Continuous Analysis • Performance Trending, Spotting Progressions, Regressions • Anomaly Detection and Notification • All in open-source and published • Tools and code • Results and analytics FD.io CSIT-CPL Continuous Performance Lab – CI/CD for SW network data planes
FD.io CSIT-CPL Per release test and performance reports https://docs.fd.io/csit/rls1801/report/index.html
Measuring and Trending Performance – a Spectrum of Data https://docs.fd.io/csit/master/trending/
CSIT-CPL - Getting “C” right in “CI/CD”.. • Need “baremetal” to execute tests • Many functional and all performance tests need to run on physical servers • Lots of tests, many combinations, they take time • Physical resources, testbeds, servers are always in short supply! • Dealing with scarce physical resources • Focus on efficiency and execution time • Reduce infra overhead • Speedup build time for per patch tests • Reduce execution time • smarter NDR/PDR throughput rate search algorithms • Parallelize • Keep optimizing..
x86 Server NIC1 Socket 0 Xeon Processor E5-2699v3 NIC2 NIC3 x8 x8 x8 DDR4 Socket 1 Xeon Processor E5-2699v3 NIC1 NIC2 NIC3 x8 x8 x8 Q P I x86 Server NIC3 Socket 0 Xeon Processor E5-2699v3 NIC2NIC1 x8x8x8 DDR4 Socket 1 Xeon Processor E5-2699v3 NIC3NIC2NIC1 x8x8x8 Q P I x86 Server x86 Server NIC3 Socket 0 Xeon Processor E5-2699v3 NIC2NIC1 x8x8x8 DDR4 Socket 1 Xeon Processor E5-2699v3 NIC3NIC2NIC1 x8x8x8 Q P I x86 Server CSIT-CPL – Testbeds Today 2-Node Topology 3-Node Topology Systems Under Test “SW Devices” Under Test
CSIT-CPL – Where we got to.. • Enabled per patch performance tests • In POC phase due to limited physical testbeds capacity, be fixed shortly • Growing physical performance lab • 20 of 2-socket Xeon Skylake servers • Each Skylake server can do 280Gbps of I/O full-duplex per socket! • https://goo.gl/UtbaHy
CSIT-CPL – .. and where we are going with this.. • Every patch performance benchmarked • Cause once it is merged, it is gone • Results summarized and abstracted for meaningful feedback loop • To humans: contributors, commiters, testers, users • To downstream projects • To trending analytics for anomaly detection and notification • To telemetry analytics for efficiency verification
#cycles/packet = cpu_freq[MHz] / throughput[Mpps] Future: Planned Summary Data Views Results and Analysis – #cycles/packet (CPP) and Throughput (Mpps) See Kubecon Dec-2017, Benchmarking and Analysis.., https://wiki.fd.io/view/File:Benchm arking-sw-data-planes-Dec5_2017.pdf
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 10 20 30 40 50 60 70 80 90 100 CoreMark DPDK-Testpmd L2 Loop DPDK-L3Fwd IPv4 Forwarding VPP L2 Patch Cross-Connect VPP L2 MAC Switching OVS-DPDK L2 Cross-Connect VPP IPv4 Routing IPCValue %Contribution TMAM Level 1 Distribution (HT) %Retiring %Bad_ Speculation %Frontend_Bound %Backend_Bound IPC Compute Usage Efficiency IPC Good IPC for all Network workloads due to code optimization, HT makes it even better. Retiring Instructions retired, drives IPC. Bad_Speculations Minimal Bad branch speculations, Attributed to architecture logic, and software pragmas. Backend Stalls Major contributor causing low IPC in noHT cases, HT hides backend stalls. Frontend Stalls Becomes a factor in HT as more instructions are being executed by both logical cores. Observations: Future: Planned Summary Data Views Xeon Telemetry Analytics See Kubecon Dec-2017, Benchmarking and Analysis.., https://wiki.fd.io/view/File:Benchm arking-sw-data-planes-Dec5_2017.pdf
25 Breaking the Barrier of Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! Superior Performance Most Efficient on the Planet Flexible and Extensible Open Source Cloud Native SOFTWARE DEFINED NETWORKING CLOUD NETWORK SERVICES LINUX FOUNDATION A Universal Terabit Network Platform For Native Cloud Network Services EFFICIENCY PERFORMANCE
Summary • Terabit level SW network services are within reach • FD.io is here, available to all • And it continuously improving.. • Next is to make use of them in the cloud • Birth of Cloud-native Network Services • E.g. Integration into k8s eco-system • Industry collaboration in open-source is essential • Code development, benchmarking • Publishing all work and results, dev and test • Benchmarking automation tools • Automated telemetry data analytics
References FD.io VPP, CSIT-CPL and related projects • VPP: https://wiki.fd.io/view/VPP • CSIT-CPL: https://wiki.fd.io/view/CSIT • pma_tools - https://wiki.fd.io/view/Pma_tools Benchmarking Methodology • Kubecon Dec-2017, Benchmarking and Analysis.., https://wiki.fd.io/view/File:Benchmarking-sw-data-planes-Dec5_2017.pdf • “Benchmarking and Analysis of Software Network Data Planes” by M. Konstantynowicz, P. Lu, S.M. Shah, https://fd.io/resources/performance_analysis_sw_data_planes.pdf Benchmarks • EEMBC CoreMark® - http://www.eembc.org/index.php • DPDK testpmd - http://dpdk.org/doc/guides/testpmd_app_ug/index.html • FDio VPP – Fast Data IO packet processing platform, docs: https://wiki.fd.io/view/VPP, code: https://git.fd.io/vpp/ Performance Analysis Tools • “Intel Optimization Manual” – Intel® 64 and IA-32 architectures optimization reference manual • Linux PMU-tools, https://github.com/andikleen/pmu-tools TMAM • Intel Developer Zone, Tuning Applications Using a Top-down Microarchitecture Analysis Method, https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win • Technion presentation on TMAM , Software Optimizations Become Simple with Top-Down Analysis Methodology (TMAM) on Intel® Microarchitecture Code Name Skylake, Ahmad Yasin. Intel Developer Forum, IDF 2015. [Recording] • A Top-Down Method for Performance Analysis and Counters Architecture, Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, https://sites.google.com/site/analysismethods/yasin-pubs
Opportunities to Contribute We invite you to Participate in FD.io • Get the Code, Build the Code, Run the Code • Try the vpp user demo • Install vpp from binary packages (yum/apt) • Install Honeycomb from binary packages • Read/Watch the Tutorials • Join the Mailing Lists • Join the IRC Channels • Explore the wiki • Join FD.io as a member FD.io Foundation 28 • Container Integration • Firewall • IDS • Hardware Accelerators • Control plane – support your favorite SDN Protocol Agent • DPI • Test tools • Packaging • Testing
THANK YOU !

Software Network Data Plane - Satisfying the need for speed - FD.io - VPP and CSIT projects

  • 1.
    Software Network DataPlane – Satisfying the Need for Speed FastData.io – VPP and CSIT Projects 16th of April 2018, Out of the Box Net Dev Meetup, London Maciek Konstantynowicz, lf-id/iirc: mackonstan, mkonstan@cisco.com CSIT-CPL - Continuous System Integration and Testing, a.k.a. “Continuous Performance Lab” https://wiki.fd.io/view/CSIT VPP – Vector Packet Processing https://wiki.fd.io/view/VPP
  • 2.
    Multiparty: Broad Membership FD.ioFoundation 2 Service Providers Network Vendors Chip Vendors Integrators
  • 3.
    Multiparty: Broad Contribution FD.ioFoundation 3 Universitat Politècnica de Catalunya (UPC) Yandex Qiniu
  • 4.
    Topics • What isFD.io • The “Magic” of Vectors • SW Data Plane Benchmarking • Deployment Applicability • Addressing the Continuity Problem • Some Results, Reports and Analysis
  • 5.
    5 Breaking the Barrierof Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! EFFICIENCY PERFORMANCE SOFTWARE DEFINED NETWORKING CLOUD NETWORK SERVICES LINUX FOUNDATION A Universal Terabit Network Platform For Cloud-native Network Services Superior Performance Most Efficient on the Planet Flexible and Extensible Open Source Cloud Native
  • 6.
    FD.io VPP –Vector Packet Processor Compute Optimized SW Network Platform Packet Processing Software Platform • High performance • Linux user space • Runs on compute CPUs: - And “knows” how to run them well ! 6 Packet Processing Dataplane Management Agent Network IO Bare-metal / VM / Container
  • 7.
    FD.io VPP –The “Magic” of Vectors Compute Optimized SW Network Platform 1 Packet processing is decomposed into a directed graph of nodes … Packet 0 Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 Packet 7 Packet 8 Packet 9 Packet 10 … packets move through graph nodes in vector …2 Microprocessor … graph nodes are optimized to fit inside the instruction cache … … packets are pre-fetched into the data cache. Instruction Cache3 Data Cache4 3 4 vhost-user- input af-packet- input dpdk-input ip4-lookup- mulitcast ip4-lookup* ethernet- input mpls-input lldp-input arp-inputcdp-input ...-no- checksum ip6-inputl2-input ip4-input ip4-load- balance mpls-policy- encap ip4-rewrite- transit ip4- midchain interface- output * Each graph node implements a “micro-NF”, a “micro-NetworkFunction” processing packets.
  • 8.
    FD.io Benefits fromIntel® Xeon® Processor Developments Increased Processor I/O Improves Packet Forwarding Rates YESTERDAY Intel® Xeon® E5-2699v4 22 Cores, 2.2 GHz, 55MB Cache Network I/O: 160 Gbps Core ALU: 4-wide parallel µops Memory: 4-channels 2400 MHz Max power: 145W (TDP) 1 2 3 4 Socket 0 Broadwell Server CPU Socket 1 Broadwell Server CPU 2 DDR4 QPI QPI 4 2 DDR4 DDR4 PCIe PCIe PCIe x8 50GE x16 100GE x16 100GE 3 1 4 PCIe PCIe x8 50GE x16 100GE Ethernet 1 3 DDR4 DDR4 DDR4 DDR4 DDR4 SATA B I O S PCH Intel® Xeon® Platinum 8168 24 Cores, 2.7 GHz, 33MB Cache TODAY Network I/O: 280 Gbps Core ALU: 5-wide parallel µops Memory: 6-channels 2666 MHz Max power: 205W (TDP) 1 2 3 4 Socket 0 Skylake Server CPU Socket 1 Skylake Server CPU UPI UPI DDR4 DDR4 DDR4 PCIe PCIe PCIe PCIe PCIe PCIe x8 50GE x16 100GE x8 50GE x16 100GE x16 100GE SATA B I O S 2 4 2 1 3 1 4 3 x8 50GE DDR4 PCIe x8 40GE Lewisburg PCH DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 0 200 400 600 800 1000 1200 160 280 320 560 640 Server [1 Socket] Server [2 Sockets] Server 2x [2 Sockets] +75% +75% PCle Packet Forwarding Rate [Gbps] Intel® Xeon® v3, v4 Processors Intel® Xeon® Platinum 8180 Processors 1,120* Gbps +75% * On compute platforms with all PCIe lanes from the Processors routed to PCIe slots. Breaking the Barrier of Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! FD.io Takes Full Advantage of Faster Intel® Xeon® Scalable Processors No Code Change Required https://goo.gl/UtbaHy
  • 9.
    2CPU Network I/O 490Gbps Crypto I/O 100 Gbps 2CPU Network I/O 490 Gbps Crypto I/O 100 Gbps Socket 0 Skylake Server CPU Socket 1 Skylake Server CPU UPI UPI DDR4 DDR4 DDR4 PCIe PCIe PCIe PCIe PCIe PCIe x8 50GE x16 100GE x8 50GE x16 100GE x16 100GE SATA B I O S 2 4 2 1 3 1 4 3 x8 50GE DDR4 PCIe x8 40GE Lewisburg PCH DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 FD.io VPP – The “Magic” Behind the Equation FD.io Takes Full Advantage of Faster Intel® Xeon® Scalable Processors No Code Change Required FD.io Data Plane Efficiency Metrics: { + } higher is better { - } lower is better YESTERDAY TODAY Intel® Xeon® E5-2699v4 Intel® Xeon® Platinum 8168 Improvement { + } 4 Socket forwarding rate [Gbps] 560 Gbps 948 Gbps* +69 % { - } Cycles / Packet 180 158 -12 % { + } Instructions / Cycle (HW max.) 2.8 ( 4 ) 3.28 ( 5 ) +17 % { - } Instructions / Packet 499 497 ~0 % Socket 0 Skylake Server CPU Socket 1 Skylake Server CPU UPI UPI DDR4 DDR4 DDR4 PCIe PCIe PCIe PCIe PCIe PCIe x8 50GE x16 100GE x8 50GE x16 100GE x16 100GE SATA B I O S 2 4 2 1 3 1 4 3 x8 50GE DDR4 PCIe x8 40GE Lewisburg PCH DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 DDR4 Per processor: 24 cores 48 threads 2.7GHz On-board LBG-NS 100G QAT Crypto Machine with Intel® Xeon® Platinum 8168 * Measured 4 Socket forwarding rate is limited by PCIe I/O slot layout on tested compute machines; nominal forwarding rate for tested FD.io VPP configuration is 280 Gbps per Platinum Processor. Not all cores are used. 9 Breaking the Barrier of Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! https://goo.gl/UtbaHy
  • 10.
    tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 [𝑏𝑝𝑠] =tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 𝑝𝑝𝑠 ∗ 𝑝𝑎𝑐𝑘𝑒𝑡_𝑠𝑖𝑧𝑒[𝑝𝑝𝑠] DP Benchmarking Metrics – External and Internal Compute CPP from PPS or vice versa.. 𝑝𝑟𝑜𝑔𝑟𝑎𝑚_𝑢𝑛𝑖𝑡_𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐] = #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑝𝑟𝑜𝑔𝑟𝑎𝑚_𝑢𝑛𝑖𝑡 ∗ #𝑐𝑦𝑐𝑙𝑒𝑠 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 ∗ 𝑐𝑦𝑐𝑙𝑒_𝑡𝑖𝑚𝑒 𝑝𝑎𝑐𝑘𝑒𝑡_𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐] = #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑝𝑎𝑐𝑘𝑒𝑡 ∗ #𝑐𝑦𝑐𝑙𝑒𝑠 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 ∗ 𝑐𝑦𝑐𝑙𝑒_𝑡𝑖𝑚𝑒 #cycles_per_packet = #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠 𝑝𝑎𝑐𝑘𝑒𝑡 ∗ #𝑐𝑦𝑐𝑙𝑒𝑠 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 tℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡[𝑝𝑝𝑠] = 1 ]𝑝𝑎𝑐𝑘𝑒𝑡_𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑖𝑛𝑔_𝑡𝑖𝑚𝑒[𝑠𝑒𝑐 = ]𝐶𝑃𝑈_𝑓𝑟𝑒𝑞[𝐻𝑧 #𝑐𝑦𝑐𝑙𝑒𝑠_𝑝𝑒𝑟_𝑝𝑎𝑐𝑘𝑒𝑡 Treat software network Data Plane as one would any program, .. with the instructions per packet being the program unit, .. and arrive to the main data plane benchmarking metrics. CPP PPS BPS CPI or 1/IPCIPP External Metrics Internal Metrics
  • 11.
    Main architecture resourcesused for packet- centric operations: 1. Packet processing operations – How many CPU core cycles are required to process a packet? 2. Memory bandwidth – How many memory-read and -write accesses are made per packet? 3. I/O bandwidth – How many bytes are transferred over PCIe link per packet? 4. Inter-socket transactions – How many bytes are accessed from the other socket or other core in the same socket per packet? Socket 0 Broadwell Server CPU Socket 1 Broadwell Server CPU 1 DDR4 QPI QPI 4 1 DDR4 DDR4 PCIe PCIe PCIe x8 50GE x16 100GE x16 100GE 2 3 PCIe PCIe x8 50GE x16 100GE Ethernet 3 DDR4 DDR4 DDR4 DDR4 DDR4 SATA B I O S PCH Metrics – Mapping Them to Resources.. In-depth introduction on SW data plane performance benchmarking: https://fd.io/resources/performance_analysis_sw_data_planes.pdf
  • 12.
    Applicability – SWNetwork Services within a Node • Start simple • Benchmark the NIC to NIC packet path • Use the right test and telemetry tools and approaches • Analyze all key metrics: PPS, CPP, IPC, IPP and more • Find performance ceilings of Network Function data plane • Then apply the same methodology to other packet paths and services ✓ …
  • 13.
    • FD.io VPPworks today • Great external and internal performance metrics • The world keeps moving on • New functions and features are being added continuously • New generations of hardware are showing up periodically • So, how do you keep the world happy, i.e. : • maintain the best-in-class performance? • prevent rogue patches going in? • qualify further optimizations of existing code? • quantify HW accelerators, processors, device setting changes? The Continuity Problem
  • 14.
    • CSIT-CPL goalsand aspirations • FD.io VPP benchmarking • VPP functionality per specifications (RFCs1) • VPP performance and efficiency (PPS2, CPP3) • Network data plane - throughput Non-Drop Rate, bandwidth, PPS, packet delay • Network Control Plane, Management Plane Interactions (memory leaks!) • Performance baseline references for HW + SW stack (PPS, CPP) • Range of deterministic operation for HW + SW stack (NDR, PDR4) • Provide testing platform and tools to FD.io VPP dev and usr community • Automated functional and performance tests • Automated telemetry feedback with conformance, performance and efficiency metrics • Help to drive good practice and engineering discipline into FD.io dev community • Drive innovative optimizations into the source code – verify they work • Enable innovative functional, performance and efficiency additions & extensions • Prevent unnecessary code “harm” Addressing the Continuity Problem with FD.io CSIT-CPL Continuous Performance Lab Legend: 1 RFC – Request For Comments – IETF Specs basically 2 PPS – Packets Per Second 3 CPP – Cycles Per Packet (metric of packet processing efficiency) 4 NDR, PDR – Non-Drop Rate, Partial Drop Rate
  • 15.
    • Continuous Testingand Reporting • Functional – Pass/Fail • Device Drivers – Pass/Fail • Performance Benchmarking – Throughput and Latency • no Pass/Fail, but a Spectrum of Data that needs to be analyzed and classfied further • Continuous Analysis • Performance Trending, Spotting Progressions, Regressions • Anomaly Detection and Notification • All in open-source and published • Tools and code • Results and analytics FD.io CSIT-CPL Continuous Performance Lab – CI/CD for SW network data planes
  • 16.
    FD.io CSIT-CPL Per releasetest and performance reports https://docs.fd.io/csit/rls1801/report/index.html
  • 17.
    Measuring and TrendingPerformance – a Spectrum of Data https://docs.fd.io/csit/master/trending/
  • 18.
    CSIT-CPL - Getting“C” right in “CI/CD”.. • Need “baremetal” to execute tests • Many functional and all performance tests need to run on physical servers • Lots of tests, many combinations, they take time • Physical resources, testbeds, servers are always in short supply! • Dealing with scarce physical resources • Focus on efficiency and execution time • Reduce infra overhead • Speedup build time for per patch tests • Reduce execution time • smarter NDR/PDR throughput rate search algorithms • Parallelize • Keep optimizing..
  • 19.
    x86 Server NIC1 Socket 0 Xeon Processor E5-2699v3 NIC2NIC3 x8 x8 x8 DDR4 Socket 1 Xeon Processor E5-2699v3 NIC1 NIC2 NIC3 x8 x8 x8 Q P I x86 Server NIC3 Socket 0 Xeon Processor E5-2699v3 NIC2NIC1 x8x8x8 DDR4 Socket 1 Xeon Processor E5-2699v3 NIC3NIC2NIC1 x8x8x8 Q P I x86 Server x86 Server NIC3 Socket 0 Xeon Processor E5-2699v3 NIC2NIC1 x8x8x8 DDR4 Socket 1 Xeon Processor E5-2699v3 NIC3NIC2NIC1 x8x8x8 Q P I x86 Server CSIT-CPL – Testbeds Today 2-Node Topology 3-Node Topology Systems Under Test “SW Devices” Under Test
  • 20.
    CSIT-CPL – Wherewe got to.. • Enabled per patch performance tests • In POC phase due to limited physical testbeds capacity, be fixed shortly • Growing physical performance lab • 20 of 2-socket Xeon Skylake servers • Each Skylake server can do 280Gbps of I/O full-duplex per socket! • https://goo.gl/UtbaHy
  • 21.
    CSIT-CPL – ..and where we are going with this.. • Every patch performance benchmarked • Cause once it is merged, it is gone • Results summarized and abstracted for meaningful feedback loop • To humans: contributors, commiters, testers, users • To downstream projects • To trending analytics for anomaly detection and notification • To telemetry analytics for efficiency verification
  • 22.
    #cycles/packet = cpu_freq[MHz]/ throughput[Mpps] Future: Planned Summary Data Views Results and Analysis – #cycles/packet (CPP) and Throughput (Mpps) See Kubecon Dec-2017, Benchmarking and Analysis.., https://wiki.fd.io/view/File:Benchm arking-sw-data-planes-Dec5_2017.pdf
  • 23.
    0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0 10 20 30 40 50 60 70 80 90 100 CoreMark DPDK-Testpmd L2 Loop DPDK-L3Fwd IPv4Forwarding VPP L2 Patch Cross-Connect VPP L2 MAC Switching OVS-DPDK L2 Cross-Connect VPP IPv4 Routing IPCValue %Contribution TMAM Level 1 Distribution (HT) %Retiring %Bad_ Speculation %Frontend_Bound %Backend_Bound IPC Compute Usage Efficiency IPC Good IPC for all Network workloads due to code optimization, HT makes it even better. Retiring Instructions retired, drives IPC. Bad_Speculations Minimal Bad branch speculations, Attributed to architecture logic, and software pragmas. Backend Stalls Major contributor causing low IPC in noHT cases, HT hides backend stalls. Frontend Stalls Becomes a factor in HT as more instructions are being executed by both logical cores. Observations: Future: Planned Summary Data Views Xeon Telemetry Analytics See Kubecon Dec-2017, Benchmarking and Analysis.., https://wiki.fd.io/view/File:Benchm arking-sw-data-planes-Dec5_2017.pdf
  • 24.
    25 Breaking the Barrierof Software Defined Network Services 1 Terabit Services on a Single Intel® Xeon® Server ! Superior Performance Most Efficient on the Planet Flexible and Extensible Open Source Cloud Native SOFTWARE DEFINED NETWORKING CLOUD NETWORK SERVICES LINUX FOUNDATION A Universal Terabit Network Platform For Native Cloud Network Services EFFICIENCY PERFORMANCE
  • 25.
    Summary • Terabit levelSW network services are within reach • FD.io is here, available to all • And it continuously improving.. • Next is to make use of them in the cloud • Birth of Cloud-native Network Services • E.g. Integration into k8s eco-system • Industry collaboration in open-source is essential • Code development, benchmarking • Publishing all work and results, dev and test • Benchmarking automation tools • Automated telemetry data analytics
  • 26.
    References FD.io VPP, CSIT-CPLand related projects • VPP: https://wiki.fd.io/view/VPP • CSIT-CPL: https://wiki.fd.io/view/CSIT • pma_tools - https://wiki.fd.io/view/Pma_tools Benchmarking Methodology • Kubecon Dec-2017, Benchmarking and Analysis.., https://wiki.fd.io/view/File:Benchmarking-sw-data-planes-Dec5_2017.pdf • “Benchmarking and Analysis of Software Network Data Planes” by M. Konstantynowicz, P. Lu, S.M. Shah, https://fd.io/resources/performance_analysis_sw_data_planes.pdf Benchmarks • EEMBC CoreMark® - http://www.eembc.org/index.php • DPDK testpmd - http://dpdk.org/doc/guides/testpmd_app_ug/index.html • FDio VPP – Fast Data IO packet processing platform, docs: https://wiki.fd.io/view/VPP, code: https://git.fd.io/vpp/ Performance Analysis Tools • “Intel Optimization Manual” – Intel® 64 and IA-32 architectures optimization reference manual • Linux PMU-tools, https://github.com/andikleen/pmu-tools TMAM • Intel Developer Zone, Tuning Applications Using a Top-down Microarchitecture Analysis Method, https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win • Technion presentation on TMAM , Software Optimizations Become Simple with Top-Down Analysis Methodology (TMAM) on Intel® Microarchitecture Code Name Skylake, Ahmad Yasin. Intel Developer Forum, IDF 2015. [Recording] • A Top-Down Method for Performance Analysis and Counters Architecture, Ahmad Yasin. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, https://sites.google.com/site/analysismethods/yasin-pubs
  • 27.
    Opportunities to Contribute Weinvite you to Participate in FD.io • Get the Code, Build the Code, Run the Code • Try the vpp user demo • Install vpp from binary packages (yum/apt) • Install Honeycomb from binary packages • Read/Watch the Tutorials • Join the Mailing Lists • Join the IRC Channels • Explore the wiki • Join FD.io as a member FD.io Foundation 28 • Container Integration • Firewall • IDS • Hardware Accelerators • Control plane – support your favorite SDN Protocol Agent • DPI • Test tools • Packaging • Testing
  • 28.