0% found this document useful (0 votes)
147 views14 pages

A Heterogeneous RISC-V Based SoC For Secure Nano-UAV Navigation

Uploaded by

mohamed234sheimy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views14 pages

A Heterogeneous RISC-V Based SoC For Secure Nano-UAV Navigation

Uploaded by

mohamed234sheimy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2266 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO.

5, MAY 2024

A Heterogeneous RISC-V Based SoC for Secure


Nano-UAV Navigation
Luca Valente , Alessandro Nadalini , Asif Hussain Chiralil Veeran, Mattia Sinigaglia, Bruno Sá,
Nils Wistoff , Graduate Student Member, IEEE, Yvan Tortorella , Simone Benatti , Rafail Psiakis,
Ari Kulmala, Baker Mohammad , Senior Member, IEEE, Sandro Pinto , Daniele Palossi ,
Luca Benini , Fellow, IEEE, and Davide Rossi , Senior Member, IEEE

Abstract— The rapid advancement of energy-efficient parallel fully programmable energy- and area-efficient multi-core cluster
ultra-low-power (ULP) µcontrollers units (MCUs) is enabling of RV32 cores optimized for general-purpose DSP as well as
the development of autonomous nano-sized unmanned aerial reduced- and mixed-precision ML. To the best of the authors’
vehicles (nano-UAVs). These sub-10cm drones represent the next knowledge, it is the first silicon prototype of a ULP SoC coupling
generation of unobtrusive robotic helpers and ubiquitous smart the RV64 and RV32 cores in a heterogeneous host+accelerator
sensors. However, nano-UAVs face significant power and payload architecture fully based on the RISC-V ISA. We demonstrate the
constraints while requiring advanced computing capabilities akin capabilities of the proposed SoC on a wide range of benchmarks
to standard drones, including real-time Machine Learning (ML) relevant to nano-UAV applications including general-purpose
performance and the safe co-existence of general-purpose and DSP as well as inference and online learning of quantized DNNs.
real-time OSs. Although some advanced parallel ULP MCUs The cluster can deliver up to 90GOp/s and up to 1.8TOp/s/W
offer the necessary ML computing capabilities within the pre- on 2-bit integer kernels and up to 7.9GFLOp/s and up to
scribed power limits, they rely on small main memories (<1MB) 150GFLOp/s/W on 16-bit FP kernels.
and µcontroller-class CPUs with no virtualization or security fea-
tures, and hence only support simple bare-metal runtimes. In this Index Terms— Heterogeneous, Linux, low-power, autonomous
nano-UAVs, RISC-V.
work, we present Shaheen, a 9mm2 200mW SoC implemented in
22nm FDX technology. Differently from state-of-the-art MCUs,
Shaheen integrates a Linux-capable RV64 core, compliant with I. I NTRODUCTION
the v1.0 ratified Hypervisor extension and equipped with timing
channel protection, along with a low-cost and low-power memory
controller exposing up to 512MB of off-chip low-cost low-power
HyperRAM directly to the CPU. At the same time, it integrates a
T HE number of Internet-of-Things (IoT) devices and the
spectrum of IoT applications are rapidly growing: from
home automation, robotics, industrial gateways, and building
automation to smart cities, digital signage, medical equipment,
Manuscript received 30 July 2023; revised 3 November 2023 and 7 January
and more [1]. In this context, nano-sized unmanned aerial
2024; accepted 18 January 2024. Date of publication 7 February 2024; date vehicles (nano-UAVs) can be considered the “ultimate” IoT
of current version 30 April 2024. This work was supported in part by the node, thanks to their ability to navigate, sense, analyse, and
Technology Innovation Institute, Secure Systems Research Center, Abu Dhabi, understand the surrounding environment. Nano-UAVs have a
United Arab Emirates; in part by the Spoke 1 on Future High-Performance- form factor of a few centimeters in diameter, and a weight
Computing (HPC) of the Italian Research Center on High-Performance
Computing, Big Data and Quantum Computing (ICSC) that received funding of only tens of grams, which allows them to safely operate
from the Ministry of University and Research (MUR) for the Mission 4–Next near humans and in narrow, cramped spaces [2], [3]. They
Generation EU programme; and in part through the TRISTAN (101095947) have a total power envelope of a few Watts, of which only
project that received funding from the HORIZON CHIPS-Joint Undertaking
programme. This article was recommended by Associate Editor Y. Tang.
5-15% for computation [4], and their small physical footprint
(Corresponding author: Luca Valente.) and limited payload restrain the maximum battery, the printed
Luca Valente, Alessandro Nadalini, Mattia Sinigaglia, Yvan Tortorella, circuit board size and exclude any form of active cooling.
Simone Benatti, and Davide Rossi are with the Department of Electrical, Elec- Nowadays, µcontroller units (MCUs) are the only computing
tronic and Information Engineering, University of Bologna, 40136 Bologna, platforms that meet the nano-UAV’s power and form-factor
Italy (e-mail: [Link]@[Link]).
Asif Hussain Chiralil Veeran and Baker Mohammad are with the Department constraints.
of Electrical Engineering and Computer Science, Khalifa University, Abu MCUs feature simple RISC host processors (e.g., ARM
Dhabi, United Arab Emirates. Cortex-M) with low computational capabilities and no virtu-
Bruno Sá and Sandro Pinto are with Centro ALGORITMI, University of alization support, to which they expose just a few hundred
Minho, 4800-058 Guimarães, Portugal.
Nils Wistoff is with the Integrated Systems Laboratory (IIS), ETH Zürich,
kBytes of on-chip SRAM scratchpad memory (SPM) [5],
8092 Zürich, Switzerland. [6], [7], [8], [9], [10], [11]. To deliver more advanced com-
Rafail Psiakis and Ari Kulmala are with the Secure Systems Research putational capabilities, state-of-the-art (SoA) MCUs integrate
Center, Technology Innovation Institute, Abu Dhabi, United Arab Emirates. accelerators with high data processing capabilities [5], [6], [7],
Daniele Palossi is with the Integrated Systems Laboratory (IIS), ETH
Zürich, 8092 Zürich, Switzerland, and also with the Dalle Molle Institute
[8], [9], [10], [11]. Usually, ultra-low-power (ULP) devices’
for Artificial Intelligence (IDSIA), USI-SUPSI, 6900 Lugano, Switzerland. accelerators are hardwired application-specific data-paths [9],
Luca Benini is with the Department of Electrical, Electronic and Information [10] which achieve the best energy efficiency but are tailored to
Engineering, University of Bologna, 40136 Bologna, Italy, and also with the a single application domain, leading to poor programmability
Integrated Systems Laboratory (IIS), ETH Zürich, 8092 Zürich, Switzerland.
Color versions of one or more figures in this article are available at
and a high nonrecurring engineering cost [12] while occupying
[Link] a considerable part of the scarce area resources. To improve
Digital Object Identifier 10.1109/TCSI.2024.3359044 the overall versatility of the SoC, recent works replace ASIC
1549-8328 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
VALENTE et al.: HETEROGENEOUS RISC-V BASED SoC FOR SECURE NANO-UAV NAVIGATION 2267

accelerators with fully programmable parallel accelerators [7], TABLE I


[8] that achieve competitive energy efficiency while maintain- UAV S TAXONOMY BY V EHICLE C LASS -S IZE [2]
ing significant flexibility, and hence make the most out of the
available power and area.
The increase in computing capabilities of SoA MCUs
has enabled nano-UAVs to achieve autonomous flight while
executing intelligent auxiliary tasks. For example, Quantized
Neural Networks (QNNs) have been proposed to carry out
obstacle avoidance [2] or human pose estimation and object
detection [3]. At the same time, floating point (FP) digital- Apart from a 1MB on-chip SRAM SPM, Shaheen connects
signal-processing (DSP) computation has been proposed for to up to 512MB of off-chip low-power HyperRAMs [21]
path planning or structural build monitoring [13], [14]. Also, directly on the main interconnect, through a low-power,
a recent trend for edge devices is online learning, which low-cost 0.27mm 2 1.6 Gbps fully-digital memory controller.
enables a small portion of the Neural Networks (NN) training Relying on HyperRAMs instead of high-end LPDDR4/5
to happen on the edge, increasing the accuracy and reliability memories, typically integrated into embedded Linux-capable
directly on the field [15], [16]. Nevertheless, even the most systems, frees Shaheen from expensive and proprietary mem-
advanced MCUs, supporting this new class of applications, lag ory controllers with large mixed-signal PHYs, while still
behind in terms of software support. Due to the small amount exposing hundreds of MB to the host processor and matching
of available memory and the simplicity of their host CPU, SoA the tight power, form factor and cost requirements of nano-
MCUs only provide close-to-metal software environments, UAVs.
based on minimal real-time operating systems (RTOS) or The cluster integrates the so-called Flex-V cores. The
simple bare-metal runtimes. However, enabling the execution Flex-V core extends the RISC-V ISA with custom instruc-
of full-fledged OSs (like Linux), securely along with the real- tions for reduced-precision single instruction multiple data
time control applications, would allow nano-UAVs to leverage (SIMD) FP-based computation and byte and sub-byte mixed-
an existing, mature, and solid software stack and hence ease precision QNN inference, achieving State-of-the-Art (SoA)
the software development [17]. In this context, this work software power and energy efficiency. Thanks to the aggressive
presents a step forward in the current and future generation optimizations, the cluster achieves up to 22.5 GigaOpera-
of autonomous nano-UAVs. We present Shaheen, a 9mm 2 tions per second (GOp/s) and 90 GOp/s on 8-bit and 2-bit
200mW heterogeneous System-on-Chip (SoC) implemented integer kernels, enabling low-latency mixed-precision QNN-
in 22 nm FDX technology that couples an application-class based autonomous navigation [2], [3]. Furthermore, the cluster
RV64 host processor with a low-power HyperRAM memory achieves up to 4 GigaFloating-Point Operations per second
controller and with a flexible cluster of eight RV32 cores, (GFLOp/s) and 7.9 GFLOp/s on FP32 and FP16 kernels
providing best-in-class energy efficiency and performance for enabling DSP and online training of neural networks (NNs)
IoT applications. The design is fine-tuned to accommodate the [15], [16].
requirements of emerging nano-UAV applications. To sum up, compared to the State-of-the-Art MCUs for
The host includes hardware virtualization support [18]: to nano-UAVs, Shaheen is the first one coupling:
the best of our knowledge, it is the first silicon implemen- • an RV64 host with Hypervisor support and security
tation fully compliant with the ratified RISC-V Instruction features,
Set Architecture (ISA) Hypervisor extension,1 enabling the • a low-power memory controller exposing hundreds of MB
secure coexistence of an RTOS and a full-blown OS onto the at up to 1.6Gbps to the host core,
same host core. In particular, the Hypervisor extension aims to • a fully-programmable parallel RV32 cluster providing
provide confidentiality and integrity of virtual machines (VM) SoA software performance for IoT,
by enforcing isolation (via two-stage virtual memory) between while keeping the overall power envelope within 200mW.
multiple consolidated guest OSes, i.e., General Purpose OS The structure of this manuscript is as follows: in Section II,
(GPOS) and RTOS. To further isolate the execution of these we will present an overview of the State-of-the-Art SoC
coexisting software stacks (trusted and untrusted), prevent for UAVs. Following this, Sections III and IV will delve
security threats, and ensure multi-domain operations, the host into an exhaustive discussion of Shaheen’s architecture, its
core features Physical Memory Protection (PMP) [19] and ISA implementation, and the measurements obtained from the
and micro-architecture extensions for timing channel mitiga- silicon prototype. Moving forward, Sections V and VI will
tion [20]. Namely, the PMP aims to provide confidentiality address the software stack and provide a benchmark of the
and integrity by limiting the physical addresses accessible cluster’s performance and energy efficiency on a relevant set of
by software running on CVA6. PMP enforces the separation applications for nano-UAVs. In Section VII, we will compare
between the bare-metal firmware (running in machine mode) Shaheen with similar silicon prototypes from both industry and
and everything else through a set of additional registers, which academia. Finally, Section VIII will summarize our results and
specify the physical memory access privileges (read, write, offer insights into potential future research directions.
execute) for each physical memory region. Lastly, timing chan-
nel mitigation aims to provide confidentiality by eliminating II. R ELATED W ORK
side-channel attacks.
Table I shows the four categories of UAV systems according
to size, weight, power budget and onboard processing plat-
1 SiFive, Ventana, and StarFive have announced RISC-V CPU designs with form. The latter two characteristics are tightly coupled, as only
Hypervisor extension support, but we are not aware of any silicon available around ∼ 5 − 15% of the power budget is allocated to com-
on the market yet. putation [2]. Across all the categories of UAVs, autonomous

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2268 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 5, MAY 2024

TABLE II
S TATE - OF - THE -A RT S O C S FOR UAV S

navigation is achieved by the combination of two components: large and power-hungry mixed-signal DDR controllers, all
mission control and flight control. Mission control is the high- within a power envelope of few watts [29], [31], [32], [33].
level decisional part of the navigation algorithm, e.g., path The NVIDIA Jetson TX2 is claimed to be “the fastest, most
planning [23], optimization-based control [24], etc. To carry power-efficient embedded artificial intelligence (AI) computing
out these types of tasks, SoA drones mostly rely on machine device” by NVIDIA [29] and it is the board of choice for
learning (ML) algorithms [2], [17]. Flight control, on the the Agilicious drone [17]. It features a Quad-core Cortex
other hand, is the actuation of the output decisions of mission A57 running up to 2GHz and a Pascal CUDA GPU, which
control: it collects data from the sensors to determine the can deliver up to 1.33TFLOp/s resulting in an overall power
vehicle’s state and generates the control law, which manages consumption of more than 7.5 W. The Intel Atom x7 is the
the actuators [25]. Flight control is often based on cascade PID heart of the Intel UpBoard platforms, as big as a credit card.
control [26], especially in the case of nano-UAVs [27], [28], It features 4 Intel Atom processors running up to 2GHz and
and it is not as computationally intensive as mission control, an Intel HD 505 GPU delivering up to 230GFLOp/s and
but it requires low-latency guarantees. As a consequence, also roughly consuming 10W. Another compute hardware platform
in the context of standard and micro-drones, flight control commonly used on autonomous UAVs is the NanoPi Neo
is usually carried out by simple MCUs with a predictable Air, which integrates an Allwinner H7 SoC with a quad-
execution time like the STM32-H7 [5] integrated into the core CortexA7 and a Mali-400 MP2 GPU, delivering up to
Pixhawk board [25]. Table II shows some mainstream SoCs 10GFLOp/s. All these SoCs offer a mainstream Ubuntu-ready
successfully deployed on drones of standard, micro, and nano software stack and virtualization capabilities and can handle
size. For each SoC, it highlights the different computational very sophisticated and complex applications. However, due to
capabilities and power envelope, as well as the specific tasks their power envelope, size, and the necessity for high-end off-
and UAV platforms they are suited for, detailed in the three chip memories, these SoCs can only be integrated into standard
sections below. Sections II-A and II-B describe the state of and micro-UAVs.
the art of standards, micro, and nano UAVs SoCs. Naturally, Shaheen can not compete with these architectures
in terms of performance, but our approach borrows the best of
their characteristic while targeting a much smaller power enve-
lope. Firstly, to mimic high-end SoCs with their heterogeneous
A. SoCs for Standard and Micro-Sized UAVs GPU-based architecture, Shaheen integrates an RV32-based
As table I shows, micro-size drones integrate embedded parallel programmable cluster along with an RV64 CPU.
computers, while standard-sized drones can even accommo- Secondly, it exposes a significant amount of off-chip main
date desktop processors. Nevertheless, embedded processors memory to the CPU. However, instead of high-performance
can nowadays deliver performance in the order of hundreds DRAMs (LPDDR3/4/5) that are connected through large,
of TOp/s and hundreds of GFLOp/s, which has proven to be proprietary and expensive mixed-signal PHYs with a high pin
sufficient to support the full flight stack for mission control, count (>30), Shaheen leverages HyperRAMs, which are fully-
both for micro [17] and standard-size UAVs [22]. digital low-power small-area DRAMs with less than 14 pins
Embedded computers integrate high-end SoCs with and feasible to be deployed on nano-UAVs. A similar approach
application-class cores, supporting virtualization and vari- is adopted in Cheshire [34], which is not optimized for nano-
ous privilege levels (and hence full-fledged OSs), embed- UAV applications. Cheshire revolves around CVA6 as Shaheen
ded GPUs, and GBytes of high-performance off-chip and exposes up to 1GB of Reduced Pin Count (RPC) DRAM
LPDDR/DDR4/5 memories, connected through expensive, memory, which uses a minimum number of signals to deliver

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
VALENTE et al.: HETEROGENEOUS RISC-V BASED SoC FOR SECURE NANO-UAV NAVIGATION 2269

DDR3-level in-system bandwidth at the cost of 22 switching with enhanced computational capabilities, based on parallel
signals for a 16-bit wide data bus [34], [35]. While RPC and programmable accelerators. GAP8 and GAP9 are commercial
the related controller offer higher bandwidth than HyperBUS, products by GreenWaves Technologies compliant with the
the RPC protocol is more convoluted, leading to higher design so-called Crazyflie-AIdeck [41] board, which is meant as a
complexity and a bigger area, mostly due to the four 8kB companion of the Crazyflie to offload the mission control
buffers [34]. More importantly, Cheshire’s CVA6 does not tasks [2]. GAP8 embeds the so-called Ri5cy [7] core as host
feature hardware virtualization support and micro-architectural CPU and 1.5MB of on-chip SRAM memory, accompanied
extensions for timing channel mitigations. Lastly, while being by a parallel programmable cluster of other eight Ri5cy cores
easily extensible through the AXI4 interface, Cheshire’s sili- delivering up to 150 GOP/s on 8-bit data. Ri5cy is a 4 pipeline-
con prototype does not integrate a parallel accelerator, heavily stages core compliant with the so-called XpulpV2 ISA,
limiting the offered performance. To sum up, Shaheen is the a custom RISC-V ISA based on RV32 with extensions for
first silicon implementation of a heterogeneous MCU coupling DSP and ML applications, with support for 16/8-bit SIMD
an RV64 core with a cluster of eight RV32 cores and tens of operations and hardware loops. GAP9 is an improved version
MB of main memory. of the GAP8 processor. It is fabricated in a more advanced
node than GAP8, halving the power envelope and it features
B. SoCs for Nano-UAVs as well 2MB of non-volatile SVM memory. Also, differently
from GAP8, GAP9’s cluster includes 4 FPUs with FP16/32
A state-of-the-art MCU for nano-UAVs platforms is the support. Lastly, Kraken [8] is a research prototype based on
STM32-F4 [6]. The STM32-F4 is the computational unit of the same heterogeneous architecture of GAP8 and GAP9, i.e.,
the Crazyflie [36] platform, integrating a Cortex-M4 core and an RV32 CPU along with an eight RV32 core cluster, which
192kB of on-chip SRAM with 180MHz of maximum operat- delivers up to 90 GOp/s on 2-bit data. Kraken’s RV32 cores
ing frequency. Its low performance and small memory capacity are a more advanced version of Ri5cy, i.e., the Ri5cyNN
limit the autonomous navigation capabilities of the nano-drone cores [8] with support of sub-byte SIMD operations and fused
when compared to embedded computers. To this extent, two Mac&Load instructions, which enable the concurrent execu-
kinds of approaches have been proposed: minimization of the tion of SIMD dot-product and memory accesses, increasing
workload [37] or offloading of the mission control computation the computation efficiency up to 94%. Kraken embeds 1.5MB
to an external base station [38], limiting the MCU to flight of on-chip SRAM memory and the CUTIE accelerator, able
control. The latter approach presents severe drawbacks, in the to achieve up to roughly 90k Ternary-MACs per cycle. Fur-
first instance, it introduces network-dependent latency, limiting thermore, it provides an event-based camera, tightly coupled
the maximum distance from the workstation to a few tens of with a Spiking Neural Engine accelerator. When compared to
meters. Also, the data transmitted are subject to noise on the traditional cameras, event-based cameras offer high temporal
transmission channel, limiting reliability, and eavesdropping resolution (in the order of µs), very high dynamic range
on confidential data [39]. (140 dB vs. 60 dB), low power consumption, and high pixel
To offer enhanced computational capabilities within a small bandwidth (on the order of kHz) resulting in reduced motion
power budget, recent works also propose SoCs featuring blur [42].
hardwired ASIC accelerators designed for specific UAV appli- Shaheen’s approach leverages the best from these advanced
cations, like, for example, motion-control [9], visual-inertial AI IoT SoCs, integrating its own fully-programmable parallel
odometry (VIO) [10], simultaneous-localization-and-mapping 8-core RV32 cluster accelerator. Shaheen’s RV32 cores stem
(SLAM) [40], or QNN inference [7], [8]. These accelerators from the Ri5cyNN cores and are further enhanced with mixed-
achieve impressive energy efficiency, in the order of hundreds precision support to eliminate the massive software overhead
of TOp/s/W, by carefully mapping the target algorithm to the necessary for packing and unpacking data when executing
hardware. For example, many accelerators exploit the inherent mixed-precision sub-byte kernels, providing up to 8.5x speed-
parallelism of the target application, such as using a systolic up over Kraken and less than 5.6% extra area resource over
array for motion control [9]. Another common approach the baseline core without extensions. In addition, Shaheen
exploits reduced-precision arithmetic, as in SLAM [40] and addresses a major limitation of SoA MCUs: the software stack
VIO [10], to reduce the memory footprint and the datapath based on lightweight RTOSs or simple bare-metal runtimes.
size. Exploiting both parallelism and reduced-precision com- Programming applications on these stacks is hard, owing to
putation is also a well-established technique to accelerate QNN (i) the lack of virtualization capabilities of the host CPUs and
inference and training, due to the nature of such algorithm. For (ii) the small amount of memory directly accessible through
example, accelerators like the NE16 in GAP9 [7], the HWCE loads and stores, which limits the maximum software memory
in GAP8 [7], and the ternary weight neural-network (so-called footprint. In the context of MCUs, memory resources coincide
CUTIE) accelerator in Kraken [8], able to reach peaks of with on-chip SRAMs and off-chip DRAMs. The first ones
11.6 TMAC/s, have been proposed to speed up QNN inference. provide high bandwidth but are limited to a few hundred kB,
However, due to poor flexibility and programmability, these due to the area and power constraints [21]. The latter ones offer
accelerators have to anyway rely on general-purpose CPUs one order of magnitude more capacity but are much slower
to achieve end-to-end flight. Furthermore, the high area cost and are typically accessed only through explicit input-output
per device makes them hard to adopt as they risk becoming copy functions. Thus, beyond SoA, to support a richer software
obsolete due to the rapid evolution of the target nano-UAVs stack while offering the advanced computing performance
applications. and energy efficiency of the RV32-based cluster, Shaheen
To overcome these limitations, recent MCUs integrate par- integrates an RV64 core with advanced virtualization and
allel fully-programmable and flexible accelerators, that have security features, along with up to 512MB of main memory.
successfully proved to enable autonomous navigation [2], This enables the secure coexistence of rich and mature general-
[3]. Namely, GAP8, GAP9 [7] and Kraken [8] are MCUs purpose OS and bare-metal RTOS on the same platform and

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2270 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 5, MAY 2024

TABLE III
RISC-V ISA P RIVILEGE M ODES W ITH THE H YPERVISOR E XTENSION

Fig. 2. RISC-V privilege levels.

extension adds the virtualization mode (V), indicating whether


the hart is currently executing in a guest (V=1) or not (V=0).
When V=0, the S-mode is modified into the hypervisor-
Fig. 1. Shaheen architecture block diagram. extended supervisor (HS) mode, well suited to host both type-1
and type-2 hypervisors. Other than in the HS-mode, when
V=0, the hart can either be in M-mode or in U-mode atop
eases porting of feature-rich software stacks for robotics, such an OS running in HS-mode. When V=1, two new privilege
as ROS [43]. levels are added, namely the virtual supervisor (VS) mode and
the virtual user (VU) mode. Also, the hypervisor extension
III. S HAHEEN A RCHITECURE defines a second stage of translation (the so-called “G-stage”)
to virtualize the guest memory by translating guest-physical
Shaheen consists of 4 clock domains, as illustrated in addresses (GPA) into host-physical addresses (HPA).
Fig. 1: (i) the CVA6 domain, where the host core is; (ii) the To enable these new execution modes, the Control Status
host domain, including the main interconnect and 4 256kB Register (CSR) and Decode modules have been modified.
interleaved SRAM banks; (iii) the cluster domain, served by The CSR module was extended to implement the first three
16 16kB interleaved SRAM banks and 8 specialized RV32 building blocks that comprise the hardware virtualization
cores; and (iv) the peripheral domain. logic, specifically: (i) access logic and permission checks for
VS-mode and HS-mode CSRs, (ii) delegation and triggering
A. CVA6 Host Core of exceptions and interrupts, and (iii) handling of trap entry
and exit. The Decode module underwent changes to enable
CVA6 [44] is the heart of Shaheen. It is an open-source the decoding of hypervisor instructions (such as hypervi-
6-stages, single-issue, in-order, 64-bit Linux-capable RISC- sor load/store instructions and memory-management fence
V core, supporting the RV64GC ISA variant, SV39 virtual instructions), as well as the execution of all VS-mode-related
memory with a dedicated Memory Management Unit (MMU), instructions and the triggering of access exceptions.
three levels of privilege (Machine, Supervisor, User), and PMP The MMU’s page table walker (PTW) and translation
[19]. In the context of this work, the baseline version of lookaside buffer (TLB) have been modified to support the
CVA6 has been enhanced with 2 extra features to provide second stage of translation. The PTW features a new control
high-assurance isolation between the different applications co- state to monitor the current stage of translation and facilitate
existing on the core: the switching of contexts between VS-Stage and G-Stage
1) hardware support for virtualization compliant with the translations. Finally, the TLB entries have been extended to
ratified 1.0 version of the RISC-V Hypervisor specifica- store both VS-Stage and G-Stage Page Table Entries (PTEs),
tion [18]. as well as their corresponding permissions and virtual machine
2) temporal fence instruction, namely fence.t [20], which identifiers. Overall, all these modifications account for less
flushes µarchitectural state and enables the OS to close than 6% extra area and hardware while enabling the safe co-
covert channels with a low increase in context switch existence of a full-fledge guest-OS (executed in VS-mode)
costs and negligible hardware overhead. together with a bare-metal RTOS (executed in U-mode).
Such new features are relevant to many UAV applications such 2) fence.t: The CVA6 core also implements the fence.t
as the co-existence of full-fledged and real-time OSes (both instruction [20]. This instruction is added to CVA6’s ISA
custom and legacy), as well as isolation for safety and security to prevent timing channels: exploitable hardware resources
reasons. holding state depending on the execution history (caches,
1) H Extension: Tab. III shows the different privilege modes TLBs, branch predictors, and prefetchers) can leak information
when implementing the Hypervisor extension and Fig. 2 the if not properly reset during a context switch. Timing channels
resulting software stack. The nominal privilege modes are can be exposed by prime-and-probe attacks [20]. In these kinds
machine (M), supervisor (S), and user (U). The Hypervisor of attacks, the spy first brings the target hardware resource

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
VALENTE et al.: HETEROGENEOUS RISC-V BASED SoC FOR SECURE NANO-UAV NAVIGATION 2271

Fig. 3. Channel matrices on the CHANNEL BENCH test. Fig. 4. HyperRAM memory controller architecture.

into a known state (prime). In the following time slice, the OS To enable independent data transfer from peripherals to
switches to an application containing a Trojan, which accesses the SoC, Shaheen includes in the peripheral domain the so-
a subset of the hardware resource to encode a secret. Finally, called “µDMA subsystem” which is a controller intended to
when the execution switches back to the spy, it again probes autonomously serve a set of I/O interfaces popular in critical
(probe) the whole buffer and observes an execution time t, applications. Such interfaces include for instance HyperBUS,
correlated with the encoded secret. For data caches, the spy I2C, (Q)SPI, CPI, SDIO, UART, CAN, PWM, and I2S. The
traverses a large buffer of n lines so that the Trojan can then µDMA exports two ports, one for receiving and one for
transmit a secret s ≤ n by touching s lines: in the last time sending data, to read/write data from/to the L2SPM SRAM
slice, the spy decodes s from the measured execution time t. memory to/from the off-chip peripherals [7]. Shaheen also
In this context, the fence.t extends the control that the features an open-source Linux-compliant Ethernet IP, to be
OS, or the Hypervisor, has over the hardware. Namely, fully compliant with the Pixhawk standard [25], popular
it provides the capability of clearing vulnerable microarchi- open-source hardware specifications and guidelines for drone
tectural states to enable a history-independent context-switch systems development.
latency by flushing the caches and the TLB and resetting the 1) HyperRAM Memory Controller: Fig. 4 depicts Shaheen’s
internal FSMs of the core. The fence.t has been validated HyperRAM controller, which provides a configuration APB
against prime-and-probe attacks from the MASTIK toolkit [20], port and an AXI4 subordinate port. It connects the SoC with
[45]. These attacks are implemented within Ge’s CHANNEL off-chip HyperRAMs, compliant with the HyperBUS protocol,
BENCH [46], [47] suite, which provides a minimal OS and which is a fully digital protocol counting 11+n pins: 3 control
data collection infrastructure, running on an experimental pins, n Chip Select (CS), and 8 Double-Data-Rate pins used
version of seL4 supporting timing protection. To visualize both for commands and data [21]. Depending on the off-chip
the correlation between s and t, we use channel matrices. memory models, the controller exposes between 32MB and
A channel matrix represents the conditional probability of 512MB to the interconnect, and it provides up to 1.6 Gbps.
getting an execution time t, having an input secret s. In Fig. 3, HyperRAMs are the main memory of choice for Shaheen
we represent the channel matrix as heatmaps: s (the secret because, differently from high-end DDR DRAM memories,
encoded by the Trojan by touching s ≤ n data cache lines) they target a much lower power consumption and silicon
varies horizontally, and t (the execution time measured by the footprint while guaranteeing enough bandwidth for advanced
spy) varies vertically, bright colours indicate a high probability AI IoT applications and capacity to boot embedded SPM
and dark colours indicate a low probability of measuring such Linux [21].
t, given a certain s. There are two distinct modules within the HyperRAM con-
Fig. 3 shows the channel matrices on the CHANNEL troller, i.e., the PHY controller (back-end) and the front-end,
BENCH test for CVA6’s write-through L1 data cache. On the operating in different frequency domains. The front-end mod-
left it is shown the matrix when not using the fence.t: the ule consists of an AXI4-to-PHY converter and a specialized
correlation between the Trojan’s secret and the spy’s probe µDMA engine channel accessible through APB to execute
time indicates a covert channel. On the right, when using the software-programmed DMA transfers. The AXI4 and µDMA
fence.t, there is no correlation. With less than 320 additional transactions are multiplexed towards the PHY, which translates
clock cycles to the context-switch latency (insignificant at the incoming data packets into HyperRAM transactions and
typical switch rates of 1 kHz), the fence.t requires a low vice versa. The AXI4 front-end enqueues the AXI4 transac-
implementation effort and negligible hardware costs. tions individually and lets through only one read or one write
at a time and converts it into a request for the PHY. At this
point, the back-end translates the request into a command for
B. Host & Peripheral Domain the HyperRAMs and issues it over the HyperBUS. Following,
The host domain leverages the popular AXI4 protocol [48] in the case of a write, the W channel transactions get converted
for the main interconnect. Namely, it includes a 64-bit AXI4 into multiple PHY data packets. For reads, the PHY back-
crossbar delivering up to 32Gbps on each AXI4 port, respec- end sends data packets to a converter that then populates the
tively on read and write channels. It also includes 4 256kB R channel. The µDMA engine directly connects the L2SPM
SRAM banks, composing a 1MB L2 ScratchPad Memory and the back-end and can generate both 1D and 2D burst
(L2SPM) delivering up to 64Gbps, either for writing or transactions. These features are highly valuable for the efficient
reading. The L2SPM is meant to (i) store data to be shared execution of ML algorithms on the cluster, as it is achieved
with off-chip peripherals, (ii) store the cluster code, (iii) through explicit orchestration of the data movement between
for fast communication between CVA6 and the cluster, and, the off-chip memory, the L2SPM and the L1SPM [49].
more in general, (iv) for low-latency (<10 clock cycles) and To double the bandwidth and the capacity, Shaheen’s back-
predictable accesses. end module controls 2 HyperBUS interfaces in parallel, and it

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2272 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 5, MAY 2024

controls 2 memories on each HyperBUS, with 2 dedicated CS.


Each memory is seen as a memory block of 16 bits width and
N rows, programmable at runtime according to the onboard
memories available. The pair of memories on the same CS
of the two different buses are mapped as interleaved, hence
Fig. 5. Instruction decoding during the status-based execution.
occupying the first 2 · 2 · N Bytes. The other pair of memory
is placed contiguously on top.

C. Parallel Programmable Cluster


While the host core supports advanced virtualization, secu-
rity and isolation features, it is not optimized for number
crunching: when running computation-intensive kernels is
needed, it invokes the cluster. The cluster domain is a pro-
grammable parallel accelerator connected to the main host
interconnect through a controller and a subordinate AXI4 port. Fig. 6. Dotp unit datapath.
The cluster is composed of 8 70kGE 4-pipeline stages RV32
cores, optimized for general-purpose DSP and ML applica-
tions, described below. The cores share 16 16kB interleaved
SRAM banks, composing a 256kB L1 SPM, accessible
through a single clock cycle latency logarithmic intercon-
nect, providing up to 256 Gbps at 500MHz. A hierarchical
instruction cache, composed of 8 512 Bytes private caches
and a 4kB of 2-cycle latency shared cache, assists the cores. Fig. 7. Execution flow of a mixed-precision sumdotp instruction between
8-bit operand A and 4-bit operand B.
It is implemented with latch-based SCM to improve energy
efficiency over energy-expensive SRAM cuts. The cluster also
includes a DMA with one 64-bit AXI4 port and 4 32-bit degrades significantly because of the substantial software
ports towards the L1SPM for high-bandwidth, low-latency overhead required for packing and unpacking data.
transactions to/from the L1 SPM. Leveraging explicit memory To overcome this limitation and maximize the compu-
DMA transfers and scratchpad memories, double-buffering tational unit utilization, Flex-V further extends XPulpNN
and custom ISA extensions, the cluster avoids the hardware with mixed-precision operation support. To efficiently enable
overhead of expensive data caches while maximizing the arbitrary mixed-precision operations while avoiding the pro-
utilization of memory and computing resources [49]. liferation of extra instructions, Flex-V exploits the dynamic
1) Flex-V Cores: The 8 RV32 cores are the so-called bit-scalable execution mode: the ISA instruction only encodes
Flex-V cores [50]. Each core has a dedicated FPU unit the type of the operation, while the format is specified by
supporting FP32, FP16 and bfloat16 types, supporting SIMD a CSR in the core. Figure 5 illustrates the relative decoding
instructions on lower precision data. Also, all the cores share process: the decoder retrieves all the necessary information
a single Floating Point division and square root operations from the instruction and transmits it to the EX stage. If the
unit (DIV/SQRT). The Flex-V core is an aggressively opti- received op-code corresponds to a Virtual SIMD instruction,
mized version of the Ri5cy core [51], which support the such as a (ml)sdotp, the decoder activates the SIMD func-
XPulpV2 ISA extension, considered as the baseline. The Ri5cy tional unit which will execute the instruction according to
core already provides custom instructions to accelerate the CSRs values and to signals from the dedicated MAC&Load
execution of ML and DSP workloads, namely, it supports and Mixed-precision controllers (MCD). Figure 6 shows the
post-increment LD/ST, hardware loops, and SIMD instructions mixed-precision Dot Product (Dotp) unit. This unit integrates
down to 8-bit precision. a Slicer&Router, responsible for the extraction of the 4- and
To enhance the performance of sub-byte uniform linear ker- 2-bit operands from a 32-bit input word, along with the
nels, the XpulpNN ISA has been proposed [8], which extends two dedicated units for the sub-byte operations. For example,
XpulpV2 ISA with 4- and 2-bit SIMD operations. Addi- as shown in Fig. 7, the slicer is needed in the case of a sum-of-
tionally, it introduces fused Mac&Load instructions enabling dot-product (sdotp) operation between an 8-bit operand A and
simultaneous execution of SIMD dot-product operations a 4-bit operand B. Since the single instruction can consume
alongside memory accesses, almost doubling the computation just four elements of the eight 4-bit words inside the register,
efficiency. More precisely, the fused Mac&Load (mlsdotp) it selects either the 16 MSBs or the 16 LSBs, according to the
instruction combines a SIMD dot-product-like operation with value of the MPC_CNT signal from the MCD. Subsequently,
a load operation performed during the writeback stage. Doing the Router directs the desired elements to the Dotp units
so enables replacing the non-stationary data in a register to following the SIMD_FMT signal coming from the CSR, i.e.
directly feed the next Mac&Load instruction with it. To decou- the DOTP-8b (operating on 8b inputs) for this example.
ple and simplify the Mac&Load execution, the XPulpNN core In the case of mixed-precision kernels, by re-arranging
integrates six additional 32-bit registers, forming the so-called the data natively in hardware, Flex-V alleviates the sub-
Neural Network Register File (NN-RF), enabling the Load stantial software overhead needed for pointer management
operations (of weights and activations) during the Mac&Load and explicit data unpacking that would be needed otherwise.
write-back stage, which could not be performed otherwise on Table IV shows Flex-V’s performance gain over XPulpV2 and
the general purpose register file (GP-RF). However, when deal- XPulpNN on dense matrix multiplication kernels with weights
ing with mixed-precision inputs, the performance of XpulpNN and activations (operands A and B in Figure 7, respectively)

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
VALENTE et al.: HETEROGENEOUS RISC-V BASED SoC FOR SECURE NANO-UAV NAVIGATION 2273

TABLE IV
F LEX -V’ S P ERFORMANCE [MAC/ CYCLE ] ON M AT M UL K ERNELS ,
AGAINST XP ULP NN AND XP ULP V2 [50]

of different bit widths. It expresses performance in terms


of MAC/cycle, isolating the inner kernel and excluding the Fig. 8. Die micrograph (3mm × 3mm) and area breakout.
non-idealities that arise when running complex real-world
applications. While on uniform kernels, Flex-V and XPulpNN TABLE V
achieve the same performance, on mixed-precision kernels, S HAHEEN S O C F EATURES
Flex-V outperforms XPulpNN by up to 6.8× and for only
5.6% extra area resources.
2) IOTLB: The cluster accesses towards the host intercon-
nect are mediated by an IO TLB (IOTLB) unit [52]. Since
the Flex-V cores cannot perform virtual-to-physical address
translation, the IOTLB unit is meant to ease pointer sharing
with CVA6 and further prevent cluster unauthorized accesses
towards the shared memory. The latter is a fundamental feature
for critical applications: without any control, malicious or
buggy applications running on the cluster could potentially
cause denial-of-service to the host core or break confidentiality
(i.e., get unauthorised access to sensitive data).
The IOTLB provides 32 entries. For each entry, CVA6 has to
specify the starting and ending virtual addresses, the physical
base address and the characteristics of the region: if the
cluster can access it, and if it is readable or writeable. Before
offloading a task to the cluster, the host statically reserves the
portions of the main memory to be shared with the cluster and Fig. 9. Shaheen test-board, top and bottom.
then it programs the entries. Then, once a transaction from the
cluster arrives, its address is compared against the 32 virtual
address ranges. If it is within one of the available ranges and FD-SOI technology. It was synthesized with Synopsys Design
the cluster has the right permissions, the address is translated Compiler 2019.12, while Place & Route was performed with
through simple subtraction of the virtual base address and the Cadence Innovus 19.10. Shaheen’s 4 different clocks are gen-
addition of the physical base address. Otherwise, the IOTLB erated by 4 Frequency Locked Loops (FLLs), taking a 32KHz
sends an interrupt to CVA6 to notify the cluster’s attempt clock in input from an off-chip ring oscillator. The FLLs’
at accessing memory outside the expected regions. Then, the maximum achievable output frequency at 0.8V is 600MHz.
IOTLB behaves as a simple AXI4 subordinate to not break the The different peripheral PHYs (I2C, SPI, HyperBUS, . . . )
AXI4 protocol: for write transactions, it accepts the incoming internally feature clock division to further scale down the input
data on the write channel, without propagating them, while clock when needed.
for read transactions, it serves as many read beats as needed, Figure 9 shows the test board developed for the bring-up
providing an arbitrary value set at design time. At this point, and measurements. It provides four 8MB HyperRAM chips
the cluster is not aware that the transaction was not allowed and a socket to test different chips easily. It also exposes the
and continues the execution until it receives an interrupt from interfaces required to debug the chip, such as JTAG and UART,
CVA6. If, in this scenario, the cluster’s runtime has not been as well as pin headers connected to all the other interfaces for
compromised by the malicious/buggy application, the cluster testing purposes. Finally, it exposes the pin headers to regulate
will gracefully interrupt its execution and resume from a the voltage supply of the two power domains: (i) one for the
known state. If this is not the case and it is not possible to core logic and the SRAM macros, which we vary between
shut down the cluster, the IOTLB will anyway prevent denial- 0.625V and 0.8V, and (ii) one for the IOs, fixed at 1.8V. While
of-service attacks and prevent unauthorised access to sensitive the SoC fits all the requirements for Nano-UAV navigation, the
data. board described above has not been designed for flying, but
specifically for the testing and characterization of Shaheen.
First, we measure the idle power. To do so, we reduce the
IV. I MPLEMENTATION AND M EASUREMENTS frequency of the SoC to 32kHz and clock gate the cluster while
Fig. 8 shows the microphotograph of the Shaheen SoC, CVA6 is in a wait-for-interrupt state, i.e., a for loop of nop
highlighting the main building blocks described in Section III. operations. As reported in Table V, idle power consumption is
The SoC is implemented in Global Foundries 22nm CMOS between 9mW and 19mW, depending on the supply voltage.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2274 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 5, MAY 2024

Lastly, we performed post-layout, parasitic annotated sim-


ulations of the HyperRAM memory controller’s netlist to
characterize its power consumption. At 0.8V and a speed
of 1.6Gbps, it consumes 1.25mW, 70% of which is con-
sumed by the IOs. At 0.625V, delivering 1.1Gbps, the power
consumption is 0.8mW, with 75% of the IOs. More details
about the HyperRAM controller’s performance and power
characterization, as well as comparisons with traditional DDR
controllers, can be found in [53].

V. H ETEROGENEOUS S OFTWARE S TACK


A. Software Stack and Programming Model
Shaheen comes with a mature software stack for hetero-
geneous programming. On the cluster side, we provide a
Fig. 10. Maximum frequency and power envelope varying VDD. lightweight bare-metal runtime that allows low programming
overhead, and fast hardware functionality validation and per-
formance profiling. On the host side, CVA6 can either run
full-ledged Buildroot-based Linux distribution (v5.16.9) on
top of the Bao Hypervisor [18] or a bare-metal runtime, and
both are equipped with a dedicated driver for the cluster
management. The APIs provided by the cluster runtime and the
CVA6’s driver are already sufficient to run heterogeneous code
on the platform. However, one must write two different codes
for the host and cluster. To avoid this, Shaheen adapts the
OpenMP 5 framework from HERO [52], allowing users to use
a high-level, directive-based, intuitive programming interface
to efficiently offload the computationally intensive part of a
Fig. 11. Pulp cluster energy efficiency on dense matrix multiplication. program to the cluster within one single heterogeneous source
1 MAC = 2 Op. code. Also, to map the execution of QNNs on the cluster,
we adopt the data and execution flow presented in Dory [49].
Dory is a tool that given the description of a QNN in input
Fig. 10 (a) shows the measured maximum frequency varying generates the corresponding C code to be executed on parallel
the voltage supply of the host domain, the cluster domain, programmable clusters. Dory calculates data tiling solutions
and CVA6. The cluster and the host domain can run at up to fitting the available L1SPM (where it puts the data to be
280MHz at 0.625V and up to 500MHz at 0.8V. Thanks to the processed by the cluster) and it schedules the DMA data
more aggressive pipelining, the CVA6 core can reach up to transfers from the main memory to the L1SPM and vice-versa.
310MHz at 0.625V and up to 600MHz at 0.8V. Fig. 10 (b) Thanks to the efficiency of tiling and double-buffering, when
also shows the measured maximum power consumption at the the execution is not memory bound, data movements overlap
highest achievable frequency for the voltage supply. For these with computation for more than 95% of the execution time
tests, CVA6 runs a dense FP64 matrix multiplication, and the [49].
cluster runs a dense INT32 matrix multiplication, both within
an infinite loop. Then, we measure and sum the average cur-
rents consumed by the two power domains. To get the average B. Offload Mechanism & Performance
power consumption, we measure the power consumption of To perform the offload, CVA6 lazily (at first occurrence)
Shaheen when the cluster is clock-gated, which matches with loads the cluster code into the L2SPM to then communicate
the power consumption of CVA6, the peripheral, and the host to the cluster where is the code to execute. Such a mechanism
domain together. Then, we also enable the cluster and measure requires a few thousand clock cycles, depending on the length
the resulting total power, which coincides with the maximum of the code. Hence, when the cluster execution time is very
power consumed by Shaheen. Varying VDD and frequency, the short (<100k cycles), the cluster’s offload overhead (i.e.,
power consumption of CVA6 and the host domain varies from loading the code) dominates the total execution time and
45mW to 130mW. On the other hand, the cluster consumes reduces the speedup. Based on our (empirical) experience, this
from 30mW to 70mW. is a very uncommon case.
Fig. 11 shows the cluster domain energy efficiency vary- Figure 12 shows the offload speedup and overhead over
ing frequency, VDD and data width. On 2-bit data, the an FP matrix multiplication. It is important to notice how
cluster can achieve up to 90GOp/s and up to 1.8TOp/s/W. the cluster can run such a benchmark with reduced precision
On 8-bit data, the cluster can achieve up to 26.9GOp/s (down to FP16), exploiting the SIMD extensions otherwise
and up to 540GOp/s/W. All the experiments were per- unavailable on the CVA6 core.
formed running on Shaheen the various n-bit matrix- The plot on the left in Fig. 12 shows CVA6 and cluster
multiplication kernels extracted from the PULPNN library performance at the maximum frequency at 0.8V (600MHz for
on the software-programmable cores and extracting the CVA6, 500MHz for the cluster), and it also highlights the
MAC/cycle of the inner loops, excluding the initial data speed up. The figure shows the acceleration when executing
arrangement overhead [50]. the accelerated kernel once or 1000 times on the cluster; the

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
VALENTE et al.: HETEROGENEOUS RISC-V BASED SoC FOR SECURE NANO-UAV NAVIGATION 2275

Filtering of the input data and state estimation as well as


intelligence tasks like path planning or structural health mon-
itoring usually rely on general-purpose DSP primitives [13],
[14]. At the same time, QNN inference is widely adopted for
tasks like classification or recognition for obstacle avoidance
or object recognition and localization [2], [3], [17], [54], [55].
However, one limitation of the QNN inference at the edge is
the mainstream adoption of the train-once-deploy-everywhere
approach, which trains the networks offline and then deploys
them later on the embedded devices, where no further mod-
ifications to the weights happen. This approach prevents the
Fig. 12. Offload performance breakout on an FP MM. models to adapt in the deployment environment and possibly
leads to accuracy degradation and unreliability [15]. On-device
learning potentially overcomes this limitation by enabling
small portions of the training to happen on the field, directly
on the MCU [16]. Thus, we benchmark the proposed SoC
on three sets of kernels representative of the different tasks
described above, i.e., (i) general-purpose DSP, (ii) DNN, and
(iii) online learning benchmarks.

Fig. 13. Shaheen’s autonomous mission & flight functional blocks.


A. General Purpose DSP
Figure 14 shows the cluster performance and energy
first case represents low code utilization, while the second efficiency over seven open-source FP benchmarks [56] rep-
represents high code utilization. In each execution, the cluster resentative of DSP applications for filtering, feature extraction
performs the computation on a different couple of input classification, and basic linear algebra functions, relevant both
matrices. At the same time, it fetches the input matrices for input data filtering, state estimation but also intelligence
for the next execution and writes back the result of the tasks such as path planning or structural health monitoring
previous computation. Data movement is performed through [13], [14]. To show the advantage given by parallelism and
the DMA and overlaps computation. On a simple FP32 matrix reduced-precision computation, such benchmarks are executed
multiplication, the cluster can deliver up to 4.3 GFLOp/s, both at full precision (FP32) on a single Flex-V core, and then
which is roughly 27 times more than CVA6. Furthermore, the on 8 cores at full and lower precision (FP16 & bfloat16) to
plot shows once again the benefit of scaling down the number exploit the available packed-SIMD support.
precision, which is not a possibility on CVA6. Some of the benchmark kernels are representative of digital
The plot on the right in Figure 12 compares the energy data acquisition and analysis, such as the Finite Impulse
efficiency achieved by CVA6 and the host domain against Response (FIR) and Infinite Impulse Response (IIR) filters.
the cluster on the same benchmark, with the IPs working at To characterize the cluster on frequency-domain applications,
the maximum frequency at 0.65V (280MHz for the cluster we run a decimation-in-frequency radix-2 variant of the Fast
and 310MHz for CVA6). On the reduced-precision matrix Fourier Transform (FFT) and a Discrete Wavelet Transform
multiplication, the cluster can reach up to 157 GOp/s/W, while (DWT), a standard kernel used for feature extraction. As for
CVA6 can only provide 2 GOp/s/W, ≈ 80× less. more state-estimation-oriented kernels, we provide the perfor-
mance results when executing a K-Means classifier kernel.
Lastly, we also benchmark two classical basic linear algebra
VI. B ENCHMARKING
kernels such as a Matrix Multiplication and a 1D Convolution.
Figure 13 presents the loop that Shaheen executes to achieve As the plots in Fig. 14 show, on all these benchmarks, thanks
autonomous flight while executing other auxiliary tasks, and to the ISA extension not available on the host core, a single
how it maps on Shaheen’s hardware. In the first instance, it col- Flex-V running at 500MHz provides from 1× to 3× the
lects and filters the sensors’ input data, which are subsequently performance delivered by CVA6 on a dense and regular matrix
used to estimate the current state. Then, in the “intelligence” multiplication at 600MHz. Furthermore, the parallel execution
block it has to independently determine the next state (i.e., of the benchmarks on 8 Flex-V cores can give an additional
what to do next) and carry out the target auxiliary tasks such as speed-up between 5.9 and 7.9 times when compared to single-
object detection, recognition or monitoring [3], [14]. Once the core execution. Leveraging the reduced precision arithmetic
next state is determined, the control part actuates the change. can further provide almost a 2x speed-up, allowing the core
In Shaheen, the first three phases (filtering, state estimation, to reach up to 7.9GFLOp/s and up to 157 GFLOp/s/W.
and intelligence) are mapped on the cluster hardware, while
the real-time control is left to the host core, also executing
the general-purpose OS, to leverage its legacy software stack B. QNN Inference
for non-real-time tasks (e.g. transmitting the classification of a In this subsection, we focus on two real-world 8-bit QNNs
detected object to the cloud through the legacy network stack). fine-tuned for Nano-UAVs application scenarios, namely Tiny-
In this section, we focus on benchmarking the cluster on its PULP-Dronet [2] and FrontNet [3], as well as two aggressively
target tasks as they are the most computing-intensive phases quantized mixed-precision QNNs for object detection and clas-
and potential bottlenecks of the autonomous flight and mission sification. The Tiny-PULP-Dronet is a lightweigth QNN based
loop. on the ResNet architecture [55] and it enables autonomous

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2276 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 5, MAY 2024

Fig. 14. Performance and en. eff. delivered by the cluster on general-purpose
DSP benchmarks. Fig. 15. Performance and En. Eff. delivered by the cluster on FP training
benchmarks.
TABLE VI
ACCURACY, M EMORY F OOTPRINT, P ERF. & E N . higher frequency and optimized ISA, Shaheen’s cluster pro-
O F E ND - TO -E ND N ETWORKS vides the smallest latency, but not the lowest energy, due to
higher power consumption (70mW) when compared to GAP9
(50mW), being the latter tuned for energy-efficient operation.
As soon as the mixed-precision extensions can be exploited,
Shaheen’s cluster emerges both as the fastest and most energy-
efficient one.

C. Online Training
In this subsection, we benchmark the cluster against a set of
open-source kernels to enable online learning on MCU con-
trollers [16]. In particular, we benchmark three very popular
layers such as 2D Convolution, Pointwise and Fully-Connected
which are the three building blocks of Convolutional NNs
(CNNs), used to find patterns in images. Convolutional and
pointwise layers are the core building blocks of CNN, where
navigation within tight spaces avoiding obstacle collision. most of the computation happens, and are used to perform
FrontNet on the other hand is based on the MobileNet [54] feature extraction. The Fully connected layer connects the
architecture and it is used for Human-Robot Interaction (HRI): information extracted from the previous steps (i.e., Convo-
it allows the nano-drone to recognize a face and follow it. The lution layer and Pooling layers) to the output layer and
cluster is able to achieve 320FPS on a Tiny-PULP-Dronet and eventually classifies the input into the desired label. For each
260FPS on an optimized 6.7MMAC FrontNet, which is well layer, we consider the three phases of training: (i) the forward
above the 20FPS needed to achieve autonomous flight [2], [3]. pass, to compute the output result and hence the loss, (ii)
This means that more than 90% of the cluster’s computational the backward computation of the gradients with respect to
capabilities are actually available to carry out other activities. the activations, and (iii) the backward computation of the
Stemming the analysis from the QNNs mentioned below, gradients with respect to the weights. The kernels we leverage
we first benchmark the cluster on a relatively big (325MMAC) map each of these computation phases directly to one matrix
8-bit MobileNetV1 [54] for object classification. Then, multiplication containing all the matrix multiplications needed
we extend the analysis to a mixed-precision MobileNetV1 with to obtain the output [16]. Depending on the matrices’ shapes,
8-bit activations and 4-bit (8b4b) weights and an aggressively the amount of parallelizable work changes, and hence the
quantized 4b2b ResNet-20 [57] for object detection. The two performance [58]. Figure 15 shows performance and energy
MobileNetV1 networks have been trained on ImageNet while efficiency over such benchmarks. As for the DSP bench-
the 4b2b ResNet-20 targets CIFAR10. As table VI shows, marks, the parallelization provides a significant speed-up for
reducing the operands’ precision does not automatically jeop- most of them. Except for the weight gradient computation
ardize the accuracy: in the case of the MobileNetV1, there on the convolution kernel, which achieves a 4.7× speedup,
is a 47% memory footprint reduction for a negligible 3% the parallelization provides between 6.1x and 7.5x faster
accuracy loss, from 69% to 66% [50], while the ResNet- execution. At the same time, leveraging the bfloat16 format
20 achieves 90.2% accuracy [55]. As shown in table VI, (providing a wide dynamic range explicitly thought for ML
Flex-V is the only version of Ri5cy able to efficiently deal training) and the dedicated SIMD extensions provides up
with mixed precision networks: in terms of MAC/cycles, to 1.8x more performance. Overall, the cluster is able to
on the 4b-2b mixed-precision ResNet-20, it achieves 2.3× achieve up 6.2 GFLOp/s and 120 GFLOp/s/W on this class of
and 2.5× of speedup with respect to XPulpNN and XPulpV2. benchmarks.
Table VI also compares the latency and energy consumed
by Shaheen over the three networks when running at the
maximum frequency at 0.8V, compared to two other 8-core VII. C OMPARISON W ITH S TATE - OF - THE -A RT
clusters respectively implementing the baseline XPulpV2 Table VII shows Shaheen against 6 SoCs for UAVs, both
instructions or the XPulpNN, namely GAP9 [7] and Kraken from industry and academia. To have a thorough comparison,
[8]. On the uniform-precision MobileNetV1, thanks to the we extend it also with SoC not explicitly optimized for UAVs

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
VALENTE et al.: HETEROGENEOUS RISC-V BASED SoC FOR SECURE NANO-UAV NAVIGATION 2277

TABLE VII
C OMPARISON W ITH S OA S O C S

but with similar general-purpose software performance and RV32 cluster, achieving up to 90GOp/s and up to 1.8TOp/s/W
functionalities that could fit the purpose, namely Cheshire [34], on 2-bit integer kernel and up to 26.9GOp/s and up to
the work from Jia et al. [31], the STM32-H7 [5] and the 540GOp/s/W on 8-bit integer kernels.
work by Ju et al. [11]. From an architectural viewpoint, After this thorough evaluation, we envision the miniaturiza-
Ju et al. [11] consists of a homogeneous systolic array of RV32 tion of the testing PCB (see Fig. 9) and development of ad-hoc
cores, while Jia et al. [31] instantiates a cluster of four RV64 control software, tightly coupled with the physical character-
cores along with a set of hardwired ASIC accelerators. More istics of the board, to achieve real-world nano-UAV flight,
advanced nano-UAV SoCs, such as GAP9 [7] and Kraken [8], exploiting Shaheen’s secure and scalable architecture with
incorporate an RV32 CPU that can offload compute-intensive host/cluster decoupling and advanced virtualization. Overall,
tasks to a parallel cluster of cores with the same ISA. Shaheen is the first prototype SoC providing support for
In this context, Shaheen is the first silicon demonstrator general-purpose OSs within a 200mW power envelope while
of a heterogeneous RV64/RV32 architecture. When offloading offering state-of-the-art performance over a wide spectrum of
compute-intensive tasks to the fully-programmable parallel applications, thanks to the programmable multi-core cluster.
cluster of Flex-V cores, performance can be improved by up to All the IPs integrated within Shaheen are released as open
2 orders of magnitude achieving state-of-the-art performance source2 under a liberal license to foster future research in the
with up to 90 GOp/s on heavily quantized integer tasks and area of AI-IoT computing devices.
up to 7.9 GFLOp/s/W on 16-bit floating point tasks. Shaheen
stands out as the only nano-UAV SoC that provides Linux, R EFERENCES
hypervisor, and security capabilities to the host enabling [1] M. O. Ojo, S. Giordano, G. Procissi, and I. N. Seitanidis, “A review of
the secure co-existence of user applications running on full- low-end, middle-end, and high-end IoT devices,” IEEE Access, vol. 6,
fledged OSes and control tasks running on real-time OSes pp. 70528–70554, 2018.
[2] L. Lamberti et al., “Tiny-PULP-Dronets: Squeezing neural networks
while providing up to 512MB of low-cost and low-power off- for faster and lighter inference on multi-tasking autonomous nano-
chip memory within the power envelope of 200 mW. drones,” in Proc. IEEE 4th Int. Conf. Artif. Intell. Circuits Syst. (AICAS),
Jun. 2022, pp. 287–290.
VIII. C ONCLUSION [3] E. Cereda et al., “Deep neural network architecture search for accurate
visual pose estimation aboard nano-UAVs,” in Proc. IEEE Int. Conf.
We presented Shaheen: a heterogeneous and flexible SoC Robot. Autom. (ICRA), May 2023, pp. 6065–6071.
implemented in 22 nm FDX technology. Shaheen features a [4] R. J. Wood et al., “Progress on ‘Pico’ air vehicles,” in Robotics Research.
Linux-capable RV64 core, compliant with the v1.0 ratified Cham, Switzerland: Springer, 2017, pp. 3–19, doi: 10.1007/978-3-319-
Hypervisor extension. To the best of our knowledge, it is the 29363-9_1.
first silicon implementation fully compliant with the ratified [5] STMicroeletronics. (2020). STM32H7. Accessed: Jul. 30, 2023.
[Online]. Available: [Link]
RISC-V ISA Hypervisor extension. It features support for processors/[Link]
timing channel protection to isolate concurrent execution of [6] STMicroeletronics. (2020). STM32F4. Accessed: Jul. 30, 2023.
multiple software stacks (trusted and untrusted), preventing [Online]. Available: [Link]
security threats and ensuring multi-domain operations. It pro- processors/[Link]
vides up to 512MB of main off-chip HyperRAM memory, [7] GreenWavesTechnology. (2023). GAP8/9. Accessed: Jul. 30, 2023.
[Online]. Available: [Link]
large enough to host general-purpose OSs as well as RTOSs. processor/
Also, it is the first silicon implementation of a heterogeneous
MCU coupling an RV64 host together with a multi-core 2 [Link]

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2278 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 71, NO. 5, MAY 2024

[8] A. Di Mauro, M. Scherer, D. Rossi, and L. Benini, “Kraken: A direct [29] NVIDIA. (2023). NVIDIA Jetson TX2. Accessed: Jul. 30, 2023. [Online].
event/frame-based multi-sensor fusion SoC for ultra-efficient visual Available: [Link]
processing in nano-UAVs,” in Proc. IEEE Hot Chips 34 Symp. (HCS), [30] Intel. (2020). Intel Atom X7-E3950. Accessed: Jul. 30, 2023.
Aug. 2022, pp. 1–19. [Online]. Available: [Link]
[9] I.-T. Lin et al., “2.5 A 28 nm 142 mW motion-control SoC for en/products/sku/96488/intel-atom-x7e3950-processor-2m-cache-up-
autonomous mobile robots,” in IEEE Int. Solid-State Circuits Conf. to-2-00-ghz/[Link]
(ISSCC) Dig. Tech. Papers, Feb. 2023, pp. 1–3. [31] T. Jia et al., “A 12 nm agile-designed SoC for swarm-based perception
[10] A. Suleiman, Z. Zhang, L. Carlone, S. Karaman, and V. Sze, “Navion: with heterogeneous IP blocks, a reconfigurable memory hierarchy, and
A 2-mW fully integrated real-time visual-inertial odometry accelerator an 800 MHz multi-plane NoC,” in Proc. IEEE 48th Eur. Solid State
for autonomous navigation of nano drones,” IEEE J. Solid-State Circuits, Circuits Conf. (ESSCIRC), Sep. 2022, pp. 269–272.
vol. 54, no. 4, pp. 1106–1119, Apr. 2019. [32] C.-H. Lin et al., “7.1 A 3.4-to-13.3 TOPS/W 3.6 TOPS dual-core deep-
[11] Y. Ju and J. Gu, “A systolic neural CPU processor combining deep learning accelerator for versatile AI applications in 7 nm 5G smartphone
learning and general-purpose computing with enhanced data locality and SoC,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers,
end-to-end performance,” IEEE J. Solid-State Circuits, vol. 58, no. 1, Feb. 2020, pp. 134–136.
pp. 216–226, Jan. 2023. [33] C. Schmidt et al., “An eight-core 1.44-GHz RISC-V vector proces-
[12] W. J. Dally, Y. Turakhia, and S. Han, “Domain-specific hardware sor in 16-nm FinFET,” IEEE J. Solid-State Circuits, vol. 57, no. 1,
accelerators,” Commun. ACM, vol. 63, no. 7, pp. 48–57, Jun. 2020, doi: pp. 140–152, Jan. 2022.
10.1145/3361682. [34] A. Ottaviano, T. Benz, P. Scheffler, and L. Benini, “Cheshire:
[13] P. Tsiotras, D. Jung, and E. Bakolas, “Multiresolution hierarchical A lightweight, Linux-capable RISC-V host platform for domain-specific
path-planning for small UAVs using wavelet decompositions,” J. Intell. accelerator plug-in,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 70,
Robotic Syst., vol. 66, no. 4, pp. 505–522, Jun. 2012. no. 10, pp. 3777–3781, Oct. 2023.
[14] A. Khadka, B. Fick, A. Afshar, M. Tavakoli, and J. Baqersad, “Non- [35] Etron. (2022). 256 MB High Bandwidth RPC DRAM. [Online].
contact vibration monitoring of rotating wind turbines using a semi- Available: [Link]
autonomous UAV,” Mech. Syst. Signal Process., vol. 138, Apr. 2020, GA16LGDABMACAEA-RPC-DRAM_Rev.-[Link]
Art. no. 106446. [36] Bitcraze. (2023). Crazyflie. Accessed: Jul. 30, 2023. [Online]. Available:
[15] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and [Link]
D. Mané, “Concrete problems in AI safety,” 2016, arXiv:1606.06565. [37] G. Shi, W. Hönig, Y. Yue, and S.-J. Chung, “Neural-swarm:
Decentralized close-proximity multirotor control using learned inter-
[16] D. Nadalini, M. Rusci, G. Tagliavini, L. Ravaglia, L. Benini, and
actions,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020,
F. Conti, “PULP-TrainLib: Enabling on-device training for RISC-V
pp. 3241–3247.
multi-core MCUs through performance-driven autotuning,” in Embedded
Computer Systems: Architectures, Modeling, and Simulation. Cham, [38] F. Candan, A. Beke, and T. Kumbasar, “Design and deployment of fuzzy
Springer, 2022, pp. 200–216. PID controllers to the nano quadcopter crazyflie 2.0,” in Proc. Innov.
Intell. Syst. Appl. (INISTA), Jul. 2018, pp. 1–6.
[17] P. Foehn et al., “Agilicious: Open-source and open-hardware agile
quadrotor for vision-based flight,” Sci. Robot., vol. 7, no. 67, Jun. 2022, [39] B. Nassi, R. Bitton, R. Masuoka, A. Shabtai, and Y. Elovici, “SoK:
Art. no. eabl6259. Security and privacy in the age of commercial drones,” in Proc. IEEE
Symp. Secur. Privacy (SP), May 2021, pp. 1434–1451.
[18] B. Sá, L. Valente, J. Martins, D. Rossi, L. Benini, and S. Pinto, “CVA6
RISC-V virtualization: Architecture, microarchitecture, and design space [40] J.-H. Yoon and A. Raychowdhury, “31.1 A 65 nm 8.79 TOPS/W
exploration,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 31, 23.82 mW mixed-signal oscillator-based NeuroSLAM accelerator for
no. 11, pp. 1713–1726, Nov. 2023. applications in edge robotics,” in IEEE Int. Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, Feb. 2020, pp. 478–480.
[19] M. Schneider, A. Dhar, I. Puddu, K. Kostiainen, and S. Čapkun, “Com-
[41] Bitcraze. (2023). AI-Deck. Accessed: Jul. 30, 2023. [Online]. Available:
posite enclaves: Towards disaggregated trusted execution,” IACR Trans.
[Link]
Cryptograph. Hardw. Embedded Syst., vol. 2022, no. 1, pp. 630–656,
Nov. 2021. [42] G. Gallego et al., “Event-based vision: A survey,” IEEE Trans. Pattern
Anal. Mach. Intell., vol. 44, no. 1, pp. 154–180, Jan. 2022.
[20] N. Wistoff, M. Schneider, F. K. Gürkaynak, G. Heiser, and L. Benini,
“Systematic prevention of on-core timing channels by full temporal [43] ROS. (2022). Robot Operating System. Accessed: Jul. 30, 2023.
partitioning,” IEEE Trans. Comput., vol. 72, no. 5, pp. 1420–1430, [Online]. Available: [Link]
May 2023. [44] OpenHW. (2023). CVA6. [Online]. Available: [Link]
/openhwgroup/cva6
[21] B. John. (2020). HyperRAM As a Low Pin-count Expansion Memory
for Embedded Systems. Accessed: Jul. 30, 2023. [Online]. Available: [45] Y. Yarom. (2016). Mastik: A Micro-Architectural Side-Channel Toolkit.
[Link] Accessed: Jul. 30, 2023. [Online]. Available: [Link]
[46] Q. Ge, “Principled elimination of microarchitectural timing channels
[22] A. Das, P. Kol, C. Lundberg, K. Doelling, H. E. Sevil, and F. Lewis,
through operating-system enforced time protection,” Ph.D. dissertation,
“A rapid situational awareness development framework for heteroge-
Dept. School Comput. Sci. Eng., UNSW, Sydney, NSW, Australia,
neous manned-unmanned teams,” in Proc. IEEE Nat. Aerosp. Electron.
2019.
Conf. (NAECON), Jul. 2018, pp. 417–424.
[47] Sel4. (2023). Timing Channel Benchmarking Tool. [Online]. Available:
[23] B. Forsberg, D. Palossi, A. Marongiu, and L. Benini, “GPU-accelerated [Link]
real-time path planning and the predictable execution model,” Proc.
[48] ARM. (2022). AMBA AXI Protocol Specification. [Online]. Available:
Comput. Sci., vol. 108, pp. 2428–2432, Jan. 2017. [Online]. Available:
[Link]
[Link]
[49] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi, and
[24] S. A. Quintero and J. P. Hespanha, “Vision-based target tracking with F. Conti, “DORY: Automatic end-to-end deployment of real-world
a small UAV: Optimization-based control strategies,” Control Eng. DNNs on low-cost IoT MCUs,” IEEE Trans. Comput., vol. 70, no. 8,
Pract., vol. 32, pp. 28–42, Nov. 2014. [Online]. Available: [Link] pp. 1253–1268, Aug. 2021.
.[Link]/science/article/pii/S0967066114001774
[50] A. Nadalini et al., “A 3 TOPS/W RISC-V parallel cluster for infer-
[25] Pixhawk. (2023). PX4. Accessed: Jul. 30, 2023. [Online]. Available: ence of fine-grain mixed-precision quantized neural networks,” in
[Link] Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI), Jun. 2023,
[26] M. Idrissi, M. Salami, and F. Annaz, “A review of quadrotor unmanned pp. 1–6.
aerial vehicles: Applications, architectural design and control algo- [51] OpenHW. (2023). CV32E40P. Accessed: Jul. 30, 2023. [Online]. Avail-
rithms,” J. Intell. Robot. Syst., vol. 104, no. 2, p. 22, Jan. 2022. able: [Link]
[27] C. Budaciu, N. Botezatu, M. Kloetzer, and A. Burlacu, “On the [52] A. Kurth, B. Forsberg, and L. Benini, “HEROv2: Full-stack open-
evaluation of the crazyflie modular quadcopter system,” in Proc. 24th source research platform for heterogeneous computing,” IEEE Trans.
IEEE Int. Conf. Emerg. Technol. Factory Autom. (ETFA), Sep. 2019, Parallel Distrib. Syst., vol. 33, no. 12, pp. 4368–4382, Dec. 2022, doi:
pp. 1189–1195. 10.1109/TPDS.2022.3189390.
[28] O. H. Zekry, T. Attia, A. T. Hafez, and M. M. Ashry, “PID tra- [53] L. Valente et al., “HULK-V: A heterogeneous ultra-low-power Linux
jectory tracking control of crazyflie nanoquadcopter based on genetic capable RISC-V SoC,” in Proc. Design, Autom. Test Eur. Conf. Exhib.
algorithm,” in Proc. IEEE Aerosp. Conf., Mar. 2023, pp. 1–8. (DATE), Apr. 2023, pp. 1–6.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
VALENTE et al.: HETEROGENEOUS RISC-V BASED SoC FOR SECURE NANO-UAV NAVIGATION 2279

[54] A. G. Howard et al., “MobileNets: Efficient convolutional neural net- Simone Benatti received the Ph.D. degree in electrical engineering and
works for mobile vision applications,” 2017, arXiv:1704.04861. computer science from the University of Bologna, Bologna, Italy, in 2016.
[55] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, “HAWQ: He has collaborated with several international research institutes and compa-
Hessian aware quantization of neural networks with mixed-precision,” nies. Previously, he was an electronic designer and a research and development
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct./Nov. 2019, engineer of electromedical devices, for eight years. He has authored or
pp. 293–302. coauthored more than 50 papers in international peer-reviewed conferences
[56] P. Platform. (2023). TransLib. Accessed: Jul. 30, 2023. [Online]. Avail- and journals. His research interests focus on energy efficient embedded
able: [Link] wearable systems, signal processing, sensor fusion, and actuation systems.
[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2015, arXiv:1512.03385. Rafail Psiakis received the bachelor’s and M.S. joint diploma degree from
[58] G. M. Amdahl, “Validity of the single processor approach to achieving the Department of Electrical and Computer Engineering, University of Patras,
large scale computing capabilities,” in Proc. Spring Joint Comput. Conf. Greece, in 2015, and the Ph.D. degree from the University of Rennes,
(AFIPS). New York, NY, USA: Association for Computing Machinery, France, in 2018. Currently, he is a Lead Silicon Security Researcher with
Apr. 1967, pp. 483–485, doi: 10.1145/1465482.1465560. the Technology Innovation Institute, Abu Dhabi, United Arab Emirates.
Previously, he was an embedded research and development security engineer
and a research assistant, for five years. His research interests include embedded
systems, security, confidential computing, computer architecture, RISC-V, and
Luca Valente received the [Link]. degree in electronic engineering from fault tolerance.
the Polytechnic University of Turin in 2020. He is currently pursuing the
Ph.D. degree with the Department of Electrical, Electronic and Information
Ari Kulmala received the Ph.D. degree from the Tampere University of
Technologies Engineering (DEI), University of Bologna. His main research
Technology (TUT) in 2009. Currently, he is leading the SoC development with
interests are hardware-software co-design of multi-processors heterogenous
the Technology Innovation Institute (TII), Secure Systems Research Center,
systems on chip, parallel programming, and FPGA prototyping.
Abu Dhabi; and holds the position of a Professor of practice with TUT. His
experience on system-on-chip design ranges from small power mobile devices
to large-scale processing infrastructure devices and datacenter applications.
Alessandro Nadalini received the [Link]. and [Link]. degrees in electronic
engineering from the University of Bologna, Bologna, Italy, in 2018 and Baker Mohammad (Senior Member, IEEE) received the B.S. degree in ECE
2021, respectively. He currently holds a research grant from the University from the University of New Mexico, Albuquerque, the M.S. degree in ECE
of Bologna. His research regards lightweight extensions to the RISC-V ISA from Arizona State University, Tempe, and the Ph.D. degree in electrical and
to boost the efficiency of heavily quantized neural networks inference on computer engineering (ECE) from The University of Texas at Austin. He is
microcontroller-class cores. He received the Mukherjee Best Paper Award of currently a Professor of electrical engineering and computer science (EECS)
the 2023 IEEE Computer Society Annual Symposium on VLSI. with Khalifa University and the Director of the SOCL. He has authored or
coauthored over 200 refereed journals and conference proceedings, more than
three books, and over 20 U.S. patents. He is a distinguished lecturer of IEEE
CAS.
Asif Hussain Chiralil Veeran received the B.S. degree in electronics and
communication engineering from the University of Calicut and the M.S.
degree in VLSI design from Anna University, India. He is a Researcher in Sandro Pinto received the Ph.D. degree in electronics and computer engi-
electrical engineering and computer science (EECS) with Khalifa University. neering. He is an Associate Research Professor with the University of
His research interests encompass VLSI, developing efficient and effective Minho, Portugal. He has a deep academic background and several years of
methodologies, designing and optimizing the layout of ICs, ensuring their industry collaboration focusing on operating systems, virtualization, security
functionality, performance, and manufacturability. for embedded, CPS, and IoT systems. He has published more than 80 peer-
reviewed articles and a skilled presenter with speaking experience in several
academic and industrial conferences.

Mattia Sinigaglia received the master’s degree in electronic engineering


from the University of Bologna, where he is currently pursuing the Ph.D. Daniele Palossi received the Ph.D. degree from ETH Zürich, Zürich, Switzer-
degree in digital systems design with the Group of Prof. Luca Benini, land, in 2019. He is currently a Senior Researcher with the Dalle Molle
Department of Electrical and Information Engineering (DEI). His research Institute for Artificial Intelligence (IDSIA), USI-SUPSI, Lugano, Switzerland,
interest is semiconductor engineering for ultra-low-power devices, mainly and the Integrated Systems Laboratory (IIS), ETH Zürich. His work has
targeting hardware-software co-design of heterogeneous system-on-chip. resulted in more than 30 publications in international peer-reviewed con-
ferences and journals. His research focuses on energy-efficient embedded
platforms, algorithms for autonomous navigation, and miniaturized cyber-
physical systems. He was a recipient of the Swiss National Science Foundation
Bruno Sá received the master’s degree in electronics and computer engi- (SNSF) Spark Grant, the Second Prize at the Design Contest held at the
neering, with a specialization in embedded systems and automation, control ACM/IEEE ISLPED’19, and led the winning team of the first “Nanocopter
and robotics. He is currently pursuing the Ph.D. degree with the Embedded AI Challenge“ at the IMAV’22 International Conference.
Systems Research Group, University of Minho, Portugal. His areas of interests
include operating systems and virtualization for embedded critical systems,
computer architectures, and hardware-software co-design. Luca Benini (Fellow, IEEE) received the Ph.D. degree from Stanford
University. He holds the Chair of digital circuits and systems with ETH Zürich
and a Full Professor with the University of Bologna. He has published more
than 1000 peer-reviewed articles and five books. His research interests are
Nils Wistoff (Graduate Student Member, IEEE) received the [Link]. and [Link]. in energy-efficient parallel computing systems, smart sensing micro-systems,
degrees from RWTH Aachen University in 2017 and 2020, respectively. He is and machine learning hardware. He is a fellow of the ACM and a member
currently pursuing the Ph.D. degree with the Integrated Systems Laboratory, of the Academia Europaea.
ETH Zürich. His research interests include processor and system-on-chip
design and secure computer architecture.
Davide Rossi (Senior Member, IEEE) received the Ph.D. degree from the
University of Bologna, Bologna, Italy, in 2012. He has been a Post-Doctoral
Researcher with the Department of Electrical, Electronic and Information
Yvan Tortorella received the master’s degree in electronic engineering from Engineering “Guglielmo Marconi,” University of Bologna, since 2015, where
the University of Bologna in October 2021, where he is currently pursuing he is currently an Assistant Professor. His research interests focus on energy-
the Ph.D. degree in digital systems design with the Group of Prof. Luca efficient digital architectures. In this field, he has published more than
Benini, Department of Electrical and Information Engineering (DEI). His 100 papers in international peer-reviewed conferences and journals. He was a
research interests include the design of parallel ultra-low power (PULP)-based recipient of the Donald O. Pederson Best Paper Award 2018, the 2020 IEEE
hardware accelerators for ultra-low power machine learning and the design of TCAS Darlington Best Paper Award, and the 2020 IEEE T RANSACTIONS ON
RISC-V-based computer architectures for satellite applications. V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS Prize Paper Award.

Authorized licensed use limited to: Zhejiang University. Downloaded on September 07,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like