0% found this document useful (0 votes)
26 views10 pages

Approximate ADCs For In-Memory Computing

The document discusses the design and challenges of analog to digital converters (ADCs) in in-memory computing (IMC) architectures for deep learning accelerators. It highlights the significant power and area overhead associated with high-resolution ADCs, which can consume over 85% of the total compute power, and proposes an ADC-aware deep neural network training method to mitigate these issues. The work emphasizes the need for efficient ADC designs that accommodate non-ideal characteristics to improve performance while reducing power consumption in IMC systems.

Uploaded by

Tang Jing Jung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Approximate ADCs For In-Memory Computing

The document discusses the design and challenges of analog to digital converters (ADCs) in in-memory computing (IMC) architectures for deep learning accelerators. It highlights the significant power and area overhead associated with high-resolution ADCs, which can consume over 85% of the total compute power, and proposes an ADC-aware deep neural network training method to mitigate these issues. The work emphasizes the need for efficient ADC designs that accommodate non-ideal characteristics to improve performance while reducing power consumption in IMC systems.

Uploaded by

Tang Jing Jung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Approximate ADCs for In-Memory Computing

Arkapravo Ghosh1, Hemkar Reddy Sadana1, Mukut Debnath1, Panthadip Maji1, Shubham Negi2, Sumeet Gupta2, Mrigank
Sharad1, Kaushik Roy2
1
Indian Institute of Technology Kharagpur, Purdue University

Abstract—In memory computing (IMC) architectures for deep quantization require higher resolution analog to digital
learning (DL) accelerators leverage energy-efficient and highly converters (ADC) for digitizing the result [26, 27]. The
parallel matrix vector multiplication (MVM) operations, ADC resolution requirements as a function of the
implemented directly in memory arrays. Such IMC designs have aforementioned design-parameters has been estimated in
been explored based on CMOS as well as emerging non-volatile
memory (NVM) technologies like RRAM. IMC architectures
recent work [38, 39]. The power and area overhead
generally involve a large number of cores consisting of memory associated with ADCs grows sharply with increasing bit
arrays, storing the trained weights of the DL model. Peripheral resolution, while the performance drops [38]. Recent
units like DACs and ADCs are also used for applying inputs and designs ascribe more than 85% power consumption to the
reading out the output values. Recently reported designs reveal ADCs and only a smaller fraction is used for actual analog-
that the ADCs required for reading out the MVM results, consume mode IMC operation in the compute core [40, 41].
more than 85% of the total compute power and also dominate the For a target bit resolution, design for higher
area, thereby eschewing the benefits of the IMC scheme. linearity, measured in terms of INL and DNL (integrated
Mitigation of imperfections in the ADCs, namely, non-linearity and differential non-linearity), and reduced variation-
and variations, incur significant design overheads, due to
dedicated calibration units. In this work we present peripheral
effects, results in higher power, area and performance
aware design of IMC cores, to mitigate such overheads. It involves penalties [16-18]. Ideally, for highest possible
incorporating the non-idealities of ADCs in the training of the DL performance from IMC cores, the number of ADCs should
models, along with that of the memory units. The proposed be equal to the number of memory array columns.
approach applies equally well to both current mode as well as However, large area foot print of ADCs necessitates their
charge mode MVM operations demonstrated in recent years., and sharing among multiple columns.
can significantly simplify the design of mixed-signal IMC units. To overcome the challenges related to ADCs in
IMC units, we propose ADC-aware DNN training method,
Keywords—in memory computing, deep learning, low power, which involves the use of non-ideal and imperfect ADC
vlsi, mixed signal
characteristics with variations and non-linearity, in place
I. INTRODUCTION of ideal ReLU functions of the convolution layers (fig. 1).
The DNN model is first trained with ideal ReLU (perfectly
Deep Neural Networks (DNN) involve a large number of linear and variation free) and weight values, with bit-
dot-product calculations between multiple pre-trained truncation, to estimate the minimum bits required to retain
convolution kernels and input pixels at each layer [1, 2]. the inference accuracy close to that obtained through high-
Such kernel multiplications over pixels from multiple resolution floating-point ReLU and weights. The non-
input-channels can be organized in the form of matrix- linearity and variations estimates obtained from un-
vector-multiplication (MVM) operations. Conventional calibrated ADC of required bit-resolution, estimated from
DNN accelerator architectures incur significant circuit design, is then incorporated in the model re-
performance and power overhead due to extensive data training, by replacing the ReLU with such imperfect
movement between multiple processing units and characteristics. The ADC design constraints are tightened
associated local and shared memory blocks, storing these till the re-training achieves a desired accuracy, close to the
kernel weights and partial results [3, 4]. ideal ReLU case. The tolerance towards weight variations
In-memory computing (IMC) scheme has and subsequent bit-cell sizing constraints are obtained in
gained significant attention for energy-efficient DNN parallel, through addition of estimated weight-noise from
accelerator design, owing to reduced memory-to-processor bit-cell and array simulations. VAT approach for weight
data-traffic [5-7]. IMC involves storing kernel weights for variations has been explored earlier [47]. Incorporation of
each layer and its input channels, in memory arrays. The ADC imperfection in VAT, proposed in this work can
input data can be applied along the word-lines. Based on significantly simplify the associated design constraints for
the circuit level interaction between the applied signal low power and high-performance IMC architectures.
level and memory cell, current-mode [11-13] or charge- Rest of the paper is organized as follows. A brief
mode [14, 15] dot product is computed. The result of the description of IMC unit along with the peripheral circuits
dot product operation is generated along the column-wise in given in section-II. Section-III presents design analysis
bit-lines of the memory array, which accumulate the and characterization of an ADC based on ring oscillators.
current or charge outputs from multiple memory cells. The Variation aware training framework is described in section
resulting outputs are inevitably analog values, which need IV. Section V presents comparison with conventional
to be quantized into digital outputs, before feeding them to
the next layer. The number of quantization levels depend
upon several parameters like, number of input levels,
number of weight-bits and size of the memory-array.
Larger values of each of these parameters translate to a
larger number of possible output levels and hence
mandates larger quantization levels for overall higher
inference accuracy [37]. Larger levels of output

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


Fig. 1a. Mapping of DNN to an NVM IMC unit Fig. 1b IMC mapping of CNN

approach of calibration. Conclusions are given in section This is because, the ADC needs to resolve more levels in the
VI. . analog-mode MVM output for such cases. High precision
ADCs can overwhelm area and power cost and hence are not
suitable for high-performance column-parallel operations.
II. IN MEMORY COMPUTING MACRO Use of larger IMC arrays also suffer from degradation in
compute precision due to the effect of parasitic and device
In this section we present a brief description of IMC scheme variations [8, 33, 34]. The MVM operation therefore needs to
based on SRAM . Both, current-mode [11, 19, 20], as well as, be partitioned and allocated to smaller memory arrays.
charge mode MVM computation [14, 15, 21] based on
SRAM are discussed.

A. Mapping of DNN onto IMC SRAM Macro with ADC


Based Readout

As shown in fig.1b, the convolution operation involved in a


DNN architecture can be represented as MVM operations
[22, 23]. To obtain the output pixel of a particular layer,
convolution of all kernels of that layer with the pixels of all
input channels of the previous layers are needed. A fully
parallel implementation of such a compute step would require
large IMC crossbars of dimension MXNXS2XP, where M is the
number of input channels to the layer, N is the number of Fig. 2 (a) IMC unit based on SRAM, with (b) current-based
kernels, S is the size of kernels and P is the number of output and (c) charge-based bit cell operation for MVM
channels, assuming that all output pixels of a particular layer
are computed in the same IMC array. An IMC core, in general Bit-wise computation schemes for IMC cores have
constitutes of a memory array (SRAM/DRAM/NVM) along been proposed to reduce the requirements for ADC
with input and output- readout peripheral circuits. For mixed- resolutions [24-27]. This is achieved by allocating different
signal implementations, digital inputs to a particular IMC memory arrays for each of the weight-bits, feeding inputs in
core are applied to row-parallel DAC, as shown in fig.1. The a bit-wise manner and combining the partial sums obtained
DAC outputs drive the word-lines. Each column accumulates from each of the arrays, using digital peripherals through
the result of dot products between the input values and the shift-add operations [28-29]. While, the use of multi-level
kernel weights stored in the IMC memory elements in the RRAM cells have been proposed for implementing multi-bit
respective columns. The analog-mode output thus produced, weights in a single array [42], such weight-bit slicing
needs to be digitized by ADCs, before getting applied to the facilitates IMC implementation using 1-bit SRAM memory
next layer operation. In terms of degree of parallelism and cells, as shown in Fig.2a. Streaming inputs in bit-wise manner
hence, performance, larger array-size and higher bit also eliminates the need for multi-bit DACs at the row-input
resolution for inputs, weights as well as outputs are favorable. points. Even for bit-wise operations, a moderate size memory
However, larger array size and bit resolutions of weights array of dimension 64x64, leads to 6-to-8-bit resolution
translate to requirement of higher resolution ADCs [24-27].
requirement for ADCs [9, 35, 36], as shown in a subsequent An alternate approach for current mode IMC is the use
section. Partial word-line activation has also been used to of current-mode ADCs, which can help eliminate
compute lower-bit partial-sums corresponding to smaller sets transimpedance operation and the associated circuit overhead
of inputs applied sequentially [30-32]. However, such (fig. 3b). For instance, a current-controlled oscillator can be
approaches transfer significant amount of computation to used to generate oscillations with frequency proportional to
digital peripheral circuits which combine the partial results the input current, which can be converted into the output
obtained from multiple cross-bars. This results in increase in digital code with the help of a counter [11]. Though the
area and power overhead and also incurs performance penalty current-mode ADC overcomes the need of the
due to reduced degree of parallelism. transimpedance stage, it does require a biasing stage at the
input, which clamps the bit-line voltage to a desired level. In
B. Current-mode IMC Operation this work we present detailed analysis of current-mode ADC,
For SRAM based IMC cores, two different modes of MVM however, similar benefits are expected from voltage mode
operation have been proposed in recent years, namely: ADCs as well.
current mode [11, 19, 20] and charge-mode [14, 15, 21]. As
shown in Fig. 2b, current mode operation for SRAM core,
involves conditionally drawing a static current from the bit- III. CCO-BASED ADC FOR IMC MACRO
line, in the weight-bit (stored in the SRAM bit cell) and the The current-mode ADC used in this work consists of a ring-
input-bit are both high. The net current flowing through the oscillator based current-controlled oscillator (CCO) [11].
bit-line represents the dot-product between the weight bits Due to it’s simple and compact structure, it can be amenable
stored along the column and the input bit-vector. As shown to column-parallel operation, i.e., each column of the SRAM
in Fig. 2c, the charge-mode scheme involves an additional IMC array having a dedicated ADC for highest possible read-
capacitor in each bit-cell to conditionally store and transfer out speed. CCO oscillates at a frequency which has a near-
charge to the bit-line, depending upon the weight and input linear dependence on the input current IBL, received from the
bit values [15, 21]. It essentially involves charge-sharing bit-line (Fig. 4). Following the CCO is a ripple-counter
between the individual bit-cell caps and the bit-line cap, to which produces the final digital code at the end of the
implement the summation of bit-wise dot products obtained evaluation period Teval. For interfacing the CCO with the bit-
through individual cells along the column. Though charge- line, an OPAMP-based negative feedback loop is used to
mode operation offers some benefits, like reduced static maintain a constant bit-line voltage, ensuring a fixed voltage
power and faster compute speed and lesser variation, it may drop VBL = Vref, across the current sinking device of all bit-
suffer from poorer scalability due to added capacitors per bit cells, as shown in fig. 2b. For nominal supply voltage, the
cell [15, 43]. The VAT scheme proposed in this work is frequency vs input current characteristics are apparently
applicable to both cases. However, we limit our discussion to linear. The influence of supply-voltage scaling on the
current-mode computation in this work. frequency versus control current characteristics of the CCO
in nominal (TT) corner is presented in Fig. 5(a).
C. ADC for IMC operation

For current-mode IMC, the transimpedance stage can be used


to convert the output current into a proportional voltage.
Alternatively, a current integrator stage can also be used to
produce a proportional output voltage [44]. The resulting
voltage signal can be applied to a voltage mode ADC. Single
slope ADC (SS-ADC) [46], as well as Successive
Approximation Register ADC (SAR-ADC) [45, 46], have
been used for current-mode IMCs. While SS-ADC are
relatively compact, they need large number of clock-cycles Fig. 4. Block-diagram of CCO - based ADC
per conversion (2N cycles for N bit output). SAR ADCs incur
significant area overhead due to the capacitive DAC, but take We choose the maximum CCO frequency (𝑓 ) and thus,
only N clock cycles for a single N bit conversion. the conversion speed based on the maximum IBL (𝐼 ) when
all the dot-products are HIGH in a column. This can be
simplified for an N-bit (=7) ADC as shown in (1).
𝑡 = = (1)

The ADC output as a function of the column current (𝐼 ) is
shown in (2).

𝐶𝑂𝐷𝐸 = 𝑡 ⋅𝑓 = (2)
The average ADC computation power and energy is found to
Fig. 3 ADC interface for current-mode IMC column using (a)
reduce with the supply voltage (fig. 5b). This is because the
TIA and voltage-mode ADC, (b) current-mode ADC, and (c)
input current, which determines the oscillation frequency is
for charge-mode IMC using voltage mode ADC
independent of the VDD. Hence, supply voltage scaling is
advantageous in terms of power and energy efficiency (fig.
5c). However, linearity of the CCO and hence the ADC drops
(a) (b) (c)

(d) (e) (f)

Fig. 5. Trends in performance metrics of CCO-ADC: (a) CCO frequency characteristics (b) ADC compute power v/s
maximum BL current (c) ADC compute energy v/s maximum BL current (d) INL characteristics for 𝐼 = 10 𝜇𝐴
(e) DNL characteristics for 𝐼 = 10 𝜇𝐴, (f) ADC characteristics across worst-case process corners

with reducing supply voltage. This is due to increasingly the CCO. Such corner-wise PVT-shifts, which would affect
imperfect mirroring of input current into the CCO unit from all ADCs on a die in a similar fashion, can be compensated
OPAMP controlled input branch, resulting from the reduced through global calibration, which involves corner detection
voltage headroom for the current mirrors. This is evident circuits and applying corner dependent scaling of digitized
from the INL and DNL plots of the ADC shown in fig. 5d and outputs [49].
fig. 5e. In general, similar trend of power consumption and Random variations on the same die can cause significant
linearity is observed in other ADC topologies. differences between ADC characteristics on the same chip.
The effect of random variations obtained through Monte
Carlo simulations is shown in fig.6, for different supply
voltages. It indicates that the spread in the CCO
characteristics due to random variations increases with the
down-scaling of the supply voltage. This is due to the devices
being increasingly pushed towards near-threshold operation,
resulting in reduced current density and higher percentage
variations. Hence, both variation and power consumption
trade-off with energy-efficiency. In general, mitigating such
random variations would require dedicated calibration of
each ADC, which can incur significant design, area and
performance overhead [11].
IV. VARIATION AWARE TRAINING FRAMEWORK
In this section, we describe the VAT scheme which addresses
the non-idealities of ADCs along with the SRAM and
Fig. 6. Effect of random variations on CCO frequency crossbar interconnects.
characteristics for different supply voltages: showing higher A. Modeling of crossbar with non-idealities
spread and non-linearity for lower supply.
The 8T SRAM unit cell, depicted in Fig. 7, has the output port
Change in the slope of the characteristics across corners, as constituted by M7, driven by the input word-line and M8,
shown in fig. 5f, can be ascribed to the shift in device driven by the bit-cell weight value (high or low). The current
threshold-voltage mean-values and hence the stage delays of sunk byM7-M8 in a sizable SRAM array can suffer from
inaccuracy due to parasitic resistances arising from the read-
bit-line (RBL) interconnect. The dynamic characteristics of
the output signal is affected by parasitic caps of the transistors
as well as the RBL. The parasitic resistance and capacitance
per unit length of the word-line (WL) are designated as 𝑟
and 𝑐 , while the bitline (BL) parasitics are represented by
𝑟 and 𝑐 . For a given technology node, these parameters
equate to the wire resistance (𝑟 ) and the wire capacitance
(𝑐 ) per unit length. The total parasitic resistance and
capacitance between two unit cells are expressed 𝑎𝑠 𝑅 =
𝑟. 𝐿 and 𝐶 = 𝑐. 𝐿, where 𝑟 = 𝑟 or 𝑟 , and 𝑐 = 𝑐 or 𝑐 .
Notably, the effect of the parasitics become more pronounces
for increasingly larger array sizes. The effect of these (a)
parameters have been incorporated in the training model
using the open-source GENIEx framework [49]. GENIEx
stands for Generalized Approach to Emulating Non-Ideality
in memory arrays by using Neural Networks (NN). Using a
trained NN-based, it estimates the output signal of the RBL
based on conductance values of the bit-cell elements ( in this
case, M7-M8), parasitic and the applied input.

(b)
Fig. 8. Distribution of the pretrained ResNet20 model for (a)
Conv2d (ReLU) outputs (b) Conv2d Weights

.Table-I
Training accuracies for 0 integer bit for weights and 2 bits
Fig. 7. Modeling of interconnect parasitics in crossbar for activations

B. Description of Training process used: Weight Activation Bits


Bits
In this work, we train the ResNet20 model on CIFAR10 5 6 7 8
datase. This model comprises 19 convolution layers along
with a fully-connected classifying layer at the output. 6 40.33 84.62 91.06 92.21
Employing the optimal 32-bit floating point configuration 7 59 81.34 92.05 92.32
within the ResNet20 PyTorch architecture [], a classification
accuracy of 93.26% was obtained. Subsequently, the 8 50.42 80.79 92.96 92.14
minimum weight-bit precision and the ideal ADC resolution 9 41.56 81.28 92.97 92.36
required to retain an inference accuracy close to the ideal case
was estimated using iterative training. The table presents training accuracies corresponding to
various total numbers of bits, while maintaining fixed 0
(i)Pytorch training of selected CNN model with target bit integer bits for weights and 2 integer bits for activations, as
precision of input and weights. dictated by necessity. Notably, the findings reveal that
employing a total of 7 bits for both weights and activations
Fig. 8 depicts a boxplot, excluding outliers, illustrating the yields only a marginal decrease in accuracy compared to the
distribution of weights and Conv2d outputs under ideal 32- accuracy achieved with the ideal floating-point model.
bit floating-point precision arithmetic. Notably, the weights Consequently, we opt to proceed with utilizing 7 bits
predominantly fall within the range of 0 to 1, while the exclusively for subsequent training and testing phases.
Conv2d layer outputs span from 0 to 4. Based on this
observation, 0 integer bits for weights and 2 integer bits for (ii) Retraining with real ADC characteristics to mitigate
ADC were allocated and number of fractional bits were effects of non-linearity.
estimated by iteratively computing the training accuracy , as
shown in table-I We first test the impact of non-ideal ADC characteristics on
the inference accuracy. Here, ideal ADC is referred to the 32-
bit floating-point (FP) ReLU transfer-function, which acts as
a perfectly linear ADC with no variations. First, the model the proposed approach achieves significant reduction in
trained on ideal ADC characteristics is tested with non-ideal accuracy-drop due to random variations in ADCs, as the
ADCs, results for which are shown in table-I. Since, 7 bit neural network learns to adapt to such characteristics.
ideal ADC characteristics was found to produce inference
accuracy close to the 32-bit case, for subsequent analysis 7-
bit ADCs were used. The first case is ideal 7-bit ADC, which
is obtained by replacing the 32-bit ReLU function by a 7-bit
counterpart. We also test the model accuracy with 7-bit CCO
ADC characteristics obtained in the typical corner. For
comparison, we also used a single-slope ADC characteristics.
The 2nd column shows degradation in inference accuracy due
to incorporation on non-ideal ADC characteristics. This is
due to non-zero INL and DNL, for such ADCs (as observed
earlier in case of CCO ADC).
Next, retraining (RT) of the ResNet20 model Fig. 9. Test results on CCO ADC variations
was performed with non-ideal ADC characteristics. This was
done by replacing the ideal ReLU operation with simulated (iv) Modelling of weight variation
ADC characteristics for a particular corner. Subsequently, the
model thus trained was tested with the same ADC Finally, we use the model proposed by [47], for the modeling
characteristics to evaluate the impact of retraining. The 3rd of the weight variations. These variations are incorporated
column in table-I shows that, post retraining, the inference during the training process itself. The noise is added to the 7-
accuracy was restored for the real ADC characteristics. This bit quantized weights of the Conv2d layers of the CNN model
shows that the retraining process makes the model robust just before the application of the forward pass in each
towards the non-linearity and offsets present in real ADC iteration. First, for a given 𝛾, we analyze the CNN
characteristics. performance under the following conditions: (i) non-noisy
Table-II weight and ideal ADC (uniform quantization), (ii) noisy
Test Results of different ADCs weight and ideal ADC, (iii) noisy weight and single CCO
ADC, (iv) noisy weight and CCO ADC MC VAT. Note than
𝛾= 0.1 effectively translates to σ/µ of 10%, where σ is the
standard deviation and µ is the mean of the effective
weight value. This directly translates to the distribution
of bit-cells compute current sunk by the output device, as
in fig. 2a.
Table-III
Incorporating weight noise in training
Condition Accuracy (%)

Ideal ADC, 𝛾 = 0.0 93.05

Ideal ADC, 𝛾 = 0.1 91.84


(iii) Variation Aware Training for ADC
CCO ADC, 𝛾 = 0.1 91.50
Upon establishing the efficacy of retraining with real ADC
characteristics to maintain classification accuracy, we CCO ADC VAT, 𝛾 = 0.1 90.38
proceeded to assess the trained model's performance under
random variations, captured by Monte Carlo (MC)
simulations. 200-run MC samples were used to model the Table-III shows that weight variations has significant impact
variation of column-parallel ADC for 128x128 IMC array. on accuracy post-training. The results may be improved by
First, one randomly selected sample allocated for training and upsizing of the SRAM output device, to reduce the spread in
the rest utilized for testing to evaluate the impact of random output current and hence the effective weight, specifically,
variations in ADCs on inference accuracy. Our findings for the SRAM arrays computing for higher value weight-bits
reveal a substantial drop in accuracy due to these variations (increasing the number of weight bits. It’s important to note
across large number of tests, as shown in fig. 9a. that, only addressing the weight-bit variation and ignoring the
To address this, we introduce Variation Aware ADC non-idealities would lead us back to results shown in
Training (VAT), wherein all 200 MC characteristic samples table-II, without retraining.
are incorporated during the training process, randomly
allocated to each output node of a neuron. This approach
ensures the training process captures the circuit variations V. COMPARISON WITH CALIBRATED ADC
derived from Monte Carlo sampling and restores the model
accuracy close to the mean-value of inference accuracy for In this section we present describe a more conventional
the case without variation, as shown in fig. 9b. Consequently, scheme of single point calibration for multiple ADCs
associated with an SRAM based MVM unit for mitigating the
effects of random variations. We compare this approach with since each ADC can have different tuning bits owing to
the proposed VAT scheme. We also present the variation random mismatch, the tuning block needs to be separate.
results for another popular ADC topology for IMC, namely
SAR-ADC, to justify the need of the proposed scheme IMC The ADC characteristics post calibration are shown
design, irrespective of the ADC topology. in Fig. 11. It shows significant reduction in spread as
compared to the plot in fig. 5f. Here, the calibration point is
A. Comparison with Single Point Calibration Scheme. chosen around 1⁄3 of the maximum targeted current. It can
be observed that the calibrated characteristics tend to diverge
As compared to more complex calibration scheme towards higher current values while they are well matched for
used in [11] for CCO based ADC, we adopt a relatively current values up to middle of the range. This is purposefully
simpler, single-point calibration scheme in this work. This done, considering the output statistics of each convolution
scheme forces the different ADC characteristics to coincide layer. More than 75% of the outputs are found to be less than
at the point of calibration. We observe that the random 25% of the maximum value for all layers and hence, matching
variations effect the slope of the CCO characteristics, (fig. 5f the ADCs better at lower values is advantageous.
and fig. 6). Assuming that the slope remains roughly constant
(which is a significant simplification for low voltage
operation), we provide a correction factor for it by tuning the
current-mirroring ratio between the input branch of the CCO
and the current source devices of the starved inverters. The
detailed circuit-level implementation of the proposed single-
point calibration scheme is shown in Fig. 10. The
combinational logic unit produces a 9-bit digital output that
serves to selectively enable or disable a bank of equally sized
transistors connected in parallel, within the tuning bit circuit.
This method of tuning is referred to as the thermometer
coding scheme.
Fig. 11. ADC characteristics across process corners post-
calibration

For comparing the area penalty of this calibration scheme, the


layouts of 7b CCO-ADC with and without calibration block
are shown in Fig. 12. The proposed calibration scheme
results in an area penalty of around 3x, due to additional
components like switches, registers and control signals.
The training accuracy post calibration was found to be around
88.5% , which was lower than the case of variation aware
training. It may be possible to improve the training accuracy
by employing more sophisticated calibration schemes as in
[11], however, that would come at a further higher area
penalty.

(a)

Fig. 10. Single point calibration scheme for CCO ADC.

The calibration starts with resetting the serial-in-parallel-out


(SIPO) shift register. At every clock cycle, one bit of the shift
register turns HIGH which turns ON one MOSFET in the
tuning block. The moment at which the digital output
becomes less than our reference count, the combinational
logic circuit disables the register, thus fixing its content. The (b)
logic circuit of the calibration unit can be shared across Fig. 12. 7b CCO-ADC layouts (a) with and (b) without
multiple ADCs of different columns using a proper bus calibration
connection to calibrate the ADCs sequentially. However,
B. Variations in SAR ADC and need for VAT. comparator. Both can contribute to significant variations
resulting from capacitive mismatch and offset respectively.
In order to interface a SAR ADC with an SRAM based IMC Fig. 15 shows the 200 run MC simulation results for a 7 bit
array, with current mode operation, we need to convert the SAR ADC in 65nm CMOS technology. The efficacy of the
output current into proportional voltage. This can be achieved proposed VAT scheme was tested for the SAR ADC
using a transimpedance amplifier or an integrator as shown in characteristics as well. Without VAT, the model accuracy
fig. 13. Hence, we need to incorporate the effects of was less than 81%, which improved to 91.2% after the
variations and non-linearities in both, the ADC as well as variation aware training process. This approach can therefore
integrator. allow smaller unit-caps and smaller devices for the
comparator and integrating amplifier, as it can allow the IMC
core to work with cruder and less precise ADCs, without the
need of per-device calibration. Smaller devices can facilitate
lower area, higher speed as well as lower power operation for
large number of ADCs interfaced with the IMC units.

VI. CONCLUSION

In this work we proposed variation aware training approach


Fig. 13 SAR ADC interface with IMC output. for mitigating the impact of ADC non-idealities for In-
Memory-Computing. Conventional approaches involve
expensive calibration scheme and mandate relatively robust
and accurate ADC designs for integration with IMC cores.
The proposed approach on the other hand can significantly
relax the design constraints for the read-out ADCs and allow
relatively more compact units. The VAT scheme proposed in
this work can be extended to different IMC schemes,
including charge-mode and current-mode methods. It is also
applicable to different ADC topologies suitable for IMC
integration. The proposed approach can be integrated into
system level design flow for DNN accelerators based on
IMC, which would result in circuit-architecture-algorithm co-
design.

Fig.14 Schematic diagram of SAR ADC. REFERENCES


1. Chen, X., et al. (2021). “Energy-efficient offloading for DNN-
based smart IoT systems in cloud-edge environments”. IEEE
TPDS, 33(3), 683-697.
2. Whatmough, P. N., et al. “DNN engine: A 28-nm timing-error
tolerant sparse deep neural network processor for IoT
applications”. IEEE JSSC, 53(9), 2722-2731, 2018.
3. Shin, D., et al. “The heterogeneous deep neural network
processor with a non-von Neumann architecture”. Proceedings
of the IEEE, 108(8), 1245-1260, 2019.
4. Ganguly, et al. "Towards energy efficient non-von neumann
architectures for deep learning." 20th ISQED, IEEE, 2019.
5. Verma, N., et al. “In-memory computing: Advances and
prospects”. IEEE SSC Magazine, 11(3), 43-55, 2019.
6. Ma, Y., et al. “In-memory computing: The next-generation AI
computing paradigm”. GLSVLSI, 2020.
7. Zou, X., et al. “Breaking the von Neumann bottleneck:
architecture-level processing-in-memory technology”. SCIS,
64(6), 160404, 2021.
8. Sebastian, A., et al. “Memory devices and applications for in-
memory computing. Nat. Nano. 15(7), 529-544, 2020.
9. Ankit, A., et al. “Circuits and architectures for in-memory
computing-based machine learning accelerators”. IEEE Micro,
Fig. 15 MC results for 7-bit SAR characteristics without any 40(6), 8-22, 2020.
calibration . 10. Khoram, S., et al. “Challenges and opportunities: From near-
memory computing to in-memory computing”. ACM ISPD,
SAR ADC constitutes of variation sensitive units like the 2017.
capacitive digital to analog converter (DAC) and the 11. Khaddam-Aljameh, R., et al. “HERMES-core—A 1.59-
TOPS/mm 2 PCM on 14-nm CMOS in-memory compute core
using 300-ps/LSB linearized CCO-based ADCs”. JSSC, 57(4), 35. Chen, Z., et al. “CAP-RAM: A charge-domain in-memory
1027-1038, 2022. computing 6T-SRAM for accurate and precision-
12. Liu, Z., et al. “NS-CIM: A current-mode computation-in- programmable CNN inference”. IEEE, JSSC, 56(6), 1924-
memory architecture enabling near-sensor processing for 1935, 2021.
intelligent IoT vision nodes”. IEEE TCAS I, 67(9), 2909-2922, 36. Chen, Z., et al. “DCT-RAM: A driver-free process-in-memory
2020. 8T SRAM macro with multi-bit charge-domain computation
13. Chen, W-H., et al. “A 16Mb dual-mode ReRAM macro with and time-domain quantization”. CICC, IEEE, 2022.
sub-14ns computing-in-memory and memory functions 37. Yin, S., et al. “High-throughput in-memory computing for
enabled by self-write termination scheme”. IEDM, IEEE, 2017. binary deep neural networks with monolithically integrated
14. Yin, G., et al. “Enabling lower-power charge-domain RRAM and 90-nm CMOS”. TED, 67(10), 4185-4192, 2020.
nonvolatile in-memory computing with ferroelectric FETs”. 38. Roy, K., et al. “In-memory computing in emerging memory
IEEE TCAS II: Express Briefs, 68(7), 2262-2266, 2017. technologies for machine learning: An overview”. DAC, IEEE,
15. Valavi, H., et al. “A 64-tile 2.4-Mb in-memory-computing 2020.
CNN accelerator employing charge-domain compute”. JSSC, 39. Kneip, A., et al. “Impact of analog non-idealities on the design
54(6), 1789-1799, 2019. space of 6T-SRAM current-domain dot-product operators for
16. Mroszczyk, P., et al. “Energy efficient flash ADC with PVT in-memory computing”. IEEE TCAS I: 68(5), 1931-1944,
variability compensation through advanced body biasing”. 2021.
IEEE TCAS II, 66(11), 1775-1779, 2019. 40. Kim, H., et al. “A 16K SRAM-base mixed-signal in-memory
17. Palani, et al. “High linearity PVT tolerant 100MS/s rail-to-rail computing macro featuring voltage-mode accumulator and
ADC driver with built-in sampler in 65nm CMOS”. CICC, row-by-row ADC”. A-SSCC, IEEE, 2019.
IEEE, 2014. 41. Murali, G., et al. “Heterogeneous mixed-signal monolithic 3-D
18. Wang, C-C., et al. “Anti-PVT-variation low-power time-to- in-memory computing using resistive RAM”. IEEE, TVLSI,
digital converter design using 90-nm CMOS process”. TVLSI, 29(2), 386-396, 2020.
IEEE, 28(9), 2069-2073, 2020. 42. Xue, C-X., et al. “A 1Mb Multibit ReRAM Computing-In-
19. Liu, Z., et al. “NS-CIM: A current-mode computation-in- Memory Macro with 14.6ns Parallel MAC Computing Time
memory architecture enabling near-sensor processing for for CNN Based AI Edge Processors”. ISSCC, pp. 388- 389,
intelligent IoT vision nodes”. IEEE TCAS I: 67(9), 2909-2922, 2019.
2020. 43. Mu, J., et al. “SRAM-based in-memory computing macro
20. Yoon, J-H., et al. “A 40nm 64kb 56.67 tops/w read-disturb- featuring voltage-mode accumulator and row-by-row ADC for
tolerant compute-in-memory/digital rram macro with active- processing neural networks”. TCAS I: Regular Papers, 69(6),
feedback-based read and in-situ write verification”. ISSCC, 2412-2422, 2022.
IEEE, 2021. 44. Liu, Q., et al. “A fully integrated analog ReRAM based 78.4
21. Genov, R., et al. “Charge-mode parallel architecture for vector- TOPS/W compute-in-memory chip with fully parallel MAC
matrix multiplication”. IEEE TCAS II: , 48(10), 930-936, 2001. computing”. ISSCC, IEEE, 2020.
22. Fowers, J., et al. “A configurable cloud-scale DNN processor 45. Moon, K-J., et al. “A 9.1-ENOB 6-mW 10-bit 500-MS/s
for real-time AI”. ACM/IEEE 45th ISCA, IEEE, 2018. pipelined-SAR ADC with current-mode residue processing in
23. Papistas, I. A., et al. “A 22 nm, 1540 TOP/s/W, 12.1 TOP/s/mm 28-nm CMOS”. IEEE, JSSC, 54(9), 2532-2542, 2019.
2 in-memory analog matrix-vector-multiplier for DNN 46. Su, J-W., et al. “A 28nm 384kb 6T-SRAM computation-in-
acceleration”. CICC, IEEE, 2021. memory macro with 8b precision for AI edge chips”. IEEE
24. Wan, W., et al. “A compute-in-memory chip based on resistive ISSCC, 2021.
random-access memory”. Nature, 608(7923), 504-512, 2022. 47. Long, Yun et al. "Design of reliable DNN accelerator with un-
25. Seshadri, V., et al. “Ambit: In-memory accelerator for bulk reliable ReRAM." In 2019 Design, Automation & Test in
bitwise operations using commodity DRAM technology”. 50th Europe Conference & Exhibition (DATE), pp. 1769-1774.
IEEE/ACM ISM, 2017. IEEE, 2019.
26. Li, S., et al. “Pinatubo: A processing-in-memory architecture 48. Ankit et al, “PUMA: A programmable ultra-efficient
for bulk bitwise operations in emerging non-volatile memristor-based accelerator for machine learning inference”.
memories”. 53rd DAC, IEEE, 2016. ASPLOS, IEEE pp. 715-731, 2019.
27. Sinangil, M. E., et al. “A 7-nm compute-in-memory SRAM 49. Chakraborty et al., “Geniex: A generalized approach to
macro supporting multi-bit input, weight and output and emulating non-ideality in memristive xbars using neural
achieving 351 TOPS/W and 372.4 GOPS”. JSSC, 56(1), 188- networks” DAC, IEEE, 2020
198, 2020. 50. Kaiming et al. , “Deep residual learning for image
28. Si, X., et al. “A dual-split 6T SRAM-based computing-in- recognition”, CVPR, IEEE, pages 770–778, 2016
memory unit-macro with fully parallel product-sum operation
for binarized DNN edge processors”. TCAS I: Regular Papers,
66(11), 4172-4185, 2019.
29. Sun, X., et al. “Computing-in-memory with SRAM and RRAM
for binary neural networks”. 14th IEEE ICSICT, 2018.
30. Heimel, M., et al. “Hardware-oblivious parallelism for in-
memory column-stores”. VLDB, 6(9), 709-720, 2013.
31. Imani, M., et al. “Resistive configurable associative memory
for approximate computing”. DATE, IEEE, 2016.
32. Choi, B., et al. “DeNovo: Rethinking the memory hierarchy for
disciplined parallelism”. ICPACT, IEEE, 2011.
33. Ielmini, D., et al. “Device and circuit architectures for in-
memory computing”. AIS, 2(7), 2000040, 2020.
34. Jeong, Y., et al. “Parasitic effect analysis in memristor-array-
based neuromorphic systems”. IEEE Tran. Nanotech, 17(1),
184-193, 2017.

You might also like