0% found this document useful (0 votes)
14 views10 pages

Rethinking Generic Camera Models For Deep Single Image Camera Calibration To Recover Rotation and Fisheye Distortion

This document presents a novel generic camera model aimed at improving deep single image camera calibration, particularly for fisheye images where conventional methods struggle due to projection mismatches. The proposed model incorporates a learning-based calibration method that simultaneously recovers rotation and fisheye distortion while introducing a new loss function to enhance accuracy. Extensive experiments demonstrate that this approach outperforms existing geometric-based and learning-based calibration methods on large-scale datasets and real-world fisheye camera images.

Uploaded by

Khoa Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

Rethinking Generic Camera Models For Deep Single Image Camera Calibration To Recover Rotation and Fisheye Distortion

This document presents a novel generic camera model aimed at improving deep single image camera calibration, particularly for fisheye images where conventional methods struggle due to projection mismatches. The proposed model incorporates a learning-based calibration method that simultaneously recovers rotation and fisheye distortion while introducing a new loss function to enhance accuracy. Extensive experiments demonstrate that this approach outperforms existing geometric-based and learning-based calibration methods on large-scale datasets and real-world fisheye camera images.

Uploaded by

Khoa Tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Rethinking Generic Camera Models for Deep Single Image Camera Calibration

to Recover Rotation and Fisheye Distortion

Nobuhiko Wakai1 Satoshi Sato1 Yasunori Ishii1 Takayoshi Yamashita2


1 2
Panasonic Corporation Chubu University
{lastname.f irstname}@jp.panasonic.com takayoshi@isc.chubu.ac.jp
arXiv:2111.12927v1 [cs.CV] 25 Nov 2021

Abstract

Although recent learning-based calibration methods can


predict extrinsic and intrinsic camera parameters from a
single image, the accuracy of these methods is degraded in
fisheye images. This degradation is caused by mismatching
between the actual projection and expected projection. To Input fisheye image Fully recovered image
address this problem, we propose a generic camera model
Regressor 𝜃𝜃 Remap
that has the potential to address various types of distortion.
Our generic camera model is utilized for learning-based Regressor 𝜓𝜓 Proposed generic
Feature camera model
methods through a closed-form numerical calculation of the extractor Regressor 𝑓𝑓 𝛾𝛾 = 𝑓𝑓 (𝜂𝜂 + 𝑘𝑘1 𝜂𝜂3 )
camera projection. Simultaneously to recover rotation and
fisheye distortion, we propose a learning-based calibration Regressor 𝑘𝑘1

method that uses the camera model. Furthermore, we pro-


pose a loss function that alleviates the bias of the magnitude Figure 1. Concept illustrations of our work. Our network predicts
of errors for four extrinsic and intrinsic camera parame- Figure 1 in our proposed generic camera model to obtain fully
parameters
recovered images using remapping. Red lines indicate horizontal
ters. Extensive experiments demonstrated that our proposed
lines in each of the images.
method outperformed conventional methods on two large-
scale datasets and images captured by off-the-shelf fisheye
cameras. Moreover, we are the first researchers to analyze ibration object; hence, tackling the trade-off has been an
the performance of learning-based methods using various open challenge, which we explain further in the following.
types of projection for off-the-shelf cameras. Calibration methods are classified into two categories:
geometric-based and learning-based methods. Geometric-
based calibration methods achieve high accuracy, but they
1. Introduction require a calibration object, such as a cube [41] and
planes [48], to obtain a strong geometric constraint. By con-
Learning-based perception methods are widely used for trast, learning-based methods can calibrate cameras without
surveillance, cars, drones, and robots. These methods are a calibration object from a general scene image [24, 47],
well established for many computer vision tasks. Most which is called deep single image camera calibration. Al-
computer vision tasks require undistorted images; however, though learning-based methods do not require a calibration
fisheye images have the superiority of a large field of view object, the accuracy of these methods is degraded for fish-
in visual surveillance [11], object detection [34], pose esti- eye images because of the mismatch between the actual
mation [5], and semantic segmentation [26]. To use fisheye projection and expected projection in conventional meth-
cameras through removing distortion, camera calibration is ods. In particular, calibration methods [29, 42] that predict
a desirable step before perception. Camera calibration is both camera rotation and distortion have much room for
a long-studied topic in areas of computer vision, such as improvement regarding addressing complex fisheye distor-
image undistortion [24, 47], image remapping [42], virtual tion. López-Antequera’s method [29] was designed for non-
object insertion [17], augmented reality [3], and stereo mea- fisheye cameras with radial distortion and cannot process
surement [28]. In camera calibration, we cannot escape the fisheye distortion. Although four standard camera models
trade-off between accuracy and usability that we need a cal- are used for fisheye cameras, Wakai’s method [42] supports

1
only one fisheye camera model. on convolutional neural networks calibrate cameras from a
Based on the observations above, we propose a new single image in the wild. In this study, we focus on learning-
generic camera model for various fisheye cameras. The based calibration methods and describe them below.
proposed generic camera model has the potential to address Calibration methods for only extrinsic parameters have
various types of distortion. For the generic camera model, been proposed that are aimed at narrow view cameras [19,
we propose a learning-based calibration method that pre- 32, 38, 39, 44, 45] and panoramic 360◦ images [10]. These
dicts extrinsic parameters (tilt and roll angles), focal length, methods cannot calibrate intrinsic parameters, that is, they
and a distortion coefficient simultaneously from a single im- cannot remove distortion. For extrinsic parameters and fo-
age, as shown in Figure 1. Our camera model is utilized for cal length, narrow-view camera calibration was developed
learning-based methods through a closed-form numerical with depth estimation [8, 15] and room layout [35]. These
calculation of camera projection. To improve the prediction methods are not suitable for fisheye cameras because fish-
accuracy, we use a joint loss function composed of each loss eye distortion is not negligible because the projection of the
for the four camera parameters. Unlike heuristic approaches field of view is over 180◦ .
in conventional methods, our loss function makes signifi- To address large distortion, calibration methods for only
cant progress; that is, we can determine the optimal joint undistortion have been proposed that use specific image
weights based on the magnitude of errors for these camera features, that is, segmentation information [47], straight
parameters instead of the heuristic approaches. lines [46], and ordinal distortion of part of the images [24].
To evaluate the proposed method, we conducted exten- Furthermore, Chao et al. [7] proposed undistortion net-
sive experiments on two large-scale datasets [6, 30] and im- works based on generative adversarial networks [13]. These
ages captured by off-the-shelf fisheye cameras. This evalu- methods can process only undistortion and image remap-
ation demonstrated that our method meaningfully outper- ping tasks.
formed conventional geometric-based [37] and learning-
For both extrinsic and intrinsic parameters, López-
based methods [7, 24, 29, 42, 47]. The major contributions
Antequera et al. [29] proposed a pioneering method for non-
of our study are summarized as follows:
fisheye cameras. This method estimates distortion using a
• We propose a learning-based calibration method for re- polynomial function model of perspective projection similar
covering camera rotation and fisheye distortion using to Tsai’s quartic polynomial model [41]. This polynomial
the proposed generic camera model that has an adap- function of the distance from a principal point has the two
tive ability for off-the-shelf fisheye cameras. To the coefficients for second- and fourth-order terms. The method
best of our knowledge, we are the first researchers to is only trainable for the second-order coefficient, and the
calibrate extrinsic and intrinsic parameters of generic fourth-order coefficient is calculated using a quadratic func-
camera models from a single image. tion of the second-order one. This method does not cali-
brate fisheye cameras effectively because the camera model
• We propose a new loss function that alleviates the bias does not represent fisheye camera projection. Additionally,
of the magnitude of errors between the ground-truth Wakai et al. [42] proposed a calibration method for extrinsic
and predicted camera parameters for four extrinsic and parameters and focal length in fisheye cameras. Although
intrinsic parameters to obtain accurate camera param- four types of standard fisheye projection are used for camera
eters. models, for example, equisolid angle projection, the method
of Wakai et al. [42] only expects equisolid angle projec-
• We first analyze the performance of learning-based
tion. As discussed above, conventional learning-based cali-
methods using various off-the-shelf fisheye cameras.
bration methods do not fully calibrate extrinsic and intrinsic
In previous studies, these conventional learning-based
parameters of generic camera models from a single image.
methods were evaluated using only synthetic images.
Exploring loss landscapes: To optimize networks ef-
2. Related work fectively, loss landscapes have been explored after train-
ing [9, 14, 23] and during training [16]. In learning-based
Camera calibration: Camera calibration estimates pa- calibration methods, we have the problem that joint weights
rameters composed of extrinsic parameters (rotation and are difficult to determine without training. To stabilize train-
translation) and intrinsic parameters (image sensor and dis- ing or the merging of heterogeneous loss components, the
tortion parameters). Geometric-based calibration methods joint loss function was often defined [24, 29, 42, 47]. How-
have been developed using a strong constraint based on the ever, these joint weights were defined using experiments
calibration object [41,48] or line detection [2,37]. This con- or the same values, that is, unweighted joints. These joint
straint explicitly represents the relation between world co- weights are hyperparameters that depend on networks and
ordinates and image coordinates for the stable optimization datasets. A hyperparameter search method was proposed by
of calibration. By contrast, learning-based methods based Akiba et al. [1]. However, hyperparameter search tools re-

2
quire large computational cost because they execute various Table 1. Comparison of absolute errors in fisheye camera models
conditions. Additionally, to analyze the optimizers, Good- Mean absolute error [pixel]
Reference model1
fellow et al. [14] proposed an examination method for loss STG EQD ESA ORT
landscapes that use linear interpolation from the initial net- Stereographic (STG) – 9.33 13.12 93.75
work weights to the final weights. To overcome the sad- Equidistance (EQD) 9.33 – 3.79 23.58
dle points of loss landscapes, Dauphin et al. [9] proposed Equisolid angle (ESA) 13.12 3.79 – 14.25
an optimization method based on Newton’s method. Fur- Orthogonal (ORT) 93.75 23.58 14.25 –
thermore, Li et al. [23] developed a method for visualiz- Proposed generic model 0.54 0.00 0.02 0.35
1
ing loss landscapes. Although these methods can explore Each reference model is compared with other fisheye models

high-order loss landscapes, the optimal values of joint loss


weights have not been determined in learning-based calibra- calibration using the explicit focal length, given by
tion methods. Moreover, the aforementioned methods can-
not explore loss landscapes without training because they γ = f (η + k1 η 3 ), (3)
require training results. where f is the focal length and k1 is a distortion coefficient.
Evaluating our camera model: Our generic camera
3. Proposed method model is a third-order polynomial function that corresponds
First, we describe our proposed camera model based on a to Taylor’s expansion of the trigonometric function in fish-
closed-form solution for various fisheye cameras. Second, eye cameras, that is, stereographic projection, equidistance
we describe our learning-based calibration method for re- projection, equisolid angle projection, and orthogonal pro-
covering rotation and fisheye distortion. Finally, we explain jection. In the following, we show that our model can ex-
a new loss function, with its notation and mechanism. press trigonometric function models with slight errors.
We compared projection function, γ = g(η), of the
3.1. Generic camera model four trigonometric function models and our generic cam-
era model, as shown in Table 1. In this comparison, we
Camera models are composed of extrinsic parameters
calculated mean absolute errors  between pairs of the pro-
[ R | t ] and intrinsic parameters, and these camera models
jection function g1 and g2 . We defined the errors as  =
represent the mapping from world coordinates p̃ to image R π/2
coordinates ũ in homogeneous coordinates. This projection 1/(π/2) 0 | g1 (η) − g2 (η) | dη. This mean absolute er-
can be expressed for radial distortion [33] and fisheye mod- rors simply represents mean distance errors in image co-
els [40] as ordinates. Our model is useful for various fisheye models
because our model had the smallest mean absolute errors
among the camera models in Table 1.
 
γ/du 0 cu
ũ =  0 γ/dv cv  [ R | t ] p̃, (1) Calculation easiness: For our generic camera model,
0 0 1 it is easy to calculate back-projection, which converts im-
age coordinates to corresponding incident angles. When
where γ is distortion, (du , dv ) is the image sensor pitch, using back-projection, we must solve the generic camera
(cu , cv ) is a principal point, R is a rotation matrix, and t is model against incident angles η in Equation (3). Practi-
a translation vector. The subscripts of u and v denote the cally, we can solve equations on the basis of iterations or
horizontal and vertical direction, respectively. closed-form. Non-fisheye cameras often use the iteration
The generic camera model including fisheye lenses [12] approaches [4]. By contrast, we cannot use the iteration ap-
is defined as proaches for fisheye cameras because large distortion pre-
γ = k˜1 η + k˜2 η 3 + · · · , (2) vents us from obtaining the initial values close to solutions.
We therefore use a closed-form approach because the Abel-
where k˜1 , k˜2 , . . . are distortion coefficients, and η is an inci- Ruffini theorem [49] shows that fourth-order or less alge-
dent angle. Note that the focal length is not defined explic- braic equations are solvable.
itly, that is, the focal length is set to 1 mm, and the distortion
coefficients represent distortion and implicit focal length. 3.3. Proposed calibration method
To calibrate various fisheye cameras, we propose a
3.2. Proposed camera model
learning-based calibration method that uses our generic
A generic camera model with high order has the potential camera model. We use DenseNet-161 [18] pretrained on
to achieve high calibration accuracy. However, this high- ImageNet [36] to extract image features and details as fol-
order function leads to unstable optimization, particularly lows: First, we convert the image features using global aver-
for learning-based methods. Considering this problem, we age pooling [25] for regressors. Second, four individual re-
propose a generic camera model for learning-based fisheye gressors predict the normalized parameters (from 0 to 1) of

3
1.4
𝐿𝐿𝜃𝜃 for each camera parameter. Wakai et al. [42] used the joint
1.2 𝐿𝐿𝜓𝜓
1.0
𝐿𝐿𝑓𝑓 𝑆𝑆𝜃𝜃 𝑆𝑆𝜓𝜓 weights set to the same values. To determine the optimal
𝐿𝐿𝑘𝑘1 joint weights, they needed to repeat training and validation.
0.8
Loss

0.6
However, they did not search for the optimal joint weights
GT
0.4 𝑆𝑆𝑓𝑓 𝑆𝑆𝑘𝑘1 because of large computational cost.
0.2 To address this problem, we surprisingly found that nu-
0.0 merical simulations instead of training can analyze the loss
0.0 0.2 0.4 0.6 0.8 1.0
landscapes. This loss function can be divided into two steps:
Normalized camera parameter
predicting camera parameters from a image and projecting
(a) (b)
sampled points using camera parameters. The latter step
Figure 2. Difference between the non-grid bearing loss func- requires only the sampled points and camera parameters.
tions [42] for the camera parameters. (a) Each loss landscape Therefore, we focused on the latter step independent of in-
along the normalized camera parameters using a predicted camera put images. Figure 2 (a) shows the loss landscapes for cam-
parameter with a subscript parameter and ground-truth parameters
Figure 2
for the remaining parameters, and the ground-truth values are set
to 0.5. (b) Areas S integrating L that shows (a) for the interval 0
era parameters along normalized camera parameters. The
landscapes express that the magnitude of loss values of the
to 1 for θ, ψ, f , and k1 . focal length is the smallest among θ, ψ, f , and k1 , and the
focal length is relatively hard to train. Our investigation
suggests that the optimal joint loss weights w are estimated
the tilt angle θ, roll angle ψ, focal length f , and a distortion as follows: We calculate areas S under the loss function L
coefficient k1 . Each regressor consists of a 2208-channel for θ, ψ, f , and k1 . Assuming practical conditions, we set
fully connected (FC) layer with Mish activation [31] and a the ground-truth values to 0.5, which means that the center
256-channel FC layer with sigmoid activation. Batch nor- of the normalized parameter ranges from 0 to 1 in Figure 2
malization [20] uses these FC layers. Finally, we predict a (a). This area S integrating L for the interval 0 to 1 is illus-
camera model by recovering the ranges of the normalized trated in Figure 2 (b) and is given by
camera parameters to their original ranges. We scale the
1 1 n
input images to 224 × 224 pixels following conventional
Z Z
1X
studies [17, 29, 42]. Sα = Lα dα = ||pi − p̂i ||2 dα, (6)
0 0 n i=1
3.4. Harmonic non-grid bearing loss
These areas S represent the magnitude of each loss for θ,
Unlike a loss function based on image reconstruction, ψ, f , and k1 . Therefore, we define the joint weights w in
Wakai et al. proposed the non-grid bearing loss func- Equation (5) using normalization as follows:
tion L [42] based on projecting image coordinates to world
coordinates as wα = w̃α / W, (7)
n
1X P
where w̃α = 1/Sα and W = α w̃α . We call a loss func-
Lα = ||pi − p̂i ||2 , (4)
n i=1 tion using the weights in Equation (7) "harmonic non-grid
bearing loss (HNGBL)." As stated above, our joint weights
where n is the number of sampling points; α is a parameter, can alleviate the bias of the magnitude of the loss for cam-
α = {θ, ψ, f, k1 }; and p̂ is the ground-truth value of world era parameters. Remarkably, we determine these weights
coordinates p. Note that Lθ indicates that the loss function without training.
uses a predicted θ and ground-truth parameters for ψ, f , and
k1 . Additionally, Lψ , Lf , and Lk1 are determined in the
4. Experiments
same manner. We obtain the world coordinates p from the
image coordinates in sampled points. The sampled points To validate the adaptiveness of our method to various
are projected from a unit sphere. For sampling on the unit types of fisheye cameras, we conducted massive experi-
sphere, we use uniform distribution within valid incident ments using large-scale synthetic images and off-the-shelf
angles that depend on k1 . The loss function achieved stably fisheye cameras.
to calibrate cameras using the unit sphere. The joint loss is
defined as 4.1. Datasets

L = wθ Lθ + wψ Lψ + wf Lf + wk1 Lk1 , (5) We used two large-scale datasets of outdoor panora-


mas called the StreetLearn dataset (Manhattan 2019 sub-
where wθ , wψ , wf , and wk1 are the joint weights of θ, ψ, f , set) [30] and the SP360 dataset [6]. First, we divided each
and k1 , respectively. Although this loss function can effec- dataset into train and test sets following in [42]: 55, 599
tively train networks, we need to determine the joint weights train and 161 test images for StreetLearn, and 19, 038 train

4
Table 2. Distribution of the camera parameters for our train set Table 4. Feature summarization of the conventional methods and
Parameters Distribution Range or values1
our method
Method DL2 Rot2 Dist2 Projection
Pan φ Uniform [0, 360)
Santana-Cedrés [37] X Perspective
Mix Normal 70%, Uniform 30% Liao [24] X X Perspective
Tilt θ Normal µ = 0, σ = 15 Yin [47] X X Generic camera
Uniform [−90, 90] Chao [7]1 X X –
López-Antequera [29] X X X Perspective
Mix Normal 70%, Uniform 30%
Wakai [42] X X X Equisolid angle
Roll ψ Normal µ = 0, σ = 15
Ours X X X Proposed generic camera
Uniform [−90, 90]
1
Using a generator for undistortion
{1/1 9%, 5/4 1%, 4/3 66%, 2
DL denotes learning-based method; Rot denotes rotation; Dist denotes distortion
Aspect ratio Varying
3/2 20%, 16/9 4%}
Focal length f Uniform [6, 15]
4.3. Parameter and network settings
Distortion k1 Uniform [−1/6, 1/3]
Max angle ηmax Uniform [84, 96] To simplify the camera model, we fixed du = dv and the
1
principal points (cu , cv ) as the center of the image. Because
Units: φ, θ, ψ, and ηmax [deg]; f [mm]; k1 [dimensionless]
the scale factor depends on the focal length and image sen-
sor size, which is arbitrary for undistortion, we assumed that
Table 3. Off-the-shelf fisheye cameras with experimental IDs the image sensor height was 24 mm, which corresponds to a
full-size image sensor. We ignored the arbitrary translation
ID Camera body Camera lens
vector t. Because the origin of the pan angle is arbitrary, we
1 Canon EOS 6D Mark II Canon EF8-15mm F4L Fisheye USM provided the pan angle for training and evaluation. There-
2 Canon EOS 6D Mark II Canon EF15mm F2.8 Fisheye
fore, we focused on four trainable parameters, that is, tilt
Panasonic LUMIX
3 Panasonic LUMIX GM1 angle θ, roll angle ψ, focal length f , and a distortion co-
G FISHEYE 8mm F3.5
4 FLIR BFLY-U3-23S6C FIT FI-40 efficient k1 , in our method. Note that we considered cam-
5 FLIR FL3-U3-88S2 FUJIFILM FE185C057HA-1 era rotation based on the horizontal line, unlike calibration
6 KanDao QooCam8K Built-in methods [21, 22] under the Manhattan world assumption.
We optimized our network for a 32 mini-batch size us-
ing a rectified Adam optimizer [27], whose weight decay
and 55 test images for SP360. Second, we generated im- was 0.01. We set the initial learning rate to 1 × 10−4 and
age patches, with a 224-pixel image height (Himg ) and im- multiplied the learning rate by 0.1 at the 50th epoch. Ad-
age width (Wimg = Himg · A), where A is the aspect ditionally, we set the joint weights in Equation (5) using
ratio, from panorama images: 555, 990 train and 16, 100 wθ = 0.103, wψ = 0.135, wf = 0.626, and wk1 = 0.136.
test image patches for StreetLearn, and 571, 140 train and 4.4. Experimental results
16, 500 test image patches for SP360. Table 2 shows the
random distribution of the train set when we generated im- In Table 4, we summarize the features of the conven-
age patches using camera models with the maximum inci- tional methods. We implemented the methods accord-
dent angle ηmax . The test set used the uniform distribution ing to the implementation details provided in the corre-
instead of the mixed and varying distribution used for the sponding papers, with the exception that StreetLearn and
train set. During the generation step, we set the minimum SP360 were used for training the methods of Chao [7],
image circle diameter to the image height, assuming practi- López-Antequera [29], and Wakai [42]. For the method
cal conditions. of Santana-Cedrés [37], we excluded test images with few
lines because this method requires many lines for calibra-
tion.
4.2. Off-the-shelf fisheye cameras
4.4.1 Parameter and reprojection errors
We evaluated off-the-shelf fisheye cameras because fish-
eye cameras have complex lens distortion, unlike narrow- To validate the accuracy of the predicted camera parame-
view cameras. Table 3 shows various fisheye cameras that ters, we compared methods that can predict rotation and
we used for evaluation. Note that we only used the front distortion parameters. We evaluated the mean absolute er-
camera in the QooCam8K camera, which has both front rors of the camera parameters and mean reprojection errors
and rear cameras. We captured outdoor fisheye images in (REPE) on the test set for our generic camera model. Ta-
Kyoto, Japan using the off-the-shelf cameras. ble 5 shows that our method achieved the lowest mean ab-

5
Table 5. Comparison of the absolute parameter errors and reprojection errors on the test set for our generic camera model
StreetLearn SP360
Method Mean absolute error ↓ REPE ↓ Mean absolute error ↓ REPE ↓
Tilt θ [deg] Roll ψ [deg] f [mm] k1 [pixel] Tilt θ [mm] Roll ψ [deg] f [mm] k1 [pixel]
López-Antequera [29] 27.60 44.90 2.32 – 81.99 28.66 44.45 3.26 – 84.56
Wakai [42] 10.70 14.97 2.73 – 30.02 11.12 17.70 2.67 – 32.01
Ours w/o HNGBL1 7.23 7.73 0.48 0.025 12.65 6.91 8.61 0.49 0.030 12.57
Ours 4.13 5.21 0.34 0.021 7.39 3.75 5.19 0.39 0.023 7.39
1
"Ours w/o HNGBL" refers to replacing HNGBL with non-grid bearing loss [42]

solute errors and REPE among all methods. This REPE Table 6. Comparison of mean PSNR and SSIM on the test set for
reflected the errors of both extrinsic and intrinsic param- our generic camera model
eters. To calculate the REPE, we generated 32, 400 uni- StreetLearn SP360
Method
form world coordinates on the unit sphere within less than PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
90◦ because of the lack of calibration points for the image-
Santana-Cedrés [37] 14.65 0.341 14.26 0.390
based calibration methods. López-Antequera’s method [29] Liao [24] 13.71 0.362 13.85 0.404
did not seem to work well because it expects non-fisheye in- Yin [47] 13.91 0.349 14.03 0.390
put images. Our method substantially reduced focal length Chao [7] 16.13 0.409 15.88 0.449
errors and camera rotation errors (tilt and roll angles) by López-Antequera [29] 17.88 0.499 16.24 0.486
86% and 66%, respectively, on average for the two datasets Wakai [42] 21.57 0.622 20.98 0.639
compared with Wakai’s method [42]. Furthermore, our Ours w/o HNGBL1 27.41 0.801 26.49 0.801
method reduces the REPE by 76% on average for the two Ours 29.01 0.838 28.10 0.835
1 "Ours w/o HNGBL" refers to replacing HNGBL with non-grid bearing loss [42]
datasets compared with Wakai’s method [42]. Therefore,
our method predicted accurate extrinsic and intrinsic cam-
era parameters.
camera models, we also evaluated the performance on the
We also evaluated our method, referred to as "Ours w/o trigonometric function models in Table 7. Although orthog-
HNGBL," replacing our loss function with non-grid bear- onal projection decreased PSNR, our method addressed all
ing loss [42] to analyze the performance of our loss func- the trigonometric function models; hence, our method had
tion, as shown in Table 5. This result demonstrates that our the highest PSNR in all cases. This suggests that our generic
loss function effectively reduced the rotation errors in the camera model precisely behaved like a trigonometric func-
tilt and roll angles by 3.05◦ on average for the two datasets tion model. Therefore, our method has the potential to cali-
compared with the "Ours w/o HNGBL" case. In addition to brate images from various fisheye cameras.
rotation errors, the REPE for our method with HNGBL was
5.22 pixels on average for the two datasets smaller than that
for "Ours w/o HNGBL." These results suggest that our loss 4.4.3 Qualitative evaluation
function enabled networks to accurately predict not only fo- We evaluated the performance of undistortion and full re-
cal length but also other camera parameters. covery not only for synthetic images but also off-the-shelf
cameras to describe the image quality after calibration.
4.4.2 Comparison using PSNR and SSIM Synthetic images: Figure 3 shows the qualitative results
on the test set for our generic camera model. Our results
To demonstrate validity and effectiveness in images, we are the most similar to the ground-truth images in terms of
used the peak signal-to-noise ratio (PSNR) and structural undistortion, and fully recovering rotation and fisheye dis-
similarity (SSIM) [43] for intrinsic parameters. When per- tortion. Our method worked well for various types of dis-
forming undistortion, extrinsic camera parameters are arbi- tortion and scaling. By contrast, it was difficult to calibrate
trary because we consider only intrinsic camera parameters, circumferential fisheye images with large distortion using
image coordinates, and incident angles. Table 6 shows the Santana-Cedrés’s method [37], Liao’s method [24], Yin’s
performance of undistortion on the test set for our generic method [47], and Chao’s method [7]. Furthermore, López-
camera model. Our method notably improved the image Antequera’s [29] and Wakai’s [42] methods did not remove
quality of undistortion by 7.28 for the PSNR and 0.206 for distortion, although the scale was close to the ground truth.
the SSIM on average for the two datasets compared with When fully recovering rotation and distortion, López-
Wakai’s method [42]. Antequera’s [29] and Wakai’s [42] methods tended to pre-
To validate the dependency of the four types of fisheye dict camera rotation with large errors in the tilt and roll an-

6
Table 7. Comparison of mean PSNR on the test set for the trigonometric function models
StreetLearn SP360
Method Stereo- Equi- Equisolid Ortho- Stereo- Equi- Equisolid Ortho-
All All
graphic distance angle gonal graphic distance angle gonal
Santana-Cedrés [37] 14.68 13.20 12.49 10.29 12.66 14.25 12.57 11.77 9.34 11.98
Liao [24] 13.63 13.53 13.52 13.74 13.60 13.76 13.66 13.67 13.92 13.75
Yin [47] 13.81 13.62 13.59 13.77 13.70 13.92 13.74 13.72 13.94 13.83
Chao [7] 15.86 15.12 14.87 14.52 15.09 15.60 15.02 14.83 14.69 15.03
López-Antequera [29] 17.84 16.84 16.43 15.15 16.57 15.72 14.94 14.68 14.52 14.97
Wakai [42] 22.39 23.62 22.91 17.79 21.68 22.29 22.65 21.79 17.54 21.07
Ours w/o HNGBL1 26.49 29.08 28.56 23.97 27.02 25.35 28.53 28.26 23.85 26.50
Ours 26.84 30.10 29.69 23.70 27.58 25.74 29.28 28.95 23.93 26.98
1
"Ours w/o HNGBL" refers to replacing HNGBL with non-grid bearing loss [42]
StreetLearn
SP360

Input Santana- Liao Yin Chao Lopez-


́ Wakai Ours GT
Cedres
́ Antequera
(a) Only undistortion
StreetLearn SP360

Input Lopez-
́ Wakai Ours GT Input Lopez-
́ Wakai Ours GT
Antequera Antequera
(b) Fully recovered rotation and distortion

Figure 3. Qualitative results on the test images for our generic camera model. (a) Undistortion results shown in the input image, results
of the compared methods (Santana-Cedrés [37], Liao [24], Yin [47], Chao [7], López-Antequera [29], and Wakai [42]), our method, and

Figure 3
the ground-truth image from left to right. (b) Fully recovered rotation and distortion shown in the input image, results of the compared
methods (López-Antequera [29] and Wakai [42]), our method, and the ground-truth image from left to right.

7
ID 1
Equisolid angle

ID 2
Equisolid angle

ID 3
Equisolid angle

ID 4
Orthogonal

ID 5
Equidistance

ID 6
Stereographic

Input Lopez-
́ Wakai Ours Input Lopez-
́ Wakai Ours
Antequera Antequera

(a) Networks trained using StreetLearn (b) Networks trained using SP360

Figure 4. Qualitative results of fully recovering rotation and fisheye distortion for the off-the-shelf cameras shown in the input image,
results of the compared methods (López-Antequera [29] and Wakai [42]), and our method from left to right for each image. The IDs
correspond to IDs in Table 3, and the projection names are attached to the IDs from specifications (ID: 3–5) and our estimation (ID: 1, 2,
and 6). Qualitative results of the methods trained using StreetLearn [30] and SP360 [6] datasets as shown in (a) and (b), respectively.

gles. As shown in Figure 3, our synthetic images consisted the dataset domain mismatch between the two panorama
of zoom-in images of parts of buildings and zoom-out im- datasets and our captured images. Overall, our method out-
ages of skyscrapers. Our method processed both types of performed the conventional methods in the qualitative eval-
images, that is, it demonstrated scale robustness. uation of off-the-shelf cameras. As described above, our
Off-the-shelf cameras: We also validated calibration method precisely recovered both rotation and fisheye dis-
methods using off-the-shelf fisheye cameras to analyze the tortion using our generic camera model.
performance of actual complex fisheye distortion. Note

Figure 4
that studies on the conventional learning-based methods 5. Conclusion
in Table 4 reported evaluation results using only synthetic We proposed a learning-based calibration method using
fisheye images. Figure 4 shows the qualitative results of a new generic camera model to address various types of
fully recovering rotation and fisheye distortion for meth- camera projection. Additionally, we introduced a novel loss
ods that predicted intrinsic and extrinsic camera parame- function that has optimal joint weights determined with-
ters. These methods were trained using the StreetLearn [30] out training. These weights can alleviate the bias of the
or SP360 [6] datasets. The results for López-Antequera’s magnitude of each loss for four camera parameters. As a
method had distortion and/or rotation errors. Our method result, we enabled networks to precisely predict both ex-
outperformed Wakai’s method [42], which often recovered trinsic and intrinsic camera parameters. Extensive experi-
only distortion for all our cameras. Our fully recovered im- ments demonstrated that our proposed method substantially
ages demonstrated the effectiveness of our method for off- outperformed conventional geometric-based and learning-
the-shelf fisheye cameras with various types of projection. based methods on two large-scale datasets. Moreover, we
In all the calibration methods, images captured by off- demonstrated that our method fully recovered rotation and
the-shelf cameras seemingly degraded the overall perfor- distortion using various off-the-shelf fisheye cameras. To
mance in the qualitative results compared with synthetic improve the calibration performance in off-the-shelf cam-
images. This degradation probably occurred because of eras, in future work, we will study the dataset domain mis-
the complex distortion of off-the-shelf fisheye cameras and match.

8
References ternational Conference on Learning Representations (ICLR),
2015. 2, 3
[1] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Op- [15] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova. Depth
tuna: A next-generation hyperparameter optimization frame- from videos in the wild: Unsupervised monocular depth
work. In Proceedings of International Conference on Knowl- learning from unknown cameras. In Proceedings of Interna-
edge Discovery and Data Mining (KDD), 2019. 2 tional Conference on Computer Vision (ICCV), pages 8976–
[2] M. Alemán-Flores, L. Alvarez, L. Gomez, and D. Santana- 8985, 2019. 2
Cedrés. Automatic lens distortion correction using one- [16] R. Groenendijk, S. Karaoglu, T. Gevers, and T. Mensink.
parameter division models. Image Processing On Line Multi-loss weighting with coefficient of variations. In Pro-
(IPOL), 4:327–343, 2014. 2 ceedings of IEEE Winter Conference on Applications of
[3] H. A. Alhaija, S. K. Mustikovela, L. Mescheder, and A. Computer Vision (WACV), pages 1469–1478, 2020. 2
Geiger. Augmented reality meets computer vision: Efficient [17] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher,
data generation for urban driving scenes. International Jour- E. Gambaretto, S. Hadap, and J. Lalonde. A perceptual mea-
nal of Computer Vision (IJCV), 126(9):961–972, 2018. 1 sure for deep single image camera calibration. In Proceed-
[4] L. Alvarez, L. Gómez, and J.R. Sendra. An algebraic ap- ings of IEEE Conference on Computer Vision and Pattern
proach to lens distortion by line rectification. Journal of Recognition (CVPR), pages 2354–2363, 2018. 1, 4
Mathematical Imaging and Vision, 35:36–50, 2009. 3 [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger.
[5] Z. Cao, T. Simon, S. Wei, and Y. Sheikh. Realtime multi- Densely connected convolutional networks. In Proceedings
person 2D pose estimation using part affinity fields. In Pro- of IEEE Conference on Computer Vision and Pattern Recog-
ceedings of IEEE Conference on Computer Vision and Pat- nition (CVPR), pages 2261–2269, 2017. 3
tern Recognition (CVPR), pages 1302–1310, 2017. 1 [19] Z. Huang, Y. Xu, J. Shi, X. Zhou, H. Bao, and G. Zhang.
[6] S. Chang, C. Chiu, C. Chang, K. Chen, C. Yao, R. Lee, and Prior guided dropout for robust visual localization in dy-
H. Chu. Generating 360 outdoor panorama dataset with reli- namic environments. In Proceedings of International Con-
able sun position estimation. In Proceedings of SIGGRAPH ference on Computer Vision (ICCV), pages 2791–2800,
Asia, 2018. 2, 4, 8 2019. 2
[7] C. Chao, P. Hsu, H. Lee, and Y. Wang. Self-supervised deep [20] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating
learning for fisheye image rectification. In Proceedings of deep network training by reducing internal covariate shift. In
IEEE International Conference on Acoustics, Speech, and Proceedings of International Conference on Machine Learn-
Signal Processing (ICASSP), pages 2248–2252, 2020. 2, 5, ing (ICML), volume 37, pages 448–456, 2015. 4
6, 7 [21] J. Lee, H. Go, H. Lee, S. Cho, M. Sung, and J. Kim. CTRL-
[8] Y. Chen, C. Schmid, and C. Sminchisescu. Self-supervised C: Camera calibration TRansormer with line-classification.
learning with geometric constraints in monocular video: In Proceedings of International Conference on Computer Vi-
Connecting flow, depth, and camera. In Proceedings of In- sion (ICCV), 2021. 5
ternational Conference on Computer Vision (ICCV), pages [22] J. Lee, M. Sung, H. Lee, and J. Kim. Neural geometric parser
7062–7071, 2019. 2 for single image camera calibration. In Proceedings of Eu-
ropean Conference on Computer Vision (ECCV), 2020. 5
[9] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli,
[23] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visual-
and Y. Bengio. Identifying and attacking the saddle point
izing the loss landscape of neural nets. In Proceedings of Ad-
problem in high-dimensional non-convex optimization. In
vances in Neural Information Processing Systems (NeurIPS),
Proceedings of Advances in Neural Information Processing
2018. 2, 3
Systems (NeurIPS), 2018. 2, 3
[24] K. Liao, C. Lin, and Y. Zhao. A deep ordinal distortion esti-
[10] B. Davidson, M. S. Alvi, and J. F. Henriques. 360o Cam-
mation approach for distortion rectification. IEEE Transac-
era alignment via segmentation. In Proceedings of Euro-
tions on Image Processing (TIP), 30:3362–3375, 2021. 1, 2,
pean Conference on Computer Vision (ECCV), pages 579–
5, 6, 7
595, 2020. 2
[25] M. Lin, Q. Chen, and S. Yan. Network in network. In Pro-
[11] Z. Fu, Q. Liu, Z. Fu, and Y. Wang. Template-free visual ceedings of International Conference on Learning Represen-
tracking with space-time memory networks. In Proceedings tations (ICLR), 2014. 3
of IEEE Conference on Computer Vision and Pattern Recog- [26] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille,
nition (CVPR), 2021. 1 and L. Fei-Fei. Auto-DeepLab: Hierarchical neural architec-
[12] D. B. Gennery. Generalized camera calibration including ture search for semantic image segmentation. In Proceedings
fish-eye lenses. International Journal of Computer Vision of IEEE Conference on Computer Vision and Pattern Recog-
(IJCV), 68:239–266, 2006. 3 nition (CVPR), pages 82–92, 2019. 1
[13] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. [27] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han.
Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gener- On the variance of the adaptive learning rate and beyond. In
ative adversarial nets. In Proceedings of Advances in Neural Proceedings of International Conference on Learning Rep-
Information Processing Systems (NeurIPS), 2014. 2 resentations (ICLR), 2020. 5
[14] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualita- [28] A. Locher, M. Perdoch, and L. V. Gool. Progressive priori-
tively characterizing neural network. In Proceedings of In- tized multi-view stereo. In Proceedings of IEEE Conference

9
on Computer Vision and Pattern Recognition (CVPR), pages TV cameras and lenses. IEEE Journal of Robotics and Au-
3244–3252, 2016. 1 tomation, 3(4):323–344, 1987. 1, 2
[29] M. López-Antequera, R. Marí, P. Gargallo, Y. Kuang, J. [42] N. Wakai and T. Yamashita. Deep single fisheye image
Gonzalez-Jimenez, and G. Haro. Deep single image cam- camera calibration for over 180-degree projection of field of
era calibration with radial distortion. In Proceedings of view. In Proceedings of International Conference on Com-
IEEE Conference on Computer Vision and Pattern Recog- puter Vision Workshops (ICCVW), pages 1174–1183, 2021.
nition (CVPR), pages 11809–11817, 2019. 1, 2, 4, 5, 6, 7, 1, 2, 4, 5, 6, 7, 8
8 [43] Z. Wang and A. C. Bovik. Image quality assessment: From
[30] P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin, error visibility to structural similarity. IEEE Transactions on
K. M. Hermann, M. Malinowski, M. K. Grimes, K. Si- Image Processing (TIP), 13(4):600–612, 2004. 6
monyan, K. Kavukcuoglu, A. Zisserman, and R. Hadsell. [44] W. Xian, Z. Li, N. Snavely, M. Fisher, J. Eisenman, and E.
The StreetLearn environment and dataset. arXiv preprint Shechtman. UprightNet: Geometry-aware camera orienta-
arXiv:1903.01292, 2019. 2, 4, 8 tion estimation from single images. In Proceedings of In-
[31] D. Misra. Mish: A self regularized non-monotonic neural ternational Conference on Computer Vision (ICCV), pages
activation function. In Proceedings of British Machine Vision 9973–9982, 2019. 2
Conference (BMVC), 2020. 4 [45] F. Xue, X. Wang, Z. Yan, Q. Wang, J. Wang, and H.
[32] Y. Nie, X. Han, S. Guo, Y. Zheng, J. Chang, and J. J. Zhang. Zha. Local supports global: Deep camera relocalization
Total3DUnderstanding: Joint layout, object pose and mesh with sequence enhancement. In Proceedings of International
reconstruction for indoor scenes from a single image. In Conference on Computer Vision (ICCV), pages 2841–2850,
Proceedings of IEEE Conference on Computer Vision and 2019. 2
Pattern Recognition (CVPR), pages 52–61, 2020. 2 [46] Z. Xue, N. Xue, G. Xia, and W. Shen. Learning to calibrate
[33] G. V. Puskorius and L. A. Feldkamp. Camera calibration straight lines for fisheye image rectification. In Proceedings
methodology based on a linear perspective transformation of IEEE Conference on Computer Vision and Pattern Recog-
error model. In Proceedings of IEEE International Confer- nition (CVPR), pages 1643–1651, 2019. 2
ence on Robotics and Automation (ICRA), pages 1858–1860 [47] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao.
vol.3, 1988. 3 FishEyeRecNet: A multi-context collaborative deep network
[34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You for fisheye image rectification. In Proceedings of European
Only Look Once: Unified, real-time object detection. In Conference on Computer Vision (ECCV), pages 475–490,
Proceedings of IEEE Conference on Computer Vision and 2018. 1, 2, 5, 6, 7
Pattern Recognition (CVPR), pages 779–788, 2016. 1
[48] Z. Zhang. A flexible new technique for camera calibration.
[35] L. Ren, Y. Song, J. Lu, and J. Zhou. Spatial geometric rea-
IEEE Transactions on Pattern Analysis and Machine Intelli-
soning for room layout estimation via deep reinforcement
gence (PAMI), 22(11):1330–1334, 2000. 1, 2
learning. In Proceedings of European Conference on Com-
[49] H. Żoładek. The topological proof of Abel–Ruffini theo-
puter Vision (ECCV), pages 550–565, 2020. 2
rem. Topological Methods in Nonlinear Analysis, 16:253–
[36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S.
265, 2000. 3
Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C.
Berg, and L. Fei-Fei. ImageNet large scale visual recognition
challenge. International Journal of Computer Vision (IJCV),
115(3):211–252, 2015. 3
[37] D. Santana-Cedrés, L. Gomez, M. Alemán-Flores, A. Sal-
gado, J. Esclarín, L. Mazorra, and L. Alvarez. An iterative
optimization algorithm for lens distortion correction using
two-parameter models. Image Processing On Line (IPOL),
6:326–364, 2016. 2, 5, 6, 7
[38] M. R. U. Saputra, P. Gusmao, Y. Almalioglu, A. Markham,
and N. Trigoni. Distilling knowledge from a deep pose re-
gressor network. In Proceedings of International Conference
on Computer Vision (ICCV), pages 263–272, 2019. 2
[39] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Gan-
guly. End-to-End camera calibration for broadcast videos.
In Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 13624–13633, 2020. 2
[40] S. Shah and J. K. Aggarwal. A simple calibration procedure
for fish-eye (high distortion) lens camera. In Proceedings of
IEEE International Conference on Robotics and Automation
(ICRA), pages 3422–3427 vol.4, 1994. 3
[41] R. Y. Tsai. A versatile camera calibration technique for high-
accuracy 3D machine vision metrology using off-the-shelf

10

You might also like