Dataset creation for Deep Learning-based Geometric Computer Vision problems

DatasetcreationforDeepLearning-basedGeometric ComputerVisionproblems

Purpose of this presentation ● This is the more ‘pragmatic set’ accompanying the slideset for analyzing SfM-Net architecture from Google. ● The main idea in the dataset creation is to have multiple sensor quality levels in the same rig, in order for us to obtain good quality reference data (ground truth, gold standard) with terrestrial laser scanner that can be used for image restoration deep learning networks - In order to get more out from the inference of lower quality sensors as well. Think of Google Tango, iPhone 8 with depth sensing, Kinect, etc. ● Presentation tries to address the typical problem of finding the relevant “seed literature” for a new topic helping fresh grad students, postdocs, software engineers and startup founders. - Answer to “Do you know if someone has done some work on the various steps involved in SfM” to identify what wheels do not need to be re-invented ●● Some of the RGB image enhancement/styling slides are not the most relevant when designing the hardware pipe per se, but are there highlighting the need for systems engineering approach for the design of the whole pipe rather than just obtaining the data somewhere and expecting the deep learning software to do all the magic for you. Deep Learning for Structure-from-Motion (SfM) https://www.slideshare.net/PetteriTeikariPhD/deconstructing-sfmnet

Future Hardware and dataset creation

Pipeline Dataset creation #1A The Indoor Multi-sensor Acquisition System (IMAS) presented in this paper consists of a wheeled platform equipped with two 2D laser heads, RGB cameras, thermographic camera, thermohygrometer, and luxmeter. http://dx.doi.org/10.3390/s16060785 Inspired by the system of Armesto et al., one could have a custom rig with: ● high-quality laser scanner giving the “gold standard” for depth, ● accompanied with smart phone quality RGB and depth sensing, ● accompanied by DSLR gold standard for RGB ● and some mid-level structured light scanner? Rig config would allow multiframe exposure techniques to be used easily than a handheld system (see next slide) We saw previous that the brightness constancy assumption might be tricky with some materials, and polarization measurement for example can help distinguishing materials (dielectric materials polarize light, whereas conductive do not), or if there is some other way of estimating Bidirectional Reflectance Distribution Function (BRDF) Multicamera rig calibration by double-sided thick checkerboard Marco Marcon; Augusto Sarti; Stefano Tubaro IET Computer Vision 2017 http://dx.doi.org/10.1049/iet-cvi.2016.0193

Pipeline Dataset creation #2a : Multiframe Techniques Note! In deep learning, the term super-resolution refers to “statistical upsampling” whereas in optical imaging super-resolution typically refers to imaging techniques. Note2! Nothing should stop someone marrying them two though In practice anyone can play with super-resolution at home by putting a camera on a tripod and then taking multiple shots of the same static scene, and post-processing them through super-resolution that can improve modulation transfer function (MTF) for RGB images, improve depth resolution and reduce noise for laser scans and depth sensing e.g. with Kinect. https://doi.org/10.2312/SPBG/SPBG06/009-015 Cited by 47 articles (a) One scan. (b) Final super-resolved surface from 100 scans. “PhotoAcute software processes sets of photographs taken in continuous mode. It utilizes superresolution algorithms to convert a sequence of images into a single high-resolution and low-noise picture, that could only be taken with much better camera.” Depth looks a lot nicer when reconstructed using 50 consecutive Kinec v1 frames in comparison to just one frame. [Data from Petteri Teikari[ Kinect multiframe reconstruction with SiftFu [Xiao et al. (2013)] https://github.com/jianxiongxiao/ProfXkit

Pipeline Dataset creation #2b : Multiframe Techniques Boring to take manually e.g. 100 shots of the same scene involving even 360 rotation of the imaging devices, in practice this would need to be automated in some way with a step motor driven by Arduino or if no good commercial systems are not available. Multiframe techniques would allow another level of “nesting” of ground truths for a joint image enhancement block along with the proposed structure and motion network. ● The reconstructed laser scan / depth image / RGB from 100 images would the target, and the single-frame version the input that need to be enhanced Meinhardt et al. (2017) Diamond et al. (2017)

Pipeline Dataset creation #3 A Pipeline for Generating Ground Truth Labels for Real RGBD Data of Cluttered Scenes Pat Marion, Peter R. Florence, Lucas Manuelli, Russ Tedrake Submitted on 15 Jul 2017, last revised 25 Jul 2017 https://arxiv.org/abs/1707.04796 In this paper we develop a pipeline to rapidly generate high quality RGBD data with pixelwise labels and object poses. We use an RGBD camera to collect video of a scene from multiple viewpoints and leverage existing reconstruction techniques to produce a 3D dense reconstruction. We label the 3D reconstruction using a human assisted ICP- fitting of object meshes. By reprojecting the results of labeling the 3D scene we can produce labels for each RGBD image of the scene. This pipeline enabled us to collect over 1,000,000 labeled object instances in just a few days. We use this dataset to answer questions related to how much training data is required, and of what quality the data must be, to achieve high performance from a DNN architecture. Overview of the data generation pipeline. (a) Xtion RGBD sensor mounted on Kuka IIWA arm for raw data collection. (b) RGBD data processed by ElasticFusion into reconstructed pointcloud. (c) User annotation tool that allows for easy alignment using 3 clicks. User clicks are shown as red and blue spheres. The transform mapping the red spheres to the green spheres is then the user specified guess. (d) Cropped pointcloud coming from user specified pose estimate is shown in green. The mesh model shown in grey is then finely aligned using ICP on the cropped pointcloud and starting from the user provided guess. (e) All the aligned meshes shown in reconstructed pointcloud. (f) The aligned meshes are rendered as masks in the RGB image, producing pixelwise labeled RGBD images for each view. Increasing the variety of backgrounds in the training data for single-object scenes also improved generalization performance for new backgrounds, with approximately 50 different backgrounds breaking into above- 50% IoU on entirely novel scenes. Our recommendation is to focus on multi-object data collection in a variety of backgrounds for the most gains in generalization performance. We hope that our pipeline lowers the barrier to entry for using deep learning approaches for perception in support of robotic manipulation tasks by reducing the amount of human time needed to generate vast quantities of labeled data for your specific environment and set of objects. It is also our hope that our analysis of segmentation network performance provides guidance on the type and quantity of data that needs to be collected to achieve desired levels of generalization performance.

Pipeline Dataset creation #4 A Novel Benchmark RGBD Dataset for Dormant Apple Trees and Its Application to Automatic Pruning Shayan A. Akbar, Somrita Chattopadhyay, Noha M. Elfiky, Avinash Kak; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2016 https://doi.org/10.1109/CVPRW.2016.50 Extending of the Kinect device functionality and the corresponding database Libor Bolecek ; Pavel Němec ; Jan Kufa ; Vaclav Ricny Radioelektronika (RADIOELEKTRONIKA), 2017 https://doi.org/10.1109/RADIOELEK.2017.7937594 One of the possible research directions is use of infrared version of the investigated scene for improvement of the depth map. However, the databases of the Kinect data which would contain the corresponding infrared images do not exist. Therefore, our aim was to create such database. We want to increase the usability of the database by adding stereo images. Moreover, the same scenes were captured by Kinect v2. It was also investigated the impact of simultaneous use Kinect v1 and Kinect v2 to improve depth map investigated the scene. The database contains sequences of objects on turntable and simple scenes containing several objects. The depth map of the scene obtained by a) Kinect v1, b) Kinect v2. The comparison of the one row of the depth map obtained by a) Kinect v1 b) Kinect v2 with true depth map. Kinect inrared image after change of the dynamics of brightness

Pipeline Multiframe Pipe #1 100 Multiframe reconstruction enhancement block 1 2 100 3 ... ... Depth image (e.g. Kinect) Laser scan (e.g. Velodyne) RGB Image Target Learn to improve image quality from single image when the system is deployed Reconstruction could be done using traditional algorithms (e.g. OpenCV) to start with need then to save all individual frames when reconstruction algorithms improve, and all blocks can be iterated then ad infinitum Mix different image qualities and sensor qualities then in the training set to build invariance to scan quality

Pipeline Multiframe Pipe #2 You could cascade different levels of quality If you want to make things complex in deeply supervised fashion LOWEST QUALITY Just with RGB HIGHEST QUALITY Depth map with professional laser scanner 2 1 3 4 5 6 The following step in the cascade is closer in quality to the previous one, and one could assume that this enhancement would be easier to learn, and the pipeline would output the enhanced quality as a “side effect” which is good for visualization purposes.

Pipeline acquisition example with Kinect https://arxiv.org/abs/1704.07632 KinectFusion (Newcombe et al. 2011), one of the pioneering works, showed that a real-world object as well as an indoor scene canbe reconstructed in real-time with GPU acceleration. It exploits the iterative closest point (ICP) algorithm (Besl and McKay 1992) to track 6-DoF poses and the volumetric surface representation scheme with signed distance functions (Curless and Levoy, 1996) to fuse 3D measurements. A number of following studies (e.g. Choi et al. 2015) have tackled the limitation of KinectFusion; as the scale ofa scene increases, it is hard to completely reconstruct thescene due to the drift problem of the ICP algorithm as wellas the large memory consumption of volumetric integration. To scale up the KinectFusion algorithm, Whelan et al . (2012)] presented a spatially extended KinectFusion, named as Kintinuous, by incrementally adding KinectFusion results as the form of triangular meshes. Whelan et al . (2015) also proposed ElasticFusion to tackle similar problems as well as to overcome the problem of a pose graph optimization by using the surface loop closure optimization and the surfel-based representation. Moreover, to decrease the space complexity, ElasticFusion deallocates invisible surfels from the memory; invisible surfels are allocated in the memory again only if they are likely to be visible in the near future.

Pipeline Multiframe Pipe into Sfm-Net

Pipeline Multiframe Pipe Quality simulation Simulated Imagery Rendering Workflow for UAS- Based Photogrammetric 3D Reconstruction Accuracy Assessments Richard K. Slocum and Christopher E. Parrish Remote Sensing 2017, 9(4), 396; doi :10.3390/rs9040396 “Here, we present a workflow to render computer generated imagery using a virtual environment which can mimic the independent variables that would be experienced in a real-world UAS imagery acquisition scenario. The resultant modular workflow utilizes Blender Python API, an open source computer graphics software, for the generation of photogrammetrically-accurate imagery suitable for SfM processing, with explicit control of camera interior orientation, exterior orientation, texture of objects in the scene, placement of objects in the scene, and ground control point (GCP) accuracy.” Pictorial representation of the simUAS (simulated UAS) imagery rendering workflow. Note: The SfM-MVS step is shown as a “black box” to highlight the fact that the procedure can be implemented using any SfM-MVS software, including proprietary commercial software. The imagery from Blender, rendered using a pinhole camera model, is postprocessed to introduce lens and camera effects. The magnitudes of the postprocessing effects are set high in this example to clearly demonstrate the effect of each. The fullsize image (left) and a close up image (right) are both shown in order to depict both the large and small scale effects. A 50 cm wide section of the point cloud containing a box (3 m cube) is shown with the dense reconstruction point clouds overlaid to demonstrate the effect of point cloud dense reconstruction quality on accuracy near sharp edges. The points along the side of a vertical plane on a box were isolated and the error perpendicular to the plane of the box were visualized for each dense reconstruction setting, with white regions indicating no point cloud data. Notice that the region with data gaps in the point cloud from the ultra-high setting corresponds to the region of the plane with low image texture, as shown in the lower right plot.

Data fusion combining multimodal data

Pipeline data Fusion / Registration #1 “Rough estimates for 3D structure obtained using structure from motion (SfM) on the uncalibrated images are first co- registered with the lidar scan and then a precise alignment between the datasets is estimated by identifying correspondences between the captured images and reprojected images for individual cameras from the 3D lidar point clouds. The precise alignment is used to update both the camera geometry parameters for the images and the individual camera radial distortion estimates, thereby providing a 3D-to- 2D transformation that accurately maps the 3D lidar scan onto the 2D image planes. The 3D to 2D map is then utilized to estimate a dense depth map for each image. Experimental results on two datasets that include independently acquired high-resolution color images and 3D point cloud datasets indicate the utility of the framework. The proposed approach offers significant improvements on results obtained with SfM alone.” Fusing structure from motion and lidar for dense accurate depth map estimation Li Ding ; Gaurav Sharma Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on https://doi.org/10.1109/ICASSP.2017.7952363 https://arxiv.org/abs/1707.03167 “In this paper, we present RegNet, the first deep convolutional neural network (CNN) to infer a 6 degrees of freedom (DOF) extrinsic calibration between multimodal sensors, exemplified using a scanning LiDAR and a monocular camera. Compared to existing approaches, RegNet casts all three conventional calibration steps (feature extraction, feature matching and global regression) into a single real-time capable CNN.” Development of the mean absolute error (MAE) of the rotational components over training iteration for different output representations: Euler angles are represented in red, quaternions in brown and dual quaternions in blue. Both quaternion representations outperform the Euler angles representation. “Our method yields a mean calibration error of 6 cm for translation and 0.28◦ for rotation with decalibration magnitudes of up to 1.5 m and 20◦ , which competes with state-of-the-art online and offline methods.”

Pipeline data Fusion / Registration #2 Depth refinement for binocular Kinect RGB-D cameras Jinghui Bai ; Jingyu Yang ; Xinchen Ye ; Chunping Hou Visual Communications and Image Processing (VCIP), 2016 https://doi.org/10.1109/VCIP.2016.7805545

Pipeline data Fusion / Registration #3 Used Kinects inexpensive ~£29.95 (eBay) Use multiple Kinects at once for better occlusion handling Tanwi Mallick ; Partha Pratim Das ; Arun Kumar Majumdar IEEE Sensors Journal ( Volume: 14, Issue: 6, June 2014 ) https://doi.org/10.1109/JSEN.2014.2309987 Characterization of Different Microsoft Kinect Sensor Models IEEE Sensors Journal (Volume: 15, Issue: 8, Aug. 2015) https://doi.org/10.1109/JSEN.2015.2422611 An ANOVA analysis was performed to determine if the model of the Kinect, the operating temperature, or their interaction were significant factors in the Kinect's ability to determine the distance to the target. Different sized gauge blocks were also used to test how well a Kinect could reconstruct precise objects. Machinist blocks were used to examine how well the Kinect could reconstruct objects setup on an angle and determine the location of the center of a hole. All the Kinect models were able to determine the location of a target with a low standard deviation (<;2 mm). At close distances, the resolutions of all the Kinect models were 1 mm. Through the ANOVA analysis, the best performing Kinect at close distances was the Kinect model 1414, and at farther distances was the Kinect model 1473. The internal temperature of the Kinect sensor had an effect on the distance reported by the sensor. Using different correction factors, the Kinect was able to determine the volume of a gauge block and the angles machinist blocks were setup at, with under a 10% error.

Pipeline data Fusion / Registration #4 A Generic Approach for Error Estimation of Depth Data from (Stereo and RGB-D) 3D Sensors Luis Fernandez, Viviana Avila and Luiz Gonçalves Preprints | Posted: 23 May 2017 | http://dx.doi.org/10.20944/preprints201705.0170.v1 “We propose an approach for estimating the error in depth data provided by generic 3D sensors, which are modern devices capable of generating an image (RGB data) and a depth map (distance) or other similar 2.5D structure (e.g. stereo disparity) of the scene. We come up with a multi-platform system and its verification and evaluation has been done, using the development kit of the board NVIDIA Jetson TK1 with the MS Kinects v1/v2 and the Stereolabs ZED camera. So the main contribution is the error determination procedure that does not need any data set or benchmark, thus relying only on data acquired on- the-fly. With a simple checkerboard, our approach is able to determine the error for any device” In the article of Yang [16], an MS Kinect v2 structure is proposed to improve the accuracy of the sensors and the depth of capture of objects that are placed more than four meters apart. It has been concluded that an object covered with light-absorbing materials, may cause less reflected IR light back to the MS Kinect and therefore erroneous depth data. Other factors, such as power consumption, complex wiring and high requirement for laptop computer also limits the use of the sensor. The characteristics of MS Kinect stochastic errors are presented for each direction of the axis in the work by Choo [17]. The depth error is measured using a 3D chessboard, similarly the one used in our approach. The results show that, for all three axes, the error should be considered independently. In the work of Song [18] it is proposed an approach to generate a per-pixel confidence measure for each depth map captured by MS Kinect in indoor scenes through supervised learning and the use of artificial intelligence Detection (a) and ordering (b) of corners in the three planes of the pattern. It would make sense to combine versions 1 and 2 for the same rig as Kinect v1 is more accurate for close distances, and Kinect v2 more accurate for far distances

Pipeline data Fusion / Registration #5 Precise 3D/2D calibration between a RGB-D sensor and a C-arm fluoroscope International Journal of Computer Assisted Radiology and Surgery August 2016, Volume 11, Issue 8, pp 1385–1395 https://doi.org/10.1007/s11548-015-1347-2 “A RMS reprojection error of 0.5 mm is achieved using our calibration method which is promising for surgical applications. Our calibration method is more accurate when compared to Tsai’s method. Lastly, the simulation result shows that using a projection matrix has a lower error than using intrinsic and extrinsic parameters in the rotation estimation.” While the color camera has a relative high resolution (1920 px × 1080 px for Kinect 2.0), the depth camera is mid-resolution (512 px × 424 px for Kinect 2.0) and highly noisy. Furthermore, RGB-D sensors have a minimal distance to the scene from which they can estimate the depth. For instance, the minimum optimal distance of Kinect 2.0 is 50 cm. On the other hand, C-arm fluoroscopes have a short focus, which is typically 40 cm, and a much narrower field of view than the RGB-D sensor with also a mid-resolution image (ours is 640 px × 480 px). All these factors lead to a high disparity in the field of view between the C-arm and the RGB-D sensor if the two were to be integrated in a single system. This means that the calibration process is crucial. We need to achieve high accuracy for the localization of 3D points using RGB-D sensors, and we require a calibration phantom which can be clearly imaged by both devices. Workflow of the calibration process between the RGB-D sensor and a C-arm. The input data include a sequence of infrared, depth, and color images from the RGB-D sensor and X-ray images from the C-arm. The output of the calibration pipeline is the projection matrix, which is calculated by the 3D/2D correspondences detected from the input data

Pipeline data Fusion / Registration #6 Fusing Depth and Silhouette for Scanning Transparent Object with RGB-D Sensor Yijun Ji, Qing Xia, and Zhijiang Zhang System overview; TSDF: truncated signed distance function; SFS: shape from silhouette. Results on noise region. (a) Color images captured by stationary camera with a rotating platform. (b) The noisy voxels detected by multiple depth images are in red. (c) and (d) show the experimental results done by a moving Kinect; the background is changing in these two cases.

Pipeline data Fusion / Registration #7 Intensity Video Guided 4D Fusion for Improved Highly Dynamic 3D Reconstruction Jie Zhang, Christos Maniatis, Luis Horna, Robert B. Fisher (Submitted on 6 Aug 2017) https://arxiv.org/abs/1708.01946 Temporal tracking of intensity image points (of moving and deforming objects) allows registration of the corresponding 3D data points, whose 3D noise and fluctuations are then reduced by spatio-temporal multi-frame 4D fusion. The results demonstrate that the proposed algorithm is effective at reducing 3D noise and is robust against intensity noise. It outperforms existing algorithms with good scalability on both stationary and dynamic objects. The system framework (using 3 consecutive frames as an example) Static Plane (first row): (a) mean roughness; (b) std of roughness vs. number of frames fused. Falling ball (second row): (c) mean roughness; (d) std of roughness vs. number of frames fused Texture-related 3D noise on a static plane: (a) 3D frame; (b) 3D frame with textures. The 3D noise is closely related to the textures in the intensity image. Illustration of 3D noise reduction on the ball. Spatial-temporal divisive normalized bilateral filter (DNBF)

Pipeline data Fusion / Registration #8 Utilization of a Terrestrial Laser Scanner for the Calibration of Mobile Mapping Systems Seunghwan Hong, Ilsuk Park, Jisang Lee, Kwangyong Lim, Yoonjo Choi and Hong-Gyoo Sohn Sensors 2017, 17(3), 474; doi :10.3390/s17030474 Configuration of mobile mapping system: network video cameras (F: front, L: left, R: right), mobile laser scanner, and Global Navigation Satellite System (GNSS)/Inertial Navigation System (INS). To integrate the datasets captured by each sensor mounted on the Mobile Mapping System (MMS) into the unified single coordinate system, the calibration, which is the process to estimate the orientation (boresight) and position (lever-arm) parameters, is required with the reference datasets [Schwarz and El-Sheimy 2004, Habib et al. 2010, Chan et al. 2010]. When the boresight and lever-arm parameters defining the geometric relationship between each sensing data and GNSS/INS data are determined, georeferenced data can be generated. However, even after precise calibration, the boresight and lever- arm parameters of an MMS can be shaken and the errors that deteriorate the accuracy of the georeferenced data might accumulate. Accordingly, for the stable operation of multiple sensors, precise calibration must be conducted periodically. (a) Sphere target used for registration of terrestrial laser scanning data; (b) sphere target detected in a point cloud (the green sphere is a fitted sphere model). Network video camera: AXIS F1005-E GNSS/INS unit: OxTS Survey+ Terrestrial laser scanner (TLS): Faro Focus 3D Mobile laser scanner: Velodyne HDL 32-E

Pipeline data Fusion / Registration #9 Dense Semantic Labeling of Very-High-Resolution Aerial Imagery and LiDAR with Fully-Convolutional Neural Networks and Higher-Order CRFs Yansong Liu, Sankaranarayanan Piramanayagam, Sildomar T. Monteiro, Eli Saber http://openaccess.thecvf.com/content_cvpr_2017_workshops/w18/papers/Liu_Dense_Semantic_Labeling_CVPR_2017_paper.pdf Our proposed decision-level fusion scheme: training one fully-convolutional neural network on the color-infrared image (CIR) and one logistic regression using hand-crafted features. Two probabilistic results: PFCN and PLR are then combined in a higher-order CRF framework Main original contributions of our work are: 1) the use of energy based CRFs for efficient decision- level multisensor data fusion for the task of dense semantic labeling. 2) the use of higher-order CRFs for generating labeling outputs with accurate object boundaries. 3) the proposed fusion scheme has a simpler architecture than training two separate neural networks, yet it still yields the state-of-the- art dense semantic labeling results. Guiding multimodal registration with learned optimization updates Gutierrez-Becker B, Mateus D, Peter L, Navab N Medical Image Analysis Volume 41, October 2017, Pages 2-17 https://doi.org/10.1016/j.media.2017.05.002 Training stage (left): A set of aligned multimodal images is used to generate a training set of images with known transformations. From this training set we train an ensemble of trees mapping the joint appearance of the images to displacement vectors. Testing stage (right): We register a pair of multimodal images by predicting with our trained ensemble the required displacements δ for alignment at different locations z. The predicted displacements are then used to devise the updates of the transformation parameters to be applied to the moving image. The procedure is repeated until convergence is achieved. Corresponding CT (left) and MR-T1 (middle) images of the brain obtained from the RIRE dataset. The highlighted regions are corresponding areas between both images (right). Some multimodal similarity metrics rely on structural similarities between images obtained using different modalities, like the ones inside the blue boxes. However in many cases structures which are clearly visible in one imaging modality correspond to regions with homogeneous voxel values in the other modality (red and green boxes).

Future Image restoration Natural Images (RGB)

PipelineRGB image Restoration #1 https://arxiv.org/abs/1704.02738 Our method includes a sub- pixel motion compensation (SPMC) layer that can better handle inter-frame motion for this task. Our detail fusion (DF) network that can effectively fuse image details from multiple images after SPMC alignment “Hardware Super-resolution” of course all via deep learning too https://petapixel.com/2015/02/21/a-pract ical-guide-to-creating-superresolution-p hotos-with-photoshop/

PipelineRGB image Restoration #2A “Data-driven Super-resolution” what super-resolution typically means in the deep learning space Output of the “hardware super-resolution” can be used as a target for the “data-driven super-resolution” External Prior Guided Internal Prior Learning for Real Noisy Image Denoising Jun Xu, Lei Zhang, David Zhang (Submitted on 12 May 2017) https://arxiv.org/abs/1705.04505 Denoised images of a region cropped from the real noisy image from DSLR “Nikon D800 ISO 3200 A3”, Nam et al. 2016 (+video) by different methods. The scene was shot 500 times with the same camera and camera setting. The mean image of the 500 shots is roughly taken as the “ground truth”, with which the PSNR index can be computed. The images are better viewed by zooming in on screen Benchmarking Denoising Algorithms with Real Photographs Tobias Plötz, Stefan Roth (Submitted on 5 Jul 2017) https://arxiv.org/abs/1707.01313 “We then capture a novel benchmark dataset, the Darmstadt Noise Dataset (DND), with consumer cameras of differing sensor sizes. One interesting finding is that various recent techniques that perform well on synthetic noise are clearly outperformed by BM3D on photographs with real noise. Our benchmark delineates realistic evaluation scenarios that deviate strongly from those commonly used in the scientific literature.” Image formation process underlying the observed low-ISO image xr and high-ISO image xn . They are generated from latent noise-free images yr and yn , respectively, which in turn are related by a linear scaling of image intensities (LS), a small camera translation (T), and a residual low- frequency pattern (LF). To obtain the denoising ground truth yp , we apply post-processing to xr aiming at undoing these undesirable transformations. Mean PSNR (in dB) of the denoising methods tested on our DND benchmark. We apply denoising either on linear raw intensities, after a variance stabilizing transformation (VST, Anscombe), or after conversion to the sRGB space. Likewise, we evaluate the result either in linear raw space or in sRGB space. The noisy images have a PSNR of 39.39 dB (linear raw) and 29.98 dB (sRGB). Difference between blue channels of low- and high-ISO images from Fig. 1 after various post- processing stages. Images are smoothed for display to highlight structured residuals, attenuating the noise.

PipelineRGB image Restoration #2b “Data-driven Super-resolution” what super-resolution typically means in the deep learning space MemNet: A Persistent Memory Network for Image Restoration Ying Tai, Jian Yang, Xiaoming Liu, Chunyan Xu (Submitted on 7 Aug 2017) https://arxiv.org/abs/1708.02209 https://github.com/tyshiwo/MemNet. Output of the “hardware super-resolution” can be used as a target for the “data-driven super-resolution” The same MemNet structure achieves the state-of-the-art performance in image denoising, super-resolution and JPEG deblocking. Due to the strong learning ability, our MemNet can be trained to handle different levels of corruption even using a single model. Training Setting: Following the method of Mao et al. (2016), for image denoising, the grayscale image is used; while for SISR and JPEG deblocking, the luminance component is fed into the model. Deep Generative Adversarial Compression Artifact Removal Leonardo Galteri, Lorenzo Seidenari, Marco Bertini, Alberto Del Bimbo (Submitted on 8 Apr 2017) https://arxiv.org/abs/1704.02518 In this work we address the problem of artifact removal using convolutional neural networks. The proposed approach can be used as a post-processing technique applied to decompressed images, and thus can be applied to different compression algorithms (typically applied in YCrCb color space) such as JPEG, intra-frame coding of H.264/AVC and H.265/HEVC. Compared to super resolution techniques, working on compressed images instead of down-sampled ones, is more practical, since it does not require to change the compression pipeline, that is typically hardware based, to subsample the image before its coding; moreover, camera resolutions have increased during the latest years, a trend that we can expect to continue.

PipelineRGB image Restoration #3 An attempt to improve smartphone camera quality with DSLR high quality image as the ‘gold standard’ with deep learning https://arxiv.org/abs/1704.02470 Andrey Ignatov, Nikolay Kobyshev, Kenneth Vanhoey, Radu Timofte, Luc Van Gool Computer Vision Laboratory, ETH Zurich, Switzerland “Quality transfer”

Pipelineimage enhancement #1 Aesthetics enhancement: “AI-driven Interior Design” “Re-colorization” of scanned indoor scenes or intrinsic decomposition based editing Limitations. We have to manually correct inaccurate segmentation, though seldom encountered. This is a limitation of our method. However, segmentation errors are seldom encountered during experiments. Since our method is object-based, our segmentation method does not consider the color patterns among similar components of an image object. Currently, our system is not capable of segmenting the mesh according to the colored components with similar geometry for this kind of objects. This is another limitation of our method. An intrinsic image decomposition method could be helpful to our image database, for extracting lighting-free textures to be further used in rendering colorized scenes. However, such methods are not so robust that can be directly applied to various images in a large image database. On the other hand, intrinsic image decomposition is not essential to achieve good results in our experiments. So we did not incorporate it in our work, but we will further study it to improve our database.

Pipelineimage enhancement #2 “Auto-adjust” RGB texture maps for indoor scans with user interaction We use the CIELab color space for both the input and output images. We can use 3-channel Lab color as the color features. However, it generates color variations in smooth regions since each color is processed independently. To alleviate this issue, we add the local neighborhood information by concatenating the Lab color and the L2 normalized first-layer convolutional feature maps of ResNet-50. Although the proposed method provides the users with automatically adjusted photos, some users may want their photos to be retouched by their own preference. In the first row of Fig. 2 for example, a user may want only the color of the people to be changed. For such situations, we provide a way for the users to give their own adjustment maps to the system. Figure 4 shows some examples of the personalization. When the input image is forwarded, we substitute the extracted semantic adjustment map with the new adjustment map from the user. As shown in the figure, the proposed method effectively creates the personalized images adjusted by user’s own style. Deep Semantics-Aware Photo Adjustment Seonghyeon Nam, Seon Joo Kim (Submitted on 26 Jun 2017) https://arxiv.org/abs/1706.08260

Pipelineimage enhancement #3 Aesthetic-Driven Image Enhancement by Adversarial Learning Yubin Deng, Chen Change Loy, Xiaoou Tang (Submitted on 17 Jul 2017) https://arxiv.org/abs/1707.05251 GAN GAN GAN Pro Pro Pro Examples of image enhancement given original input (a). The architecture of our proposed EnhanceGAN framework. ResNet module is the feature extractor (for image in CIELab color space); in this work, we use the ResNet-101 and removed the last average pooling layer and the final fc layer. The switch icons in the discriminator network represent zero-masking during stage-wise training “Auto-adjust” RGB texture maps for indoor scans with GANs

Pipelineimage enhancement #4 “Auto-adjust” RGB texture maps for indoor scans with GANs for “auto-matting” Creatism: A deep-learning photographer capable of creating professional work Hui Fang, Meng Zhang (Submitted on 11 Jul 2017) https://arxiv.org/abs/1707.03491 https://google.github.io/creatism/ Datasets were created that contain ratings of photographs based on aesthetic quality [Murray et al., 2012] [Kong et al., 2016] [Lu et al., 2015]. Using our system, we mimic the workflow of a landscape photographer, from framing for the best composition to carrying out various post-processing operations. The environment for our virtual photographer is simulated by a collection of panorama images from Google Street View. We design a "Turing- test"-like experiment to objectively measure quality of its creations, where professional photographers rate a mixture of photographs from different sources blindly. We work with professional photographers to empirically define 4 levels of aesthetic quality: ● 1: point-and-shoot photos without consideration. ● 2: Good photos from the majority of population without art background. Nothing artistic stands out. ● 3: Semi-pro. Great photos showing clear artistic aspects. The photographer is on the right track of becoming a professional. ● 4: Pro-work. Clearly each professional has his/her unique taste that needs calibration. We use AVA dataset [Murray et al., 2012] wto bootstrap a consensus among them. Assume there exists a universal aesthetics metric, Φ. By definition, needs to incorporate all aesthetic aspects, suchΦ as saturation, detail level, composition... To define withΦ examples, number of images needs to grow exponentially to cover more aspects [Jaroensri et al., 2015]. To make things worse, unlike traditional problems such as object recognition, what we need are not only natural images, but also pro- level photos, which are much less in quantity.

Pipelineimage enhancement #5 “Auto-adjust” images based on different user groups (or personalizing for different markets for indoor scan products) Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models Ardavan Saeedi, Matthew D. Hoffman, Stephen J. DiVerdi, Asma Ghandeharioun, Matthew J. Johnson, Ryan P. Adams CSAIL, MI; Adobe Research; Media Lab, MIT; Harvard and Google Brain (Submitted on 17 Apr 2017) https://arxiv.org/abs/1704.04997 The main goals of our proposed models: (a) Multimodal photo edits: For a given photo, there may be multiple valid aesthetic choices that are quite different from one another. (b) User categorization: A synthetic example where different user clusters tend to prefer different slider values. Group 1 users prefer to increase the exposure and temperature for the baby images; group 2 users reduce clarity and saturation for similar images. Predictive log-likelihood for users in the test set of different datasets. For each user in the test set, we compute the predictive log-likelihood of 20 images, given 0 to 30 images and their corresponding sliders from the same user. 30 sample trajectories and the overall average ± s.e. is shown for casual, frequent and expert users. The figure shows that knowing more about the user (up to around 10 images) can increase the predictive log-likelihood. The log-likelihood is normalized by subtracting off the predictive log-likelihood computed given zero images. Note the different y-axis in the plots. The rightmost plot is provided for comparing the average predictive log-likelihood across datasets.

Pipelineimage enhancement #6 Combining semantic segmentation for higher quality “Instagram filters” Exemplar-Based Image and Video Stylization Using Fully Convolutional Semantic Features Feida Zhu ; Zhicheng Yan ; Jiajun Bu ; Yizhou Yu IEEE Transactions on Image Processing ( Volume: 26, Issue: 7, July 2017 ) https://doi.org/10.1109/TIP.2017.2703099 Color and tone stylization in images and videos strives to enhance unique themes with artistic color and tone adjustments. It has a broad range of applications from professional image postp-rocessing to photo sharing over social networks. Mainstream photo enhancement softwares, such as Adobe Lightroom and Instagram, provide users with predefined styles, which are often hand-crafted through a trial-and-error process. Such photo adjustment tools lack a semantic understanding of image contents and the resulting global color transform limits the range of artistic styles it can represent. On the other hand, stylistic enhancement needs to apply distinct adjustments to various semantic regions. Such an ability enables a broader range of visual styles. Traditional professional video editing softwares (Adobe After Effects, Nuke, etc.) offer a suite of predefined operations with tunable parameters that apply common global adjustments (exposure/color correction, white balancing, sharpening, denoising, etc). Local adjustments within specific spatiotemporal regions are usually accomplished with masking layers created with intensive user interaction. Both parameter tuning and masking layer creation are labor intensive processes. An example of learning semantics-aware photo adjustment styles. Left: Input image. Middle: Manually enhanced by photographer. Distinct adjustments are applied to different semantic regions. Right: Automatically enhanced by our deep learning model trained from image exemplars. (a) Input image. (b) Ground truth. (c) Our result. Given a set of exemplar image pairs, each representing a photo before and after pixel-level color (in CIELab space) and tone adjustments following a particular style, we wish to learn a computational model that can automatically adjust a novel input photo in the same style. We still cast this learning task as a regression problem as in Yan et al. (2016). For completeness, let us first review their problem definition and then present our new deep learning based architecture and solution.

Pipelineimage enhancement #7A Combining semantic segmentation for higher quality “Instagram filters” Deep Bilateral Learning for Real-Time Image Enhancement Michaël Gharbi, Jiawen Chen, Jonathan T. Barron, Samuel W. Hasinoff, Frédo Durand MIT CSAIL, Google Research, MIT CSAIL / Inria, Université Côte d’Azur (Submitted on 10 Jul 2017) https://arxiv.org/abs/1707.02880 | https://github.com/mgharbi/hdrnet | https://groups.csail.mit.edu/graphics/hdrnet/ https://youtu.be/GAe0qKKQY_I Our novel neural network architecture can reproduce sophisticated image enhancements with inference running in real time at full HD resolution on mobile devices. It can not only be used to dramatically accelerate reference implementations, but can also learn subjective effects from human retouching (“copycat” filter). By performing most of its computation within a bilateral grid and by predicting local affine color transforms, our model is able to strike the right balance between expressivity and speed. To build this model we have introduced two new layers: a data-dependent lookup that enables slicing into the bilateral grid, and a multiplicative operation for affine transformation. By training in an end-to-end fashion and optimizing our loss function at full resolution (despite most of our network being at a heavily reduced resolution), our model is capable of learning full-resolution and non-scale- invariant effects.

Pipelineimage enhancement #8 Blind Image Quality assessment e.g. for quantifying RGB scan quality real-time RankIQA: Learning from Rankings for No- reference Image Quality Assessment Xialei Liu, Joost van de Weijer, Andrew D. Bagdanov (Submitted on 26 Jul 2017) https://arxiv.org/abs/1707.08347 The classical approach trains a deep CNN regressor directly on the ground-truth. Our approach trains a network from an image ranking dataset. These ranked images can be easily generated by applying distortions of varying intensities. The network parameters are then transferred to the regression network for finetuning. This allows for the training of deeper and wider networks. Siamese network output for JPEG distortion considering 6 levels. This graphs illustrate the fact that the Siamese network successfully manages to separate the different distortion levels. Blind Deep S3D Image Quality Evaluation via Local to Global Feature Aggregation Heeseok Oh ; Sewoong Ahn ; Jongyoo Kim ; Sanghoon Lee IEEE Transactions on Image Processing ( Volume: 26, Issue: 10, Oct. 2017 ) https://doi.org/10.1109/TIP.2017.2725584

Pipelineimage Styling #1 Aesthetics enhancement: High Dynamic Range from SfM Large scale structure-from-motion (SfM) algorithms have recently enabled the reconstruction of highly detailed 3-D models of our surroundings simply by taking photographs. In this paper, we propose to leverage these reconstruction techniques to automatically estimate the outdoor illumination conditions for each image in a SfM photo collection. We introduce a novel dataset of outdoor photo collections, where the ground truth lighting conditions are known at each image. We also present an inverse rendering approach that recovers a high dynamic range estimate of the lighting conditions for each low dynamic range input image. Our novel database is used to quantitatively evaluate the performance of our algorithm. Results show that physically plausible lighting estimates can faithfully be recovered, both in terms of light direction and intensity. Lighting Estimation in Outdoor Image Collections Jean-Francois Lalonde (Laval University); Iain Matthews (Disney Research) 3D Vision (3DV), 2014 2nd International Conference on https://www.disneyresearch.com/publication/lighting-estimation-in-outdoor-image-collections/ https://doi.org/10.1109/3DV.2014.112 The main limitation of our approach is that it can recover precise lighting parameters only when lighting actually creates strongly visible effects—such as cast shadows, shading differences amongst surfaces of different orientations—on the image. When the camera does not observe significant lighting variations, for example when the sun is shining on a part of the building that the camera does not observe, or when the camera only see a very small fraction of the landmark with little geometric details, our approach recovers a coarse estimate of the full lighting conditions. In addition, our approach is sensitive to errors in geometry estimation, or to the presence of unobserved, nearby objects. Because it does not know about these objects, our method tries to explain their cast shadows with the available geometry, which may result in errors. Our approach is also sensitive to inter-reflections. Incorporating more sophisticated image formation models such as radiosity could help alleviating this problem, at the expense of significantly more computation. Finally, our approach relies on knowledge of the camera exposure and white balance settings, which might be less applicable to the case of images downloaded on the Internet. We plan to explore these issues in future work. Exploring material recognition for estimating reflectance and illumination from a single image Michael Weinmann; Reinhard Klein MAM '16 Proceedings of the Eurographics 2016 Workshop on Material Appearance Modeling https://doi.org/10.2312/mam.20161253 We demonstrate that reflectance and illumination can be estimated reliably for several materials that are beyond simple Lambertian surface reflectance behavior because of exhibiting mesoscopic effects such as interreflections and shadows. Shading Annotations in the Wild Balazs Kovacs, Sean Bell, Noah Snavely, Kavita Bala (Submitted on 2 May 2017) https://arxiv.org/abs/1705.01156 http://opensurfaces.cs.cornell.edu/saw/ We use this data to train a convolutional neural network to predict per-pixel shading information in an image. We demonstrate the value of our data and network in an application to intrinsic images, where we can reduce decomposition artifacts produced by existing algorithms.

Pipelineimage Styling #2A Aesthetics enhancement: High Dynamic Range #1 Learning High Dynamic Range from Outdoor Panoramas Jinsong Zhang, Jean-François Lalonde (Submitted on 29 Mar 2017 (v1), last revised 8 Aug 2017 (this version, v2)) https://arxiv.org/abs/1703.10200 http://www.jflalonde.ca/projects/learningHDR Qualitative results on the synthetic dataset. Top row: the ground truth HDR panorama, middle row: the LDR panorama, and bottom row: the predicted HDR panorama obtained with our method. To illustrate dynamic range, each panorama is shown at two exposures, with a factor of 16 between the two. For each example, we show the panorama itself (left column), and the rendering of a 3D object lit with the panorama (right column). The object is a “spiky sphere” on a ground plane, seen from above. Our method accurately predicts the extremely high dynamic range of outdoor lighting in a wide variety of lighting conditions. A tonemapping of γ = 2.2 is used for display purposes. Real cameras have non-linear response functions. To simulate this, we randomly sample real camera response functions from the Database of Response Functions (DoRF) [Grossberg and Nayar, 2003], and apply them to the linear synthetic data before training. Examples from our real dataset. For each case, we show the LDR panorama captured by the Ricoh Theta S camera, a consumer grade point-and-shoot 360º camera (left), and the corresponding HDR panorama captured by the Canon 5D Mark III DSLR mounted on a tripod, equipped with a Sigma 8mm fisheye lens (right, shown at a different exposure to illustrate the high dynamic range). We present a full end-to-end learning approach to estimate the extremely high dynamic range of outdoor lighting from a single, LDR 360º panorama. Our main insight is to exploit a large dataset of synthetic data composed of a realistic virtual city model, lit with real world HDR sky light probes [Lalonde et al. 2016 http://www.hdrdb.com/] to train a deep convolutional autoencoder

Pipelineimage Styling #2b High Dynamic Range #2: Learn illumination for relighting purposes Learning to Predict Indoor Illumination from a Single Image Marc-André Gardner, Kalyan Sunkavalli, Ersin Yumer, Xiaohui Shen, Emiliano Gambaretto, Christian Gagné, Jean-François Lalonde (Submitted on 1 Apr 2017 (v1), last revised 25 May 2017 (this version, v2)) https://arxiv.org/abs/1704.00090

Pipelineimage Styling #3a Improving photocompositing and relighting of RGB textures Deep Image Harmonization Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, Ming-Hsuan Yang (Submitted on 28 Feb 2017) https://arxiv.org/abs/1703.00069 Our method can adjust the appearances of the composite foreground to make it compatible with the background region. Given a composite image, we show the harmonized images generated by Xue et al. (2012), Zhu et al. (2015) and our deep harmonization network. The overview of the proposed joint network architecture. Given a composite image and a provided foreground mask, we first pass the input through an encoder for learning feature representations. The encoder is then connected to two decoders, including a harmonization decoder for reconstructing the harmonized output and a scene parsing decoder to predict pixel-wise semantic labels. In order to use the learned semantics and improve harmonization results, we concatenate the feature maps from the scene parsing decoder to the harmonization decoder (denoted as dot-orange lines). In addition, we add skip links (denoted as blue-dot lines) between the encoder and decoders for retaining image details and textures. Note that, to keep the figure clean, we only depict the links for the harmonization decoder, while the scene parsing decoder has the same skip links connected to the encoder. Given an input image (a), our network can adjust the foreground region according to the provided mask (b) and produce the output (c). In this example, we invert the mask from the one in the first row to the one in the second row, and generate harmonization results that account for different context and semantic information.

Pipelineimage Styling #3b Sky is not the limit: semantic-aware sky replacement YH Tsai, X Shen, Z Lin, K Sunkavalli; Ming-Hsuan Yang ACM Transactions on Graphics (TOG) - Volume 35 Issue 4, July 2016 https://doi.org/10.1145/2897824.2925942 In order to find proper skies for replacement, we propose a data-driven sky search scheme based on semantic layout of the input image. Finally, to re-compose the stylized sky with the original foreground naturally, an appearance transfer method is developed to match statistics locally and semantically. Sample sky segmentation results. Given an input image, the FCN generates results that localize the sky well but contain inaccurate boundaries and noisy segments. The proposed online model refines segmentations that are complete and accurate, especially around the boundaries (best viewedin color with enlarged images). Overview of the proposed algorithm. Given an input image, we first utilize the FCN to obtain scene parsing results and semantic response for each category. A coarse-to-fine strategy is adopted to segment sky regions (illustrated as the red mask). To find reference images for sky replacement, we develop a method to search images with similar semantic layout. After re-composing images with the found skies, we transfer visual semantics to match foreground statistics between the input image and the reference image. Finally, a set of composite images with different stylized skies are generated automatically. GP-GAN: Towards Realistic High-Resolution Image Blending Huikai Wu, Shuai Zheng, Junge Zhang, Kaiqi Huang (Submitted on 21 Mar 2017 (v1), last revised 25 Mar 2017 (this version, v2)) https://arxiv.org/abs/1703.07195 Qualitative illustration of high-resolution image blending. a) shows the composited copy-and-paste image where the inserted object is circled out by red lines. Users usually expect image blending algorithms to make this image more natural. b) represents the result based on Modified Poisson image editing [32]. c) indicates the result from Multi-splines approach. d) is the result of our method Gaussian-Poisson GAN (GP-GAN). Our approach produces better quality images than that from the alternatives in terms of illumination, spatial, and color consistencies. We advanced the state-of-the-art in conditional image generation by combining the ideas from the generative model GAN, Laplacian Pyramid, and Gauss-Poisson Equation. This combination is the first time a generative model could produce realistic images in arbitrary resolution. In spite of the effectiveness, our algorithm fails to generate realistic images when the composited images are far away from the distribution of the training dataset. We aim to address this issue in future work. Improving photocompositing and relighting of RGB textures

Pipelineimage Styling #3c Live User-Guided Intrinsic Video for Static Scenes Abhimitra Meka ; Gereon Fox ; Michael Zollhofer ; Christian Richardt ; Christian Theobalt IEEE Transactions on Visualization and Computer Graphics ( Volume: PP, Issue: 99 ) https://doi.org/10.1109/TVCG.2017.2734425 Improving photocompositing and relighting of RGB textures User constraints, in the form of constant shading and reflectance strokes, can be placed directly on the real-world geometry using an intuitive touch- based interaction metaphor, or using interactive mouse strokes. Fusing the decomposition results and constraints in three-dimensional space allows for robust propagation of this information to novel views by re-projection. We propose a novel approach for live, user-guided intrinsic video decomposition. We first obtain a dense volumetric reconstruction of the scene using a commodity RGB-D sensor. The reconstruction is leveraged to store reflectance estimates and user-provided constraints in 3D space to inform the ill-posed intrinsic video decomposition problem. Our approach runs at real-time frame rates, and we apply it to applications such as relighting, recoloring and material editing. Our novel user-guided intrinsic video approach enables real-time applications such as recoloring, relighting and material editing. Constant reflectance strokes improve the decomposition by moving the high-frequency shading of the cloth to the shading layer. Comparison to state-of-the-art intrinsic video decomposition techniques on the ‘girl’ dataset. Our approach matches the real-time performance of Meka et al. (2016), while achieving the same quality as previous off-line techniques

Pipelineimage Styling #4 Beyond low-level style transfer for high-level manipulation Generative Semantic Manipulation with Contrasting GAN Xiaodan Liang, Hao Zhang, Eric P. Xing (Submitted on 1 Aug 2017) https://arxiv.org/abs/1708.00315 Generative Adversarial Networks (GANs) have recently achieved significant improvement on paired/unpaired image-to-image translation, such as photo sketch and artist painting style transfer. However, existing models→ can only be capable of transferring the low-level information (e.g. color or texture changes), but fail to edit high- level semantic meanings (e.g., geometric structure or content) of objects. Some example semantic manipulation results by our model, which takes one image and a desired object category (e.g. cat, dog) as inputs and then learns to automatically change the object semantics by modifying their appearance or geometric structure. We show the original image (left) and manipulated result (right) in each pair. Although our method can achieve compelling results in many semantic manipulation tasks, it shows little success for some cases which require very large geometric changes, such as car truck and↔ car bus. Integrating spatial transformation layers for explicitly learning pixel-wise offsets may help↔ resolve very large geometric changes. To be more general, our model can be extended to replace the mask annotations with the predicted object masks or automatically learned attentive regions via attention modeling. This paper pushes forward the research of unsupervised setting by demonstrating the possibility of manipulating high-level object semantics rather than the low-level color and texture changes as previous works did. In addition, it would be more interesting to develop techniques that are able to manipulate object interactions and activities in images/videos for the future work.

Pipelineimage Styling #5A Aesthetics enhancement: Style Transfer | Introduction #1 Neural Style Transfer: A Review Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Mingli Song (Submitted on 11 May 2017) https://arxiv.org/abs/1705.04058 A list of mentioned papers in this review, corresponding codes and pre-trained models are publicly available at: https://github.com/ ycjing/Neural-Style-Transfer-Papers One of the reasons why Neural Style Transfer catches eyes in both academia and industry is its popularity in some social networking sites (e.g., Twitter and Facebook). A mobile application Prisma [36] is one of the first industrial applications that provides the Neural Style Transfer algorithm as a service. Before Prisma, the general public almost never imagines that one day they are able to turn their photos into art paintings in only a few minutes. Due to its high quality, Prisma achieved great success and is becoming popular around the world. Another use of Neural Style Transfer is to act as user-assisted creation tools. Although, to the best of our knowledge, there are no popular applications that applied the Neural Style Transfer technique in creation tools, we believe that it will be a promising potential usage in the future. Neural Style Transfer is capable of acting as a creation tool for painters and designers. Neural Style Transfer makes it more convenient for a painter to create an artifact of a specific style, especially when creating computer-made fine art images. Moreover, with Neural Style Transfer algorithms it is trivial to produce stylized fashion elements for fashion designers and stylized CAD drawings for architects in a variety of styles, which is costly to produce them by hand.

Pipelineimage Styling #5b Aesthetics enhancement: Style Transfer | Introduction #2 Neural Style Transfer: A Review Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Mingli Song (Submitted on 11 May 2017) https://arxiv.org/abs/1705.04058 A list of mentioned papers in this review, corresponding codes and pre-trained models are publicly available at: https://github.com/ ycjing/Neural-Style-Transfer-Papers Promising directions for future research in Neural Style Transfer mainly focus on two aspects. The first aspect is to solve the existing aforementioned challenges for current algorithms, i.e., problem of parameter tuning, problem of stroke orientation control and problem existing in “Fast” and “Faster” Neural Style Transfer algorithms. The second aspect of promising directions is to focus on new extensions to Neural Style Transfer (e.g., Fashion Style Transfer and Character Style Transfer). There are already some preliminary work related with this direction, such as the recent work of Yang et al. (2016) on Text Effects Transfer. These interesting extensions may become trending topics in the future and related new areas may be created subsequently.

Pipelineimage Styling #5C Aesthetics enhancement: Video Style Transfer DeepMovie: Using Optical Flow and Deep Neural Networks to Stylize Movies Alexander G. Anderson, Cory P. Berg, Daniel P. Mossing, Bruno A. Olshausen (Submitted on 26 May 2016) https://arxiv.org/abs/1605.08153 https://github.com/anishathalye/neural-style Coherent Online Video Style Transfer Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, Gang Hua (Submitted on 27 Mar 2017 (v1), last revised 28 Mar 2017 (this version, v2)) https://arxiv.org/abs/1703.09211 The main contribution of this paper is to use optical flow to initialize the style transfer optimization so that the texture features move with the objects in the video. Finally, we suggest a method to incorporate optical flow explicitly into the cost function. Overview of Our Approach: We begin by applying the style transfer algorithm to the first frame of the movie using the content image as the initialization. Next, we calculate the optical flow field that takes the first frame of the movie to the second frame. We apply this flow-field to the rendered version of the first frame and use that as the initialization for the style transfer optimization for the next frame. Note, for instance, that a blue pixel in the flow field image means that the underlying object in the video at that pixel moved to the left from frame one to frame two. Intuitively, in order to apply the flow field to the styled image, you move the parts of the image that have a blue pixel in the flow field to the left. We propose the first end-to-end network for online video style transfer, which generates temporally coherent stylized video sequences in near real-time. Two key ideas include an efficient network by incorporating short- term coherence, and propagating short-term coherence to long-term, which ensures the consistency over larger period of time. Our network can incorporate different image stylization networks. We show that the proposed method clearly outperforms the per-frame baseline both qualitatively and quantitatively. Moreover, it can achieve visually comparable coherence to optimization-based video style transfer, but is three orders of magnitudes faster in runtime. There are still some limitations in our method. For instance, limited by the accuracy of ground-truth optical flow (given by DeepFlow2 [Weinzaepfel et al. 2013] ), our results may suffer from some incoherence where the motion is too large for the flow to track. And after propagation over a long period, small flow errors may accumulate, causing blurriness. These open questions are interesting for further exploration in the future work.

Pipelineimage Styling #6A Aesthetics enhancement: Texture synthesis and upsampling TextureGAN: Controlling Deep Image Synthesis with Texture Patches Wenqi Xian, Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, James Hays (Submitted on 9 Jun 2017) https://arxiv.org/abs/1706.02823 TextureGAN pipeline. A feed-forward generative network is trained end-to-end to directly transform a 4- channel input to a high- res photo with realistic textural details. Photo-realistic Facial Texture Transfer Parneet Kaur, Hang Zhang, Kristin J. Dana (Submitted on 14 Jun 2017) https://arxiv.org/abs/1706.04306 Overview of our method. Facial identity is preserved using Facial Semantic Regularization which regularizes the update of meso-structures using a facial prior and facial semantic structural loss. Texture loss regularizes the update of local textures from the style image. The output image is initialized with the content image and updated at each iteration by back-propagating the error gradients for the combined losses. Content/style photos: Martin Scheoller/Art+Commerce. Identity-preserving Facial Texture Transfer (FaceTex). The textural details are transferred from style image to content image while preserving its identity. FaceTex outperforms existing methods perceptually as well as quantitatively. Column 3 uses input 1 as the style image and input 2 as the content. Column 4 uses input 1 as the content image and input 2 as the style image. Figure 3 shows more examples and comparison with existing methods. Input photos: Martin Scheoller/Art+Commerce.

Pipelineimage Styling #6B Aesthetics enhancement: Texture synthesis with style transfer Stable and Controllable Neural Texture Synthesis and Style Transfer Using Histogram Losses Eric Risser, Pierre Wilmot, Connelly Barnes Artomatix, University of Virginia (Submitted on 31 Jan 2017 (v1), last revised 1 Feb 2017 (this version, v2)) https://arxiv.org/abs/1701.08893 Our style transfer and texture synthesis results. The input styles are shown in (a), and style transfer results are in (b, c). Note that the angular shapes of the Picasso painting are successfully transferred on the top row, and that the more subtle brush strokes are transferred on the bottom row. The original content images are inset in the upper right corner. Unless otherwise noted, our algorithm is always run with default parameters (we do not manually tune parameters). Input textures are shown in (d) and texture synthesis results are in (e). For the texture synthesis, note that the algorithm synthesizes creative new patterns and connectivities in the output. Different statistics that can be used for neural network texture synthesis.

Pipelineimage Styling #6C Aesthetics enhancement: Enhancing texture maps Depth Texture Synthesis for Realistic Architectural Modeling Félix Labrie-Larrivée ; Denis Laurendeau ; Jean-François Lalonde Computer and Robot Vision (CRV), 2016 13th Conference on https://doi.org/10.1109/CRV.2016.77 In this paper, we present a novel approach that improves the resolution and geometry of 3D meshes of large scenes with such repeating elements. By leveraging structure from motion reconstruction and an off-the-shelf depth sensor, our approach captures a small sample of the scene in high resolution and automatically extends that information to similar regions of the scene. Using RGB and SfM depth information as a guide and simple geometric primitives as canvas, our approach extends the high resolution mesh by exploiting powerful, image-based texture synthesis approaches. The final results improves on standard SfM reconstruction with higher detail. Our approach benefits from reduced manual labor as opposed to full RGBD reconstruction, and can be done much more cheaply than with LiDAR- based solutions. In the future, we plan to work on a more generalized 3D texture synthesis procedure capable of synthesizing a more varied set of objects, and able to reconstruct multiple parts of the scene by exploiting several high resolution scan samples at once in an effort to address the tradeoff mentioned above. We also plan to improve the robustness of the approach to a more varied set of large scale scenes, irrespective of the lighting conditions, material colors, and geometric configurations. Finally, we plan to evaluate how our approach compares to SfM on a more quantitative level by leveraging LiDAR data as ground truth. Overview of the data collection and alignment procedure. Top row: a collection of photos of the scene is acquired with a typical camera, and used to generate a point cloud via SfM [Agarwal et al. 2009] and dense multi-view stereo (MVS) [ Furukawa and Ponce, 2012]. Bottom row: a repeating feature of the scene (in this example, the left-most window) is recorded with a Kinect sensor, and reconstructed into a high resolution mesh via the RGB-D SLAM technique KinectFusion [ Newcombe et al. 2011]. The mesh is then automatically aligned to the SfM reconstruction using bundle adjustment and our automatic scale adaptation technique (see sec. III-C). Right: the high resolution Kinect mesh is correctly aligned to the low resolution SfM point cloud

Pipelineimage Styling #6D Aesthetics enhancement: Towards photorealism with good maps One Ph.D. position (supervision by Profs Niessner and Rüdiger Westermann) is available at our chair in the area of photorealistic rendering for deep learning and online reconstruction Research in this project includes the development of photorealistic realtime rendering algorithms that can be used in deep learning applications for scene understanding, and for high-quality scalable rendering of point scans from depth sensors and RGB stereo image reconstruction. If you are interested in applying, you should have a strong background in computer science, i.e., efficient algorithms and data structures, and GPU programming, have experience implementing C/C++ algorithms, and you should be excited to work on state-of-the-art research in the 3D computer graphics. https://wwwcg.in.tum.de/group/joboffers/phd-position-photorealistic-rendering-for-deep-le arning-and-online-reconstruction.html Ph.D. Position – Photorealistic Rendering for Deep Learning and Online Reconstruction Photorealism Explained Blender Guru Published on May 25, 2016 http://www.blenderguru.com/tutorials/photorealism-explained/ https://youtu.be/R1-Ef54uTeU Stop wasting time creating texture maps by hand. All materials on Poliigon come with the relevant normal, displacement, reflection and gloss maps included. Just plug them into your software, and your material is ready to render. https://www.poliigon.com/ How to Make Photorealistic PBR Materials - Part 1 Blender Guru Published on Jun 28, 2016 http://www.blenderguru.com/tutoria ls/pbr-shader-tutorial-pt1/ https://youtu.be/V3wghb Z-Vh4?t=24m5s Physically Based Rendering (PBR)

Pipelineimage Styling #7 Styling line graphics (e.g. floorplans, 2D CADs) and monochrome images e.g. for desired visual identity Real-Time User-Guided Image Colorization with Learned Deep Priors Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S. Lin, Tianhe Yu, Alexei A. Efros (Submitted on 8 May 2017) https://arxiv.org/abs/1705.02999 Our proposed method colorizes a grayscale image (left), guided by sparse user inputs (second), in real-time, providing the capability for quickly generating multiple plausible colorizations (middle to right). Photograph of Migrant Mother by Dorothea Lange, 1936 (Public Domain). Network architecture We train two variants of the user interaction colorization network. Both variants use the blue layers for predicting a colorization. The Local Hints Network also uses red layers to (a) incorporate user points Ul and (b) predict a color distribution bZ. The Global Hints Network uses the green layers, which transforms global input Uд by 1 × 1 conv layers, and adds the result into the main colorization network. Each box represents a conv layer, with vertical dimension indicating feature map spatial resolution, and horizontal dimension indicating number of channels. Changes in resolution are achieved through subsampling and upsampling operations. In the main network, when resolution is decreased, the number of feature channels are doubled. Shortcut connections are added to upsampling convolution layers. Style Transfer for Anime Sketches with Enhanced Residual U-net and Auxiliary Classifier GAN Lvmin Zhang, Yi Ji, Xin Lin (Submitted on 11 Jun 2017 (v1), last revised 13 Jun 2017 (this version, v2)) https://arxiv.org/abs/1706.03319 Examples of combination results on sketch images (top-left) and style images (bottom-left). Our approach automatically applies the semantic features of an existing painting to an unfinished sketch. Our network has learned to classify the hair, eyes, skin and clothes, and has the ability to paint these features according to a sketch. In this paper, we integrated residual U-net to apply the style to the grayscale sketch with auxiliary classifier generative adversarial network (AC-GAN, Odena et al. 2016). The whole process is automatic and fast, and the results are creditable in the quality of art style as well as colorization Limitation: the pretrained VGG is for ImageNet photograph classification, but not for paintings. In the future, we will train a classification network only for paintings to achieve better results. Furthermore, due to the large quantity of layers in our residual network, the batch size during training is limited to no more than 4. It remains for future study to reach a balance between the batch size and quantity of layers. +

Future Image restoration Depth Images (Kinect, etc.)

PipelineDepth image enhancement #1a Image Formation #1 Pinhole Camera Model: ideal projection of a 3D object on a 2D image. Fernandez et al. (2017) Dot patterns of a Kinect for Windows (a) and two Kinects for Xbox (b) and (c) are projected on a flat wall from a distance of 1000 mm. Note that the projection of each pattern is similar, and related by a 3-D rotation depending on the orientation of the Kinect diffuser installation. The installation variability can clearly be observed from differences in the bright dot locations (yellow stars), which differ by an average distance of 10 pixels. Also displayed in (d) is the idealized binary replication of the Kinect dot pattern [Kinect Patter Uncovered] , which was used in this project to simulate IR images. - Landau et al. (2016) Landau et al. (2016)

PipelineDepth image enhancement #1b Image Formation #2 Characterizations of Noise in Kinect Depth Images: A Review Tanwi Mallick ; Partha Pratim Das ; Arun Kumar Majumdar IEEE Sensors Journal ( Volume: 14, Issue: 6, June 2014 ) https://doi.org/10.1109/JSEN.2014.2309987 Kinect outputs for a scene. (a) RGB Image. (b) Depth data rendered as an 8- bit gray-scale image with nearer depth values mapped to lower intensities. Invalid depth values are set to 0. Note the fixed band of invalid (black) pixels on left. (c) Depth image showing too near depths in blue, too far depths in red and unknown depths due to highly specular objects in green. Often these are all taken as invalid zero depth. Shadow is created in a depth image (Yu et al. 2013) when the incident IR from the emitter gets obstructed by an object and no depth can be estimated. PROPERTIES OF IR LIGHT [Rose]

Pipeline Depth image enhancement #1c Image Formation #3 Authors’ experiments on structural noise using a plane in 400 frames. (a) Error at 1.2m. (b) Error at 1.6m. (c) Error at 1.8m. Smisek et al. (2013) calibrate a Kinect against a stereo-rig (comprising two Nikon D60 DSLR cameras) to estimate and improve its overall accuracy. They have taken images and fitted planar objects at 18 different distances (from 0.7 to 1.3 meters) to estimate the error between the depths measured by the two sensors. The experiments corroborate that the accuracy varies inversely with the square of depth [2]. However, even after the calibration of Kinect, the procedure still exhibits relatively complex residual errors (Fig. 8). Fig. 8. Residual noise of a plane. (a) Plane at 86cm. (b) Plane at 104cm. Authors’ experiments on temporal noise. Entropy and SD of each pixel in a depth frame over 400 frames for a stationary wall at 1.6m. (a) Entropy image. (b) SD image. Authors’ experiments with vibrating noise showing ZD samples as white dots. A pixel is taken as noise if it is zero in frame i and nonzero in frames i±1. Note that noise follows depth edges and shadow. (a) Frame (i−1). (b) Frame i. (c) Frame (i+1). (d) Noise for frame i.

PipelineDepth image enhancement #1d Image Formation #4 The filtered intensity samples generated from unsaturated IR dots (blue dots) were used to fit the intensity model (red line), which follows an inverse square model for the distance between the sensor and the surface point Landau et al. (2016) (a) Multiplicative speckle distribution is unitless, and can be represented as a gamma distribution Γ (4.54, 0.196). (b) Additive detector noise distribution can be represented as a normal distribution Ν (−0.126, 10.4), and has units of 10-bit intensity. Landau et al. (2016) The standard error in depth estimation (mm) as a function of radial distance (pix) is plotted for the (a) experimental and (b) simulated data sets of flat walls at various depths (mm). The experimental standard depth error increases faster with an increase in radial distance due to lens distortion. Landau et al. (2016)

PipelineDepth image enhancement #2A Metrological Calibration #1 A New Calibration Method for Commercial RGB-D Sensors Walid Darwish, Shenjun Tang, Wenbin Li and Wu Chen Sensors 2017, 17(6), 1204; doi:10.3390/s17061204 Based on these calibration algorithms, different calibration methods have been implemented and tested. Methods include the use of 1D [Liu et al. 2012] 2D [Shibo and Qing 2012] , and 3D [Gui et al. 2014] calibration objects that work with the depth images directly; calibration of the manufacture parameters of the IR camera and projector [Herrera et al. 2012] ; or photogrammetric bundle adjustments used to model the systematic errors of the IR sensors [ Davoodianidaliki and Saadatseresht 2013; Chow and Lichti 2013] . To enhance the depth precision, additional depth error models are added to the calibration procedure [7,8,21,22,23]. All of these error models are used to compensate only for the distortion effect of the IR projector and camera. Other research works have been conducted to obtain the relative calibration between an RGB camera and an IR camera by accessing the IR camera [24,25,26]. This can achieve relatively high accuracy calibration parameters for a baseline between IR and RGB cameras, while the remaining limitation is that the distortion parameters for the IR camera cannot represent the full distortion effect for the depth sensor. This study addressed these issues using a two-step calibration procedure to calibrate all of the geometric parameters of RGB-D sensors. The first step was related to the joint calibration between the RGB and IR cameras, which was achieved by adopting the procedure discussed in [27] to compute the external baseline between the cameras and the distortion parameters of the RGB camera. The second step focused on the depth sensor calibration. Point cloud of two perpendicular planes (blue color: default depth; red color: modeled depth): highlighted black dashed circles shows the significant impact of the calibration method on the point cloud quality. The main difference between both sensors is the baseline between the IR camera and projector. The longer the sensor’s baseline, the longer working distance can be achieved. The working range of Kinect v1 is 0.80 m to 4.0 m, while it is 0.35 m to 3.5 m for Structure Sensor.

PipelineDepth image enhancement #2A Metrological Calibration #2 Photogrammetric Bundle Adjustment With Self- Calibration of the PrimeSense 3D Camera Technology: Microsoft Kinect IEEE Access ( Volume: 1 ) 2013 https://doi.org/10.1109/ACCESS.2013.2271860 Roughness of point cloud before calibration. (Bottom) Roughness of point cloud after calibration. The colours indicate the roughness as measured by the normalized smallest eigenvalue. Estimated Standard Deviation of the Observation Residuals To quantify the external accuracy of the Kinect and the benefit of the proposed calibration, a target board located at 1.5–1.8 m away with 20 signalized targets was imaged using an in- house program based on the Microsoft Kinect SDK and with RGBDemo. Spatial distances between the targets were known from surveying using the FARO Focus3D terrestrial laser scanner with a standard deviation of 0.7 mm. By comparing the 10 independent spatial distances measured by the Kinect to those made by the Focus3D, the RMSE was 7.8 mm using RGBDemo and 3.7 mm using the calibrated Kinect results; showing a 53% improvement to the accuracy. This accuracy check assesses the quality of all the imaging sensors and not just the IR camera-projector pair alone. The results show improvements in geometric accuracy up to 53% compared with uncalibrated point clouds captured using the popular software RGBDemo. Systematic depth discontinuities were also reduced and in the check-plane analysis the noise of the Kinect point cloud was reduced by 17%.

PipelineDepth image enhancement #2B Metrological Calibration #3 Evaluating and Improving the Depth Accuracy of Kinect for Windows v2 Lin Yang ; Longyu Zhang ; Haiwei Dong ; Abdulhameed Alelaiwi ; Abdulmotaleb El Saddik IEEE Sensors Journal (Volume: 15, Issue: 8, Aug. 2015) https://doi.org/10.1109/JSEN.2015.2416651 Illustration of accuracy assessment of Kinect v2. (a) Depth accuracy. (b) Depth resolution. (c) Depth entropy. (d) Edge noise. (e) Structural noise. The target plates in (a- c) and (d-e) are parallel and perpendicular with the depth axis, respectively. Accuracy error distribution of Kinect for Windows v2.

PipelineDepth image enhancement #2c A Comparative Error Analysis of Current Time-of-Flight Sensors IEEE Transactions on Computational Imaging (Volume: 2, Issue: 1, March 2016) https://doi.org/10.1109/TCI.2015.2510506 For evaluating the presence of wiggling, ground truth distance information is required. We calculate the true distance by setting up a stereo camera system. This system consists of the ToF camera to be evaluated and a high resolution monochrome camera (IDS UI-1241LE7) which we call the reference camera. The cameras are calibrated with Zhang (2000)’s algorithm with point correspondences computed with ROCHADE ( Placht et al. 2014). Ground truth is calculated by intersecting the rays of all ToF camera pixels with the 3D plane of the checkerboard. For higher accuracy, we compute this plane from corners detected in the reference image and transform the plane into the coordinate system of the ToF camera This experiment aims to quantify the so- called amplitude-related distance error and also to show that this effect is not related to scattering. This effect can be observed when looking at a planar surface with high reflectivity variations. With some sensors the distance measurements for pixels with different amplitudes do not lie on the same plane, even though they should. To the best of our knowledge no evaluation setup has been presented for this error source so far. In the past this error has been typically observed with images of checkerboards or other high contrast patterns. However, the analysis of single images allows no differentiation between amplitude-related errors andinternal scattering Metrological Calibration #4

PipelineDepth image enhancement #2c Metrological Calibration #5 Low-Cost Reflectance-Based Method for the Radiometric Calibration of Kinect 2 IEEE Sensors Journal ( Volume: 16, Issue: 7, April1, 2016 ) https://doi.org/10.1109/JSEN.2015.2508802 In this paper, a reflectance-based radiometric method for the second generation of gaming sensors, Kinect 2, is presented and discussed. In particular, a repeatable methodology generalizable to different gaming sensors by means of a calibrated reference panel with Lambertian behavior is developed. The relationship between the received power and the final digital level is obtained by means of a combination of linear sensor relationship and signal attenuation, into a least squares adjustment with an outlier detector. The results confirm that the quality of the method (standard deviation better than 2% in laboratory conditions and discrepancies lower than 7% b) is valid for exploiting the radiometric possibilities of this low-cost sensor, which ranges from the pathological analysis (moisture, crusts, etc.…); to agricultural and forest resource evaluation. 3D data acquired with Kinect 2 (left) and digital number (DN) distribution (right) for the reference panel at 0.7 m (units: counts). Visible-RGB view of the brick wall (a), intensity-IR digital levels (DN) (b-d) and calibrated reflectance values (e-g) for the three acquisition distances The objective of this paper was to develop a radiometric calibration equation of an IR projector- camera for the second generation of gaming sensors, Kinect 2, to convert the recorded digital levels into physical values (reflectance). By the proposed equation, the reflectance properties of the IR projector-camera set of Kinect 2 were obtained. This new equation will increase the number of application fields of gaming sensors, favored by the possibility of working outdoors. The process of radiometric calibration should be incorporated as part of an integral process where the geometry obtained is also corrected (i.e., lens distortion, mapping function, depth errors, etc.). As future perspectives, the effects of the diffuse radiance, which does not belong to the sensor footprint and contaminate the received signal, will be evaluated to determine the error budget in the active sensor.

PipelineDepth image enhancement #3 ‘Old-school’ depth refining techniques Depth enhancement with improved exemplar-based inpainting and joint trilateral guided filtering Liang Zhang ; Peiyi Shen ; Shu'e Zhang ; Juan Song ; Guangming Zhu Image Processing (ICIP), 2016 IEEE International Conference on https://doi.org/10.1109/ICIP.2016.7533131 In this paper, a novel depth enhancement algorithm with improved exemplar-based inpainting and joint trilateral guided filtering is proposed. The improved examplar-based inpainting method is applied to fill the holes in the depth images, in which the level set distance component is introduced in the priority evaluation function. Then a joint trilateral guided filter is adopted to denoise and smooth the inpainted results. Experimental results reveal that the proposed algorithm can achieve better enhancement results compared with the existing methods in terms of subjective and objective quality measurements. Robust depth enhancement and optimization based on advanced multilateral filters Ting-An ChangYang-Ting ChouJar-Ferr Yang EURASIP Journal on Advances in Signal Processing December 2017, 2017:51 https://doi.org/10.1186/s13634-017-0487-7 Results of the depth enhancement coupled with hole filling results obtainedby a noisy depth map, b joint bilateral filter (JBF) [16 ], c intensity guided depth superresolution (IGDS) [39], d compressive sensing based depth upsampling (CSDU) [40], e adaptive joint trilateral filter (AJTF) [18], and f the proposed AMF for Art, Books, Doily, Moebius, RGBD_1, and RGBD_2

PipelineDepth image enhancement #4A Deep learning-based depth refining techniques DepthComp : real-time depth image completion based on prior semantic scene segmentation Atapour-Abarghouei, A. and Breckon, T.P. 28th British Machine Vision Conference (BMVC) 2017 London, UK, 4-7 September 2017. http://dro.dur.ac.uk/22375/ Exemplar results on the KITTI dataset. S denotes the segmented images [3] and D the original (unfilled) disparity maps. Results are compared with [1, 2, 29, 35, 45]. Results of cubic and linear interpolations are omitted due to space. Comparison of the proposed method using different initial segmentation techniques on the KITTI dataset [27]. Original color and disparity image (top-left), results with manual labels (top-right), results with SegNet [3] (bottom-left) and results with mean-shift [26] (bottom-right). Fast depth image denoising and enhancement using a deep convolutional network Xin Zhang and Ruiyuan Wu Acoustics, Speech and Signal Processing (ICASSP), 2016 IEE https://doi.org/10.1109/ICASSP.2016.7472127

PipelineDepth image enhancement #4b Deep learning-based depth refining techniques Guided deep network for depth map super-resolution: How much can color help? Wentian Zhou ; Xin Li ; Daryl Reynolds Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE https://doi.org/10.1109/ICASSP.2017.7952398 https://anvoy.github.io/publication.html Depth map upsampling using joint edge-guided convolutional neural network for virtual view synthesizing Yan Dong; Chunyu Lin; Yao Zhao; Chao Yao Journal of Electronic Imaging Volume 26, Issue 4 http://dx.doi.org/10.1117/1.JEI.26.4.043004 Depth map upsampling. Input: (a) low-resolution depth map and (b) the corresponding color image. Output: (c) recovered high-resolution depth map. When the depth edges become unreliable, our network tends to rely on color-based prediction network (CBPN) for restoring more accurate depth edges. Therefore, contribution of color image increases when the reliability of the LR depth map decreases (e.g., as noise gets stronger). We adopt the popular deep CNN to learn non-linear mapping between LR and HR depth maps. Furthermore, a novel color-based prediction network is proposed to properly exploit supplementary color information in addition to the depth enhancement network. In our experiments, we have shown that deep neural network based approach is superior to several existing state-of-the-art methods. Further comparisons are reported to confirm our analysis that the contributions of color image vary significantly depending on the reliability of LR depth maps.

Future Image restoration Depth Images (Laser scanning)

PipelineLaser range Finding #1a Versatile Approach to Probabilistic Modeling of Hokuyo UTM-30LX IEEE Sensors Journal ( Volume: 16, Issue: 6, March15, 2016 ) https://doi.org/10.1109/JSEN.2015.2506403 When working with Laser Range Finding (LRF), it is necessary to know the principle of sensor’s measurement and its properties. There are several measurement principles used in LRFs [Nejad and Olyaee 2006], [ Łabęcki et al. 2012], [Adams 1999] : ● Triangulation ● Time of flight (TOF) ● Frequency modulation continuous wave (FMCW) ● Phase shift measurement (PSM) The geometry of terrestrial laser scanning; identification of errors, modeling and mitigation of scanning geometry Soudarissanane, S.S.. TU Delft. Doctoral Thesis (2016) http://doi.org/10.4233/uuid:b7ae0bd3-23b8-4a8a-9b7d-5e494ebb54e5 Distance measurement principle of time-of-flight laser scanners (top) and phase based laser scanners (bottom). Laser Range Finding : Image formation #1

PipelineLaser range Finding #1b Laser Range Finding : Image formation #2 The geometry of terrestrial laser scanning; identification of errors, modeling and mitigation of scanning geometry Soudarissanane, S.S.. TU Delft. Doctoral Thesis (2016) http://doi.org/10.4233/uuid:b7ae0bd3-23b8-4a8a-9b7d-5e494ebb54e5 Two ways link budget between the receiver (Rx) and the transmitter (Tx) in a Free Space Path (FSP) propagation model. Schematic representation of the signal propagation from the transmitter to the receiver. Effect of increasing incidence angle and range to the signal deterioration. (left) Plot of the the signal deterioration due to increasing incidence angle α, (right) plot of the signal deterioration due to increasing ranges ρ, with ρmin = 0 m and ρmax = 100 m Relationship between scan angle and normal vector orientation used for the segmentation of point cloud with respect to planar features. A point P = [ , , ]θ ϕ ρ is measured on the plane with the normal parameters N = [ , , ]α β γ . Different angles used for the range image gradients are plotted Theoretical number of points. Practical example of a plate of 1×1 m placed at 3 m, oriented at 0º and being rotated at 60º. Theoretical number of points. (left) Number of points with respect to the orientation of the patch and the distance. Reference plate measurement set-up. A white coated plywood board is mounted on a tripod via a screw clamp mechanism provided with a 2º precision goniometer.

PipelineLaser range Finding #1c Laser Range Finding : Image formation #3 The geometry of terrestrial laser scanning; identification of errors, modeling and mitigation of scanning geometry Soudarissanane, S.S.. TU Delft. Doctoral Thesis (2016) http://doi.org/10.4233/uuid:b7ae0bd3-23b8-4a8a-9b7d-5e494ebb54e5 Terrestrial Laser Scanning (TLS) good practice of survey planning Future directions At the time this research started, terrestrial laser scanners were mainly being used by research institutes and manufacturers. However, nowadays, terrestrial laser scanners are present in almost every field of work, e.g. forensics, architecture, civil engineering, gaming industry, movie industry. Mobile mapping systems, such as scanners capturing a scene while driving a car, or scanners mounted on drones are currently making use of the same range determination techniques used in terrestrial laser scanners. The number of applications that make use of 3D point clouds is rapidly growing. The need for a sound quality product is even more significant as it impacts the quality of a huge panel of end-products.

PipelineLaser range Finding #1D Laser Range Finding : Image formation #4 Ray-Tracing Method for Deriving Terrestrial Laser Scanner Systematic Errors Derek D. Lichti, Ph.D., P.Eng. Journal of Surveying Engineering | Volume 143 Issue 2 - May 2017 https://www.doi.org/10.1061/(ASCE)SU.1943-5428.0000213 Error model of direct georeferencing procedure of terrestrial laser scanning Pandžić, Jelena; Pejić, Marko; Božić, Branko; Erić, Verica Automation in Construction Volume 78, June 2017, Pages 13-23 https://doi.org/10.1016/j.autcon.2017.01.003

PipelineLaser range Finding #2A Calibration #1 Statistical Calibration Algorithms for Lidars Anas Alhashimi, Luleå University of Technology, Control Engineering Licentiate thesis (2016), ORCID iD: 0000-0001-6868-2210 A rigorous cylinder-based self-calibration approach for terrestrial laser scanners Ting On Chan; Derek D. Licht; David Belton ISPRS Journal of Photogrammetry and Remote Sensing; Volume 99, January 2015 https://doi.org/10.1016/j.isprsjprs.2014.11.003 The proposed method and its variants were first applied to two simulated datasets, to compare their effectiveness, and then to three real datasets captured by three different types of scanners are presented: a Faro Focus 3D (a phase-based panoramic scanner); a Velodyne HDL-32E (a pulse-based multi spinning beam scanner); and a Leica ScanStation C10 (a dual operating-mode scanner). In situ self-calibration is essential for terrestrial laser scanners (TLSs) to maintain high accuracy for many applications such as structural deformation monitoring (Lindenbergh, 2010) . This is particularly true for aged TLSs and instruments being operated for long hours outdoors with varying environmental conditions. Although the plane-based methods are now widely adopted for TLS calibration, they also suffer from the problem of high parameter correlation when there is a low diversity in the plane orientations (Chow et al., 2013). In practice, not all locations possess large and smooth planar features that can be used to perform a calibration. Even though planar features are available, their planarity is not always guaranteed. Because of the drawbacks to the point- based and plane-based calibrations, an alternative geometric feature, namely circular cylindrical features (e.g. Rabbani et al., 2007), should be considered and incorporated in to the self-calibration procedure. Estimate d without being aware of the mode hopping, i.e., assuming a certain λ0 without actually knowing that the average λ jumps between different lasing modes, reflects thus in a multimodal measurement of d Potential temperature-bias dependencies for the polynomial model. The plot explaining the cavity modes, gain profile and lasing modes for typical laser diode. The upper drawing shows the wavelength v1 as the dominant lasing mode while the lower drawing shows how both wavelengths v1 and v2 are competing; this latter case is responsible for the mode-hopping effects.

PipelineLaser range Finding #2b Calibration #2 Calibration of a multi-beam Laser System by using a TLS-generated Reference Gordon, M.; Meidow, J. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume II-5/W2, 2013, pp.85-90 http://dx.doi.org/10.5194/isprsannals-II-5-W2-85-2013 Extrinsic calibration of a multi-beam LiDAR system with improved intrinsic laser parameters using v-shaped planes and infrared images Po-Sen Huang ; Wen-Bin Hong ; Hsiang-Jen Chien ; Chia-Yen Chen IVMSP Workshop, 2013 IEEE 11th https://doi.org/10.1109/IVMSPW.2013.6611921 Velodyne HDL-64E S2, the LiDAR system studied in this work , for example, is a mobile scanner consisting of 64 pairs of laser emitter-receiver which are rigidly attached to a rotating motor and provides real-time panoramic range data with measurement errors of around 2.5 mm. In this paper we propose a method to use IR images as feedbacks in finding optimized intrinsic and extrinsic parameters of the LiDAR-vision scanner. First, we apply the IR-based calibration technique to a LiDAR system that fires multiple beams, which significantly increases the problem's complexity and difficulty. Second, the adjustment of parameters is applied to not only the extrinsic parameters, but also the laser parameters as well as the intrinsic parameters of the camera. Third, we use two different objective functions to avoid generalization failure of the optimized parameters. It is assumed that the accuracy of this point clouds is considerably higher than that from the multi- beam LIDAR and that the data represent faces of man-made objects at different distances. We inspect the Velodyne HDL-64E S2 system as the best-known representative for this kind of sensor system, while Z+F Imager 5010 serve as reference data. Beside the improvement of the point accuracy by considering the calibration results, we test the significance of the parameters related to the sensor model and consider the uncertainty of measurements w.r.t. the measured distances. Standard deviation of the planar misclosure is nearly halved from 3.2 cm to 1.7 cm. The variance component estimation as well as the standard deviation of the range residuals reveal that the manufactures standard deviation of the distance accuracy with 2 cm is a bit too optimistic. The histograms of the planar misclosures and the residuals reveal that this quantities are not normal distributed. Our investigation of the distance depending misclosure variance change is one reason. Other sources were investigated by Glennie and Lichti (2010): the incidence angle and the vertical angle. A further possibility is the focal distance, which is different for each laser and the average is at 8 m for the lower block and at 15 m for the upper block. This may introduce a distance depending—but nonlinear— variance change. Further research is needed to find the sources of these observations.

PipelineLaser range Finding #2c Calibration #3 Towards System Calibration of Panoramic Laser Scanners from a Single Station Tomislav Medić, Christoph Holst and Heiner Kuhlmann Sensors 2017, 17(5), 1145; doi: 10.3390/s17051145 “Terrestrial laser scanner measurements suffer from systematic errors due to internal misalignments. The magnitude of the resulting errors in the point cloud in many cases exceeds the magnitude of random errors. Hence, the task of calibrating a laser scanner is important for applications with high accuracy demands. In order to achieve the required measurement quality, manufacturers put considerable effort on the production and assembly of all instrument components. However, these processes are not perfect and remaining mechanical misalignments need to be modeled mathematically. That is achieved by a comprehensive factory calibration (e.g., [4]). In general, manufacturers do not provide complete information about the functional relations between the remaining mechanical misalignments and the observations, the number of relevant misalignments, as well as the magnitude and precision of the parameters describing those misalignments. This data is treated as a company secret. At the time of purchase, laser scanners are expected to be free of systematic errors caused by mechanical misalignments. Additionally, their measurement quality should be consistent with the description given in the manufacturers specifications. However, many factors can influence the performance of the particular scanner, such as long-term utilization, suffered stresses and extreme atmospheric conditions. Due to that, instruments must be tested and recalibrated at certain time intervals in order to maintain the declared measurement quality. There are certain alternatives, but they lack in comprehensiveness and reliability. For example, some manufacturers like Leica Geosystems and FARO Inc. provide user calibration approaches, which can reduce systematic errors in the measurements due to misalignments to some extent (e.g., Leica’s “Check and Adjust” and Faro’s “On-site compensation”). However, those approaches do not provide detailed information about all estimated parameters, their precision and influence on the resulting point cloud. Perfect panoramic terrestriallaser scanner geometry. (a) Local Cartesian coordinate system of the scanner with a respect to the main instrument axes; (b) Local coordinate system of the scanner transformed to the polar coordinates. Rotational mirror related mechanical misalignments: (a) mirror offset; (b) mirror tilt. Horizontal axis related mechanical misalignments: (a) axis offset; (b) axis tilt.

PipelineLaser range Finding #3A Scan Optimization #1 Optimal Placement of a Terrestrial Laser Scanner with an Emphasis on Reducing Occlusions Morteza Heidari Mozaffar, Masood Varshosaz Photogrammetric Record, Volume 31, Issue 156 December 2016 Pages 374–393 https://www.doi.org/10.1111/phor.12162 Planning Complex Inspection Tasks Using Redundant Roadmaps Brendan Englot, Franz Hover, Robotics Research Pp 327-343 Part of the Springer Tracts in Advanced Robotics book series (STAR, volume 100) https://doi.org/10.1007/978-3-319-29363-9_19 A randomized art-gallery algorithm for sensor placement H. González-Banos SCG '01 Proceedings of the seventeenth annual symposium on Computational geometry https://doi.org/10.1145/378583.378674 An autonomous mobile robot with a 3D laser range finder for 3D exploration and digitalization of indoor environments H Surmann, A Nüchter, J Hertzberg Robotics and Autonomous Systems. Volume 45, Issues 3–4, 31 December 2003, Pages 181-198 https://doi.org/10.1016/j.robot.2003.09.004 Near-optimal sensor placements in Gaussian processes Carlos Guestrin; Andreas Krause; Ajit Paul Singh ICML '05 Proceedings of the 22nd international conference on Machine learning https://doi.org/10.1145/1102351.1102385 When monitoring spatial phenomena, which are often modeled as Gaussian Processes (GPs), choosing sensor locations is a fundamental task. A common strategy is to place sensors at the points of highest entropy (variance) in the GP model.

PipelineLaser range Finding #3B Scan Optimization #2: Automatic drone scanning UAV-Based Autonomous Image Acquisition With Multi-View Stereo Quality Assurance by Confidence Prediction Christian Mostegel, Markus Rumpler, Friedrich Fraundorfer, Horst Bischof; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2016 https://arxiv.org/abs/1605.01923 Autonomous Image Acquisition. After a manual initialization, our system loops between view planning and autonomous execution. Within the view planning procedure, we leverage machine learning to predict the best camera constellation for the presented scene and a specific dense MVS algorithm. This MVS algorithm will use the recorded high resolution images to produce a highly accurate and complete 3D reconstruction off-site in the lab. Consistency voting. A positive vote (center) is only cast if the reference measurement is within the uncertainty boundary of the query measurement. A negative vote is either cast if a reference measurement blocks the line of sight of the query camera (left) or the other way around (right). View planning. Our algorithm tries to find the k next best camera triplets for improving the acquisition quality. Next to the arrows, we show the data communication between our submodules (M1-M4) in red and in black we show how often this data is computed. S is the set of surrogate cameras, T the set of considered unfulfilled triangles and C3 the set of camera triplets generated from the surrogate cameras. In this paper we presented a novel autonomous system for acquiring close-range high-resolution images that maximize the quality of a later-on 3D reconstruction. We demonstrated that this quality strongly depends on the planarity of the scene structure (complex structures vs. smooth surfaces), the camera constellation and the chosen dense MVS algorithm. We learn these properties from unordered image sets without any hard ground truth and use the acquired knowledge to constrain the set of possible camera constellations in the planning phase. In using these constraints, we can drastically improve the success of the image acquisition, which finally results in a high-accuracy 3D reconstruction with a significantly higher scene coverage compared to traditional acquisition techniques.

PipelineLaser range Finding #3C Scan Optimization #3 Active Image-based Modeling Rui Huang, Danping Zou, Richard Vaughan, Ping Tan (Submitted on 2 May 2017) https://arxiv.org/abs/1705.01010 Plan3D: Viewpoint and Trajectory Optimization for Aerial Multi-View Stereo Reconstruction Benjamin Hepp, Matthias Nießner, Otmar Hilliges (Submitted on 25 May 2017) https://arxiv.org/abs/1705.09314 We propose an end-to-end system for 3D reconstruction of building-scale scenes with commercially available quadrotors. (A) A user defines the region of interest (green) on a map-based interface and specifies a pattern of viewpoints (orange), flown at a safe altitude. (B) The pattern is traversed and the captured images are processed resulting in an initial reconstruction and occupancy map. (C) We compute a viewpoint path that observes as much of the unknown space as possible adhering to characteristics of a purposeful designed camera model. The viewpoint path is only allowed to pass through known free space and thus the trajectory can be executed fully autonomously. (D) The newly captured images are processed together with the initial images to attain the final high-quality reconstruction of the region of interest. The method is capable of capturing concave areas and fine geometric detail. Comparison several iterations of our method for the church scene. Note that some prominent structures get only resolved in the second iteration. Also the overall sharpness of the walls and ornaments significantly improves with the second iteration. Modeling of an Asian building after the initial data capturing (a) and actively capturing more data (b). The figures (a.1), (b.1) show the SfM point clouds and camera poses, (a.2), (b.2) are color-coded model quality evaluation, where red indicates poor results, (a.3), (b.3) are the final 3D models generated from those images. Our system consists of an online front end and an offline back end. The front end captures images automatically to ensure good coverage of the object. The back end takes an existing method [Jancosek and Pajdla, 2011] to build a high quality 3D model.

PipelineLaser range Finding #3D Scan Optimization #4 Submodular Trajectory Optimization for Aerial 3D Scanning Mike Roberts, Debadeepta Dey, Anh Truong, Sudipta Sinha, Shital Shah, Ashish Kapoor, Pat Hanrahan, Neel Joshi Stanford University, Microsoft Research, Adobe Research Submitted on 1 May 2017 last revised 4 Aug 2017 https://arxiv.org/abs/1705.00703 3D reconstruction results obtained using our algorithm for generating aerial 3D scanning trajectories, as compared to an overhead trajectory. Top row: Google Earth visualizations of the trajectories. Middle and bottom rows: results obtained by flying a drone along each trajectory, capturing images, and feeding the images to multi- view stereo software. Our trajectories lead to noticeably more detailed 3D reconstructions than overhead trajectories. In all our experiments, we control for the flight time, battery consumption, number of images, and quality settings used in the 3D reconstruction. Overview of our algorithm for generating camera trajectories that maximize coverage. (a) Our goal is to find the optimal closed path of camera poses through a discrete graph. (b) We begin by solving for the optimal camera orientation at every node in our graph, ignoring path constraints. (c) In doing so, we remove the choice of camera orientation from our problem, coarsening our problem into a more standard form. (d) The solution to the problem in (b) defines an approximation to our coarsened problem, where there is an additive reward for visiting each node. (e) Finally, we solve for the optimal closed path on the additive approximation definedin (d).

PipelineLaser range Finding #4E Scan Optimization #5 A Reinforcement Learning Approach to the View Planning Problem Mustafa Devrim Kaba, Mustafa Gokhan Uzunbas, Ser Nam Lim Submitted on 19 Oct 2016, last revised 18 Nov 2016 https://arxiv.org/abs/1610.06204 (a) View planning for UAV terrain modeling, (b) Given a set of initial view points, (c) The goal is to find minimum number of views that provide sufficient coverage. Here, color code represents correspondence between selected views and the coverage. Visual results of coverage and sample views on various models. In the top row, lines represent location and direction of the selected cameras. Colors represent coverage by different cameras. Data Driven Strategies for Active Monocular SLAM using Inverse Reinforcement Learning Vignesh Prasad, Rishabh Jangir, Ravindran Balaraman, K. Madhava Krishna AAMAS '17 Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems http://www.aamas2017.org/proceedings/pdfs/p1697.pdf Gazebo [Koenig and Howard, 2004] is a framework that accurately simulates robots and dynamic environments. Experiments were performed in simulated environments on a Turtlebot using a Microsoft Kinect for the RGB camera input. We use PTAM (Parallel Tracking and Mapping) [Klein and Murray, 2007] for the Monocular SLAM framework.

PipelineLaser range Finding SUMMARY Previous slides highlight the need to keep your scanner calibrated as proper calibration reduces standard deviation for depth measurement. If you are not familiar with optics metrology, for example in LEDs, the increased temperature (T) will decrease light intensity even with constant forward current (If ) and shift peak wavelength (λ). Exact effects then depend on the laser type, and reading the datasheet might be helpful if available for your laser scanner And the light sensor (photodiode) has a wavelength and temperature-dependent sensitivity response: For deep learning purposes, lack of calibration of mid-quality targets (not the highest quality gold standard) could be used for “physics-true” data augmentation in addition to synthetic data augmentation Prof E. Fred Schubert "Light-Emitting Diodes" Second Edition (ISBN-13: 978-0-9863826-1-1) https://www.ecse.rpi.edu/~schubert/Ligh t-Emitting-Diodes-dot-org/chap06/chap06 .htm Ma et al. (2015) Spectral response characteristics of various IR detectors of different materials/technologies. Detectivity (vertical axis) is a measure of signal-to-noise ratio (SNR) of an imager normalized for its pixel area and noise bandwidth [Hamamatsu]. Sood et al. (2015) Room temperature emission spectra under He-Cd 325 nm laser excitation of ZnO:Pr (0.9%) fi lms prepared at di ff erent deposition temperatures and after annealing. The inset provides a closer picture of the main Pr 3+ emission peak, without the contribution Balestrieri et al. (2014)

Future COMMERCIAL LASER SCANNERS for ground truth

Pipeline Commercial Laser Scanners Low-cost LiDARs #1 A low-cost laser distance sensor Kurt Konolige ; Joseph Augenbraun ; Nick Donaldson ; Charles Fiebig ; Pankaj Shah Robotics and Automation, 2008. ICRA 2008. IEEE https://doi.org/10.1109/ROBOT.2008.4543666 Revo LDS. Approximate width is 10cm. Round carrier spins, holds optical module with laser dot module, imager, and lens. The Electrolux Trilobite, one of the only cleaners to make a map, relies on sonar sensors [Zunino and Christensen, 2002]. The barrier to using laser distance sensor (LDS) technology is the cost. The two most common devices, the SICK LMS 200 [ Alwan et al. 2005] and the Hokuyo URG- 04LX [Alwan et al. 2005], cost an order of magnitude more than the simplest robot In this paper we describe a compact, low-cost (~$30 cost to build) LDS that is as capable as standard LDS devices, yet is being manufactured for a fraction of their cost: the Revo LDS. Comparing low-cost 2D scanning Lidars May 28, 2017 https://diyrobocars.com/2017/05/28/comparing-low-cost-2d-scanning-lidars/ The RP Lidar A2 (left) and the Scanse Sweep (right). The RP Lidar A2 is the second lidar from Slamtec, a Chinese company with a good track record. Sweep is the first lidar from Scanse, a US company https://youtu.be/yLPM2BVQ2Ws

Pipeline Commercial Laser Scanners Low-cost LiDARs #2 Low-Cost 3D Systems: Suitable Tools for Plant Phenotyping Stefan Paulus, Jan Behmann, Anne-Katrin Mahlein, Lutz Plümer and Heiner Kuhlmann. Sensors 2014, 14(2), 3001-3018; doi: 10.3390/s140203001 The David laser scanning system (DAVID Vision Systems GmbH, Koblenz, Germany) is a low-cost scanning system [Winkelbach et al. 2006] consisting of a line laser pointer, a printed calibration field (Type CP-SET01, size DinA3) and a camera. A comparison study of different 3D low-cost laser scanners needs a reliable validation measurement. For this purpose, a commercial 3D laser triangulation system was used with a line laser scanner (Perceptron Scan Works V5, Perceptron Inc., Plymouth, MI, USA), coupled to an articulated measuring arm (Romer Infinite 2.0 (1.4 m), Hexagon Metrology Services Ltd., London UK Validation of a low-cost 2D laser scanner in development of a more-affordable mobile terrestrial proximal sensing system for 3D plant structure phenotyping in indoor environment Computers and Electronics in Agriculture Volume 140, August 2017, Pages 180-189 https://doi.org/10.1016/j.compag.2017.06.002 The principal components of the RPLIDAR laser scanner Illustrations of the two sets of RPLIDAR- collected points ((a) and (b)) and their registration (c).

Pipeline Commercial Laser Scanners Hand-held LiDARs GeoSLAM ZEB-REVOFARO Scanner Freestyle 3D X Price: $18k Sampling frequency: 43,2k points/ s 3D Measurement Accuracy: +/- 0.1% Maximum Range: Up to 30m (15m outdoors) http://informedinfrastructure. com/20812/next-generation-hand held-zeb-revo-scanner-adds-spe ed-and-simplicity/ Price: $13k Range: 0.5 – 3m Accuracy: <1.5mm Sampling frequency: 88k points / s

Pipeline Commercial Laser Scanners Drone/robot-based LiDARs http://www.riegl.com/products/unmanned-scanning/ Photogrammetry without a LiDAR with DJI Phantom https://www.youtube.com/watch?v=SATijfXnshg https://www.youtube.com/watch?v=BhHro_rcgHo Phoenix Aerial AL3-16 LiDAR Mapping system. The AL3-16 can be mounted on any UAV that can lift 2.5kgs. In this video, a DJI S1000 is used and for the demonstration, we flew over an over an open pit. You can see how quickly we generate a dense point cloud. You'll also see how LiDAR can pick up points underneath vegetation vs Photogrammetry will only map tree canopy. SCAN RATE: 300k shots/s, up to 600k points/s RECOMMENDED SCANNING HEIGHT: 60 m ACCURACY POSITION: 1cm + 1ppm RMS Laser Range: 107 m @ 60% Reflectivity Absolute Accuracy: 25 / 35 mm RMSE @ 50m Range PP Attitude Heading RMS Error: 0.009 / 0.017° IMU options https://www.phoenixlidar.com/al3-16/ Modular upgrade options: Dual LiDAR Sensors, DSLR, GeniCam, GigEVision, thermal, multispectral, hyperspectral & custom sensors

Pipeline Commercial Laser Scanners Mid-range LiDARs Velodyne Puck LITE Range: 100m Accuracy: +/- 30mm Sampling frequency: 300k points / s Angular Resolution (Vertical): 2.0° (Horizontal/Azimuth): 0.1° – 0.4° Integrated Web Server for Easy Monitoring and Configuration http://velodynelidar.com/vlp-16-lite.html Price: $8k 100m +/- 30mm 300k points / s 1.33° 0.1° – 0.4° http://velodynelidar.com/vlp-16-hi-res.html https://www.sli deshare.net/Fun k98/autonomous- vehicles-608454 49 https://www.recode.ne t/2015/10/27/11620026 /meet-the-companies-b uilding-self-driving- cars-for-google-and-t esla Gordon and Meidow (2013): “Standard deviation of the planar misclosure is nearly halved from 3.2 cm to 1.7 cm. The range residuals reveal that the manufactures standard deviation of the distance accuracy with 2 cm is a bit too optimistic.”

Pipeline Commercial Laser Scanners High-end LiDARs #1 FARO Focus S70 Riegl Scanner Selection Guide Leica Geosystems 08 Jun 2017— New FARO® Laser Scanner Focus3D X 30/130/330 firmware update for version 5.5.8.50259 enhances scan- quality and decreases noise on individual scans. Price: “More than the M70 (~$25k) and less than the S130 (~$60k)” Range: 0.6-70m Accuracy: +/- 1mm Sampling frequency: 1.0M points / s https://www.spar3d.com/news/lidar/faro-focus-s70-li dar-hits-1-mm-accuracy/ RIEGL VZ-400i ~$122k http://www.riegl.com/nc/products/terrestrial-scanning/produktdetail/product/scanner/48/ Accuracy of single measurement 3D Position Accuracy: 3 mm at 50 m; 6 mm at 100 m Linearity error: ≤ 1 mm Angular accuracy: 8º horizontal; 8º vertical Sampling frequency: 1.0M points / s Range: 0.4-120m

Pipeline Commercial Laser Scanners High-end LiDARs #2 Z+F Profile 5010C Gordon and Meidow (2013): “We inspect the Velodyne HDL-64E S2 system as the best-known representative for this kind of sensor system, while Z+F Imager 5010 serve as reference data.” Range: 0.3-119m Accuracy: +/- 1mm Sampling frequency: 1.0M points / s The benefits of the terrestrial 3D laser scanner Z+F IMAGER® 5010C: ● Integrated calibrated camera ● Best balanced colour imagery through ● Z+F High Dynamic Range Technology ● 80 Mpixel full panorama generation ● Extreme high speed laser sensor ● Total eyesafety guaranteed by laser class 1 ● IP 53 conformity: dust & water resistant Zoller + Fröhlich GmbH Price: ~ $82k

Future optomechanical considerations for the rig

OPTOmechanical components Reduce re-calibration efforts by having a high-quality tripod/rig Optomechanics Optomechanics are used to provide a variety of mounting or positioning options for a wide range of optical applications. Optomechanics include optical mounts, optical breadboards or tables, translation stages, or other components used to construct optical systems. Optical mounts are used to secure optical components, while optical breadboards, laboratory tables, or translation stages can create work spaces in which components may be stabilized or accurately positioned. Optomechanical components provide stability as well as accuracy to mounted optical components to increase the efficiency of optical systems. https://www.edmundoptics.com/optomechanics/ https://www.thorlabs.com/navigation.cfm?guide_id=2262 https://www.newport.com/c/opto-mechanics Manfrotto Virtual Reality Range https://www.manfrotto.co.uk/collections/supports/360- virtual-reality 360Precision Professional panoramic tripod head for virtual tours, real estate tours, professional photographers, HDRi, CGI matte backdrops, landscape http://www.360precision.com/360/index.cfm?precision=products. home&pageid=virtualtour www.aliexpress.com

RIG Rotating Mechanics for expensive sensors New FARO 3D Documentation Workflow Video - FARO Focus 3D Laser Scanner https://www.youtube.com/watch?v=4wcd7erim9U Terrestrial laser scanners typically can rotate by themselves When stacking multiple sensors on top of each other then becomes trickier Or alternatively when Kinects for example are so inexpensive, hey don’t need to rotate. Just use multiple Kinects at the same time on fixed positions Zaber™ Motorized Rotary Stage System €2,050.00 ● High Resolution with 360° Continuous Rotation ● Integrated Motor and Controller ● Controlled Manually or via RS-232 Serial Interface Lab-quality motorised rotary stages tend to be expensive Photography - Tripods & Support - Tripod Heads - Panoramic & Time Lapse Heads Consumer-quality heads cheaper static indoor scenes do not need exact syncing anyway Syrp Genie Mini Panning Motion Control System $249.00 Syrp Pan Tilt Bracket for Genie and Genie Mini allows you to create dynamic 2- axis and 3-axis time- lapse and real-time videos with varied combinations of Genie and Genie Minis. $89.00 Really Right Stuff PG-02 FG Pano-Gimbal Head with PG- CC Cradle Clamp $860.00 Manfrotto MH057A5 Virtual Reality and Panoramic Head (Sliding) $559.88

inexpensive sensors Kinect properties #1 Kinect v1 http://wiki.ros.org/kinect_calibration/technical 3D with Kinect Smisek et al. (2011): Kinect v2 Lachat et al. (2015) Since individual frame acquisitions suffer from noise inherent to the sensor and its technology, averaging successive frames is a possible improvement step to overcome this phenomenon. (a) Acquired scene; (b) Standard deviations calculated for each pixel of 100 successive depthmaps (colorbar in cm) Pre-heating Time test shows that the distance varies from 5 mm up to 30 minutes and becomes then almost constant (more or less 1 mm). A common way to assess the distance inhomogeneity consists on positioning the camera parallel to a white planar wall at different well- known distances. In our study, the wall has been surveyed beforehand with a laser scanner (FARO Focus3D) in order to assess his planarity with a device of assumed higher accuracy as the investigated sensor. FOV º px (per º) IR v1 58.5 x 46.6 5 x 5 IR v2 70.6 x 60 7 x 7 RGB v1 62 x 48.6 10 x 10 RGB v2 84.1 x 53.8 22 x 20 http://smeenk.com/kinect-field-of-view-comparison/ → http://www.smeenk.com/webgl/kinectfovexplorer.html Lens distortions of (a) IR camera and (b) RGB camera. The principal points are marked by x and the image centers by +. Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications. Sensors (Basel). 2012; 12(2): 1437–1454. doi: 10.3390/s120201437 http://pterneas.com/2014/02/08/kinect-for-windows-version-2-overview/

inexpensive sensors Kinect properties #2 Kinect v1 https://openkinect.org/wiki/Hardware_info#Datasheets RGB Camera - MT9M112 RGB Camera – MT9v112 Depth Camera - MT9M001 Infrared emitter wavelengths v1: λmax ~ 827nm [blogs.msdn.microsoft.com] v2: λmax ~ 860nm [social.msdn.microsoft.com, Lembit Valgma BSc. Thesis] Kinect v2 depth camera From Payne et al. (2014) Camera sensor size: Why does it matter and exactly how big are they? http://newatlas.com/camera-sensor-size-guide/26684/ fm multiplier: 36/22.2 ~1.6 fm multiplier: 36/4.54 ~7.9 fm ~2.54 fm ~5.41 ~10.0 fm https://msdn.microsoft.com/en-us/library/hh973074.aspx http://wiki.ipisoft.com/User_Guide_for_Dual_Depth_Sensor_Configuration

Kinect Placement Cover the scene with multiple sensors #1 4 x Kinect V2s (~$160) 4 x Kinect V2s (45º rotation) Full overlap of the scene could be achieved with already 6 sensors (6 x 70.6º ~ 423.6º). For less distorted panorama stitch, some extra Kinects might be helpful though 3D human reconstruction and action recognition from multiple active and passive sensors http://mmv.eecs.qmul.ac.uk/mmgc2013/ Berkeley Multimodal Human Action Database (MHAD) http://tele-immersion.citris-uc.org/berkeley_mhad Four Microsoft Kinects for Windows v2.0 https://kinectingoculus.wordpress.com/

Kinect Placement for multimodal tweaks Capture setup. In (a) a standard camera with a polarizing filter (linear polarizer with quarter-wave plate, model Hoya CIR-PL) is used to photograph a diffuse sphere under different filter rotations. The captured photographs in the bottom row look similar, but in (b), a sinusoidal pattern is observed when a single pixel is plotted against filter angle. The phase encodes azimuth angle and the amplitude and offset encode zenith angle. http://www.media.mit.edu/~achoo/polar3D | Kadambi et al. (2015) The fixed layout gets tricky with the advanced polarization-enhanced Kinect depth sensing as the polarization filter would need to be manually rotated for each polarizer angle for each Kinect http://www.hoyafilter.com/hoya/products/hdfilters/hdfiltercirpl/ 8MPR16-1 motorized polarizer mount provides smooth, high accuracy, high repeatability and stability continuous 360° rotation of the polarization optical components with diameters up to 1’ (25.4mm) or 1/2’ (12.7mm), Motorized polarizer rotator increases cost

Kinect mount commercial options A mount designed to attach a Kinect v2 sensor to a video camera (DSLR) for calibrated filming. The kit includes an aluminum base with an attached quick release assembly and two lasercut acrylic arms that are designed to accommodate the Kinect v2. The bottom of the aluminum cube and the top of the quick release plate both use a standard threaded 1/4"-20 tripod mount. http://www.depthkit.tv/hardware/kinect-v2 $75.00 Kinect V2 DepthKit Mount Xbox One Kinect wall mount December 28, 2013 by Tim Greenfield https://programmerpayback.com/2013/12/28/xbox-one-kinect-wall-mount/ VideoSecu 1/4" x 20 Threads Swivel Security Camera Mount. $8.89 RGBDToolkit Aluminum Mount for Kinect & DSLR/video Camera byalexicon3000 in xbox http://www.instructables.com/id/RGBD-Toolkit-Alumi num-Mount-for-Kinect-DSLRvide/ ORB 2-in-1 Kinect Sensor and PlayStation Move Camera Support (for PS3 and Xbox 360) http://www.game.co.uk/en/orb-2-in-1-kinect-sensor-and-p laystation-move-camera-support-for-ps3-and-xbox-360-632 500?pageSize=40&vendorShopId=2255&cm_mmc=google-shoppin g-_-atronica-_-allproducts-_-1p $28 Geerse et al. (2015)

Multiple sensor Mount with motorized rotation? 1 2 3

Multiple sensor Mount with motorized rotation #2 Luckily we are interested in scanning static indoor scenes, and the scanning does not have to happen exactly at same time. We are okay with pushing a button once, leaving the scan to be completed automatically in sequential order, e.g. 1) TLS, 2) Matterport, and then 3) the low-cost sensor + DSLR unit. We could register the scans in the post-processing phase with known transformation (calibrated) between the sensors Weights (approx.) Faro Focus S70 4.2 kg (9.3 lb) Velodyne Puck Lite 0.59 kg (1.3 lb) Matterport Pro2 3.4 kg (7.5 lb) Canon EOS 1300D 0.44 kg (0.97 lb) RBB150A/M Customer Inspired! Ø150 mm Rotating Breadboard £488.00 ($630) RBB300A/M Customer Inspired! Ø300 mm Rotating Breadboard£638.00 ($822) RBB450A/M Customer Inspired! Ø450 mm Rotating Breadboard£713.00 ($920) RBBA1/M Ø91 mm, SM1-Threaded Center Panel, M6 Taps£41.00 https://www.thorlabs.com/newgrouppage9.cfm? objectgroup_id=1313 https://www.thorlabs.com/newgrouppage9.cfm?objec tgroup_id=6421 https://www.thorlabs.com/newgrouppage9.cfm?objectgroup _id=10242 MSB30/M NEW!300 mm x 300 mm Mini-Series Aluminum Breadboard, M4 and M6 High-intensity Taps £199.19 https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=6327

Multiple sensor Mount with motorized rotation #3 http://www.panoramic-photo-guide.com/p anoramic-heads-on-the-market.html Starworks Panoramic Head http://www.star-works.it/360_pan_head_pro_ eng.html $470 Control: Browser-based control via WLAN/LAN Max. Payload: 20.0 kg (44.1 lbs) http://dr-clauss.de/en/applications/virtual-tours STEVENBRACE.CO.UK DIY Arduino Motorised Time-Lapse head http://stevenbrace.co.uk/2012/06/diy-arduino-motorised-time-lapse-head/

Multiple sensor Mount with motorized rotation #4 Pan+Tilt Moving Heads if you are into bricolage As suggested by Dr. Benjamin Lochocki https://www.thomann.de/gb/moving_heads_spot.html?oa=pra Open Source Lighting Control Software The Open Lighting Project's goal is to provide high quality, open source lighting control software for the entertainment lighting industry. https://www.openlighting.org/ Described as the Travel Adaptor for the Lighting Industry, the Open Lighting Architecture provides a flexible way to interconnect different lighting equipment, allowing you to spend less time dealing with low-level communication issues and more time designing the show. It supports all the major DMX-over-Ethernet protocols including ArtNet and Streaming ACN, and over 20 different USB DMX dongles. The Open Lighting Project has developed a number tools for debugging & testing RDM implementations. From the Automated RDM Responder Tests to RDM Packet Analyzers, testing RDM gear has never been easier. A step-by-step tutorial is available on installing OLA on a Raspberry Pi. With the right hardware, you can be running OLA within minutes. Open Lighting Project would allow motor control over DMX512 / RDM interface via USB (e.g. Enttec DMX USB Pro) /Arduino using Python for example if programmatic control becomes an issue to you

Dataset creation for Deep Learning-based Geometric Computer Vision problems

More Related Content

What's hot

Similar to Dataset creation for Deep Learning-based Geometric Computer Vision problems

More from PetteriTeikariPhD

Recently uploaded

Dataset creation for Deep Learning-based Geometric Computer Vision problems