Hostname: page-component-586b7cd67f-g8jcs Total loading time: 0 Render date: 2024-11-27T05:32:15.557Z Has data issue: false hasContentIssue false

A survey of structure from motion*.

Published online by Cambridge University Press:  05 May 2017

Onur Özyeşil
Affiliation:
INTECH Investment Management LLC, One Palmer Square, Suite 441, Princeton, NJ 08542, USA E-mail: oozyesil@intechjanus.com
Vladislav Voroninski
Affiliation:
Helm.ai, Menlo Park, CA 94025, USA E-mail: vlad@helm.ai
Ronen Basri
Affiliation:
Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, 76100, Israel E-mail: ronen.basri@weizmann.ac.il
Amit Singer
Affiliation:
Department of Mathematics and PACM, Princeton University, Princeton, NJ 08544-1000, USA E-mail: amits@math.princeton.edu

Abstract

The structure from motion (SfM) problem in computer vision is to recover the three-dimensional (3D) structure of a stationary scene from a set of projective measurements, represented as a collection of two-dimensional (2D) images, via estimation of motion of the cameras corresponding to these images. In essence, SfM involves the three main stages of (i) extracting features in images (e.g. points of interest, lines, etc.) and matching these features between images, (ii) camera motion estimation (e.g. using relative pairwise camera positions estimated from the extracted features), and (iii) recovery of the 3D structure using the estimated motion and features (e.g. by minimizing the so-called reprojection error). This survey mainly focuses on relatively recent developments in the literature pertaining to stages (ii) and (iii). More specifically, after touching upon the early factorization-based techniques for motion and structure estimation, we provide a detailed account of some of the recent camera location estimation methods in the literature, followed by discussion of notable techniques for 3D structure recovery. We also cover the basics of the simultaneous localization and mapping (SLAM) problem, which can be viewed as a specific case of the SfM problem. Further, our survey includes a review of the fundamentals of feature extraction and matching (i.e. stage (i) above), various recent methods for handling ambiguities in 3D scenes, SfM techniques involving relatively uncommon camera models and image features, and popular sources of data and SfM software.

Type
Research Article
Copyright
© Cambridge University Press, 2017 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

REFERENCES 27

Agarwal, S., Snavely, N., Seitz, S. and Szeliski, R. (2010), Bundle adjustment in the large. In ECCV 2010: 11th European Conference on Computer Vision, part II, Vol. 6312 of Lecture Notes in Computer Science, Springer, pp. 2942.CrossRefGoogle Scholar
Agarwal, S., Snavely, N., Simon, I., Seitz, S. and Szeliski, R. (2009), Building Rome in a day. In ICCV 2009: 12th IEEE International Conference on Computer Vision.Google Scholar
Aliaga, D. (2001), Accurate catadioptric calibration for real-time pose estimation in room-size environments. In ICCV 2001: 8th IEEE International Conference on Computer Vision, pp. 127134.Google Scholar
Arie-Nachimson, M., Kovalsky, S., Kemelmacher-Shlizerman, I., Singer, A. and Basri, R. (2012), Global motion estimation from point matches. In 3DimPVT 2012: 2nd IEEE International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission, pp. 8188.Google Scholar
Arrigoni, F., Fusiello, A. and Rossi, B. (2016), Camera motion from group synchronization. In 3DV 2016: 4th IEEE International Conference on 3D Vision, pp. 546555.Google Scholar
Aulinas, J., Petillot, Y., Salvi, J. and Lladó, X. (2008), The SLAM problem: A survey. In 2008 Conference on Artificial Intelligence Research and Development; 11th International Conference of the Catalan Association for Artificial Intelligence, IOS Press, pp. 363371.Google Scholar
Bartoli, A. and Sturm, P. (2005), ‘Structure-from-motion using lines: Representation, triangulation, and bundle adjustment’, Comput. Vision Image Underst. 100, 416441.CrossRefGoogle Scholar
Bay, H., Tuytelaars, T. and Van Gool, L. (2006), SURF: Speeded up robust features. In ECCV 2006: 9th European Conference on Computer Vision, Vol. 3951 of Lecture Notes in Computer Science, Springer, pp. 404417.CrossRefGoogle Scholar
Bolles, R. and Fischler, M. (1981), A RANSAC-based approach to model fitting and its application to finding cylinders in range data. In IJCAI ’81: 7th International Joint Conference on Artificial intelligence, part 2, pp. 637643.Google Scholar
Brand, M., Antone, M. and Teller, S. (2004), Spectral solution of large-scale extrinsic camera calibration as a graph embedding problem. In ECCV 2004: 8th European Conference on Computer Vision, Vol. 3022 of Lecture Notes in Computer Science, Springer, pp. 262273.Google Scholar
Chang, P. and Hebert, M. (2000), Omni-directional structure from motion. In 2000 IEEE Workshop on Omnidirectional Vision, pp. 127133.Google Scholar
Chatterjee, A. and Govindu, V. (2013), Efficient and robust large-scale rotation averaging. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 521528.Google Scholar
Chiuso, A., Brockett, R. and Soatto, S. (2000), ‘Optimal structure from motion: Local ambiguities and global estimates’, Int. J. Comput. Vision 39, 195228.CrossRefGoogle Scholar
Cohen, A., Zach, C., Sinha, S. and Pollefeys, M. (2012), Discovering and exploiting 3D symmetries in structure from motion. In CVPR 2012: IEEE Conference on Computer Vision and Pattern Recognition, pp. 15141521.CrossRefGoogle Scholar
Crandall, D., Owens, A., Snavely, N. and Huttenlocher, D. (2011), Discrete-continuous optimization for large-scale structure from motion. In CVPR 2011: IEEE Conference on Computer Vision and Pattern Recognition, pp. 30013008.Google Scholar
Cucuringu, M., Singer, A. and Cowburn, D. (2012), ‘Eigenvector synchronization, graph rigidity and the molecule problem’, Inf. Inference 1, 2767.Google Scholar
Engel, J., Schöps, T. and Cremers, D. (2014), LSD-SLAM: Large-scale direct monocular SLAM. In ECCV 2014: 13th European Conference on Computer Vision, Vol. 8690 of Lecture Notes in Computer Science, Springer, pp. 834849.CrossRefGoogle Scholar
Fuentes-Pacheco, J., Ruiz-Ascencio, J. and Rendón-Mancha, J. (2015), ‘Visual simultaneous localization and mapping: A survey’, Artificial Intelligence Review 43, 5581.CrossRefGoogle Scholar
Furukawa, Y. and Ponce, J. (2010a), ‘Accurate, dense, and robust multiview stereopsis’, IEEE Trans. Pattern Anal. Mach. Intel. 32, 13621376.Google Scholar
Furukawa, Y. and Ponce, J. (2010b), PMVS: Patch-based multi-view stereo software. http://www.di.ens.fr/pmvs/ Google Scholar
Furukawa, Y., Curless, B., Seitz, S. and Szeliski, R. (2010a), CMVS: Clustering views for multi-view stereo. http://www.di.ens.fr/cmvs/ Google Scholar
Furukawa, Y., Curless, B., Seitz, S. and Szeliski, R. (2010b), Towards Internet-scale multi-view stereo. In CVPR 2010: IEEE Conference on Computer Vision and Pattern Recognition, pp. 14341441.Google Scholar
Gauglitz, S., Höllerer, T. and Turk, M. (2011), ‘Evaluation of interest point detectors and feature descriptors for visual tracking’, Int. J. Comput. Vision 94, 335360.CrossRefGoogle Scholar
Gluckman, J. and Nayar, S. (1998), Ego-motion and omnidirectional cameras. In Sixth International Conference on Computer Vision, IEEE cat. no. 98CH36271, pp. 999–1005.Google Scholar
Goldstein, T., Hand, P., Lee, C., Voroninski, V. and Soatto, S. (2016), ShapeFit and ShapeKick for robust, scalable structure from motion. In ECCV 2016: 14th European Conference on Computer Vision, Vol. 9911 of Lecture Notes in Computer Science, Springer, pp. 289304.Google Scholar
Govindu, V. (2001), Combining two-view constraints for motion estimation. In CVPR 2001: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, part 2, pp. II-218–II-225.Google Scholar
Govindu, V. (2004), Lie-algebraic averaging for globally consistent motion estimation. In CVPR 2004: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, part 1, pp. I-684–I-691.Google Scholar
Hand, P., Lee, C. and Voroninski, V. (2015), Exact simultaneous recovery of locations and structure from known orientations and corrupted point correspondences. arXiv:arXiv:1509.05064 Google Scholar
Hand, P., Lee, C. and Voroninski, V. (2017), ‘ShapeFit: Exact location recovery from corrupted pairwise directions’, Comm. Pure Appl. Math., to appear.Google Scholar
Hartley, R. (1997), ‘In defense of the eight-point algorithm’, IEEE Trans. Pattern Anal. Mach. Intel. 19, 580593.Google Scholar
Hartley, R. and Zisserman, A. (2000), Multiple View Geometry in Computer Vision, Cambridge University Press.Google Scholar
Havlena, M., Torii, A. and Pajdla, T. (2010), Efficient structure from motion by graph optimization. In ECCV 2010: 11th European Conference on Computer Vision, Vol. 6312 of Lecture Notes in Computer Science, Springer, pp. 100113.CrossRefGoogle Scholar
Havlena, M., Torii, A., Knopp, J. and Pajdla, T. (2009), Randomized structure from motion based on atomic 3D models from camera triplets. In CVPR 2009: IEEE Conference on Computer Vision and Pattern Recognition, pp. 28742881.Google Scholar
Hernandez, J., Tsotsos, K. and Soatto, S. (2015), Observability, identifiability and sensitivity of vision-aided inertial navigation. In ICRA 2015: IEEE International Conference on Robotics and Automation, pp. 23192325.Google Scholar
Jiang, N., Cui, Z. and Tan, P. (2013), A global linear method for camera pose registration. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 481488.Google Scholar
Jiang, N., Tan, P. and Cheong, L. (2012), Seeing double without confusion: Structure-from-motion in highly ambiguous scenes. In CVPR 2012: IEEE Conference on Computer Vision and Pattern Recognition, pp. 14581465.Google Scholar
Kanade, T. and Morris, D. (1998), Factorization methods for structure from motion. In Philos. Trans. Royal Soc. London,Vol. 356, pp. 11531173.Google Scholar
Kannala, J. and Brandt, S. (2006), ‘A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses’, IEEE Trans. Pattern Anal. Mach. Intel. 28, 13351340.Google Scholar
Ke, Y. and Sukthankar, R. (2004), PCA-SIFT: A more distinctive representation for local image descriptors. In CVPR 2004: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, part 2, pp. 506513.Google Scholar
Longuet-Higgins, H. (1981), ‘A computer algorithm for reconstructing a scene from two projections’, Nature 293, 133135.CrossRefGoogle Scholar
Lourakis, M. and Argyros, A. (2009), ‘SBA: A software package for generic sparse bundle adjustment’, ACM Trans. Math. Softw. 36, 2:1–2:30.Google Scholar
Lowe, D. (1999), Object recognition from local scale-invariant features. In ICCV 1999: 7th IEEE International Conference on Computer Vision, part 2, pp. 11501157.Google Scholar
Lowe, D. (2004), ‘Distinctive image features from scale-invariant keypoints’, Int. J. Comput. Vision 60, 91110.CrossRefGoogle Scholar
Ma, Y., Košecká, J. and Sastry, S. (2001), ‘Optimization criteria and geometric algorithms for motion and structure estimation’, Int. J. Comput. Vision 44, 219249.Google Scholar
Martinec, D. and Pajdla, T. (2003), CVPR 2003: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. part 1, pp. 497502.Google Scholar
Martinec, D. and Pajdla, T. (2007), Robust rotation and translation estimation in multiview reconstruction. In CVPR 2007: IEEE Conference on Computer Vision and Pattern Recognition, pp. 18.Google Scholar
Micusik, B. and Pajdla, T. (2006), ‘Structure from motion with wide circular field of view cameras’, IEEE Trans. Pattern Anal. Mach. Intel. 28, 11351149.Google Scholar
Mikolajczyk, K. and Schmid, C. (2005), ‘A performance evaluation of local descriptors’, IEEE Trans. Pattern Anal. Mach. Intel. 27, 16151630.Google Scholar
Moulon, P., Monasse, P. and Marlet, R. (2013), Global fusion of relative motions for robust, accurate and scalable structure from motion. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 32483255.Google Scholar
Moulon, P., Monasse, P. and Marlet, R. et al. OpenMVG: An open multiple view geometry library. https://github.com/openMVG/openMVG Google Scholar
Mouragnon, E., Lhuillier, M., Dhome, M., Dekeyser, F. and Sayd, P. (2009), ‘Generic and real-time structure from motion using local bundle adjustment’, Image and Vision Computing 27, 11781193.Google Scholar
Musialski, P., Wonka, P., Aliaga, D., Wimmer, M., Van Gool, L. and Purgathofer, W. (2013), A survey of urban reconstruction. In Computer Graphics Forum,Vol. 32, pp. 146177.Google Scholar
Oliensis, J. (2000), ‘A critique of structure-from-motion algorithms’, Comput. Vision Image Underst. 80, 172214.CrossRefGoogle Scholar
Özyeşil, O. and Singer, A. (2015), Robust camera location estimation by convex programming. In CVPR 2015: IEEE Conference on Computer Vision and Pattern Recognition, pp. 26742683.Google Scholar
Özyeşil, O., Singer, A. and Basri, R. (2015), ‘Stable camera motion estimation using convex programming’, SIAM J. Imaging Sci. 8, 12201262.Google Scholar
Pachauri, D., Kondor, R., Sargur, G. and Singh, V. (2014), Permutation diffusion maps (PDM) with application to the image association problem in computer vision. In Advances in Neural Information Processing Systems 27 (Ghahramani, Z. et al. , ed.), Curran Associates, pp. 541549.Google Scholar
Pollefeys, M., Nistér, D., Frahm, J.-M., Akbarzadeh, A., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim, S.-J., Merrell, P., Salmi, C., Sinha, S., Talton, B., Wang, L., Yang, Q., Stewénius, H., Yang, R., Welch, G. and Towles, H. (2008), ‘Detailed real-time urban 3D reconstruction from video’, Int. J. Comput. Vision 78, 143167.Google Scholar
Quan, L. and Kanade, T. (1997), ‘Affine structure from line correspondences with uncalibrated affine cameras’, IEEE Trans. Pattern Anal. Mach. Intel. 19, 834845.Google Scholar
Ramalingam, S., Lodha, S. and Sturm, P. (2006), ‘‘A generic structure-from-motion framework’’, Comput. Vision Image Underst. 103, 218228.Google Scholar
Roberts, R., Sinha, S., Szeliski, R., Steedly, D. and Szeliski, R. (2011), Structure from motion for scenes with large duplicate structures. In CVPR 2011: IEEE Conference on Computer Vision and Pattern Recognition, pp. 31373144.Google Scholar
Schaffalitzky, F. and Zisserman, A. (2002), Multi-view matching for unordered image sets, or ‘How do I organize my holiday snaps? In ECCV 2002: 7th European Conference on Computer Vision, part 1, Vol. 2350 of Lecture Notes in Computer Science, Springer, pp. 414431.CrossRefGoogle Scholar
Schindler, G., Krishnamurthy, P. and Dellaert, F. (2006), Line-based structure from motion for urban environments. In Third International Symposium on 3D Data Processing, Visualization, and Transmission, IEEE, pp. 846853.Google Scholar
Schönberger, J. and Frahm, J.-M. (2016), Structure-from-motion revisited. In CVPR 2016: IEEE Conference on Computer Vision and Pattern Recognition, pp. 41044113.Google Scholar
Schönberger, J., Zheng, E., Frahm, J.-M. and Pollefeys, M. (2016), Pixelwise view selection for unstructured multi-view stereo. In ECCV 2016: 14th European Conference on Computer Vision, part III, Springer, pp. 501518.Google Scholar
Shakernia, O., Vidal, R. and Sastry, S. (2003), Omnidirectional egomotion estimation from back-projection flow. In CVPRW 2003: Computer Vision and Pattern Recognition Workshop, part 7, pp. 8282.Google Scholar
Singer, A. (2011), ‘‘Angular synchronization by eigenvectors and semidefinite programming’’, Appl. Comput. Harmon. Anal. 30, 2036.Google Scholar
Sinha, S., Steedly, D. and Szeliski, R. (2010), A multi-stage linear approach to structure from motion. In ECCV 2010 Workshops: Trends and Topics in Computer Vision, part II, Vol. 6554 of Lecture Notes in Computer Science, Springer, pp. 267281.Google Scholar
Snavely, N., Seitz, S. and Szeliski, R. (2006), Photo tourism: Exploring photo collections in 3D. In ACM Trans. Graph.,Vol. 25, pp. 835846.Google Scholar
Snavely, N., Seitz, S. and Szeliski, R. (2008a), Modeling the world from internet photo collections. In Int. J. Comput. Vision,Vol. 80, pp. 189210.Google Scholar
Snavely, N., Seitz, S. and Szeliski, R. (2008b), Skeletal graphs for efficient structure from motion. In CVPR 2008: IEEE Conference on Computer Vision and Pattern Recognition, pp. 18.Google Scholar
Soatto, S. (1997), ‘3-D structure from visual motion: Modeling, representation and observability’, Automatica 33, 12871312.Google Scholar
Strecha, C., Hansen, W., Van Gool, L., Fua, P. and Thoennessen, U. (2008), On benchmarking camera calibration and multi-view stereo for high resolution imagery. In CVPR 2008: IEEE Conference on Computer Vision and Pattern Recognition, pp. 18.Google Scholar
Sturm, P. and Triggs, B. (1996), A factorization based algorithm for multi-image projective structure and motion. In ECCV ’96: 4th European Conference on Computer Vision, part II, Vol. 1065 of Lecture Notes in Computer Science, Springer, pp. 709720.Google Scholar
Sweeney, C. (2016), Theia multiview geometry library: Tutorial and reference. http://theia-sfm.org CrossRefGoogle Scholar
Taylor, C. and Kriegman, D. (1995), ‘‘Structure and motion from line segments in multiple images’’, IEEE Trans. Pattern Anal. Mach. Intel. 17, 10211032.Google Scholar
Tomasi, C. and Kanade, T. (1992), ‘‘Shape and motion from image streams under orthography: A factorization method’’, Int. J. Comput. Vision 9, 137154.Google Scholar
Triggs, B. (1996), Factorization methods for projective structure and motion. In CVPR ’96: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 845851.Google Scholar
Triggs, B., McLauchlan, P., Hartley, R. and Fitzgibbon, A. (2000), Bundle adjustment: A modern synthesis. In Vision Algorithms: Theory and Practice, Vol. 1883 of Lecture Notes in Computer Science, Springer, pp. 298375.Google Scholar
Tron, R. and Vidal, R. (2014), ‘‘Distributed $3$ -D localization of camera sensor networks from $2$ -D image measurements’’, IEEE Trans. Automatic Control 59, 33253340.Google Scholar
Tron, R., Zhou, X. and Daniilidis, K. (2016), A survey on rotation optimization in structure from motion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 7785.Google Scholar
Tsotsos, K., Chiuso, A. and Soatto, S. (2015), Robust inference for visual-inertial sensor fusion. In ICRA 2015: IEEE International Conference on Robotics and Automation, pp. 52035210.Google Scholar
Tuytelaars, T. and Mikolajczyk, K. (2008), ‘‘Local invariant feature detectors: A survey’’, Found. Trends Comput. Graphics Vision 3, 177280.Google Scholar
Vedaldi, A., Guidi, G. and Soatto, S. (2007), Moving forward in structure from motion. In CVPR 2007: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17.Google Scholar
Wang, L. and Singer, A. (2013), ‘Exact and stable recovery of rotations for robust synchronization’, Inf. Inference 2, 145193.Google Scholar
Wilson, K. and Snavely, N. (2013), Network principles for SfM: Disambiguating repeated structures with local context. In ICCV 2013: IEEE International Conference on Computer Vision, pp. 513520.Google Scholar
Wilson, K. and Snavely, N. (2014), Robust global translations with 1DSfM. In ECCV 2014: 13th European Conference on Computer Vision, part III, Vol. 8691 of Lecture Notes in Computer Science, Springer, pp. 6175.Google Scholar
Wu, C. (2007), SiftGPU: A GPU implementation of scale invariant feature transform (SIFT). http://cs.unc.edu/∼ccwu/siftgpu/ Google Scholar
Wu, C. (2011), VisualSFM: A visual structure from motion system. http://ccwu.me/vsfm/ Google Scholar
Wu, C., Agarwal, S., Curless, B. and Seitz, S. (2011), Multicore bundle adjustment. In CVPR 2011: IEEE Conference on Computer Vision and Pattern Recognition, pp. 30573064.Google Scholar
Younes, G., Asmar, D. and Shammas, E. (2016), A survey on non-filter-based monocular visual SLAM systems. arXiv:1607.00470 Google Scholar
Zach, C., Klopschitz, M. and Pollefeys, M. (2010), Disambiguating visual relations using loop constraints. In CVPR 2010: IEEE Conference on Computer Vision and Pattern Recognition, pp. 14261433.Google Scholar
Zhang, Z. (1998), ‘‘Determining the epipolar geometry and its uncertainty: A review’’, Int. J. Comput. Vision 27, 161195.Google Scholar