Fusion of GPS and Visual SLAM
to improve localization of autonomous vehicles
in urban environments.
by Adam Kalisz
Degree of Doctor of Philosophy (207 pages)
Degree of Doctor of Science in Technology (181 pages)
Degree of Master of Science (135 pages)
but: Some other time!
Master of Applied Science (124 pages)
Motivation:
Increasing number of pictures
Contribution:
Fully open source (!), Blender, Evaluation tool for research
Term "matchmoving":
Used in film industry
Requirements:
Free, easy, documented, auto-testing, automated
Related work:
Commercial (Boujou, PFTrack, Syntheyes, ...)
and Open (VXL, OpenCV)
Evaluation:
real (range scanning vs. optical flow)
and synthetic (Maya vs. Blender) data sets
Problem:
video of a scene -> determine pose of camera and location of some points
Solution:
at highest level see next picture
Find "good" features (2D)
Track features across frames (2D)
Discussion: Points vs. lines or curves or planes
No clear "best" strategy to recover
cameras and structure
Closed-form-solutions
Closed-form-solutions
If perfect data: Use (B) then (D).
But...!
I) Tracks are unreliable: Outliers and drift
-> Solution: robust fitting (e.g. RANSAC)
-> Example: randomly pick six correspondences and try to solve for closed-form (compare 6 * 2D-2D-2D). Increase score of track if reprojection of 3D points into camera is close. Repeat to find best score.
II) Tracks are short: no longer than a few seconds
-> Solution: Merge multiple "triplets" of tracks into one
-> Pollefeys et al. (2004): Grow tracks by alternating closed-form-solutions
Critical issues:
For end user: Set of command line tools (9 to 10 steps)
$ mplayer -vo pnm:pgm video.avi
$ track sequence --pattern="%08d.pgm" --output="video.ts.json" --num features=500 --start=1 --end=100
$ correct radial distortion --track="video.ts.json" --output="video corrected.ts.json" --intrinsics="CanonSD500.k.json"
$ pick keyframes --track="video corrected.json" --output="video.kf.json"
$ reconstruct subsets --track="video corrected.json" --bundle subsets=true --keyframes file="video.kf.json" --output="video.prs.json" --ransac rounds=100
$ merge subsets --track="video corrected.json" --subsets="video.prs.json" --output="video.pr.json" --bundle subsets=true
$ resection --track="video corrected.json" --reconstruction="video.prs.json" --output="video resectioned.pr.json" --bundle subsets=true
$ metric upgrade --track="video corrected.json" --reconstruction="video resectioned.prs.json" --intrinsics="CanonSD500.k.json" --output="video.mr.json"
$ export blender --track="video corrected.json" --metric reconstruction="video.mr.json"
--intrinsics="CanonSD500.k.json" --output="video.py"
Projective Geometry
It's really just homogeneous coordinates
Points at infinity (line intersection)
Represent all points from ${\rm I\!R}^3$ as projection in ${\rm I\!R}^2$
(up to a scale factor)
=> $[4x; 4y; 4]^T$ is the same as $[x; y; 1]^T$, (0 is infinity)
Standard Euclidean projection
$\begin{align} \begin{bmatrix} x \\ y \\ z \end{bmatrix} \end{align}$ = $K[R(\hat{X}-C)]$, $\begin{align} \begin{bmatrix} \hat{x} \\ \hat{y} \end{bmatrix} \end{align}$ = $\begin{align} \begin{bmatrix} x / z \\ y / z \end{bmatrix} \end{align}$
translate, rotate 3D Points $\hat{X}$
scale by intrinsics Matrix K
$\begin{align} K &= \begin{bmatrix} focal_x && skew && center_x\\ 0 && focal_y && center_y \\ 0 && 0 && 1 \end{bmatrix} \end{align}$
Without zooming, K is fixed for all images.
$x^{''} = x^{'}(1 + k_1 r^2 + k_2 r^4) + 2p_1x^{'}y^{'} + p_2(r^2 + 2x^{'2})$
$y^{''} = y^{'}(1 + k_1 r^2 + k_2 r^4) + 2p_2x^{'}y^{'} + p_1(r^2 + 2y^{'2})$
$(r^2 = x^{'2} + y^{'2})$
Normalized image coords: $x^{'}$, $y^{'}$
radial distortion $k_{1,2}$
tangential distortion $p_{1,2}$
Perspective division -> distortion -> intrinsics
KLT more efficient way to minimize:
$\epsilon(\textbf{d}) = \frac{1}{2} \int\int_{Windows}^{} [J(\textbf{x} + \textbf{d}) - I(\textbf{x})]^2 weight(\textbf{x}) d\textbf{x} $
$I(\cdot)$ gray value frame1, $J(\cdot)$ gray value frame2, $\textbf{x}$ feature in frame1, $\textbf{d}$ displacement
Cross-Correlation-Search: Extremly slow!
uses Taylor expansion to reduce problem
Task left: Solve a 2x2 linear system
When camera calibration information is available:
If not:
incorporate distortion terms
into the final bundle adjustment
(Calibration was not part of libMV, instead OpenCV was used)
Two metrics used by libMV:
three-frame six-point closed form solution described by Schaffalitzky et al. [2000].
fails miserably if one of the six points is in error ("outlier")
therefore use closed form solution as a RANSAC driver
Assuming tracked points have isotropic additive Gaussian noise, Rayleigh distribution is a good model of the magnitude of reprojection errors.
$p(x; \sigma_i, \sigma_o, \gamma) = \gamma p(x; \sigma_i) + (1 - \gamma) p (x; \sigma_o)$
$x$=magnitude of residual error, $\sigma_i$ variance of inliers, $\sigma_o$ variance of outliers, $\gamma$ inlier fraction
AMLESAC2 fits above simultaneously via Expectation Maximization, scores on inlier variance
after coarse reconstruction from closed-form solutions, adjust all the reconstruction parameters simultaneously to minimize reprojection error
=> nothing more than nonlinear minimization
Many details which will be skipped here, can be a separate talk.
Mostly continues with intelligent way to penalize errors
This is an important step for reconstruction.
libMV implements the one-frame overlap method
(one camera in common)
in Practice, merging projective reconstructions is quite unreliable.
Hence, score subsets of 2 points from set of 3D correspondences and test repeatedly.
Again, LMedS, RANSAC, MLESAC, etc. are used
Nonlinear refinement is done using Levenberg-Marquardt
Hierarchical merging (binary tree upwards) as opposed to sequential for errors not to accumulate at the end of the reconstruction
Recover remaining cameras to complete reconstruction
Each correspondence produces two equations expressed as a matrix
Compute missing camera matrix via SVD
Use RANSAC again to remove outliers
"Projective reconstruction" is superset
of "Euclidean (metric) reconstruction"
Projective camera: $(K,R, \textbf(t))$
Euclidean camera: $P = K[R|\textbf(t)]$
(homogeneous matrix)
Goal: Find upgrading transformation ($H$) from projective reconstruction ($P$) and camera intrinsics ($K$).
Not supported back then: autocalibration (included now).
Metric upgrade produces bad quality
Reason: rotation components of camera matrices were forced to be orthogonal
Effect: Drastically changes the projected image coordinates
Solution: Bundle adjustment (same as projective BA, but using euler angles)
Evaluation with ground truth is hard, because of translation, scale, and rotation ambiguity (similarity transform)
Example: Photo of car against white background -> is this a real car or a 10cm model?
Solution: Scale and align with ground truth data
Synthetic: Point cloud from 3D scene, projected into image space, perturbed
Autocalibration: Final reconstruction more important to evaluate than recovered intrinsics (in practice)
Rendered images: Advantages like ground truth and control parameters (exposure, noise, distortion)
Blender community helped to provide renderings for evaluation
N-view instead of 3-view reconstruction gives better results
Manual selection of keyframes gives better results than automatic selection
Pressure is growing!