1、技术创新,变革未来深度学习下的图像视频处理技深度学习下的图像视频处理技术术看得更看得更清清,看得更,看得更懂懂目录目录1. 夜景增强2. 图像视频去模糊3. 视频超分辨率1.1. 夜景图像增夜景图像增强强Taking photos is easyAmateur photographers typically create underexposed photosPhoto Enhancement is requiredImageImage EnhancementEnhancementI In np putut“Auto Enhance” on“Auto Enhance” on iPhoneiPh
2、one“Auto Tone” “Auto Tone” inin LightroomLightroomOuOursrsExisting Photo Existing Photo EditingEditing ToolsToolsRetinex-based MethodsLIME: TIP 17WVM: CVPR 16JieP: ICCV 17 Learning-based MethodsHDRNet: SIGGRAPH 17White-Box: ACM TOG 18Distort-and-Recover: CVPR 18DPE: CVPR 18PreviousPrevious WorkWorkI
3、nputWVM CVPR16JieP ICCV17HDRNet Siggraph17DPE CVPR18White-Box TOG18Distort-and-Recover CVPR18OursLimitations of PreviousLimitations of Previous MethodsMethods Illumination maps for natural images typically have relatively simple forms with known priors. The model enables customizing the enhancement
4、results by formulating constraints on the illumination.Why Why ThisThis Model?Model?Advantage: Effective Learning and Efficient LearningNetworkNetwork ArchitectureArchitectureInputInputNaNaveve RegressionRegressionExpert-retouchedExpert-retouchedAblationAblation StudyStudyMotivation:The benchmark da
5、taset is collected for enhancing general photos instead of underexposed photos, and contains a small number of underexposed images that cover limited lighting conditions.OurOur DatasetDatasetQuantitativeQuantitative Comparison:Comparison: OurOurDatasetDatasetMethodMethodPSNRPSNRSSIMSSIMHDRNet26.330.
6、743DPE23.580.737White-Box21.690.718Distort-and-Recover24.540.712Ours w/o , w/o ,w/o 27.020.762Ours with , w/o , w/o 28.970.783Ours with , with , w/o 30.030.822Ours3030. .97970.8560.856MethodMethodPSNRPSNRSSIMSSIMHDRNet28.610.866DPE24.660.850White-Box23.690.701Distort-and-Recover28.410.841Ours w/o ,
7、w/o , w/o 28.810.867Ours with , w/o , w/o 29.410.871Ours with , with , w/o 30.710.884Ours30.800.893Quantitative Quantitative Comparison: MIT-AdobeComparison: MIT-Adobe FiveKFiveKVisualVisual Comparison:Comparison: OurOurDatasetDatasetInputJiePHDRNetDPEWhite-boxDistort-and-RecoverOur resultExpert-ret
8、ouchedVisual Comparison: MIT-AdobeVisual Comparison: MIT-Adobe FiveKFiveKInputJiePHDRNetDPEWhite-boxDistort-and-RecoverOur resultExpert-retouchedMore More Comparison Results: Comparison Results: UserUser StudyStudyInputWVMJiePHDRNetDPEWhite-BoxDistort-and-RecoverOur resultLimitLimita aionionInputOur
9、 result演示者演示者2019-05-08 03:51:53-Our work also exists some l i m i t a t i o n s , the first limitation is the region is almost black without any trace of texture. We can see the top two images. The second limitation is our method doent clear noise in the enhanced result.MoreMore ResultsResultsInput
10、White-boxDistort-and-RecoverOur resultExpert-retouchedJiePHDRNetDPEMoreMore ResultsResultsInputWhite-boxDistort-and-RecoverOur resultExpert-retouchedJiePHDRNetDPEMoreMore ResultsResultsInputWhite-boxDistort-and-RecoverOur resultExpert-retouchedJiePHDRNetDPEMoreMore ResultsResultsInputWhite-boxDistor
11、t-and-RecoverOur resultExpert-retouchedJiePHDRNetDPEMoreMore ResultsResultsInputWVMJiePHDRNetDPEWhite-BoxDistort-and-RecoverOur resultMoreMore ResultsResultsInputWVMJiePHDRNetDPEWhite-BoxDistort-and-RecoverOur resultMoreMore ResultsResultsOur resultiPhoneLightroomInputMoreMore ResultsResultsOur resu
12、ltiPhoneLightroomInput2.2. 视频超分辨视频超分辨率率Old and FundamentalSeveral decades ago Huang et al, 1984 near recent Many ApplicationsHD video generation from low-res sourcesMotivationMotivation演示者演示者2019-05-08 03:51:55-The target of video s u p e r - r e s o l u t i o n is to increase the resolution of vide
13、os with rich details. clickIt is an old and fundamental p r o b l e m that has been studied since several decades ago. clickVideo SR enables many a p p l i c a t i o n s , such as High-definition video generation from low-res sources. click32Old and FundamentalSeveral decades ago Huang et al, 1984 n
14、ear recent Many ApplicationsHD video generation from low-res sourcesVideo enhancement with detailsMotivationMotivation演示者演示者2019-05-08 03:51:55-clickVideo enhancement with details. In this example, characters on t h e roof and textures of the tree in SR result are much clearer then input. click33Old
15、 and FundamentalSeveral decades ago Huang et al, 1984 near recent Many ApplicationsHD video generation from low-res sourcesVideo enhancement with detailsText/object recognition in surveillance videosMotivationMotivation演示者演示者2019-05-08 03:51:55-clickAnd also, it can benefit text or o b j e c t recog
16、nition in low-quality surveillance videos. In this example, numbers on the c a r become recognizable only in the super-resolved result.34Image SRTraditional: Freeman et al, 2002, Glasner et al, 2009, Yang et al, 2010, etc. CNN-based: SRCNN Dong et al, 2014, VDSR Kim et al, 2016, FSRCNN Dong et al, 2
17、016, etc.Video SRTraditional: 3DSKR Takeda et al, 2009, BayesSR Liu et al, 2011, MFSR Ma et al, 2015, etc.CNN-based: DESR Liao et al, 2015, VSRNet Kappeler, et al, 2016, Caballeroet al, 2016, etc.35PreviousPrevious WorkWork演示者演示者2019-05-08 03:51:56-Previously, lots of work and m e t h o d s have bee
18、n proposed in super-resolution. clickWe list several representative m e t h o d s here.EffectivenessHow to make good use of multiple frames?RemainingRemaining ChallengesChallenges39Data from Vid4 Ce Liu et al.Bicubic x4Misalignment Large motion Occlusion演示者演示者2019-05-08 03:51:56-Although video sr ha
19、s long been s t u d i e d ,there are still remaining c h a l l e n g e s in this task. clickThe most important one is e f f e c t i v e n e s s . clickHow to make good use of m u l t i p l e frames? clickclickAs shown in this example, o b j e c t s in neighboring frames are not aligned. And in some
20、extreme cases, t h e r e even exist large motion or occlusion, which are very hard to handle. So are multiple frames useful or harmful to super-resolution?EffectivenessHow to make good use of multiple frames? Are the generated details real?RemainingRemaining ChallengesChallenges40Image SRBicubic x4演
21、示者演示者2019-05-08 03:51:56-clickOn the other hand, are the g e n e r a t e d details real details ? clickclickCNN-based SR methods i n c o r p o r a t e external data. Using only one frame, they can also produce sharp structures. In this example, on the right-hand-side, one SR method generates some cl
22、ear window patterns on the building, clickbut they are far from real on the l e f t .The problem is, details from e x t e r n a l data, may not be true for input image.EffectivenessHow to make good use of multiple frames? Are the generated details real?RemainingRemaining ChallengesChallengesImage SR
23、Truth演示者演示者2019-05-08 03:51:56-clickOn the other hand, are the g e n e r a t e d details real details ? clickclickCNN-based SR methods i n c o r p o r a t e external data. Using only one frame, they can also produce sharp structures. In this example, on the right-hand-side, one SR method generates s
24、ome clear window patterns on the building, clickbut they are far from real on the l e f t .The problem is, details from e x t e r n a l data, may not be true for input image.38EffectivenessHow to make good use of multiple frames? Are the generated details real?Model IssuesOne model for one settingRe
25、mainingRemaining ChallengesChallengesVDSR Kim et al., 2016ESPCN Shi et al., 2016VSRNet Kappeler et al, 2016演示者演示者2019-05-08 03:51:56-clickThere are also model issues in c u r r e n t methods. clickFor all recent CNN-based SR m e t h o d s , model parameters are fixed for certain scale factors, or nu
26、mber of frames. If you want to change scale factors, you need to change network configuration and train another one.39EffectivenessHow to make good use of multiple frames? Are the generated details real?Model IssuesOne model for one setting Intensive parameter tuning Slow40RemainingRemaining Challen
27、gesChallenges演示者演示者2019-05-08 03:51:56-click clickAnd most traditional video SR m e t h o d s involve intensive parameter tuning and may be slow. All the issues mentioned above prevent them from practical usage.AdvantagesBetter use of sub-pixel motionPromising results both visually and quantitativel
28、yFully Scalable Arbitrary input size Arbitrary scale factorArbitrary temporal frames41OurOur MethodMethod演示者演示者2019-05-08 03:51:57-The goals of our method are as f o l l o w s . clickWe are trying to make better use o f sub-pixel motion between frames and produce high-quality results with real detai
29、ls. clickWe also hope the designed f r a m e w o r k be fully scalable, in terms of input image size, scale factors and frame number. click45Data from Vid4 Ce Liu et al.演示者演示者2019-05-08 03:51:57-Here is one video example.Characters, numbers and t e x t u r e s are hard to recognize in bicubic result
30、. And ours results are much better and clearer.Motion EstimationOurOur MethodMethod0ME 0演示者演示者2019-05-08 03:51:57-Due to time limit, here we briefly d e s c r i b e our method. Audiences are welcome to our poster session for more details. Our method contains 3 components.clickThe first module is a m
31、otion e s t i m a t i o n network. clickThis module take 2 low-res i m a g e s as input. clickAnd outputs a low-res motion f i e l d . click43Sub-pixel Motion Compensation (SPMC) LayerOurOur MethodMethod0ME 0SPMC演示者演示者2019-05-08 03:51:57-clickThe second module is newly d e s i g n e d . We call it s
32、ub-pixel motion compensation layer. clickThis module takes as input the ith low-res frame and the estimated motion field. The output of this module is a high-res image. Unlike previous methods, this l a y e r simultaneously achieve resolution enhancement and motion compensation, which can better kee
33、p subpixel information in frames.44Detail Fusion NetOurOur MethodMethod0ME 0SPMCEncoderDecoderConvL STM = 1 = + 1skip connections演示者演示者2019-05-08 03:51:57-clickIn the last stage, we design a D e t a i l Fusion Network to combine all frames. clickHere we use encoder-decoder s t r u c t u r e in this
34、module, since it is proved very effective in image regression tasks. Skip connections are used for better convergence.clickThe important change is that, we i n s e r t a convLSTM module insider the network.It is a natural choice since we are h a n d l i n g sequential inputs and hoping to utilize te
35、mporal information. clickThe ConvLSTM considers i n f o r m a t i o n from previous time step, and pass hidden state to next time step.45Arbitrary Arbitrary InputInput SizeSize0ME 0SPMCEncoderConv LSTM = 1 = + 1skip connectionsInput size:Fully convolutionalDecoder演示者演示者2019-05-08 03:51:57-Our propos
36、ed framework has the a d v a n t a g e of fully scalability. clickInput videos may be of different s i z e s in practise. clickSince our network is fully c o n v o l u t i o n a l , it can natural handle this.46Arbitrary ScaleArbitrary Scale FactorsFactors234Parameter Free0ME 0SPMCEncoderConv LSTM =
37、 1 = + 1skip connectionsDecoder演示者演示者2019-05-08 03:51:58-clickWhen dealing with different scale f a c t o r s , previous networks need to change network parameters. clickOur network is different since the r e s o l u t i o n increase happens in SPMC layer, and it is parameter free. clickThis propert
38、y enables us to use o n e single model configuration to handle all scale factors, including non-integer values.47Arbitrary Arbitrary TemporalTemporal LengthLength3 frames5 frames0ME 0SPMCEncoderConv LSTM = 1 = + 1skip connectionsDecoder演示者演示者2019-05-08 03:51:58-clickFor practical systems, we may w a
39、 n t to choose the number of frames in testing phase, in order to achieve balance between quality and efficiency. Our framework uses ConvLSTM to handle frames in a sequential way.clickTherefore, it can accept arbitrary t e m p o r a l length.48Details from multi-framesAnAna alylys sisisOutput (ident
40、ical)3 identicalframes演示者演示者2019-05-08 03:51:58-We do analysis to evaluate our m e t h o d . clickFirst, are our recovered details r e a l ? clickHere we use three identical f r a m e s as input to our network.The information contained in this i n p u t is no more than one single low-res image. clic
41、kAs expected, although sharper, t h e output contains no more details. And the characters and logo are still unrecognizable.49Details from multi-framesAnAna alylys sisis3 consecutive framesOutput (consecutive)Output (identical)50演示者演示者2019-05-08 03:51:58-clickHowever, if we take 3 c o n s e c u t i
42、v e frames from the video as input. clickOur network produces much b e t t e r results. Characters and logo are very clear to be read.This experiment proves that the s h a r p structures recovered come from real information of inputs, rather then from external information in the network. We will be
43、safe to trust the SR results.Ablation Study: SPMC Layer v.s. BaselineAnAna alylys sisisOutput (baseline) 0BWResizeBackward warping+Resize (baseline)51演示者演示者2019-05-08 03:51:58-clickIn the next experiment, we do a b l a t i o n study of our SPMC layer. clickWe substitute SPMC layer with a b a s e l i
44、 n e module, that is a backward warping followed by upsampling.This baseline method can also c o m p e n s a t e motion and increase resolution. It is widely adopted in previous CNN-based methods. clickIn this example, the tiles on the r o o f contain severely false structures due to aliasing. Ablat
45、ion Study: SPMC Layer v.s. BaselineAnAna alylys sisisOutput (SPMC) 0SPMCSPMCOutput (baseline)52演示者演示者2019-05-08 03:51:58-clickWith our designed SPMC layer, clickthe structures of tiles in the result a r e very faithful to the ground truth.We believe only by properly h a n d l i n g motion in sub-pix
46、el precision, can we recover good results.C Compompa ariris sonsonsBicu5b6ic x4演示者演示者2019-05-08 03:51:59-We further compare with current s t a t e -o f - t h e - a r t s . This is the bicubic interpolated version of input.The windows and glass of the b u i l d i n g are severely blurred.C Compompa a
47、riris sonsonsBayesSR Liu et al, 25 7011; Ma et al., 2015演示者演示者2019-05-08 03:51:59-The result of Bayesian SR is s h a r p , but the structures are still missing.C Compompa ariris sonsonsDESR Liao58et al., 2015演示者演示者2019-05-08 03:51:59-Draft-ensemble SR recovers a f e w details, but with artifacts.C C
48、ompompa ariris sonsonsVSRNet Kapp5e9ler et al, 2016演示者演示者2019-05-08 03:51:59-One recent CNN-based VSRNet p r o d u c e s smooth result.C Compompa ariris sonsonsOu60rs演示者演示者2019-05-08 03:51:59-Visually, our result is much b e t t e r . The edges of the buildings and windows are easy to distinguish. W
49、e then go back to input.clickC Compompa ariris sonsonsBicu6b1ic x4演示者演示者2019-05-08 03:51:59-Then our results. clickC Compompa ariris sonsonsOu62rs演示者演示者2019-05-08 03:51:59-The changes are obvious.RunningRunning TimeTime60演示者演示者2019-05-08 03:52:00-We compare running time with m o s t of the current m
50、ethods clickBayesSR Liu et al, 2011RunningRunning TimeTime2hour / frameFrames: 31Scale Factor: 4演示者演示者2019-05-08 03:52:00-BayesianSR method needs 2 h o u r s to produce one frame, as reported in their paper.61MFSR Ma et al, 2015RunningRunning TimeTime1062min / frameFrames: 31Scale Factor: 4演示者演示者201