CarlosColodro-Conde,F.JavierToledo-Moreo⇑,RafaelToledo-Moreo,J.JavierMartínez-Álvarez,JavierGarrigósGuerrero,J.ManuelFerrández-Vicente
Dpto.ElectrónicayTecnologíadeComputadoras,UniversidadPolitécnicadeCartagena,Spainarticleinfoabstract
Theaccuracyofstereovisionhasbeenconsiderablyimprovedinthelastdecade,butreal-timestereomatchingisstillachallengeforembeddedsystemswherethelimitedresourcesdonotpermitfastoper-ationofsophisticatedapproaches.Thisworkpresentsanevaluationofarea-basedalgorithmsusedforcalculatingdistanceinstereoscopicvisionsystems,theirhardwarearchitecturesforimplementationonFPGAandthecostoftheiraccuraciesintermsofFPGAhardwareresources.Theresultsshowthetrade-offbetweenthequalityofsuchmapsandthehardwareresourceswhicheachsolutiondemands,sotheyserveasaguideforimplementingstereocorrespondencealgorithmsinreal-timeprocessingsystems.Ó2013ElsevierB.V.Allrightsreserved.Articlehistory:Received7December2012Receivedinrevisedform19August2013Accepted19November2013Availableonline27November2013Keywords:StereovisionFPGAEmbeddedandreal-timesystems1.IntroductionBinocularvisionallowshumanbeingstonavigateinthethree-dimensionalworldwelivein.Thesmalldifferencesintheimagesacquiredbyoureyes,duetotheseparationbetweenthem,arepro-cessedbythebraintoperceivethevolumeoftheobjectsandthedistancetothem.Inspiredbybinocularvision,stereovisionsys-temsaimtoemulatethiscapacity.Thetypicalapplicationfieldsrangefromroboticstohuman–machineinteraction,butnowadaysitexpandstoanyapplicationwhereinformationaboutdistancemaybeofinterest.Becauseofthis,stereovisionisoneofthemostimportantandactivelinesofworkincomputervisionatpresent.Theprecisionofthedepthestimationisusuallyanissueofmajorconcerninstereovision,soresearchinthisfieldisfocusedondevelopingnewalgorithmsthatobtainasaccurate3Dinforma-tionaspossible.Unfortunately,thehigheraccuracyresultsinacomputationalcomplexitythat,inpractice,limitstheintegrationofthesealgorithmsinreal-timeembeddedsystems.Itisafactthat,althoughmanystereovisionalgorithmshavebeenproposed,itisstillachallengingtasktoachieverealtimespeedathighdefinitionresolutionwhilemaintaininghighaccuracy.Asaconsequence,researchinterestisalsofocusedonalgorithmsandimplementa-tionsforreal-timestereovisionsystems.Areviewofrecentreal-timeimplementationsofstereovisionalgorithms[1,2]revealthatFieldProgrammableGateArray(FPGA)deviceshavebeenthepredominanthardwareplatform.Becauseof⇑Correspondingauthor.Tel.:+34968326513.E-mailaddress:javier.toledo@upct.es(F.J.Toledo-Moreo).1383-7621/$-seefrontmatterÓ2013ElsevierB.V.Allrightsreserved.http://dx.doi.org/10.1016/j.sysarc.2013.11.006itsinternalstructure,anFPGAiswellsuitedforsuccessfullyexploitingpixel-levelparallelisminherenttothelow-levelimageprocessingperformedinstereovisionalgorithms.Instruction-levelparallelismcanalsobeexploitedbymeansofpipelining,aswellas,athigherlevel,taskparallelism.Low-levelimageandvideoprocessing,includingstereocorre-spondencealgorithms,havebeensignificantapplicationdriversforreconfigurablecomputingsincetheearly90s[3].ThehugenumberofpapersfoundintheliteraturewhichdescribeFPGA-basedimplementationstoachievereal-timestereo-visionsystemscorroboratesthisstatement.Evenrecently,stereovisionalgo-rithmshavebeenusedforbenchmarkinginacomparisonoftheperformanceofFPGA,GPUandCPUs[4–6].Takingarea-basedmethodsasthereferencedesigns,thesepapersconcludethatFPGAsarethebestsolution.Mostofthereal-timestereovisionworkdescribedintheliter-atureimplementsalgorithmswithamanageablecomplexity,focusingontheoptimizationoftheimplementation[6–12]aresomerecentsampleswhichwehaveconsideredrepresentative.However,itisdifficulttocomparethemintermsofperformancebecauseeachoneusesadifferentconfiguration.Authors’decisionsregardingtheselectedcorrespondencealgorithm,matchingmetric,windowsize,disparityrange,etc.togetherwithapplicationsspe-cificrequirementssuchasimagesize,depthestimationrangeorframerate,resultnotonlyindifferentqualitiesofdepthestima-tion,butalsoindifferentresourceneedsandtimingperformances.Theaccuracyofdisparitymaps,processingspeedanddiesizearecloselyinterrelated.Betteraccuracyrequireshighercomputa-tionalload,whilefasterprocessing,attainablebyexploitingtheC.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–3123parallelisminherenttoFPGAs,resultsinhigherresourcesneedswhich,inturn,implyhigherpowerconsumptionandeconomiccost.Forexample,theXC4VSX55fromXilinxis2.4timesthesizeoftheXC4VSX25,butfourtimesmoreexpensive.Asaresult,theperformanceofthestereosystemmayneedtobesignificantlyreducedinordertomeetthecost,physicalsize,and/orpowerconstraintsofthetargetapplication.Withtheaimofmaximizingtheaccuracyandthroughputofthestereosystemwhileminimizingtheresourcerequirements,someefforthasbeenrecentlyputonthedevelopmentofFPGA-optimizedalgorithms(e.g.,[1,13–16,27]).Nevertheless,thereisnocomprehensivesurveyofthecostofimprovingtheaccuracyofareal-timestereosystemthatcanhelpthedesignertochoosethesmallestFPGApartpossiblewhilestillsatisfyingtheaccuracyandtimingrequirements.TheneedtobalancetheexecutiontimeandthequalityoftheresultingdisparitymaphasbeenaddressedbyHumenbergeretal.[2],butitsanalysisdoesnotconsideranimplementationinreconfigurablehardware.Therecentpaper[17]studiesthesuitabilityofstereovisionalgorithmsinreal-timeimplementationsbut,inspiteofbeingspecificallyfocusedonresource-limitedplatformssuchasFPGAs,itdoesnotconsidertheareaoccupancyasaparameterforthereportedcomparisontables.LongfieldandChang[18]presentsaparameterizedstereovisioncorewhichenablesfastexplorationofthedesignspacethatallowseasiertuningofthestereovisioncoretocomplywiththeneedsofthetargetapplicationandtheavailablehardwareresources,butitonlyusesthecensusalgorithm.Aimingtoachieveabalancebetweentheaccuracyofthestereosystemandtheresourcerequirementsofitsimplementation,[16]proposesarangeofcensus-basedimagetransformsthatreducethehardwareresourcerequirements.Inordertoobtainanefficientimplementation,itisnecessarytoevaluatethecostofimprovingaccuracyoverotherdesignparam-eterslikeprocessingtimeandresourceutilization.Thisworkdescribesapracticalmethodologyforevaluatingthehardwarecostoftheaccuracyofarea-basedstereoscopicvisionalgorithmswhenimplementedinaFPGAforreal-timeapplications.Differenthard-warearchitecturesaredescribedandananalysisoftheirlogicandmemoryresourceneedsispresented.Additionally,theimpactofamedian-basedpost-processingstageonthequalityoftheresultingdisparitymapsandonresourceusagehasbeenstudied.Thefocusisonthestereomatchingalgorithms,soimagerectifica-tionandotherpre-processingstagesarenotconsidered.Thisarticleisorganizedasfollows.Section2describesthebasicconceptsofstereocorrespondencealgorithms.Section3presentstheresultsobtainedwithasoftwareevaluationofarangeofcon-figurations,takingintoaccountthemostwidelyusedarea-basedalgorithms.Section4evaluatestheresourceutilizationofeachconfiguration,establishingarelationshipbetweenaccuracyandFPGAresources.Finally,theconclusionsareoutlinedinSection5.2.StereoscopicvisionAbasicstereovisionsystem(Fig.1)consistsontwocameraslocatedonthesamehorizontalplaneandseparatedagivendistance,calledbaselineb,alonganaxiscontainedintheimageplaneofbothcameras.Theopticalaxesofthecamerasareparalleltoeachotherandperpendiculartotheimageplane.Inthissetup,theimagesIandI0capturedbythecamerashavesomesmalldiffer-encesduetothedistancebetweenthem,andasinglepointPinthesceneisprojectedatdifferentcoordinatesIðx1;y1ÞandI0ðx2;y2Þateachacquiredimage.Duetothearrangementofthecameras,theirverticalcomponentsarethesame(y1¼y2),sothepositionswillonlydifferinalateraldisplacement,calleddisparity(d¼x2Àx1).Givenacamerasetupwithafixedgeometry,thedistancefromFig.1.Geometryofastereovisionsystem.thefocalplanetoP,i.e.,thedepthzofPinspace,maybecomputedbytriangulationas:z¼fbdð1Þwherefisthefocallengthofthecameras.Therefore,inordertodeterminethedepthoftheobjectsinascene,aprocessoffindingcorrespondingpixelsintheimagesmustbecarriedout.Thebasisofthestereocorrespondenceproblemistofindpairsofcorrespond-ingpixelsintheimagesacquiredbythecamerasandtocalculateadisparitymap:animageofthesamesizeasthestereopairimageswheredisparityisexpressedasanintensityvalue.Withthispur-pose,stereocorrespondencealgorithmsconsiderasetofpointsononeoftheimagesandsearchtheirpositionintheotherimage.Thisprocess,referredtoasdisparityestimationorstereomatching,isdonealonganhorizontalline,andthesearchrangeisusuallylim-itedtoamaximumdisparityvaluecalledmaximumdisparity(dmax).Mostcorrespondencealgorithmscanbeclassifiedintotwobasiccategories[19]:feature-matchingandblock-matching(alsocalledarea-basedmethods).Nowadays,therearemoreapplica-tionsandresearchaboutarea-basedmethodsthanfeature-basedmethods.Thebiggerdensityofdisparitymapsobtainedwitharea-basedmethodsismoreinterestinginmanyfields[20].Atthesametime,processingtasksareeasilyparallelizedinarea-basedmethods,whichallowstoimplementthesealgorithmsinreal-timeembeddedsystems.Becauseofthesetworeasons,thisworkfocusesontheevaluationofarea-basedmethodsforthecom-putationofdisparitymaps.Area-basedmethodsconsiderawindowcenteredinapixelinoneoftheimagesanddeterminecorrespondencebysearchingthemostsimilarwindowintheotherimage.Thedegreeofsimilar-itybetweentwowindowsisestimatedwithacertainmatchingcostfunctionC,whichiscalculatedoverthefullsearchrangeI0ðx2þd;y2Þlimitedbythemaximumdisparityd2½0;dmaxÀ1.OnceallthecostfunctionsofIðx1;y1Þhavebeencalculated,themostsimilarwindowisusuallydeterminedwiththeWTA(WinnerTakeAll)algorithm,andthevalueofdisparityisestablishedasthedifferencebetweenthehorizontalcoordinatesofthereferencepix-elandthecenterofthewindowintheotherimagewhichmini-mizesC.3.Evaluationofarea-basedstereoalgorithmsThehighdegreeofinterestinstereovisionapplicationshasledtoahugenumberofworksproposingdifferentarea-basedstereoalgorithms.Nevertheless,thebibliographicreviewwhichhasbeenperformed[1,20–22]hasrevealedthatthemostwidelyusedmatchingmetricsareSAD(SumofAbsoluteDifferences),SSD(Sum24C.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–31ofSquaredDifferences)andNCC(NormalizedCross-Correlaion),andalsothatitisacommonpracticetoprocesstheinputimagesasaprevioussteptotheevaluationofcostfunctionsusingthenon-parametrictransformscensusandrank.Forreal-timeimplementa-tionsofstereovision,SADisthemostpopularmatchingmetric[4,6,9,10,15].Theaimoftheranktransform[23]istoproviderobustnesstoilluminationvariations.ItconsidersaRÂSwindowcenteredaroundapixelIðy;xÞandassignsthispixelavaluewhichisequaltothenumberofpixelsofthewindowwhicharelowerthanthecenterpixelitself.Fig.2showsanexampleof3Â3ranktransform.Oncethetransformationhasbeenperformed,thedegreeofsimi-laritybetweenwindowscanbeevaluatedwithanycostfunction,withSADbeingthemostcommonone[24].Thewindowsizeoftheranktransformdoesnotdependonthesizeofthematchingwindow,i.e.,thewindowwhichisdefinedforestimatingsimilarity.Withthesameaimofimprovingreliabilityagainstilluminationvariations,thecensustransform[23]mapstheneighborhoodsur-roundingapixeltoabitvectorrepresentingthelocalinformation.ItconsidersanMÂNwindowcenteredaroundapixelIðy;xÞfromoneoftheimages,andencodestheinformationofthiswindowinanMNÀ1bitstringwherethevalue‘1’isassociatedwitheverypixelwhosevalueisgreaterthanthecenterpixel,andthevalue‘0’isassociatedotherwise.Instereovision,oncethecensustrans-formhasbeenappliedtothewindowscenteredinIðx1;y01ÞandIðx2;y2Þandthecorrespondingbitstringshavebeenobtained,thecensusalgorithmusestheHammingdistanceasthecostfunc-tionforevaluatingthesimilaritybetweenthem.Fig.3showsanexampleof3Â3censustransform.Becauseitsoperationsconsistofbitoperationsandsimplearithmetics,thecensusalgorithmiswell-suitedforaccelerationinFPGAsandthushasbeenwidelyimplementedinitsoriginalproposal[7,11,12,18],withsomehard-ware-orientedalgorithmicoptimizations[2,8,16]orevencom-binedwithSAD[1,13].Allthepreviouslymentionedmethodshavebeenstudiedinordertoevaluatethequalityoftheirdisparityestimations.Forthispurpose,thedatasetsTsukuba,Venus,TeddyandConeshavebeenusedasinputimagesforthetests.Thesesetsofimagesareavail-ableintheUniversityofMiddleburywebpage[25],andtheycur-rentlyconstituteacommonreferenceforevaluatingstereoalgorithms.Figs.4–6showthesuccessrateasafunctionofthematchingwindowsize.Thesepercentshavebeencalculatedcon-sideringa5%toleranceintheestimateddisparitywithrespecttothegroundtruth.Fig.7showsthedisparitymapsobtainedwiththedifferentsolutionsusingtheConesdatasetastheinput.Fig.4showsthesuccessrateforSAD,SSDandNCCasafunctionofthewindowsizeN.ThisgraphrevealsthatSAD,SSDandNCCpresentaverysimilarbehavior.Furthermore,therearenotsignif-icantdifferencesbetweenthedisparitymapsshowninFigs.7.3and7.4.Therefore,wecanconcludethatthehighercomputationalcostofcomputingthesquareddifferencesinSSDorthenormaliza-tionsinNCCdonotproducesignificantimprovementsoverSAD,whichonlyrequirestocomputeabsolutedifferences.Itisalsoconcluded,asobservedinFig.7.6,thatresultsareworsebeyondagivenwindowsizeduetotheexcessivesmoothingattheborders.TheimpactofperformingtheranktransformationovertheinputimagesisshowninFig.5.Inallthetestedcases,itispossibletoobtainhighersuccessratesthanwithoriginalSADstartingfromFig.2.ExampleranktransformwithR=S=3.Fig.3.ExamplecensustransformwithM=N=3.Fig.4.SuccessratesoftheSAD,SSD,NCC,censusande-censusalgorithms,withdifferentwindowsizesN.Fig.5.SuccessratesoftheSADalgorithmwithpriorranktransformsofdifferentwindowsizes,fordifferentsizesofthematchingwindowN.acertainmatchingwindowsizeN,whichisaslowasN¼5for7Â7rankwindows.ThisimprovementcanbecheckedvisuallycomparingthedisparitymapofFig.7.5withtheoneofFig.7.3.Itcanalsobeseenthatstartingfromarankwindowsizeof7Â7,makingthiswindowbiggerdoesnotresultinbetterqualityoftheresultingdisparitymap.AccordingtotheresultsinFig.4,censusrequiresawindowsizeofatleast17Â17inordertogetclosetotheresultsproducedbySAD.Incomparisonwiththedisparitymapsobtainedfromconfig-urationswithsimilarsuccessrates,onecanseethatcensusmapspresentsomeerrorswhicharesimilartoimpulsivenoise.ThiseffectcanbeeasilyseencomparingthemapsgeneratedwithSAD11Â11andcensus21Â21inFigs.7.3and7.7,respectively.Inordertoimprovetheresults,itispossibletoextendtheoriginalcensustransformwithadditionalinformation,sothattheresultingbitstringdoesnotdependexclusivelyonthevalueofthecenterpixel.OnewaytodothatistoconsidertheinformationaboutC.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–3125Fig.6.Successratesofcensusalgorithmwithmedianfilteringofdifferentwindowsizes,fordifferentsizesofthematchingwindowN.thegradientinthehorizontalandverticaldirections,computedwiththeSobeloperator.Thismodificationwillbereferredtoase-census(Extendedcensus),andanexampledisparitymapobtainedwitha21Â21windowisshowninFig.7.8.AsitcanbeseeninFig.4,e-censusoutperformscensusforanywindowsize,andstartingfromN¼11italsooutperformsSAD.However,disparitymapsobtainedwithe-censusalsopresentthekindoferrorswhichappearincensus,albeittoalesserextent.Inordertoreducetheseerrors,apost-processingofthedisparitymapisproposed.Thepost-processingstageconsistsofapplyingamedianfiltertotheresultingdisparitymap.Fig.6showsthattheuseofmedianfilterswithdifferentwindowsizesimprovesthesuccessrateofbothcensusande-census.ThisfactcanbevisuallyperceivedinthemapsofFigs.7.9and7.10.Finally,Fig.6comparescensusande-censuswiththecombina-tionofrankandSAD,showingthatstartingfromacertainvalueofthemedianfiltersize,censusande-censusproducebetterresults.4.HardwareimplementationinFPGAThestudyintheprevioussectionanalyzesthequalityoftheevaluatedalgorithmsasafunctionoftheircustomizableparame-ters,withoutconsideringimplementation-relatedissues.However,whenhavingtoselectwhichalgorithmtoimplement,onemusttakeintoaccounttherequirementsofthespecificapplicationandassessthecostofeachofthem.Inthissection,thecomputa-tionalcostoftheFPGA-basedimplementationofSAD,census,rankandmedianfilteringisevaluatedinordertodeterminethehard-warecostofachievinganspecificaccuracy.SSDandNCCarenotincludedinthisanalysis,asthesealgorithmswerediscardedinSection3.Alltheimplementedstereo-visionalgorithmsprocessinputimagesindatastreaming.Toassurevideoprocessingrate,thealgo-rithmshavebeendesignedtomaximizethethroughputexploitingthedegreeofparallelismasmuchaspossible,sothatalloperationsofthecostfunctionareperformedinparallelandthedmaxcostfunctionsareevaluatedalsoinparallel.Withthissetup,thecalcu-lationofthedisparityofonepixelrequiresasingleclockcycle.Therangeofdisparitieswhichneedtobemeasureddependsonthegeometryofthestereovisionsystemandultimatelyontheapplication.Nevertheless,asaruleofthumbforthepurposeofestimatingresources,thesearchrangedmaxischosentobe15%oftheimagewidth.Aswithanylocal-neighbourhoodalgorithm,lineandwindowbuffersareusedtocacheinputdata[26].Inordertoprovideacostfunctionmodulethatacceptsthedatainparallel,MÀ1Q-depthlinebuffersandawindowbufferwithMðNÀ1Þregistersarerequiredforeachinputimage,asshowninFig.8.Theregistersofthewindowbuffersareimplementedwithflip-flops,whilelinebuffersareimplementedwithspecificmemoryresourcesbecausetheyaremuchlarger.Toextendparallelismtothedmaxinstancesofthecostfunction,thewindowbuffershavebeenreplicated.Thus,thisconfigurationuses2ÂdmaxÂMðNÀ1Þregistersforthewindowbuffers.ItisclearthatthewindowbuffersforimageIateachinstantiationofthecostfunctionstorethesamedataandthereforetheycouldbetrimmedwithoutaffectingfunctionality.Besides,itisalsopossibletodesignawindowbufferforI0withMðNÀ1þdmaxÞregisters,eachonefeedingNinstancesofthecostfunction.However,thereductionoftheflip-flopsusagewiththeseoptimizationsraisesthecomplexityoftheroutingprocessconsid-erably,thusincreasingthecombinationalpathdelay.Furthermore,norealbenefitisobtainedsincetheresourceneedsarelimitedbythenumberofutilizedLUTs,notbythenumberofflip-flops,asthenumberofrequiredLUTsismuchhigher.Therefore,theschemeshowninFig.8hasbeentheadoptedarchitecturesoastomaxi-mizethethroughput.4.1.SADcomputationaloptimizationThestraightforwardfull-parallelimplementationoftheSADalgorithmrequiresdmaxÂMÂNabsolutedifferencemodulesandthecorrespondingdmaxÂMÂNÀ1adders.Adetailedanalysisoftheoperationsrevealsthatitispossibletooptimizethiscomputa-tion.AsshowninFig.9,forawindowsizeof5Â5,theSADcom-putationinanypixelsharesoperationsanddatawithcomputationsofneighbourpixels.Thisimpliesthatintermediateresultsthathavebeencalculatedforapixelcanbereusedforthefollowingpixels.Therearetwowaysinwhichthisoptimizationcanbeperformed:horizontalorrow-based,andverticalorcol-umn-based.Thefirstoneconsistsonusingthetopandbottomrowsumsandthesumofthewindowonthepreviousrowtocom-putethenewwindowsum,whilethesecondoneconsistsonusingtheleftandrightcolumnsumsandthesumofthepreviouswin-dowonthesamerowtocomputethenewwindowsum.Lookingatthecomputationatðy;xÞandðyþ1;xÞinFig.9,itcanbeseenthattheMÀ1lowerrowsofCSADðyþ1;x;dÞareexactlythesameastheMÀ1upperrowsofCSADðy;x;dÞ.Mathematically,ifCSADyðy;x;dÞisdefinedasthesummationoftheNabsolutediffer-encesintheyrowCSADXi¼wyðy;x;dÞ¼ADðy;xþi;dÞi¼ÀwthenCSADðy;x;dXj¼wÞ¼CSADyðyþj;x;dÞj¼ÀwwhichcanberewrittenasCSADðyþ1;x;dÞ¼CSADðy;x;dÞþCSADyðyþ1þw;x;dÞÀCSADyðyÀw;x;dÞð2ÞTherefore,tocomputeCSADðyþ1;x;dÞonlyCSADyðyþ1þw;x;dÞhastobecalculated,sincetheremainingCSADywerealreadycalcu-latedforCSADðy;x;dÞ.Withthismethod,computationalcomplexityisreducedbyMtimeswithrespecttothestraightforwardimplementationsinceonlyNADmodulesandNþ1addersplus26C.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–31(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)Fig.7.ExamplesofresultingdisparitymapsfortheConesdataset.asubtracterarerequiredpereachdisparitycomputation.BothCSADðy;x;dÞandCSADyðyÀw;x;dÞmustbestoredintemporalbuf-fersassoonastheyarecalculatedinordertobeavailablelaterwhenneeded,sothismethodtradescomputationalloadformem-ory.Thehorizontalscanlineoftypicalvideoprocessingapplica-tionsestablishthedepthofthesebuffers,whichdependontheC.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–3127Fig.10.Simplifiedschemeofthehardwarearchitecturefortherow-basedoptimizationapproachwith5Â5windowsize.Fig.8.Schemeofthearchitecturefora3Â3windowsize.imagewidthQ.CSADðy;x;dÞisreusedwhenthecomputationisatthepixeljustbelowðyþ1;xÞ,whichhappensQpixelslaterindatastreaming,soaQ-depthbufferisrequired.CSADyðyÀw;x;dÞiscal-culatedexactlyMrowsbeforethecomputationisatpixelðyþ1;xÞ,soanMÂQ-depthbufferisneeded.ThesimplifiedschemeofsuchhardwarearchitectureisshowninFig.10.Sincethecomputationsarecalculatedoverarow,theinputlinebufferisnotnecessaryandthewindowbufferisreducedtoNÀ1registers.Ontheotherhand,ifCSADxðy;x;dÞisdefinedasFig.11.Simplifiedschemeofthehardwarearchitectureforthecolumn-basedoptimizationapproachwith5Â5windowsize.CSADxðy;x;dÞ¼andj¼wXj¼ÀwADðyþj;x;dÞCSADðy;x;dÞ¼i¼wXi¼ÀwCSADxðy;xþi;dÞthenthecomputationatanypixelðy;xþ1ÞforanydisparitydcanbecalculatedfromtheresultinthepreviouspixelinthesamerowCSADðy;x;dÞbysubstractingitsleftcolumnCSADxðy;xÀw;dÞandaddingthenewrightcolumnCSADxðy;xþ1þw;dÞ,asshowninFig.9:CSADðy;xþ1;dÞ¼CSADðy;x;dÞþCSADxðy;xþ1þw;dÞÀCSADxðy;xÀw;dÞð3ÞThus,thecomputationalloadisreducedbyNtimesandagainthetrade-offisthatmemoryresourcesarerequiredinordertobuf-fertwopreviousresults.Inthiscase,however,thesizeofthebuf-fersisnotdeterminedbytheimagewidthQ,butbythewindowwidthN.Fig.11showsthesimplifiedschemeofthishardwarearchitecture.Withthecomputationcalculatedoveracolumn,thereisnoneedforthewindowbuffers.BothapproachescanbemergedintoaTotal-ReuseTRconfigura-tion,whichmakesuseofthehorizontalrecursiontoupdatethesummationoftheabsolutedifferencesinarowandthenofverticalrecursiontoupdatethesummationofthecolumns.AsitcanbeobservedinFig.9,thecomputationatðyþ1;xþ1Þcanbecarriedoutbyonlycalculatingoneabsolutedifferenceandreusingdataobtainedinpreviouscomputations.SuchhardwarearchitectureisshowninFig.12.Itallowstokeepthecomputationalloadtoaminimumandindependentofthesizeofthematchingwindow,sinceonly5elementaryarithmeticoperationsareneededtoobtainCSADðy;x;dÞateachnewpixel.Theadoptionofeithertherow-basedorthecolumn-basedopti-mizationapproacheshasasignificantimpactontheamountofmemoryresourcesrequired,asshowninFig.13.Inasoftwareimplementationthismaynotaffecttheoverallperformance,butinanFPGAimplementationeachapproachleadstodifferentFig.9.SADalgorithmoptimizationthroughreutilizationofintermediatedatawithM¼N¼5.28C.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–31Fig.12.SimplifiedschemeofthehardwarearchitectureforTotal-Reuseapproachwith5Â5windowsize.Fig.13.EstimationoftheKbitsofmemorynecessarytostorethepartialresultswithcolumn-basedandrow-basedSADoptimizationsasafunctionofNwithQ¼0anddmax¼95.resourcesneeds.ThenumberofusedLUTsisthesamebecausethearithmeticlogicisidentical.Thebuffersforthecolumn-basedopti-mizationcanbeefficientlyimplementedwithflip-flops,whilethebuffersfortherow-basedonemustbeimplementedwiththeblockRAM(BRAM)specificmemoryresources,whicharemuchlarger.TheXilinxblockRAMisan18Kbitmemorycustomizableas4KÂ4,2KÂ9,1KÂ18and512Â36memories.Considering8bitinputdata,whichisthetypicalrepresentationforluminanceinthestandardcameraformats,theblockRAMswhichimplementtheintermediatelinebuffersintherow-basedoptimizationmustbeconfiguredas1KÂ18toensurefullrepresentationofthesum-mationofabsolutedifferencesfortherangeofanalyzedwindowsizes.Fortheinputlinebuffersthatfeedthewindowbuffersinthestraightforwardandinthecolumn-basedimplementations,andagainconsidering8bitinputdata,bothIðy;xÞandI0ðy;xÞcanbestoredinthesameaddresswithinthe1KÂ18configuration,thushalvingthenumberofblockRAMs.Therow-basedsolutiondoesnotrequiretheseMÀ1Q-depthinputlinebuffers,butdmaxmodulesoftheintermediateonesmustbeinstantiatedtocomputeallthedisparitiesinparallel.Withtheseassumptions,Fig.14showstheresourcespaceofXilinxVirtex-4SXandFXFPGAfami-lies,andtheestimatedresourceneedsforrow-basedandcolumn-basedSADoptimizations.Itplotsthemaximumvalueofgeneralpurposeresourceusage(LUTs,FFs)foreachapproachversustheblockRAMusage,withdifferentconfigurationsandinputimagesizes.TheSXandFXfamilieshavethehighestslices:BRAMratioinVirtex-4.Itisobservedthatfortherow-basedapproachthemaximumwindowsizeislimitedbytheblockRAMs,andthatnoconfigurationcanbeimplementedfor720HDvideo.RegardingLUTusage,Fig.15plotsthenumberofLUTsasafunc-tionofthewindowssizeforPartial-ReusePRandTotal-ReuseTR.PRconfigurations,eithercolumn-basedorrow-based,requirethesamenumberofLUTs,andthenumberofLUTsisreducedbyoneorderofmagnitudewithrespecttothestraightforwardimplemen-tation.ForTR,thenumberofLUTsiskeptaslowaspossiblebecauseofSADisperformedwithasingleADmodule,anditvarieswiththewindowsizeonlyduetotheresultingdatasize.ThisconfigurationistheoppositeendofthedesignspacetothestraightforwardSADimplementation.Asaconclusion,thecolumn-basedapproachoffersthebestbalanceinresourceneeds.Ifthehardwareplatformcanhosttherow-basedoptimization,theTRconfigurationneedslessgeneralFig.14.SpaceresourceforSXandFXVirtex-4familiesandestimationofresourcesusageofrow-andcolumn-basedSADoptimizations,asafunctionofwindowsizeNandfordifferentinputimagesize.Fig.15.EstimationofthenumberofnecessaryLUTsasafunctionofNfortheevaluatedalgorithmswithQ¼0anddmax¼95.purposeresourceswithoutextramemoryrequirements.Obvi-ously,thesuccessratedoesnotdependonthechoice,asallimple-mentationsoftheSADalgorithmareequivalent.4.2.EstimationofhardwareresourcesInsteadofstartingwithafullimplementationofthecorrespon-dencealgorithms,varyingeverypossibleparameterinapredefinedrangeofvalues,itisagoodideatomakesomecalculationsinordertopredictwhichalgorithmismoreefficientbeforehand.Thatwayonecanexploredifferentsolutionswithouthavingtogothroughthecompletesynthesis,placeandrouteprocesses,whichcantakeseveralhoursforthelargestdesigns.Thecomputationalcostofthearea-basedstereocorrespon-dencealgorithmsisbasicallydeterminedbythreeparameters:thecomplexityofthematchingmetricsbetweenapairofC.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–3129windows,thesizeoftheaggregationwindowandthemaximumdisparity.ThematchingmetricswhichhavebeenevaluatedarethesumofabsolutedifferencesandtheHammingdistance.Bothofthemcanbecomputedwithsimplelogicandarithmeticopera-tions.Forb-bitinputdata,anabsolutedifferenceusesanb-bitsubstractor,ab-bittwo’scomplementerforarithmeticnegationanda2-inputb-bitmultiplexer.Hammingdistancedeterminesthenumberofbitsthatdifferbetweentwob-bitbitstringswithab-bitXORandanaddertreeforbitcountingitsresult.ThesizeofthewindowMÂNsetsthenumberofoperationstocomputeinordertoestimatethedegreeofsimilaritybetweentwowindowsusingagivencostfunctionC.ForSAD,thisrequirestocomputethesumofMÂNabsolutedifferences:CSADðXj¼wy;x;dÞ¼Xi¼wADðyþj;xþi;dÞj¼Àwi¼ÀwwithADðy;x;dÞ¼jIðy;xÞÀI0ðy;xþdÞjassumingsquarewindows(M¼N¼2wþ1).Forcensus,itisnecessarytocomputetheHammingdistancebetweentwoMNÀ1bitstrings:CCENSUSðy;x;dÞ¼dHAMMINGðTðy;xÞ;T0ðy;xþdÞÞwhereTðy;xÞandT0ðy;xÞarethecensustransformsofaMÂNwin-dowineachimage.ThedisparityrangeimpliestheevaluationofCatdmaxpixelsI0ðy;xþiÞwithi2½0;dmaxÀ1foreachIðy;xÞ.Finally,theestimateddisparitydforeachIðy;xÞisthevalueofiwhichminimizesfCðy;x;iÞg.InahardwareimplementationonFPGA,thiscomputationalcostlimitsthenumberofpixelsIðy;xÞwhosedisparitycanbeestimatedperunitoftime,anddeterminestheareautilizedtoperformthenecessarycalculations.Bothvaluesdependonthequantityandcomplexityofthecalculationsoftheselectedalgorithm.Besides,ineveryalgorithmthethroughputisdirectlyrelatedtotheresourceutilizationbymeansofthedegreeofparallelismofitsimplementation.Inordertobeabletodeterminewhichalgorithmismoreeffi-cient,theresourceutilizationofvarioussuccessrateshavebeenevaluatedforeachalgorithm.Thisway,itispossibletoachievethebestcompromisebetweenthequalityoftheresultsandtheimplementationcostforaspecificapplication.Forthispurpose,thenumberofnecessaryresourceshavebeenestimatedusingXilinxSystemGenerator,withthehelpoftheResourceEstimatorblock.Theestimatednumberof4-inputLUTsnecessarytoimplementcensus,e-censusandSADfordifferentwindowsizesisshowninFig.15.Becauseofthefactthatthedesignshavebeenoptimizedformaximumparallelism,thereisanexponentialgrowthoftheresourceswiththewindowsize.ThisbehaviorisparticularlypronouncedinSADbecauseitsarithmeticoperationsusemoreresourcesthanthebinaryoperationsofbothversionsofcensus.Itcanalsobeseenthattheinclusionofthehorizontalandverticalgradientine-censusmakesitconsumethreetimesmoreresourcesthancensus.Thenumberof6-inputLUTswhichwouldbeneededispracti-callyidentical,duetothenatureofthearithmetic-logicoperationsinvolved.4.3.ImplementationresultsFig.16presentsthesuccessrateofeachalgorithmasafunctionofthenumberofestimatedLUTsforitsimplementation.ForSAD,onlythePRcolumn-basedconfigurationisshown.AccordingtotheconclusionsobtainedfromFig.6,amedianfilterof5Â5hasbeenselectedforcensusande-census,and7Â7ranktransformsforSAD.Allofthesecomputations,includingthe3Â3Sobelfiltersforthecalculationofthegradientrequiredine-census,havebeendesignedwithmaximumparallelismandeverypixelisprocessedinasingleclockcycle.TheconclusionextractedfromFig.16isthatcensuswithmedianfiltering,e-censuswithmedianfilteringandPRSADwithrankprovidethebestqualityofthedisparitymaps,withsuccessratesalmostidenticalforagivennumberofLUTs.Theadditionofranktransformsandmedianfilteringcontrib-utestoimprovetheresultswithaminimumcostofresources.Cen-susande-censuscanreachsimilarsuccessratesbytheirown(i.e.,withoutmedianpost-processing),butconsuminglotsofresources.Forexample,accordingtoFig.16,census(withoutmedian)needsatleast59968LUTstogetaccuraciesover84%.Ontheotherhand,censuscombinedwitha5Â5medianfilterneedsonly7755LUTstogetaccuraciesbetterthan84%,whichmeansthatthenumberofLUTshasbeenreducedby87%withrespecttooriginalcensus.Thisresultisspeciallyrelevant,asitmeansthatthemedianpost-pro-cessingstageallowstoreducedramaticallytheareaoccupancyofacensus-basedstereosystem,whichultimatelyleadstomulti-plebenefits:smallerdiesize,lowerpowerconsumption,lowereconomiccost,morespaceavailableforotherdigitalcircuitsinthesameFPGA,etc.Inthisstudy,dmaxhasbeenfixedto63,whichisanappropriatevaluefortheMiddleburydatasetswhichhavebeenused.DuetothefactthatallcostfunctionsCiarecalculatedinparallel,dmaxaffectsthenecessaryresourcesinalinearway,butitdoesnotaltertheconclusionsdrawnuptonow.Table1isasummaryoftherealimplementationresultsfor11Â11SADwith7Â7rank,for9Â9censuswith5Â5medianandfor5Â5e-censuswiththesamemedianfilter.Theseconfigu-rationsaretheoneswhichreacha90%ofsuccessratewithfewerresources,accordingtoFig.16.ThedesignshavebeenmadewithXilinxSystemGenerator.Table1showstheresourceutilizationofthedifferentparts,soitispossibletocheckthatthesmallnum-berofresourcesusedbytheranktransformsandthemedianfiltersallowsasignificantimprovementofthedisparitymaps.Themini-mumperiodreportedbythePlace&RouteperformedbyXilinxISEisthetimerequiredtocalculatethedisparityofonepixel.Fortheconsidered352Â288images,thismeansthateachframecanbeprocessedinlessthan1ms,andmore6:4Â109disparitiesareesti-matedpersecondwithanyoftheimplementedconfigurations.Neithertheranktransformsorthemedianfilterscompromisethesemaximumoperatingfrequencies.SincelatencyisacriticalFig.16.SuccessrateofeachalgorithmasafunctionofthenumberofestimatedLUTsfortheirimplementation.30C.Colodro-Condeetal./JournalofSystemsArchitecture60(2014)22–31Table1
ImplementationresultsfordifferentconfigurationsofthecorrespondencealgorithmsinaVirtex-4target.
7Â7Rank5Â5MedianFlip-flops585197-inputLUTs5332747Area(slices)5551468BlockRAMsTmin(ns)6.704.60factorintruereal-timeapplications,thisparameteriskepttotheminimum.UsingbiggerinputimageswouldimplymoreblockRAMstostorethenecessarydataineachlineandlargervaluesofdmaxinordertokeepthesamerangeofmeasurabledistances.5.ConclusionsInthelastdecade,asignificantprogresshasbeendoneinimprovingtheaccuracyofstereodisparitymapsbythedevelop-mentofsophisticatedalgorithms.Unfortunately,thecomputation-allyintensivenatureofthesealgorithmscanmaketheirreal-timeimplementationdifficultinresource-limitedsystems,andperfor-manceintermsofqualityorspeedmayneedtobereducedinordertomeettheconstraintsofmanyembeddedapplications.Efficientimplementationofreal-timestereovisionsystemsrequirestomanageatrade-offamongaccuracy,timingperfor-manceandresources.Thesolutionsproposedintheliteraturearehardware-orientedalgorithmicoptimizationsoradhocimplementationsthatdonotaddressthismanagement.Inordertofindsuchbalance,thispaperproposesapracticalmethodologyforestimatingthesuccessrateofthemostusualreal-timearea-basedstereomatchingalgorithmsasafunctionofitsresourceconsumptioninFPGA-basedimplementa-tions.Additionally,ahardwarestudyofalternativesfortheoptimi-zationofSADcomputationhasbeenpresented.Theuseofamedianfilterinthepost-processingstageimprovessignificantlythequalityofthedisparitymapwithamarginalhardwarecost.ThebenefitofthisapproachisdramaticintermsofFPGAoccu-pancywhencomparedtoasolutionwithoutthisstage.Inalltheconsideredcases,thealgorithmshavebeendesignedwiththemaximumdegreeofparallelism,whichassuresreal-timeprocessing.References[1]K.Ambrosch,C.Zinner,W.Kubinger,Accuratehardwarebasedstereovision,Comput.VisionImageUnderst.114(2010)1303–1316.[2]M.Humenberger,C.Zinner,M.Weber,W.Kubinger,M.Vincze,Afaststereomatchingalgorithmsuitableforembeddedreal-timesystems,Comput.VisionImageUnderst.114(2010)1180–1202.[3]M.Gokhale,P.S.Graham,ReconfigurableComputing:AcceleratingComputationwithField-ProgrammableGateArrays,Springer,2005.[4]S.Asano,T.Maruyama,Y.Yamaguchi,PerformancecomparisonofFPGA,GPUandCPUinimageprocessing,in:Int.Conf.onFieldProgrammableLogicandApplicationsFPL,2009,pp.126–131.[5]R.Kalarot,J.Morris,ComparisonofFPGAandGPUimplementationsofreal-timestereovision,in:IEEEComputerSocietyConferenceonComputerVisionandPatternRecognitionWorkshops(CVPRW),2010,pp.9–15.[6]E.Gudis,G.VanderWal,S.Kuthirummal,S.Chai,Multi-resolutionreal-timedensestereovisionprocessinginFPGA,in:Int.Symp.onField-ProgrammableCustomComputingMachinesFCCM,2012,pp.29–32.[7]S.Jin,J.Cho,X.D.Pham,K.M.Lee,S.-K.Park,M.Kim,J.W.Jeon,FPGAdesignandimplementationofareal-timestereovisionsystem,IEEETrans.CircuitsSyst.VideoTechnol.20(1)(2010)15–26.[8]L.Zhang,K.Zhang,T.S.Chang,RealtimehighdefinitionstereomatchingonFPGA,in:ACM/SIGDAInt.Symp.onFieldprogrammableGateArraysFPGA,2011,pp.55–.9Â9Census5Â5E-census11Â11PRSAD1824717347169651838417204245581380412988153341630229.999.859.91[9]P.Zicari,S.Perri,P.Corsonello,G.Cocorullo,Low-costFPGAstereovisionsystemforrealtimemapscalculation,Microprocess.Microsyst.36(2012)281–288.[10]C.Y.Villalpando,A.Morfopolous,L.Matthies,S.Goldberg,FPGAimplementationofstereodisparitywithhighthroughputformobilityapplications,in:IEEEAerospaceConference,2011,pp.1–10.[11]M.A.Ibarra-Manzano,D.L.Almanza-Ojeda,M.Devy,J.L.Boizard,J.Y.Fourniols,StereovisionalgorithmimplementationinFPGAusingCensusTransformforeffectiveresourceoptimization,in:EuromicroConferenceonDigitalSystemDesign–Architectures,MethodsandTools,2009,pp.799–805.[12]M.A.Ibarra-Manzano,M.Devy,J.L.Boizard,P.Lacroix,J.Y.Fourniols,Anefficientreconfigurablearchitecturetoimplementdensestereovisionalgorithmusinghigh-levelsynthesis,in:Int.Conf.onFieldProgrammableLogicandApplicationsFPL,2009,pp.444–447.[13]M.Jin,T.Maruyama,AfastandhighqualitystereomatchingalgorithmonFPGA,in:Int.Conf.onFieldProgrammableLogicandApplicationsFPL,2012,pp.507–510.[14]H.Calderon,J.Ortiz,J.G.Fontaine,HighparalleldisparitymapcomputingonFPGA,in:Int.ConferenceonMechatronicsandEmbeddedSystemsandApplications,2010,pp.307–312.[15]D.S.Kim,S.S.Lee,B.H.Choi,Arealtimestereodepthextractionhardwareforintelligenthomeassistantrobot,IEEETrans.Consum.Electron.56(3)(2010)1782–1787.[16]W.Fife,J.Archibald,ImprovedCensusTransformforresourceoptimizedstereovision,IEEETrans.CircuitsSyst.VideoTechnol.(99)(2012).[17]B.Tippetts,D.Jye,K.Lillywhite,J.Archibald,Reviewofstereovisionalgorithmsandtheirsuitabilityforresource-limitedsystems,J.Real-TimeImageProcess.(2013)1–21.[18]S.Longfield,M.Chang,AparameterizedstereovisioncoreforFPGAs,in:IEEESymposiumonFieldCustomComputingMachines(FCCM),2009,pp.263–266.[19]R.Szeliski,ComputerVision:AlgorithmsandApplications,Springer,2010.[20]D.Scharstein,R.Szeliski,Ataxonomyandevaluationofdensetwo-framestereocorrespondencealgorithms,Int.J.Comput.Vision47(1–3)(2002)7–42.[21]M.Z.Brown,D.Burschka,G.D.Hager,Advancesincomputationalstereo,IEEETrans.PatternAnal.Mach.Intell.25(8)(2003)993–1008.[22]H.Hirschmuller,D.Scharstein,Evaluationofstereomatchingcostsonimageswithradiometricdifferences,IEEETrans.PatternAnal.Mach.Intell.31(9)(2009)1582–1599.[23]R.Zabih,J.Woodfill,Non-parametriclocaltransformsforcomputingvisualcorrespondence,in:Int.ConferenceonComputerVisionECCV,1994,pp.151–158.[24]J.Banks,M.Bennamoun,Reliabilityanalysisoftheranktransformforstereomatching,IEEETrans.Syst.ManCybern.PartBCybern.31(6)(2001)870–880.[25]MiddleburyCollegeStereoVisionResearchPage.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- cepb.cn 版权所有 湘ICP备2022005869号-7
违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务