专题论坛大数据课件.ppt_163文库

资源描述

1、Big Data vs Smart Model:Beauty and the BeastProf.Yike GuoDepartment of ComputingImperial College LondonModel:Mathematical Representation of a SimplifiedPhysical WorldModelling is an essential and inseparable part of all scientific activity.A scientific model seeks to representempirical objects,pheno

2、mena,and physical processes in a logical and objective wayTo understand the world or an object (called a target T),a model M is a simplified mathematicalrepresentation of it.Model is the result of abstraction from observations made,and its used to givepredictionHuman/SensorHuman/MachineHuman/Machine

3、.No Model Is Perfect:Inherent Uncertainty:These targets consist of a set of continuous phenomena(inboth time and space),and they typically produce rich signals.Because of thecontinuity in both time and space of target,the signals are in principle infinite.Butobservations(e.g.sensor readings)are made

4、 at discrete points in time and space,sothey are incomprehensive,and approximate,which brings the“uncertainty”.Overfitting or Underfitting:When learning a model from observations,such aslearning a nonlinear regression model,we need to choose the parameters such as K.Considering the fact that the inf

5、ormation from observations is partial.It is hard tomake a perfect choice of K.Such imperfectness causes the problem of model error,like underfitting(small k)and overfitting(large k).Simplification:From observations,we project from a multi-dimensional world asimplified model with significant reduced

6、dimensionality to focus on the features orproperties we are interested in.Nonlinearregression:K-order polynomialGeorge Box(statistician)“All models are wrong,but some areuseful.”Only models,from cosmological equations to theories of humanbehavior,seemed to be able to consistently,if imperfectly,expl

7、ain the worldaround us.-1980Peter Norvig(Google):All models are wrong,and increasinglyyou can succeed without them.-2008Chris Anderson(Wired):There is now a better way.Petabytesallow us to say:Correlation is enough.We can stop looking for models.We can analyze the data without hypotheses about what

8、it might show.Wecan throw the numbers into the biggest computing clusters the world hasever seen and let statistical algorithms find patterns where science cannot.(The Data Deluge Makes the Scientific Method Obsolete)-20124So,Why Model?The Google ArgumentAt the petabyte scale,information is not a ma

9、tter of simple three-and four-dimensionaltaxonomy and order but of dimensionally agnostic statistics.It calls for an entirely differentapproach,one that requires us to lose the tether of data as something that can be visualizedin its totality.It forces us to view data mathematically first and establ

10、ish a context for it later.For instance,Google conquered the advertising world with nothing more than appliedmathematics.It didnt pretend to know anything about the culture and conventions ofadvertising it just assumed that better data,with better analytical tools,would win the day.And Google was ri

11、ght.Googles founding philosophy is that we dont know why this page is better than thatone:If the statistics of incoming links say it is,thats good enough.No semantic orcausal analysis is required.Thats why Google can translate languages without actuallyknowing them(given equal corpus data,Google can

12、 translate Klingon into Farsi aseasily as it can translate French into German).And why it can match ads to contentwithout any knowledge or assumptions about the ads or the content.Model Free Sensor Informatics:Query Driventime10am10am.10amid12.7temp202129DatabaseTable raw-dataSensorNetwork3.Write ou

13、tput to a file/back to the database4.Write data processing tools toprocess/aggregate the output(maybe usingUser1.Extract all readings into a file2.Run MATLAB/R/other data processing toolsDB)5.Decide new data to acquireRepeatModel-free sensing treats the sensory system as a database,and sensing as qu

14、erying to fetch data from physicalworld.One of the leading vendors Crossbow is bundling a query processor with their devices.Wikisensing:A Model Free Sensor Informatics SystemBased on Big Data ArchitectureModel Free Sensing is Super Inefficient Data misrepresentation without model Latent information

15、 missing without model High demand of computation/storage without model Require too much of interoperability between sensorsand analyticsBayesian:Data Is Not the Enemy of Models,Rather aGreat Supporter!Bayesian probability is a formalism that allows us to reason about beliefs of models underconditio

16、ns of uncertainty based on the observations(data).If we have observed that a particular event has happened,such as Britain coming 10th in themedal table at the 2004 Olympics,then there is no uncertainty about it.However,suppose a is the statement“Britain sweeps the boards at 2012 London Olympics,win

17、ning more than 30 Gold Medals!“made before 28th of JulySince this is a statement about a future event,nobody can state with any certainty whether ornot it is true.Different people may have different beliefs in the statement depending on theirspecific knowledge of factors that might effect its likeli

18、hoodThe beliefs of the model were changing daily based on the performance data available eachday.By the 10 of August,most of peoples belief to this model should be almost 80%Thus,in general,a persons subjective belief in a statement a will depend on some body ofknowledge K.We write this as P(a|K).He

19、nrys belief in a is different from Marcels because theyare using different Ks.However,even if they were using the same K they might still havedifferent beliefs in a.The expression P(a|K)thus represents a belief measure.Sometimes,for simplicity,when Kremains constant we just write P(a),but you must b

20、e aware that this is a simplification.Model and Data Interaction:Bayesian Inference10Bayes Rule:Interaction between data and modelLearning as A Sequence of Interactionsp(Y|)p()p(Y)P(|Y)Big Data Meets Smart Models:A Bayesian Approachtowards Sensor InformaticsWe need model:a model is the representatio

21、n of our knowledge so farData:the observations which may revise our belief to the models we haveAnalysis:assessing our belief and updating our models to make them more believableSensing:acquiring needed data to update(enrich)modelsModels are learned from data(observations)by scientists (theoretical

22、abstraction)or by machine (machinelearning)Models are hypothesis (when making new observation)Models are knowledge(when established belief)Sensor Informatics:Sensing management-Managing the“neediness”:when and where to senseSensing analytics-Managing model updating:how to enrich models with observat

23、ionsReasoning-Decision making based on integration of trusted modelsP(M|D)=P(D|M)P(M)/P(D)Surprising Event:When an Observation Does not Fit aKnown ModelPosterior and prior(P(M|D)P(M)has great variance-surprise!How great is great variance?Surprise threshold Kullback-Leibler divergence:Other methods:s

24、ignficant level,Chebyshevs Theorem,From model,we get C(A,B)(e.g.a multivariateGaussian distribution)A:100mmB:50mmModel consistentA:100mmB:500mmSurprise!Camera example:Image-Analog Signal -Digital Data-Compressed Data -InformationWhy sensing so much data and then throw themaway?Why not sensing inform

25、ation directly?Using Compressive Sensing Technology to OptimizeObservationsCompressive sensing:Take the advantage of sparseness,to solve the under-determinedsignals with just a small amount of measurement.Unobserved behavior(behavior not captured by the currentmodel)is typically sparse.Reconstructio

26、n method:L1-min,Bayesian CS.Sensing data is enough when we can recover the need information through compressive sensing.:CS Matrix built from the model:Placement MatrixHow to Update Model Parameter Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC 25 201121:15:23NOD

27、AL SOLUTIONSTEP=360SUB =1TIME=1800TEMP (AVG)RSYS=0SMN =131.03SMX =646.41MXMNZXEstimating parameter to maximize the likelihoodof data given the model:Model:An Example in Digital CityModelling City Life via Causality:C(eA,eB)is used for predict current value of location(A)whenanother location(B)value

28、is given Location :physical/logical locations with causality(through sensory cortex)(city areas,A.B)Relationship :topology(geo topology between A and B:diffusion Structure)Event:events,which is the dynamics of observable signal S=f(E)(heavyrainfall)Ontologies are adopted to represent locations L,rel

29、ationships R*events E,and signals S.Diffusion:An event e1 E in n1causes another event e2 E in n2,when two nodes n1,n2 in G arelinked.Digital City Model:looking into the detailsSystem T=(L,R,E)Model M(T)=(G,B)Training for causality:use Bayesian network to represent theconditional independencies betwe

30、en cause and target variables:1.Gaussian Mixture Models(GMMs),estimated via expectationmaximization(EM)2.Gaussian Process with Bayesian Inference.When the surprise surprise threshold Diversity detected identify the incorrect causality C(el,ep),which is sparse Compressive sensing approachNew observat

31、ion-measurement thatcould revise model in model space tomaximize the likelihood of observationsFocusing ondiversityPlacementModel UpdatingModel Driven Sensing:No Surprise!The dynamics of model update:Surprise-Sensing-Model UpdatingThe goal for sensing:Capturing surpriseThe goal of analysis:Revising

32、modelA model cannot overfit/underfit,when there is diversity,it could be updated-consistent with the universe(target)Model UpdateIts a Bayesian:P(M,?|D)=P(D|M,?)P(M,?)/P(D)T:target,M:model,?:top-down parameter*When?is fixed:P(M|D)=P(D|M)P(M)/P(D)-The variance between posterior and prior is“surprise”

33、-bottom-up attention-model update(data assimilation):combining observations of the current state of a system with the resultsfrom a model(the forecast)to produce an analysis.The model is thenadvanced in time and its result becomes the forecast in the nextanalysis cycle*When?is updated:P(M,?)=P(M|?)P

34、(?)-top-down attention(alertness)-model updateAdaptive Observation:Sensing and Numerical ModellingCityGML Ontology-GIS-Geometry meshBuilding An Initial Model and Making Prediction bySimulationsSetting up boundary conditions,numerical schemas,model parameters,etc.Simulation24 Building Case(Fine Mesh

35、600000 Nodes):20 ProcessorsSimulationMoving Vehicles and Scalar Dispersions in Street CanyonsUsing Sensor to Verify the Prediction Results of theModelSensing:Acquiring data to get posterior of model,for validate(consistent)or update model.P(M|D)=P(D|M)P(M)/P(D)Data sensingModelvalidateupdateNew Wiki

36、Sensing:Elastic Sensing Environment forLarge Scale Sensor Informatics Elastic sensing theory based on Bayesian inference Big Data architecture for large scale sensory data management Ontology for the background knowledge management Model driven adaptive observation support Digital City and digital l

37、ife applicationsThe architecture of the New WikiSensing SystemOntology Used to Organise the Complex knowledgemanagementUsing ontology to represent the targets,signals,sensing methods,measurements,etc.Ontology to support flexible resolutionUpper ontology for unified operationOntoSensorConclusion Big

38、data offers great opportunity for building smart models Big data provides new methodology for model research New informatics comes from the close coupled integration of the data and the model worlds Bayesian theory provides a nature foundation for such an integration Sensor Informatics is a good exa

39、mple for such a paradigm A new uniform framework of sensor informatics can be developed based on the Bayesian theory wherethe dynamics of data and model capturing the essence of building a sensory system We are developing the WikiSensing system to realise this paradigmThank youUnderstanding Big Data

40、Haixun WangData ExplosionMB=106 bytesa typical book in text formatGB=109 bytesa one hour video is about 1GB;data produced by a biologyexperiment in one dayTB=1012 bytesastronomy data in one night;US Library of Congress has 1000 TB data;search log of Bing is 20 TB per day(2009)The Arecibo TelescopeWo

41、rlds largest radio telescopeDiameter:305 m(1,000 ft)Area:18 acresLocation:Arecibo,Puerto Ricohttp:/www.naic.eduThe P-ALFA surveys800 Terabytes in 5 yearsSoftware Driven Telescopefrom few,large,expensive,directional dishes to many,small,cheap,omni directional antennaea large number of high-speedinput

42、 streams(2Gbps per antenna,25,000antennae in an area of 340 km indiameter)Data sizeChallenge 1:Its the data,stupid!Data complexityKey/value storeColumn storeDocument storeGraph SystemsBig data drives tomorrows economy.The value of big data lies in its degree ofconnectedness.Existing systems cannot h

43、andle richconnectedness of big data.RDBMS and Rich Relationships Performance of multi-way joins is very poor inRDBMS Managing data of rich connectedness requiresmulti-way Joins in RDBMSTrinity A general purpose,distributed,in memory graph system Online graph query processing Offline graph analyticsT

44、rinity Performance Highlight Online query processing :visiting 2.2 million users(3 hop neighborhood)on Facebook:=100ms foundation for graph-based service,e.g.,entity search Offline graph analytics :one iteration on a 1 billion node graph:=60sec foundation for analytics,e.g.,social analyticsPeople Se

45、arch DemoMulti-way Join vs.Graph TraversalCompanyIncidentProblemIDCompanyID1ID2IDIncidentID3ID4IDProblemRDBMSTrinityChallenge 2:Interpretation of Big Data IBM Watson:Runs on 2,880 cores,15 terabytes of RAM,and80kW of power A human brain:Runs on a tuna fish sandwich and a glass of wateransweringthe q

46、uestionunconstrainednatural languageinferencing&reasoningdomain specificlanguagesimplecalculationHuman(Turing Test)SIRIWatsonWolframAlphaGoogle/Bing?the Eternal Questunderstandingthe questionSQLcalculatorTurning the Webintoa DatabaseWhat you see when you look at my homepage Haixun WangMicrosoft Rese

47、arch AsiaEmail:haixunw Tel:+86-10-58963289Tel:+1-914-902-0749I joined Microsoft Research Asia in 2009.I was with IBM T.J.Watson ResearchCenter from 2000 to 2009.I received theB.S.and M.S.Degree in Computer Sciencefrom Shanghai Jiao Tong University in1994 and 1996,the Ph.D.Degree inComputer Science f

48、romUniversity of California,Los Angelesin June,2000.AWhat a machine sees when it looks at my homepage A JPEG Imagea jpeg Filetext in bigA bold fontA4 lines of textanother dozen lines oftext with twoembedded URLsSemantic Web?Number 1 trend in 2008 Richard MacManus The infrastructure to power theSeman

49、tic Web is already here.Tim Berners-Lee Unstructured information will give way to structuredinformation paving the road to intelligent computing.Alex IskoldMore data beats better algorithmsBanko and Brill 2001Mean translation quality(1=incomprehensible,4=perfect)English-Spanish translation quality,M

50、icrosoft technical texts2.523.52001200220032004200520062007SystranImprovealgorithms,scale system,and add data!Rule-based systemwith expensivecustomizationsfor Microsoft3MSRMTLogosOff-the-shelfrule-based systemFrom Rick Rashids talk:Its a data driven world get over it!ProbaseisA(concept,entities)isPr

展开阅读全文