Invited speakers

Abstracts

  • Prof Michael Jordan, "On Computational Thinking, Inferential Thinking and Data Science"

    The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the inferential and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level---in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. On a formal level, the gap is made evident by the lack of a role for computational concepts such as "runtime" in core statistical theory and the lack of a role for statistical concepts such as "risk" in core computational theory. I present several research vignettes aimed at bridging computation and statistics, including the problem of inference under privacy and communication constraints, and methods for trading off the speed and accuracy of inference.

  • Dr Mark Briers, "Turing and Bayes"

    The use of Bayesian statistics within data science is a key enabler to data-driven decisions. The defence and security sector has made use of Bayesian statistics for many years. This talk will demonstrate the utility of this approach within the defence and security context, and highlight some of the current research challenges facing the field.

  • Prof Alastair Young, "Principled statistical inference in data science"

    We consider the requirements and challenges of principled statistical inference in modern data science. Topics to be discussed include the meaning of validity of an inference and the relevant form of post-selection inference, to most appropriately take into account additional uncertainty arising from using sample data to choose a statistical model. We will consider also issues related to the possibility of inference with black-box learning algorithms.

  • Dr Heather Battey, "Large-scale and supersaturated studies: statistical considerations with examples"

    I will discuss in non-technical terms a number of statistical issues related to big data in its various forms. Examples and some new work on high dimensional regression will be presented.

  • Prof David Leslie, "Uncertainty matters"

    A central theme of much of our early statistical training is that we must keep track of uncertainties in the quantities that we care about. However, much of applied machine learning simply makes point predictions, and perhaps acts on them. Indeed the estimated quantities are often not reported at all. In this talk I will present results showing that keeping track of uncertainties is as important as ever. In particular, in sequential decision-making an estimate of current certainties leads very naturally to solutions to the exploration-exploitation dilemma, and I will present results demonstrating why this is the case.

  • Prof David Hand, "Evaluating statistical and machine learning classification methods"

    The origins of statistical classification methods stretch back to the early decades of the twentieth century, and now such methods lie at the core of many modern machine learning algorithms. Constructing effective classifiers constitutes a canonical data science challenge, requiring consideration of measurement issues, variable selection, model building, and performance assessment. That last step is critical - to determine if a method is good enough for some application, to determine if it is better than alternatives, and as a criterion for optimisation in constructing the algorithms. It means that we need to know how “accurate”, in some suitable sense, the classifications are. Different measures of classification performance implicitly define that sense in different ways, and a wide variety of measures have been have been proposed and used. After presenting an overview of such measures, I examine several common measures and show them to be wanting. In particular, I show that misclassification rate, the area under the ROC curve, and the F-measure have fundamental flaws and should be used only after careful consideration and in particular circumstances. Poor choice of performance measure can result in poor choice classifier, with potentially disastrous implications.

  • Prof Mark Girolami, "Retail Planning in Future Cities: A stochastic formulation of a dynamical singly constrained spatial interaction model "

    One of the challenges of 21st-century science is to model the evolution of complex systems. One example of practical contemporary importance is urban structure, for which the dynamics may be described by a series of non-linear first-order ordinary differential equations. Whilst this approach provides a reasonable model of spatial interaction as are relevant in areas diverse as public health and urban retail structure, it is somewhat restrictive owing to uncertainties arising in the modelling process.

    We address these shortcomings by developing a dynamical singly constrained spatial interaction model, based on a system of nonlinear stochastic differential equations. The model is ergodic and the invariant distribution encodes our prior knowledge of spatio-temporal interactions. We proceed by performing inference and prediction in a Bayesian setting, and explore the resulting probability measures with a position-specific metropolis-adjusted Langevin algorithm. Insights from studies of interactions within the city of London from retail structure are used as illustration.

    Joint work with Louis Ellam, Greg Pavliotis, and Sir Alan Wilson

  • GCHQ speaker, "Does Data Science need Statistics?"

    With modern cloud services offering elastic storage and compute, anyone with an online account can train models on data and can extract insights, visualisations and predictions. So for organisations with data scientific requirements, a question arises whether to hire statisticians or whether to focus on hiring skilled computer scientists and developers. This talk will present some of our experience with data science in GCHQ. It will explore a few examples of analyses where pitfalls can arise from a naive application of powerful tools, and will draw attention to some challenges for the statistical community.