Invited speakers

Contributed Talks

Abstracts

Santiago Gil, Network services as risk factors: a genetic epidemiology approach to cybersecurity.

Given the multiplicity of factors that play a role in the susceptibility of a host to attacks and intrusions, calculating probabilities of compromise can be unfeasible in real-world scenarios. Even the most sophisticated models will fail to account for unknown vulnerabilities and attack vectors. In this talk I’ll discuss the approach of using data to measure probabilities of compromise in the wild, and to estimate the risk associated with any given factor. In particular, I’ll present the case study of a university network, where using data on threat activity in conjunction with network-wide scans we can establish associations between the network services a host is running and the kinds of threats to which it is susceptible. I’ll propose a methodology to associate services to threats inspired by the tools used in genetics to identify statistical associations between mutations and diseases. The proposed approach allows us to determine probabilities of infection directly from observation, offering an automated high-throughput strategy to develop comprehensive metrics for cyber-security, as well as discover unknown vectors of attack.


Vanessa Didelez, Graphical Models for Event Histories and their Application in Survival Analysis

In this talk I will first briefly review various notions of (conditional) (in)dependence among processes, how they can be represented graphically and what certain manipulations on these graphs (e.g. separations) reveal. This will then be put into the context of causality, with a focus on identifiability of effects of intereventions. I will then focus on local independence graphs for point processes which are directed graphs allowing cycles. It will be shown that various notions that we are familiar with from e.g. (causal) DAGs can be generalised to this case, such as interventions and graphical criteria for identifiability, as well as inverse probability weighting. The theory will be illustrated with original data from a Norwegian screening project for cervical cancer; the aim is to compare two types of HPV-tests which can be used in the screeing. Local independence graphs and appropriate weighting procedures turn out to be useful for the analysis of these data.


Des Higham, Matrix Functions for Temporal Networks

I will describe a range of concepts, and resulting algorithms, that allow us to extract useful information from time-dependent networks; especially those arising from on-line human behaviour. For example, we may wish to identity key players, or critical moments, or to monitor in real-time some macro-level network properties. I will pay particular attention to the case of large-scale, sparse networks. Ideas that will be discussed include the exponential, resolvent and logarithm of a matrix, and the factorization of a three-dimensional tensor.


Patrick Rubin-Delanchy, Hypothesis testing questions related to Big Data, networks, and cyber-security

In this talk, we outline a general modus operandi under which to perform intrusion detection at scale. The over-arching principle is this: a network monitoring tool has access to large stores of data on which it can learn 'normal' network behaviour. On the other hand, data on intrusions are relatively rare. This imbalance invites us to frame intrusion detection as an anomaly detection problem where, under the null hypothesis that there is no intrusion, the data follow a machine-learnt model of behaviour, and, under the alternative that there is some form of intrusion, certain anomalies in that model will be apparent. This approach to cyber-security poses some new statistical questions which are, at a high level, how to interpret Bayesian p-values in a frequentist way, how to cope with their 'revisionist' property, and how to combine different anomaly scores, a problem also called 'higher criticism'.


Alex Kent, Cyber-Security Data Sources for Dynamic Network Research

The importance of using real-world data to enable and validate dynamic network research for the purposes of cyber-security cannot be understated. Unfortunately, the use of real-world data in most cyber security research either for motivation or validation is rare. The majority of useful cyber data sources were intended for operational monitoring and not for security purposes. Access for research purposes is even more problematic. From a dynamic network point of view, they are often lacking in comprehensive coverage, difficult to integrate across disparate sources, and likely have significant noise in various forms. Nonetheless, there is a rich and abundant potential for useful cyber-security datasets. We will discuss a variety of data source opportunities, usefulness and value, and potential problems.


Melissa Turcotte, Modelling user behaviour in a network using computer event logs

Computer event logs can be a critical resource in detecting cyber security threats on a computer network. One important research problem associated with such data sets is that of user credential theft or misuse either by a malicious insider or an external adversary. Once compromised, a user credential can be used by an adversary to advance through the network and further their goals. Little attention is currently given to looking at computer event logs as an aggregated multivariate data stream. Here, the aim is to model user credentials on the network by considering independently the time series of events generated by each user credential. Simple Bayesian models are fit to the data for each user credential providing a flexible framework for monitoring user credentials on an enterprise network and identifying potentially compromised credentials.


Joshua Neil, Statistics to the rescue: data science needs in industry cyber security

Cyber security represents an enormous problem for every sector of industry. Since the subject itself involves computers, data is abundant. Since human beings (both users and attackers) are generating data, and since the underlying structure is a network, analysis is complex, leading to many unsolved problems in the application of statistical and machine learning approaches to cyber security. The authors represent a collaboration between statisticians and industry cyber security practitioners with many years experience in both fields. It is our intention to provide a comprehensive discussion of common attacks and their effects upon the data. We will provide an overview of phases of computer network attack seen most commonly in industry, and describe the data that can be measured and that carries signal for each phase. We will then discuss various approaches to detect that each phase is present in the data. Finally, we will describe options for combining the results of each detector, in order to make overall detection decisions.


Mauricio Barahona, Finding communities in directed graphs: flows and roles in Twitter networks

A current challenge in the study of real-world, complex networks is to extract intelligible, simplified descriptions in terms of relevant subgraphs (or communities), which can provide insight into the structure and function of the overall system. Directionality is a key ingredient in many such complex networks in which information, energy or influence are transmitted. In such cases, analysing flows (and not only the strength of connections) is crucial to reveal features that might go undetected if directionality is ignored. In this talk, we will showcase a flow-based approach for the analysis of directed networks based on Markov diffusions on graphs, and will illustrate it on the follower network of the most influential Twitter users during the 2011 riots in England. Our analysis reveals flow communities at different levels of coarseness within which information diffusion is contained and reinforced. Such interest communities are linked to user groupings according to location, profession, employer, and topic. The study of flows also allows us to generate an interest distance, which affords a personalised view of the attention in the network from the vantage point of any given user. In addition, using the profiles of incoming and outgoing long-range flows, we reveal that Twitter users in the network can be classified into five roles that go beyond the standard leader/follower dichotomy. We then show how the interest communities are characterised by a different mix of user roles reflecting the quality of dialogue within them. We also discuss briefly current work to include retweet networks and textual analysis of tweets.


Alberto Caimo, Bayesian modelling of static and longitudinal social networks

Exponential random graph models (ERGMs) are a class of widely used exponential family models for social network data. The topological structure of an observed network is modelled by the relative prevalence of network statistics which are regarded as random variables. Computing the ``doubly intractable'' posterior distribution of ERGMs is a very challenging problem. We overview some computational methods for estimating Bayesian ERGMs. In dynamic networks the presence or absence, creation or deletion, of ties between nodes are subject to both endogenous network dependencies as well as dependencies stemming from the temporal embedding of nodes. We present longitudinal ERGMs and a further extension that allows for simultaneous inference of the changes over time and the initial conditions.


Aric Hagberg, Temporal reachability in dynamic networks

We construct temporal random graph models using Markov chains that preserve the random graph structure and have tunable dynamic properties. We analyse these models to determine the time it takes when starting from a random node to to reach all of the other nodes by traversing temporal edges. The models we study are chosen for their simplicity and ability to be generalised for more complex models of threats in cybersecurity authentication systems.


William Oxbury, Mathematical problems in network data analysis

This talk presents a view of some mathematical problems posed by modelling of large-scale IP networks for detection and understanding of botnets and other malicious activity. We will discuss:

  • host characterisation (from flow-based heuristics to social structures and change-point detection)
  • links and information pathways (what is the right way to define a network edge?)
  • dynamic graph models (from sequences of graphs to populations of point processes).

Alex Bolton, Regime detection in multivariate stochastic processes

This talk addresses the problem of detecting malware families using dynamic instruction trace data. The dynamic instruction trace of a piece of software is the sequence of instructions that it calls when it is run in a virtual environment. A reversible jump MCMC (RJMCMC) sampler, an MCMC sampler in which the dimension of the model is allowed to change, is applied to the problem. Treating each dynamic instruction trace as a multivariate stochastic process, the RJMCMC sampler is used to detect change points and regimes in each pair of processes, detecting common patterns in each pair.


Alex Gibberd, Inferring Dependency Networks from Dynamic Data

Often in cyber-security we observe relational connections between entities (i.e. source-destination pairs) which can be interpreted as a network. How- ever, there are many kinds of data for which we do not directly observe such connections, for example, CPU/memory usage. In this talk I will present two computationally tractable methods for learning conditional dependency graphs from multivariate data-streams. We discuss how such attributional graphs high- light dependencies between variables and demonstrate this in an applied setting studying network activity throughout a denial of service (DoS) attack. In order to model activity on a network statistically one should consider the possibility of the process being non-stationary. For example, CPU/network activity is often seen to occur in bursts relating to the processes being under- taken on a given system. To account for such behaviour within our models we propose and investigate two classes of dynamic exploratory graphical models (dEGM). The first method, assumes smooth evolution of the underlying depen- dency graph, whereas our second method looks at piecewise graphical evolution. Inferring such a piecewise structure allows us to identify changepoints at key points where the graphs encoding the data-stream change. Not only can the inferred dEGM provide a probabilistic model for network activity, but they can also directly highlight key dependencies between variables. We give a simple example relating to the DARPA 1998 data-set, where we demonstrate how extracted changepoints can be used to highlight the start- end of a DoS attack. Whilst there are many ways to detect such attacks, it is interesting to consider how such dependency graphs may aid our understanding of network activity and perhaps be incorporated within other dynamic network modeling approaches.


Mariano Beguerisse-Diaz, Beyond metadata: Using content to reveal the evolution of narratives in social media

The wealth of data available from a variety of sources presents attractive opportunities in academia and beyond. Analysing large datasets and extracting useful information from them is not a trivial task. Often, collections of data have several layers of structure, are complex and noisy. Data from social media and other sources can be processed in many ways; recently we have studied a dataset of relationships among Twitter users who were prominent during the 2011 riots in England. These data consist of the names and descriptions of the users and their mutual relationships (i.e., who follows whom). Although the data did not include the actual messages that passed through these links during the riots, we are able to study the structure of the relationships to reveal information about the users, their interests, hierarchies and roles. Analyses of the network structures created by relationships or interactions between the data-generating agents, however, cannot answer questions such as: what topics do users of social media talk about, and how do these topics and their user participation change in time? To find answers we must go beyond the meta-data and look at the content produced by the users. We have developed a method to study large, longitudinal collections of textual data that allows us to understand the evolution of discourse and group narratives. Our method uses topic timelines, a concept we have recently introduced. Topic timelines are networks whose nodes are content-units (such as topics) that appear in a given time interval; the edges may depend on the shared authors between the nodes, topical similarity, etc. Handling data in this way creates tractable networks that, for example, not only reveal what topics appear when, but also help to understand the relationship among the different topics in terms of agent participation or similarity. These new networks can be explored using standard tools from network science. For example, by extracting their communities we can track the origin, evolution, and decline of collective narratives; we can identify which seemingly disparate topics are related through their common users, obtain the user turnover of topics, or know when a topic becomes exhausted for some groups of users but not for others. This method provides a way to make large and complex sets of longitudinal textual data tractable and amenable to analysis with the rich palette of tools from network science, offering a new point of view from which collective discourse can be studied. We showcase our method on collections of Twitter status updates which include conversations about obesity diabetes and the UK's National Health Service. This methodology is applicable not only to social media but to any collection of longitudinal data generated by large numbers of agents.


Rodrigo A. Collazo, Modelling the radicalisation process of a prison population using Chain Event Graphs

In this work we have developed methods to identify groups of individuals who are most likely to engage in specific criminal organization in British prisons. Because of the sensitive nature of data in this field, we have based this study on a data set some of whose variables have been simulated. However we have chosen simulations that have been calibrated to be consistent with real figures and real hypotheses in the public domain concerning the British prison population. Our explanatory variables try to capture the ethnological and criminal background of prisoners as well as their prison networks. As we will show, this example is very challenging because the classes of each variable are remarkably unbalanced and the percentage of radical prisoners - those units of special interest - is tiny. Furthermore, if expressed in terms of a BN any plausible generating model would need to be a highly context-specific: general Bayesian Network (BN) model selection methods could therefore not be expected to work well. A more flexible family of models really does need to be used. Recently tree-based graphical models have been successfully used to describe various phe- nomena in a way easily appreciated by domain experts. The tree provides a powerful graphical support through which time sequences can be easily incorporated. Each path in the tree describes the various possible sequences of events an individual can experience. One such alternative tree based model is the Chain Event Graph (CEG). CEGs have been a useful class of graphical model especially to capture context-specific conditional independences. A CEG retains most of the useful properties of a BN and provides an expressive framework for various tasks associated with representing, propagating and learning, especially when the tree of the underlying sample space is asymmetric. This model class contains all discrete context-specific BNs as a special case. Being built from a tree a CEG has a huge number of free parameters that makes the class extremely expressive but also very large. In order to search the massive CEG model space it is necessary to design bespoke efficient algorithms. In different contexts Bayes Factor scored search algorithm using non-local priors (NLPs) have recently proved very successful for searching other huge model spaces. Here to ensure parsimony and stability of selection we define three new families of NLPs designed to be applied specifically to discrete processes defined through trees. We also develop a formal framework that enables us to employ NLPs for CEG model search within our modified heuristic approach based on a Hierarchical Agglomerative Clustering algorithm. In the prison radicalisation context, that framework for CEG model search enforces par- simony over the model selection in a direct and simple way, keeping computational time and memory costs under control. So as expected for these applications NLPs tend to select sparser - simpler to explain - graphs especially when compared with conventional methods whilst keeping the efficacy to identify the individuals more prone to be radicalised. The empirical results have also indicated that a CEG model search using NLPs is very robust in the sense that model selection is similar for wide intervals of values of nuisance hyperparameters. Most recently dynamic versions of CEGs called Dynamic Chain Event Graph (DCEG) have been developed to model longitudinal discrete processes that have highly asymmetric unfoldings and context-specific structures. We have showed that any infinite tree can be rewritten as a DCEG, which represents the originally elicited tree in a much more compact and easily inter- pretable form. We have furthered demonstrate that we can extend this framework by attaching holding time distributions to the nodes in the graph, so that we can model processes observed at irregular time intervals.


Silvia Metelli, Bayesian Model-Based Clustering for Computer Network Data

Clustering methods aim to group together similar objects. Different approaches have been devel- oped in literature, including Bayesian model-based hierarchical clustering which is the method of main interest here. A model-based approach defines clusters as groups of objects which have been generated by the same underlying probability distribution whilst the Bayesian formulation enables the partition of items into subsets to be a parameter of a probability model for the data, subject to prior assumptions. We present a simple algorithm for Bayesian model-based hierarchical clustering, with an application to computer network data. Computer networks are complex, and detecting any changes in their structure represents an important security application. In particular, we are interested in analysing new edge formation in the network, as it can be a signal of malicious behaviours. Within this context, clustered data can be useful to subsequently perform new edge prediction on peer groups of similar profiles. Specifically, we perform bi-clustering, which aim at identifying subsets of both rows and columns of a data matrix which are similar, revealing an underlying structure which may be obscured when clustering either rows or columns independently. However, identifying those clusters can be a challenging task as computer networks generally show the so-called scale-free behaviour, where most of the nodes are very low connected while some nodes are extremely connected. In this respect, we will discuss the importance of including an informative prior distribution in the model.


Matt Price-Williams, Detecting limited dependency in point processes

The aim of this talk is to investigate methods of detecting triggering between edges in a large directed interaction network. Edges between nodes in a network can be interpreted and modelled as point processes where events in the process indicate information being sent along that edge. A test will be introduced to detect situations when only a subset of these events exhibit a triggering effect on another process. This test will allow us to detect correlated edges within a network, enabling locally joint modelling. This talk will consider the asymptotics of this problem as the number of events goes to infinity, whilst the proportion exhibiting dependence goes to zero, and examine the performance of different tests that are consistent in this framework.