Menu

Extracting binary signals from microarray time-course data

3 Comments

extracting binary signals from microarray time-course data

In 32nd Conference on Uncertainty in Artificial IntelligencepagesJersey City, New Jersey, USA, June Abstract: We introduce the Mondrian kernel, a fast random feature approximation to the Laplace kernel. It is suitable for both batch and online learning, and admits a fast kernel-width-selection procedure as the random features can be re-used efficiently for all kernel widths. This link provides a new insight into the relationship between kernel methods and random forests. Understanding probabilistic sparse Gaussian process approximations arXiv preprint arXivAbstract Good sparse approximations are essential for practical inference in Gaussian Processes as the computational cost of exact methods is prohibitive for large datasets. The Fully Independent Training Conditional FITC and the Variational Free Energy VFE approximations are two recent popular methods. Despite superficial similarities, these approximations have surprisingly different theoretical properties and behave differently in practice. We thoroughly investigate the two methods for regression both analytically and through illustrative examples, and draw conclusions to guide practical application. Fabular: Regression formulas as probabilistic programming. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesPOPLpagesNew York, NY, USA, acm, doi Abstract: Regression formulas are a domain-specific language adopted by several R packages for describing an important and useful class of statistical models: hierarchical linear regressions. Formulas are succinct, expressive, and clearly popular, so are they a useful addition to probabilistic programming languages? And what do they mean? We propose a core calculus of hierarchical linear regression, in which regression coefficients are themselves defined by nested regressions unlike in R. We explain how our calculus captures the essence of the formula DSL found in R. We describe the design and implementation of Fabular, a version of the Tabular schema-driven probabilistic programming language, enriched with formulas based on our regression calculus. Manifold Gaussian processes for regression. In International Joint Conference on Neural NetworksAbstract: Off-the-shelf Gaussian Process GP covariance functions encode smoothness assumptions on the structure of the function to be modeled. To model complex and nondifferentiable functions, these smoothness assumptions are often too restrictive. One way to alleviate this limitation is to find a different representation of the data by introducing a feature space. This feature space is often learned in extracting unsupervised way, which might lead to data representations that are not useful for the overall regression task. In this paper, we propose Manifold Gaussian Processes, a novel supervised method that jointly learns a transformation of the data into a feature space and a GP regression from the feature space to observed space. The Manifold GP is a full GP and allows to learn data representations, which are useful for the overall regression task. As a proof-of-concept, we evaluate our approach on complex non-smooth functions where standard GPs perform poorly, such as step functions and robotics tasks with data. Jes Frellsen, Ole Winther, Zoubin Ghahramani, and Jesper Ferkinghoff-Borg. Bayesian generalised ensemble Markov chain Monte Carlo. In 19th International Conference on Artificial Intelligence and StatisticsCadiz, Spain, May Abstract Bayesian generalised ensemble BayesGE is a new method that addresses two major drawbacks of standard Markov chain Monte Carlo algorithms for inference in high-dimensional probability models: inapplicability to estimate the partition function, and poor mixing properties. BayesGE uses a Bayesian approach to iteratively update the belief about the density of states distribution of the log likelihood under the prior for the model, with the dual purpose of enhancing the sampling efficiency and make the estimation of the partition function tractable. We benchmark BayesGE on Ising and Potts systems and show that it compares favourably to existing state-of-the-art methods. Turner, and Zoubin Ghahramani. On Sparse Variational methods and the Kullback-Leibler divergence between stochastic processes. In 19th International Conference on Artificial Intelligence and StatisticsCadiz, Spain, May Abstract: The variational framework for learning inducing variables Titsias, a has had a large impact on the Gaussian process literature. The framework may be interpreted as minimizing a rigorously defined Kullback-Leibler divergence between the approximating and posterior processes. To our knowledge this connection has thus far gone unremarked in the literature. In this paper we give a substantial generalization of the literature on this topic. We give a new proof of the result for infinite index sets which allows inducing points that are not data points and likelihoods that depend on all function values. We then discuss augmented index sets and show that, contrary to previous works, marginal consistency of augmentation is not enough to guarantee consistency of variational inference with the original model. We then characterize an extra condition where such a guarantee is obtainable. Finally we show how our framework sheds light on interdomain sparse approximations and sparse approximations for Cox processes. Ofer Meshi, Mehrdad Mahdavi, Adrian Weller, and David Sontag. Train and test tightness of LP relaxations in structured prediction. In 33rd International Conference on Machine LearningNew York, NY, June Abstract: Structured prediction is used in areas such as computer vision and natural language processing to predict structured outputs such as segmentations or parse trees. In these settings, prediction is performed by MAP inference or, equivalently, by solving an integer linear program. Because of the complex scoring functions required to obtain accurate predictions, both learning and inference typically require the use of approximate solvers. We propose a theoretical explanation to the striking observation that approximations based on linear programming LP relaxations are often tight on real-world instances. In particular, we show that learning with LP relaxed inference encourages integrality of training instances, and that tightness generalizes from train to test data. The Multivariate Generalised von Mises: Inference and applications arXiv preprint arXivAbstract Circular variables arise in a multitude of data-modelling contexts ranging from robotics to the social sciences, but they have been largely overlooked by the machine learning community. This paper partially redresses this imbalance by extending some standard probabilistic modelling tools to the circular domain. First we introduce a new multivariate distribution over circular variables, called the multivariate Generalised von Mises mGvM distribution. This distribution can be constructed by restricting and renormalising a general multivariate Gaussian distribution to the unit hyper-torus. Previously proposed multivariate circular distributions are shown to be special cases of this construction. Second, we introduce a new probabilistic model for circular regression, that is inspired by Gaussian Processes, and binary method for probabilistic principal component analysis with circular hidden variables. These models can leverage standard modelling tools e. Third, we show that the posterior distribution in these models is a mGvM distribution which enables development of an efficient variational free-energy scheme for performing approximate inference and approximate maximum-likelihood learning. Consistent kernel mean estimation for functions of random variables. In Advances in Neural Information Processing Systems 30Abstract: We provide a theoretical foundation for non-parametric estimation of functions of random variables using kernel mean embeddings. Our results extend to functions of multiple random variables. If the variables are dependent, we require an estimator of the mean embedding of their joint distribution as a starting point; if they are independent, it is sufficient to have separate estimators of the mean embeddings of their marginal distributions. In either case, our results cover both mean embeddings based on i. The latter serves as a justification for using such expansions to limit memory resources when applying the approach as a basis for probabilistic programming. However, there are many types of non-sequential data for which good compression techniques are still largely unexplored. This paper contributes insights and concrete techniques for compressing various kinds of non-sequential data via arithmetic coding, and derives re-usable probabilistic data models from fairly generic structural assumptions. Near-optimal compression methods are described for certain types of permutations, combinations and multisets; and the conditions for optimality are made explicit for each method. Uprooting and rerooting graphical models. The new model is essentially equivalent to the original model, with the same partition function and allowing recovery of the original marginals or a MAP configuration, yet may have very different computational properties that allow much more efficient inference. This meta-approach deepens our understanding, may be applied to any existing algorithm to yield improved methods in practice, generalizes earlier theoretical results, and reveals a remarkable interpretation of the triplet-consistent polytope. Characterizing tightness of LP relaxations by forbidding signed minors. In 32nd Conference on Uncertainty in Artificial IntelligenceJersey City, NJ, June Abstract: We consider binary pairwise graphical models and provide an exact characterization necessary and sufficient conditions observing signs of potentials of tightness for the LP relaxation on the triplet-consistent polytope of the MAP inference problem, by forbidding an odd-K5 complete graph on 5 variables with all edges repulsive as a signed minor in the signed suspension graph. This captures signs of both singleton and edge potentials in a compact and efficiently testable condition, and improves significantly on earlier results. We provide other results on tightness of LP relaxations by forbidding minors, draw connections and suggest paths for future research. Comment: Supplementary Material Adrian Weller and Justin Domke. Clamping improves TRW and mean field approximations. In 19th International Conference on Artificial Intelligence and StatisticsCadiz, Spain, May Abstract: We examine the effect of clamping variables for approximate inference in undirected graphical models with pairwise relationships and discrete variables. For any number of variable labels, we demonstrate that clamping and summing approximate sub-partition functions can lead only to a decrease in the partition function estimate for TRW, and an increase for the naive mean field method, in each case guaranteeing an improvement in the approximation and bound. We next focus on binary variables, add the Bethe approximation to consideration and examine ways to choose good variables to clamp, introducing new methods. We show the importance of identifying highly frustrated cycles, and of checking the singleton entropy of a variable. We explore the value of our methods by from analysis and draw lessons to guide practitioners. Adrian Weller, Mark Rowland, and David Sontag. Tightness of LP relaxations for almost balanced models. In 19th International Conference on Artificial Intelligence from StatisticsCadiz, Spain, May Abstract: Linear programming LP relaxations are widely used to attempt to identify a most likely configuration of a discrete graphical model. In some cases, the LP relaxation attains an optimum vertex at an integral location and thus guarantees an exact solution to the original optimization problem. When this occurs, we say that the LP relaxation is tight. An almost balanced sub- model is one that contains no frustrated cycles except through one privileged variable. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence,doi TPAMI Abstract: Autonomous learning has been a promising direction in control and robotics for more than a decade since data-driven learning allows to reduce the amount of engineering knowledge, which is otherwise required. However, autonomous reinforcement learning RL approaches typically require many interactions with the system to learn controllers, which is a practical limitation in real systems, such as robots, where many interactions can be impractical and time consuming. In this article, we follow a different approach and speed up learning by extracting more information from data. In particular, we learn a probabilistic, non-parametric Gaussian process transition model of the system. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks. Roy, and Zoubin Ghahramani. Training generative neural networks via Maximum Mean Discrepancy optimization. In 31st Conference on Uncertainty in Artificial IntelligencepagesAmsterdam, The Netherlands, July Abstract: We consider training a deep neural network to generate samples from an unknown distribution given i. We frame learning as an optimization minimizing a two-sample test statistic-informally speaking, a good generator network produces samples that cause a two-sample test to fail to reject the null hypothesis. As our two-sample test statistic, we use an unbiased estimate of the maximum mean discrepancy, which is the centerpiece of the nonparametric kernel two-sample test proposed by Gretton et al We compare to the adversarial nets framework introduced by Goodfellow et alin which learning is a two-player game between a generator network and an adversarial discriminator network, both trained to outwit the other. From this data, the MMD statistic plays the role of the discriminator. In addition to empirical comparisons, we prove bounds on the generalization error incurred by optimizing the empirical MMD. Bayesian time series learning with Gaussian processes. PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK, Abstract: The analysis of time series data is important in fields as disparate as the social sciences, biology, engineering or econometrics. In this dissertation, we present a number of algorithms designed to learn Bayesian nonparametric models of time series. The goal of these kinds of models is twofold. First, they aim at making predictions which quantify the uncertainty due to limitations in the quantity and the quality of the data. Second, they are flexible enough to model highly complex data whilst preventing overfitting when the data does not warrant complex models. We begin with a unifying literature review on time series models based on Gaussian processes. Then, we centre our attention on the Gaussian Process State-Space Model GP-SSM : a Bayesian nonparametric generalisation of discrete-time nonlinear state-space models. We present a novel formulation of the GP-SSM that offers new insights into its properties. We then proceed to exploit those insights by developing new learning algorithms for the GP-SSM based on particle Markov chain Monte Carlo and variational inference. Finally, we present a filtered nonlinear auto-regressive model with a simple, robust and fast learning algorithm that makes it well suited to its application by non-experts on large datasets. Its main advantage is that it avoids the computationally expensive and potentially difficult to tune smoothing step that is a key part of learning nonlinear state-space models. Yarin Gal, Yutian Chen, and Zoubin Ghahramani. Latent Gaussian processes for distribution estimation of multivariate categorical data. In Proceedings of the 32nd International Conference on Machine Learning ICMLpagesAbstract Multivariate categorical data occur in many applications of machine learning. One of the main difficulties with these vectors of categorical variables is sparsity. The number of possible observations grows exponentially with vector length, but dataset diversity might be poor in comparison. Recent models have gained significant improvement in supervised tasks with this data. These models embed observations in a continuous binary to capture similarities between them. Building on these ideas we propose a Bayesian model for the unsupervised task of distribution estimation of multivariate categorical data. We model vectors of categorical variables as generated from a non-linear transformation of a continuous latent space. Non-linearity captures multi-modality in the distribution. The continuous representation addresses sparsity. Our model ties together many existing models, linking the linear categorical latent Gaussian model, the Gaussian process latent variable model, and Gaussian process classification. We derive inference for our model based on recent developments in sampling based variational inference. We show empirically that the model outperforms its linear and discrete counterparts in imputation tasks of sparse data. Yarin Gal and Richard Turner. Improving the Gaussian process sparse spectrum approximation by representing uncertainty in frequency inputs. In Proceedings of the 32nd International Conference on Machine Learning ICMLpagesAbstract: Standard sparse pseudo-input approximations to the Gaussian process GP cannot handle complex functions well. Sparse spectrum alternatives attempt to answer this but are known to over-fit. We suggest the use of variational inference for the sparse spectrum approximation to avoid both issues. We model the covariance function with a finite Fourier series approximation and treat it as a random variable. The random covariance function has a posterior, on which a variational distribution is placed. The variational distribution transforms the random covariance function to fit the data. We study the properties of our approximate inference, compare it to alternative ones, and extend it to the distributed and stochastic domains. Our approximation captures complex functions better than standard approaches and avoids over-fitting. Probabilistic machine learning and artificial intelligence Nature—459,doi doi nature Abstract: How can a machine learn from experience? Probabilistic modelling provides a framework for understanding what learning is, and has therefore emerged as one of the principal theoretical and practical approaches for designing machines that learn from data acquired through experience. The probabilistic framework, which describes how to represent and manipulate uncertainty about models and predictions, has a central role in scientific data analysis, machine learning, robotics, cognitive science and artificial intelligence. This Review provides an introduction to this framework, and discusses some of the state-of-the-art advances in the field, namely, probabilistic programming, Bayesian optimization, data compression and automatic model discovery. Neural adaptive sequential Monte Carlo. Like other importance sampling-based methods, performance is critically dependent on the proposal distribution: a bad proposal can lead to arbitrarily inaccurate estimates of the target distribution. This paper presents a new method for automatically adapting the proposal using an approximation of the Kullback-Leibler divergence between the true posterior and the proposal distribution. The method is very flexible, applicable to any parameterised proposal distribution and it supports online and batch variants. We use the new framework to adapt powerful proposal distributions with rich parameterisations based upon neural networks leading to Neural Adaptive Sequential Monte Carlo NASMC. Experiments indicate that NASMC significantly improves inference in a non-linear state space model outperforming adaptive proposal methods including the Extended Kalman and Unscented Particle Filters. Experiments also indicate that improved inference translates into improved parameter learning when NASMC is used as a subroutine of Particle Marginal Metropolis Hastings. Finally we show that NASMC is able to train a neural network-based deep recurrent generative model achieving results that compete with the state-of-the-art for polymorphic music modelling. NASMC can be seen as bridging the gap between adaptive SMC methods and the recent work in scalable, black-box variational inference. MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Information Processing Systems 28pages 1-9, Montreal, Canada, December Abstract: Gaussian process GP models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in support of the function but otherwise signals. The result is a Binary Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs. Code to replicate each experiment in this paper will be available shortly. Scalable Variational Gaussian Process Classification. In 18th International Conference on Artificial Intelligence and Statisticspages 1-9, San Diego, California, USA, May Abstract: Gaussian process classification is a popular method with a number of appealing properties. We show how to scale the model within a variational inducing point framework, out-performing the state of the art on benchmark datasets. Importantly, the variational formulation an be exploited to allow classification in problems with millions of data points, as we demonstrate in experiments. Adams, and Zoubin Ghahramani. Predictive entropy search for Bayesian optimization with unknown constraints. In 32nd International Conference on Machine LearningpagesAbstract: Unknown constraints arise in many types of expensive black-box optimization problems. Several methods have been proposed recently for performing Bayesian optimization with constraints, based on the expected improvement EI heuristic. However, EI can lead to pathologies when used with constraints. For example, in the case of decoupled constraints—i. Additionally, computing EI requires a current best solution, which may not exist if none of the data collected so far satisfy the constraints. By contrast, information-based approaches do not suffer from these failure modes. In this paper, we present a new information-based method called Predictive Entropy Search with Constraints PESC. We analyze the performance of PESC and show that it compares favorably to EI-based approaches on synthetic and benchmark problems, as well as several real-world examples. We demonstrate that PESC is an effective algorithm that provides a promising direction towards a unified solution for constrained Bayesian optimization. Unsupervised many-to-many object matching for relational data IEEE Transactions on Pattern Analysis and Machine IntelligenceAbstract: We propose a method for unsupervised many-to-many object matching from multiple networks, which is the task of finding correspondences between groups of nodes in different networks. For example, the proposed method can discover shared word groups from multi-lingual document-word networks without cross-language alignment information. We assume that multiple networks share groups, and each group has its own interaction pattern with other groups. Using infinite relational models with this assumption, objects in different networks are clustered into common groups depending on their interaction patterns, discovering a matching. EP approximates the full intractable posterior distribution through a set of local-approximations that are iteratively refined for each datapoint. EP can offer analytic and computational advantages over other approximations, such as Variational Inference VIand is the method of choice for a number of models. The local nature of EP appears to make it an ideal candidate for performing Bayesian learning on large models in large-scale datasets settings. However, EP has a crucial limitation in this context: the number approximating factors needs to increase with the number of data-points, N, which often entails a prohibitively large memory overhead. This paper presents an extension to EP, called stochastic expectation propagation SEPthat maintains a global posterior approximation like VI but updates it in a local way like EP Experiments on a number of canonical learning problems using synthetic and real-world datasets indicate that SEP performs almost as well as full EP, but reduces the memory consumption by a factor of N. SEP is therefore ideally suited to performing approximate Bayesian learning in the large model, large dataset setting. Representation, learning, description and criticism of probabilistic models with applications to networks, functions and relational data. PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK, Abstract: This thesis makes contributions to a variety of aspects of probabilistic inference. Specifying the details of a probability distribution can be a difficult task in many situations, but when expressing beliefs about complex data structures it may not even be apparent what form such a distribution should take. This thesis starts by demonstrating how representation theorems due to Aldous, Hoover and Kallenberg can be used to specify appropriate models for data in the form of networks. These theorems are then extended in order to reveal appropriate probability distributions for arbitrary relational data or databases. A simpler data structure to specify probability distributions for is that of functions; many probability distributions for functions have been used for centuries. We demonstrate that many of these distributions can be expressed in a common language of Gaussian process kernels constructed from a few base elements and operators. The structure of this language allows for the effective automatic construction of probabilistic models for functions. Furthermore, the formal mathematical language of kernels can be mapped neatly onto natural language allowing for automatic descriptions of the automatically constructed models. By further automating the construction of statistical models, the need to be able to effectively check or criticise these models becomes greater. This thesis demonstrates how kernel two sample tests can be used to demonstrate where a probabilistic model most disagrees with data allowing for targeted improvements to the model. In proposing a new method of model criticism this thesis also briefly discusses the philosophy of model criticism within the context of probabilistic inference. Statistical model criticism using kernel two sample tests. In Advances in Neural Information Processing Systems 29pages 1-9, Montreal, Canada, December Abstract: We propose an exploratory approach to statistical model criticism using maximum mean discrepancy MMD two sample tests. Typical approaches from model criticism require a practitioner to select a statistic by which to measure discrepancies between data and a statistical model. MMD two sample tests are instead constructed as an analytic maximisation over a large space of possible statistics and therefore automatically select the statistic which most shows any discrepancy. We demonstrate on synthetic data that the selected statistic, called the witness function, can be used to identify where a statistical model most misrepresents the data it was trained on. We then apply the procedure to real data where the models being assessed are restricted Boltzmann machines, deep belief networks and Gaussian process regression and demonstrate the ways in which these models fail to capture the properties of the data they are trained on. A causal perspective on domain adaptation arXiv preprint arXivAbstract From training data from several related domains or tasksmethods of domain adaptation try to combine knowledge to improve performance. This paper discusses an approach to domain adaptation which is inspired by a causal interpretation of the multi-task problem. We assume that a covariate shift assumption holds true for a subset of predictor variables: the conditional of the target variable given this subset of predictors is invariant with respect to shifts in those predictors covariates. We propose to learn the corresponding conditional expectation in the training domains and use it for estimation in the target domain. We further introduce a method which allows for automatic inference of the above subset in regression and classification. We study the performance of this approach in an adversarial setting, time-course the case where no additional examples are available in the test domain. If a labeled sample is available, we provide a method for using both the transferred invariant conditional and task specific information. We present results on synthetic data sets and a sentiment analysis problem. Practical probabilistic programming with monads. In Proceedings of the 8th ACM SIGPLAN Symposium on Haskell Association for Computing Machinery,doi Abstract: The machine learning community has recently shown a lot of interest in practical probabilistic programming systems that target the problem of Bayesian inference. Such systems come in different forms, but they all express probabilistic models as computational processes using syntax resembling programming languages. In the functional programming community monads are known to offer a convenient and elegant abstraction for programming with probability distributions, but their use is often limited to very simple inference problems. We show that it is possible to use the monad abstraction to construct probabilistic models for machine learning, while still offering good performance of inference in challenging models. We use a GADT as an underlying representation of a probability distribution and apply Sequential Monte Carlo-based methods to achieve efficient inference. We define a formal semantics via measure theory. We demonstrate a clean and elegant implementation that achieves performance comparable with Anglican, a state-of-the-art probabilistic programming system. Multisets are a generalisation of sets where members are allowed to occur multiple times. A multiset can be encoded naively by simply storing its elements in some sequential order, but then information is wasted on the ordering. We propose a technique that transforms the multiset into an order-invariant tree representation, and derive an arithmetic code that optimally compresses the tree. Our method achieves compression even if the sequences in the multiset are individually incompressible such as cryptographic hash sums. The algorithm is demonstrated practically by compressing collections of SHA-1 hash sums, and multisets of arbitrary, individually encodable objects. Christian Steinruecken, Zoubin Ghahramani, and David MacKay. Improving PPM with dynamic data updates. Storer, editors, Proceedings of the Data Compression Conference IEEE Computer Society, April Abstract: This article makes several improvements to the classic PPM algorithm, resulting in a new algorithm with superior compression effectiveness on human text. The key differences of our algorithm to classic PPM are that A rather than the original escape mechanism, we use a generalised blending method with explicit hyper-parameters that control the way symbol counts are combined to form predictions; B different hyper-parameters are used for classes of different contexts; and C these hyper-parameters are updated dynamically using gradient information. The resulting algorithm PPM-DP compresses human text better than all currently published variants of PPM, CTW, DMC, LZ, CSE and BWT, with runtime only slightly slower than classic PPM. Learning stationary time series using gaussian process with nonparametric kernels. The GPCM is a continuous-time nonparametricwindow moving average process and, conditionally, is itself a Gaussian process with a nonparametric kernel defined in a probabilistic fashion. The generative model can be equivalently considered in the frequency domain, where the power spectral density of the signal is specified using a Gaussian process. One of the main contributions of the paper is to develop a novel variational freeenergy approach based on inter-domain inducing variables that efficiently learns the continuous-time linear filter and infers the driving white-noise process. In turn, this scheme provides closed-form probabilistic estimates of the covariance kernel and the noise-free signal both extracting denoising and prediction scenarios. Additionally, the variational inference procedure provides closed-form expressions for the approximate posterior of the spectral density given the observed data, leading to new Bayesian nonparametric approaches to spectrum estimation. The proposed GPCM is validated using synthetic and real-world signals. Unsupervised state-space modeling using reproducing kernels IEEE Transactions on Signal Processing, Abstract: A novel framework for the design of state-space models SSMs is proposed whereby the state-transition function of the model is parametrized using reproducing kernels. The nature of SSMs requires learning a latent function that resides in the state space and for which input-output sample pairs are not available, thus prohibiting the use of gradient-based supervised kernel learning. To this end, we then propose to learn the mixing weights of the kernel estimate by sampling from their posterior density using Monte Carlo methods. We first introduce an offline version of the proposed algorithm, followed by an online version which performs inference on both the parameters and the hidden state through particle filtering. The accuracy of the estimation of the state-transition function is first validated on synthetic data. Next, we show that the proposed algorithm outperforms time-course adaptive filters in the prediction of real-world time series, while also providing probabilistic estimates, a key advantage over standard methods. High-dimensional kernel regression: A guide for practitioners. Kwan, editor, Trends in Digital Signal Processing: A Festschrift in Honour of A. Design of positive-definite quaternion kernels IEEE Signal Processing Letters, Abstract: Quaternion reproducing kernel Hilbert spaces QRKHS have been proposed recently and provide a high-dimensional feature space alternative to the real-valued multikernel approach for general kernel-learning applications. The current challenge within quaternion-kernel learning is the lack of general quaternion-valued kernels, which are necessary to exploit the full advantages of the QRKHS theory in real-world problems. This letter proposes a novel way to design quaternion-valued kernels, this is achieved by transforming three complex kernels into quaternion ones and then combining their real and imaginary parts. Building on this general construction, our emphasis is on a new quaternion kernel of polynomial features, which is assessed in the prediction of bodysensor networks applications. Modelling of complex signals using Gaussian processes. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSPpagesAbstract: In complex-valued signal processing, estimation algorithms require complete knowledge or accurate estimation of the second order statistics, this makes Gaussian processes GP well suited for modelling complex signals, as they are designed in terms of covariance functions. Dealing with bivariate signals using GPs require four covariance matrices, or equivalently, two complex matrices. We propose a GP-based approach for modelling complex signals, whereby the second-order statistics are learnt through maximum likelihood; in particular, the complex GP approach allows for circularity coefficient estimation in a robust manner when the observed signal is corrupted by circular white noise. The proposed model is validated using climate signals, for both circular and noncircular cases. The results obtained open new possibilities for collaboration between the complex signal processing and Gaussian processes communities towards an appealing representation and statistical description of bivariate signals. Revisiting the limits of MAP inference by MWSS on perfect graphs. In 18th International Conference on Artificial Intelligence and StatisticspagesSan Diego, Time-course, May Abstract: A recent, promising approach to identifying a configuration of a discrete graphical model with highest probability termed MAP inference is to reduce the problem to finding a maximum weight stable set MWSS in a derived weighted graph, which, if perfect, allows a solution from be found in polynomial time. Weller and Jebara investigated the class of binary pairwise models where this method may be applied. However, their analysis made a seemingly innocuous assumption which simplifies analysis but led to only a subset of possible reparameterizations being considered. Here we introduce novel techniques and consider all cases, demonstrating that this greatly expands the set of tractable models. We provide a simple, exact characterization of the new, enlarged set and show how such models may be efficiently identified, thus settling the power of the approach on this class. Comment: Also accepted for presentation at the 21st International Conference on Principles and Practice of Constraint Programming CP Adrian Weller. Bethe and related pairwise entropy approximations. In 31st Conference on Uncertainty in Artificial IntelligencepagesAmsterdam, July Abstract: For undirected graphical models, belief propagation often performs remarkably well for approximate marginal inference, and may be viewed as a heuristic to minimize the Bethe free energy. Focusing on binary pairwise models, we demonstrate that several recent results on the Bethe approximation may be generalized to a broad family of related pairwise free energy approximations with arbitrary counting numbers. We explore approximation error and shed light on the empirical success of the Bethe approximation. Comment: Supplementary Material B. Policy search for learning robot control using sparse data. In IEEE International Conference on Robotics and AutomationpagesHong Kong, China, IEEE, doi ICRA Abstract: In many complex robot applications, such as grasping and manipulation, it is difficult to program desired task solutions beforehand, as robots are within an uncertain and dynamic environment. In such cases, learning tasks from experience can be a useful alternative. To obtain a sound learning and generalization performance, machine learning, especially, reinforcement learning, usually requires sufficient data. However, in cases where only little data is available for learning, due to system constraints and practical issues, reinforcement learning can act suboptimally. In this paper, we investigate how model-based reinforcement learning, in particular the probabilistic inference for learning control method PILCOcan be tailored to cope with the case of sparse data to speed up learning. The basic idea is to include further prior knowledge into the learning process. As PILCO is built on the probabilistic Gaussian processes framework, additional system knowledge can be incorporated by defining appropriate prior distributions, e. The resulting PILCO formulation remains in closed form and analytically tractable. The proposed approach is evaluated in simulation as well as on a physical robot, the Festo Robotino XT. For the robot evaluation, we employ the approach for learning an object pick-up task. The results show that by including prior knowledge, policy learning can be sped up in presence of sparse data. Wouter Boomsma, Pengfei Tian, Jes Frellsen, Jesper Ferkinghoff-Borg, Thomas Hamelryck, Kresten Lindorff-Larsen, and Michele Vendruscolo. Equilibrium simulations of proteins using molecular fragment replacement and NMR chemical shifts. Proceedings of the National Academy of Sciences,doi pnas Abstract: Methods of protein structure determination based on NMR chemical shifts are becoming increasingly common. The most widely used approaches adopt the molecular fragment replacement strategy, in which structural fragments are repeatedly reassembled into different complete conformations in molecular simulations. Although these approaches are effective in generating individual structures consistent with the chemical shift data, they do not enable the sampling of the conformational space of proteins with correct statistical weights. Here, we present a method of molecular fragment replacement that makes it possible to perform equilibrium simulations of proteins, and hence to determine their free energy landscapes. This strategy is based on the encoding of the chemical shift information in a probabilistic model in Markov chain Monte Carlo simulations. First, we demonstrate that with this approach it is possible to fold proteins to their native states starting from extended structures. Second, we show that the method satisfies the detailed balance condition and hence it can be used to carry out an equilibrium sampling from the Boltzmann distribution corresponding to the force field used in the simulations. Third, by comparing the results of simulations carried out with and without chemical shift restraints we describe quantitatively the effects that these restraints have on the free energy landscapes of proteins. Taken together, these results demonstrate that the molecular fragment replacement strategy can be used in combination with chemical shift information to characterize not only the native structures of proteins but also their conformational fluctuations. Scalable Gaussian Process structured prediction for grid factor graph applications. In 31st International Conference on Machine LearningAbstract: Structured prediction is an important and well studied problem with many applications across machine learning. However, the memory demand of GPstruct is quadratic in the number of latent variables and training runtime scales cubically. This prevents GPstruct from being applied to problems involving grid factor graphs, which are prevalent in computer vision and spatial statistics applications. Here we explore a scalable approach to learning GPstruct models based on ensemble learning, with weak learners predictors trained on subsets of the latent variables and bootstrap data, which can easily be distributed. We show experiments with 4M latent variables on image segmentation. Our method outperforms widely-used conditional random field models trained with pseudo-likelihood. Moreover, in image segmentation problems it improves over recent state-of-the-art marginal optimisation methods in terms of predictive performance and uncertainty calibration. Finally, it generalises well on all training set sizes. Tree-structured Gaussian process approximations. This idea sits at the heart of many approximation schemes, but such an approach requires the number of pseudo-datapoints to be scaled with the range of the input space if the accuracy of the approximation is to be maintained. This presents problems in time-series settings or in spatial datasets where large numbers of pseudo-datapoints are required since computation typically scales quadratically with the pseudo-dataset size. In this paper we devise an approximation whose complexity grows linearly with the number of pseudo-datapoints. This is achieved by imposing a tree or chain structure on the pseudo-datapoints and calibrating the approximation using a Kullback-Leibler KL minimization. Inference and learning can then be performed efficiently using the Gaussian belief propagation algorithm. We demonstrate the validity of our approach on a set of challenging regression tasks including missing data imputation for audio and spatial datasets. We trace out the speed-accuracy trade-off for the new method and show that the frontier dominates those obtained from a large number of existing approximation techniques. Alex Davies and Zoubin Ghahramani. The random forest kernel and other kernels for big data from random partitions arXivabsAbstract: We present Random Partition Kernels, a new class of kernels derived by demonstrating a natural connection between random partitions of objects and kernels between those objects. We show how the construction can be used to create kernels from methods that would not normally be viewed as random partitions, such as Random Forest. To demonstrate the potential of this method, we propose two new kernels, the Random Forest Kernel and the Fast Cluster Kernel, and show that these kernels consistently outperform standard kernels on problems involving real-world datasets. Finally, we show how the form of these kernels lend themselves to a natural approximation that is appropriate for certain big data problems, allowing O N inference in methods such as Gaussian Processes, Support Vector Machines and Kernel PCA. Avoiding pathologies in very deep networks. In 17th International Conference on Artificial Intelligence and StatisticsReykjavik, Iceland, April Abstract: Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes. Jes Frellsen, Thomas Hamelryck, and Jesper Ferkinghoff-Borg. Combining the multicanonical ensemble with generative probabilistic models of local biomolecular structure. In Proceedings of the 59th World Statistics Congress of the International Statistical InstitutepagesHong Kong, Abstract: Markov chain Monte Carlo is a powerful tool for sampling complex systems such as large biomolecular structures. However, the standard Metropolis-Hastings algorithm suffers from a number of deficiencies when applied to systems with rugged free-energy landscapes. Some of these deficiencies can be addressed with the multicanonical ensemble. In this paper we will present two strategies for applying the multicanonical ensemble to distributions constructed from generative probabilistic models of local biomolecular structure. In particular, we will describe how to use the multicanonical ensemble efficiently in conjunction with the reference ratio method. Variational Gaussian process state-space models. Weinberger, editors, Advances in Neural Information Extracting Systems 27Abstract: State-space models have been successfully used for more than fifty years in different areas of science and engineering. We present a procedure for efficient variational Bayesian learning of nonlinear state-space models based on sparse Gaussian processes. The result of learning is a tractable posterior over nonlinear dynamical systems. In comparison to conventional parametric models, we offer the possibility to straightforwardly trade off model capacity and computational cost whilst avoiding overfitting. Our main algorithm uses a hybrid inference approach combining variational Bayes and sequential Monte Carlo. We also present stochastic variational inference and online learning approaches for fast learning with long time series. Identification of Gaussian process state-space models with particle stochastic approximation EM. In Proceedings of the 19th World Congress of the International Federation of Automatic Control IFACAbstract Gaussian process state-space models GP-SSMs are a very flexible family of models of nonlinear dynamical systems. They comprise a Bayesian nonparametric representation of the dynamics of the system and additional hyper- parameters governing the properties of this nonparametric representation. The Bayesian formalism enables systematic reasoning about the uncertainty in the system dynamics. We present an approach to maximum likelihood identification of the parameters in GP-SSMs, while retaining the full nonparametric description of the dynamics. The method is based on a stochastic approximation version of the EM algorithm that employs recent developments in particle Markov chain Monte Carlo for efficient identification. Yarin Gal and Zoubin Ghahramani. Pitfalls in the use of parallel inference for the Dirichlet process. In Proceedings of the 31th International Conference on Machine Learning ICMLAbstract: Recent work done by Lovell, Adams, and Mansingka and Williamson, Dubey, and Xing has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it — work which has been picked-up and implemented in several different fields. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these requirements while approximate approaches often satisfy most of them. We present both theoretical and experimental evidence, analysing the load balance for the inference and showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation. We end with suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process. Distributed variational inference in sparse Gaussian process regression and latent variable models. Weinberger, editors, Advances data Neural Information Processing Systems 27pages Curran Associates, Inc. They have been applied to both regression and non-linear dimensionality reduction, and offer desirable properties such as uncertainty estimates, robustness to over-fitting, and principled ways for tuning hyper-parameters. However the scalability of these models to big datasets remains an active topic of research. We introduce a novel re-parametrisation of variational inference for sparse GP regression and latent variable models that allows for an efficient distributed algorithm. This is done by exploiting the decoupling of the data given the inducing points to re-formulate the evidence lower bound in a Map-Reduce setting. We show that the inference scales well with data and computational resources, while preserving a balanced distribution of the load among the nodes. We further demonstrate the utility in scaling Gaussian processes to big data. We show that GP performance improves with increasing amounts of data in regression on flight data with 2 million records and latent variable modelling on MNIST. The results show that GPs perform better than many common models often used for big data. Gaussian process volatility model. In Advances in Neural Information Processing Systems 28pages 1-9, Montreal, Canada, December Abstract: The prediction of time-changing variances is an important task in the modeling of financial data. Standard econometric models are often limited as they assume rigid functional relationships for the evolution of the variance. Moreover, functional parameters are usually learned by maximum likelihood, which can lead to overfitting. To address these problems we introduce GP-Vol, a novel non-parametric model for time-changing variances based on Gaussian Processes. This new model can capture highly flexible functional relationships for the variances. Furthermore, we introduce a new online algorithm for fast inference in GP-Vol. This method is much faster than current offline inference procedures and it avoids overfitting problems by following a fully Bayesian approach. Experiments with financial data show that GP-Vol performs significantly better than current standard alternatives. Knowles, and Zoubin Ghahramani. In 31st International Conference on Machine LearningBeijing, China, June Abstract: We define the beta diffusion tree, a random tree structure with a set of leaves that defines a collection of overlapping subsets of objects, known as a feature allocation. With the beta diffusion tree, however, multiple copies of a particle may exist and diffuse to multiple locations in the continuous space, resulting in a random number of possibly overlapping clusters of the objects. We demonstrate how to build a hierarchically-clustered factor analysis model with the beta diffusion tree and how to perform inference over the random tree structures with a Markov chain Monte Carlo algorithm. We conclude with several numerical experiments on missing data problems with data sets of gene expression arrays, international development statistics, and intranational socioeconomic measurements. Beta diffusion trees and hierarchical feature allocations. A generative process for the tree structure is defined in terms of particles representing the objects diffusing in some continuous space, analogously to the Dirichlet diffusion tree Neal, bwhich defines a tree structure over partitions i. Unlike in the Dirichlet diffusion tree, multiple copies of a particle may exist and diffuse along multiple branches in the beta diffusion tree, and an object may microarray belong to multiple subsets of particles. We conclude with several numerical experiments on missing data problems with data sets of gene expression microarrays, international development statistics, and intranational socioeconomic measurements. The combinatorial structure of beta negative binomial processes. In Bayesian nonparametric applications, such processes have served as models for unknown multisets of a measurable space. Previous work has characterized random subsets arising from conditionally-i. In this case, the combinatorial structure is described by the Indian buffet process. Our results microarray a count analogue of the Indian buffet process, which we call a negative binomial Indian buffet process. As an intermediate step toward this goal, we provide constructions for the beta negative binomial process that avoid a representation of the underlying beta process base measure. Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems 28pages 1-9, Montreal, Canada, December Abstract: We propose a novel information-theoretic approach for Bayesian optimization called Predictive Entropy Search PES. At each iteration, PES selects the next evaluation point that maximizes signals expected information gained with respect to the global maximum. PES codifies this intractable acquisition function in terms of the expected reduction in the differential entropy of the predictive distribution. This reformulation allows PES to obtain approximations that are both more accurate and efficient than other alternatives such as Entropy Search ES. Furthermore, PES can easily perform a fully Bayesian treatment of the model hyperparameters while ES cannot. We evaluate PES in both synthetic and realworld applications, including optimization problems in machine learning, finance, biotechnology, and robotics. We show that the increased accuracy of PES leads to significant gains in optimization performance. Probabilistic matrix factorization with non-random missing data. In 31st International Conference on Machine LearningBeijing, China, June Abstract: We propose a probabilistic matrix factorization model for collaborative filtering that learns from data that is missing not at random MNAR. Matrix factorization models exhibit state-of-the-art predictive performance in collaborative filtering. However, these models usually assume that the data is missing at random MARand this is rarely the case. For example, the data is not MAR if users rate items they like more than ones they dislike. When the MAR assumption is incorrect, inferences are biased and predictive performance can suffer. Therefore, we model both the generative process for the data and the missing data mechanism. By learning these two models jointly we obtain improved performance over state-of-the-art methods when predicting the ratings and when modeling the data observation process. We present the first viable MF model for MNAR data. Our results are promising and we expect that further research on NMAR models will yield large gains in collaborative filtering. Stochastic inference for scalable probabilistic modeling of binary matrices. In 31st International Conference on Machine LearningBeijing, China, June Abstract: Fully observed large binary matrices appear in a wide variety of contexts. To model them, probabilistic matrix factorization PMF methods are an attractive solution. However, current batch algorithms for PMF can be inefficient because extracting need to analyze the entire data matrix before producing any parameter updates. We derive an efficient stochastic inference algorithm for PMF models of fully observed binary matrices. Our method exhibits faster convergence rates than more expensive batch approaches and has better predictive performance than scalable alternatives. The proposed method includes new data subsampling strategies which produce large gains over standard uniform subsampling. We also address the task of automatically selecting the size of the minibatches of data used by our method. For this, we derive an algorithm that adjusts this hyper-parameter online. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In 17th International Conference on Artificial Intelligence and StatisticspagesReykjavik, Iceland, April Abstract: We address the problem of finding the maximizer of a nonlinear function that can only be evaluated, subject to noise, at a finite number of query locations. Further, we will assume that there is a constraint on the total number of permitted function evaluations. We introduce a Bayesian approach for this problem and show that it empirically outperforms both the existing frequentist counterpart and other Bayesian optimization methods. The Bayesian approach places emphasis on detailed modelling, including the modelling of correlations among the arms. As a result, it can perform well in situations where the number of arms is much larger than the number of allowed function evaluation, whereas the frequentist counterpart is inapplicable. This feature enables us to develop and deploy practical applications, such as automatic machine learning toolboxes. The paper presents comprehensive comparisons of the proposed approach with many Bayesian and bandit optimization techniques, the first comparison of many of these methods in the literature. Neil Houlsby and Massimiliano Ciaramita. A scalable Gibbs sampler for probabilistic entity linking. In 36th European Conference on Information Retrievalpages Springer, Abstract: Entity linking involves signals phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset. Cold-start active learning with robust ordinal matrix factorization. In 31st International Conference on Machine LearningBeijing, China, June Abstract: We present a new matrix factorization model for rating data and a corresponding active learning strategy to address the cold-start problem. Cold-start is one of the most challenging tasks for recommender systems: what to recommend with new users or items for which one has little or no data. An approach is to use active learning to collect the most useful initial ratings. However, the performance of active learning depends strongly upon having accurate estimates of i the uncertainty in model parameters and ii the intrinsic noisiness of the data. To achieve these estimates we propose a heteroskedastic Bayesian model for ordinal matrix factorization. We also present a computationally efficient framework for Bayesian active learning with this type of complex probabilistic model. This algorithm successfully distinguishes between informative and noisy data points. Our model yields state-of-the-art predictive performance and, coupled with our active learning strategy, enables us to gain useful information in the cold-start setting from the very first active sample. Peter Kerpedjiev, Jes Frellsen, Stinus Lindgreen, and Anders Krogh. Adaptable probabilistic mapping of short reads using position specific scoring matrices. BMC bioinformatics,doi Abstract: BACKGROUND: Modern DNA sequencing methods produce vast amounts of data that often requires mapping to a reference genome. Most existing programs use the number of mismatches between the read and the genome as a measure of quality. This approach is without a statistical foundation and can for some data types result in many wrongly mapped reads. Our method achieves better results than Bowtie and BWA on simulated and real ancient and PAR-CLIP reads, as well as on simulated reads from the AT rich organism P. For simulated Illumina reads, the method has consistently higher sensitivity for both single-end and paired-end data. We also show that our probabilistic approach can limit the problem of random matches from short reads of contamination and that it improves the mapping of real reads from one organism D. CONCLUSION: The presented work is an implementation of a novel approach to short read mapping where quality scores, prior mismatch probabilities and mapping qualities are handled in a statistically sound manner. Comment: Peter Kerpedjiev and Jes Frellsen contributed equally. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In 31st International Conference on Machine LearningpagesBeijing, China, June Abstract: Can we make Bayesian posterior MCMC sampling more efficient when faced with very large datasets? We argue that computing the likelihood for N datapoints in the Metropolis-Hastings MH test to reach a single binary decision is computationally inefficient. We introduce an approximate MH rule based on a sequential hypothesis test that allows us to accept or reject samples with high confidence using only a fraction of the data required for the exact MH rule. While this method introduces an asymptotic bias, we show that this bias can be controlled and is more than offset by a decrease in variance due to our ability to draw more samples per unit of time. However, it is unclear exactly which envelope modulation rates and statistics are the most important for the rhythm percept. Here, the hypothesis that the phase relationship between "Stress" rate Hz and "Syllable" rate Hz AMs provides a perceptual cue for speech rhythm is tested. The results confirmed these predictions. Listeners judgments of rhythm systematically followed Stress-Syllable AM phase-shifts, but were unaffected by phase-shifts between the Syllable AM and the Sub-beat AM Hz in a control condition. It is concluded that the Stress-Syllable AM phase relationship is an envelope-based modulation statistic that supports speech rhythm perception. Tenenbaum, and Zoubin Ghahramani. Automatic construction and natural-language description of nonparametric regression models. In Association for the Advancement of Artificial Intelligence AAAIJuly Abstract: This paper presents the beginnings of an automatic statistician, focusing on regression problems. Our system explores an open-ended space of statistical models to discover a good explanation of a data set, and then produces a detailed report with figures and natural-language text. Our approach treats unknown regression functions nonparametrically using Gaussian processes, which has two important consequences. First, Gaussian processes can model functions in terms of high-level properties e. Second, the use of flexible nonparametric models and a rich language for composing them in an open-ended manner also results in state-of-the-art extrapolation performance evaluated over 13 real time series data sets from various domains. Randomized nonlinear component analysis. However, these techniques only reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, they are computationally prohibitive in the large scale. In a data strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving dramatic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas also extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. Code in R implementing our methods is provided in the Appendix. Matthews and Zoubin Ghahramani. Classification using log Gaussian Cox processes arXiv preprint arXivAbstract McCullagh and Yang suggest a family of classification algorithms based on Cox processes. We further investigate the log Gaussian variant which has a number of appealing properties. Conditioned on the covariates, the distribution over labels is given by a type of conditional Markov random field. In the supervised case, computation of the predictive probability of a single test point scales linearly with the number of training points and the multiclass generalization is straightforward. We show new links between the supervised method and classical nonparametric methods. We give a detailed analysis of the pairwise graph representable Markov random field, which we use to extend the model to semi-supervised learning problems, and propose an inference method based on graph min-cuts. We give the first experimental analysis on supervised and semi-supervised datasets and show good empirical performance. G Matthews, James Hensman, and Zoubin Ghahramani. Comparing lower bounds on the entropy of mixture distributions for use in variational inference. In NIPS workshop on Advances in Variational InferenceMontreal, Canada, December Andrew McHutchon. Nonlinear modelling and control using Gaussian processes. PhD thesis, University of Cambridge, Department of Engineering, Cambridge, UK, Abstract: In many scientific disciplines it is often required to make predictions about how a system will behave or to deduce the correct control values to elicit a particular desired response. In the most interesting situations, the model needs to capture strongly nonlinear effects and deal with the presence of uncertainty and noise. Building models for such systems purely based on a theoretical understanding of underlying physical principles can be infeasibly complex and require a large number of simplifying assumptions. An alternative is to use a data-driven approach, which builds a model directly from observations. A powerful and principled approach to doing this is to use a Gaussian Process GP. In this thesis we start by discussing how GPs can be applied to data sets which have noise affecting their inputs. We present the "Noisy Input GP", signals uses a simple local-linearisation to refer the input noise into heteroscedastic output noise, and compare it to other methods both from and empirically. We show that this technique leads to a effective model for nonlinear functions with input and output noise. We then consider the broad topic of GP state space models for application to dynamical systems. We discuss a very wide variety of approaches for using GPs in state space models, including introducing a new method based on moment-matching, which consistently gave the best performance. We analyse the methods in some detail including providing a systematic comparison between approximate-analytic and particle methods. To our knowledge such a comparison has not been provided before in this area. Finally, we investigate an automatic control learning framework, which uses Gaussian Processes to model from system for which we wish to design a controller. Controller design for complex systems is a difficult task and thus a framework which allows an automatic design directly from data promises to be extremely useful. We demonstrate that the previously published framework cannot cope with the presence of observation noise but that the introduction of a state space model dramatically improves its performance. This contribution, along with some other suggested improvements opens the door for this framework to be used in real-world applications. A reversible infinite hmm using normalised random measures. In 31st International Conference on Machine LearningBeijing, China, June Abstract: We present a nonparametric prior over reversible Markov chains. We use completely random measures, specifically gamma processes, to construct a countably infinite graph with weighted edges. By enforcing symmetry to make the edges undirected we define a prior over random walks on graphs that results in a reversible Markov chain. The resulting prior over infinite transition matrices is closely related to the hierarchical Dirichlet process but enforces reversibility. A reinforcement scheme has recently been proposed with similar properties, but the de Finetti measure is not well characterised. We take the alternative approach of explicitly constructing the mixing measure, which allows more straightforward and efficient inference at the cost of no longer having a closed form predictive distribution. We use our process to construct a reversible infinite HMM which we apply to two real datasets, one from epigenomics and one ion channel recording. The Inverse Regression Topic model. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models like latent Dirichlet allocation capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model IRTMa mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation algorithm and an online variant, which is suitable for large corpora. We apply these methods to a corpus of 73K Congressional press releases and another of 150K Yelp reviews, demonstrating that the IRTM outperforms both MNIR and supervised topic models on the prediction task. Further, we give examples showing that the IRTM enables systematic discovery of in-topic lexical variation, which is not possible with previous supervised topic models. Michael Schober, David Duvenaud, and Philipp Hennig. Probabilistic ODE solvers with Runge-Kutta means arXiv preprint arXivJune Abstract: Runge-Kutta methods are the classic family of solvers for ordinary differential equations ODEsand the basis for the state-of-the-art. Like most numerical methods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribution over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus inheriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions. Student-t processes as alternatives to Gaussian processes. In AISTATSJMLR Proceedings. We derive closed form expressions for the marginal likelihood and predictive distribution of a Student-t process, by integrating away an inverse Wishart process prior over the covariance kernel of a Gaussian process model. We show surprising equivalences between different hierarchical Gaussian process models leading to Student-t processes, and derive a new sampling scheme for the inverse Wishart process, which helps elucidate these equivalences. Overall, we show that a Student-t process can retain the attractive properties of a Gaussian process - a nonparametric representation, analytic marginal and predictive distributions, and easy model selection through covariance kernels - but has enhanced flexibility, and predictive covariances that, unlike a Gaussian process, explicitly depend on the values of training observations. We verify empirically that a Student-t process is especially useful in situations where there are changes in covariance structure, or in applications like Bayesian optimization, signals accurate predictive covariances are critical for good performance. These advantages come at no additional computational cost over Gaussian processes. Adrian Weller and Tony Jebara. Clamping variables and approximate inference. Weinberger, editors, Advances in Neural Information Processing Systems 28pages Curran Associates, Inc. Here we provide a new, arguably simpler proof from first principles. We make use of the idea of clamping a variable to a particular value. For an attractive model, we show that summing over the Bethe partition functions for each sub-model obtained after clamping any variable can only raise and hence improve the approximation. In fact, we derive a stronger result that may have other useful implications. Repeatedly clamping until we obtain a model with no cycles, where the Bethe approximation is exact, yields the result. We also provide a related lower bound on a broad class of approximate partition functions of general pairwise multi-label models that depends only on the topology. We demonstrate that clamping a few wisely chosen variables can be of practical value by dramatically reducing approximation error. Covariance Kernels for Fast Automatic Pattern Discovery and Extrapolation with Gaussian Processes. PhD thesis, University of Cambridge, Cambridge, UK, Abstract: Truly intelligent systems are capable of pattern discovery and extrapolation without human intervention. Bayesian nonparametric models, which can uniquely represent expressive prior information and detailed inductive biases, provide a distinct opportunity to develop intelligent systems, with applications in essentially any learning and prediction task. Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. A covariance kernel determines the support and inductive biases of a Gaussian process. In this thesis, we introduce new covariance kernels to enable fast automatic pattern discovery and extrapolation with Gaussian processes. In the introductory chapter, we discuss the high level principles behind all of the models in this thesis we can typically improve the predictive performance of a model by accounting for additional structure in data; 2 to automatically discover rich structure in data, a model must have large support and the appropriate inductive biases; 3 we most need expressive models for large datasets, which typically provide more information for learning structure, and 4 we can often exploit the existing inductive biases assumptions or structure of a model for scalable inference, without the need for simplifying assumptions. In the context of this introduction, we then discuss, in chapter 2, Gaussian processes as kernel machines, and my views on the future of Gaussian process research. In chapter 3 we introduce the Gaussian process regression network GPRN framework, a multi-output Gaussian process method which scales to many output variables, and accounts for input-dependent correlations between the outputs. Underlying the GPRN is a highly expressive kernel, formed using an adaptive mixture of latent basis functions in a neural network like architecture. The GPRN is capable of discovering expressive structure in data. We use the GPRN to model the time-varying expression levels of genes, the spatially varying concentrations of several distinct heavy metals, and multivariate volatility input dependent noise covariances between returns on equity indices and currency exchanges, which is particularly valuable for portfolio allocation. We generalise the GPRN to an adaptive network framework, which does not depend on Gaussian processes or Bayesian nonparametrics; and we outline applications for the adaptive network in nuclear magnetic resonance NMR spectroscopy, ensemble learning, and change-point modelling. In chapter we introduce simple closed form kernel for automatic pattern discovery and extrapolation. These spectral mixture SM kernels are derived by modelling the spectral densiy of a kernel its Fourier transform using a scale-location Gaussian mixture. SM kernels form a basis for all stationary covariances, and can be used as a drop-in replacement for standard kernels, as they retain simple and exact learning and inference procedures. We use the SM kernel to discover patterns and perform long range extrapolation on atmospheric CO2 trends and airline passenger data, as well as on synthetic examples. We also show that the SM kernel can be used to automatically reconstruct several standard covariances. The SM kernel and the GPRN are highly complementary; we show that using the Data kernel with adaptive basis functions in a GPRN induces an expressive prior over non-stationary kernels. In chapter 5 we introduce GPatt, a method for fast multidimensional pattern extrapolation, particularly suited to imge and movie data. Without human intervention - no hand crafting of kernel features, and no sophisticated initialisation procedures - we show that GPatt can solve large scale pattern extrapolation, inpainting and kernel discovery problems, including a problem with 383,400 training points. GPatt exploits the structure of a spectral mixture product SMP kernel, for fast yet exact inference procedures. We find that GPatt significantly outperforms popular alternative scalable gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits existing model structure are useful in combination for modelling large scale multidimensional patterns. The models in this dissertation have proven to be scalable and with greatly enhanced predictive performance over the alternatives: the extra structure being modelled is an important part of a wide variety of real data - including problems in econometrics, gene expression, geostatistics, nuclear magnetic resonance spectroscopy, ensemble learning, multi-output regression, change point modelling, time series, multivariate volatility, image inpainting, texture extrapolation, video extrapolation, acoustic modelling, and kernel discovery. Gladden, and Andrew Blake. Bayesian inference for NMR spectroscopy with applications to chemical quantification arXiv preprint arXivAbstract Nuclear magnetic resonance NMR spectroscopy exploits the magnetic properties of atomic nuclei to discover the structure, reaction state and chemical environment of molecules. We propose a probabilistic generative model and inference procedures for NMR spectroscopy. Specifically, we use a weighted sum of trigonometric functions undergoing exponential decay to model free induction decay FID signals. We discuss the challenges in estimating the components of this general model - amplitudes, phase shifts, frequencies, decay rates, and noise variances - and offer practical solutions. We compare with conventional Fourier transform spectroscopy for estimating the relative concentrations of chemicals in a mixture, using synthetic and experimentally acquired FID signals. We find the proposed model is particularly robust to low signal to noise ratios SNRand overlapping peaks in the Fourier transform of the FID, enabling accurate predictions e. Jacob Andreas and Zoubin Ghahramani. This model reformulates earlier tensor-based approaches to vector space semantics as a top-down process, and provides efficient algorithms for transformation from natural language to vectors and from vectors to natural language. We describe procedures for estimating the parameters of the model from positive examples of time-course phrases, and from distributional representations, then use these procedures to obtain similarity judgments for a set of adjective-noun pairs. Konstantinos Bousmalis, Stefanos Zafeiriou, Louis-Philippe Morency, Maja Pantic, and Zoubin Ghahramani. Variational hidden conditional random fields with coupled Dirichlet process mixtures. An infinite HCRF is an HCRF with a countably infinite number of hidden states, which rids us not only of the necessity to specify a priori a fixed number of hidden states available but also of the problem of overfitting. Markov chain Monte Carlo MCMC sampling algorithms are often employed for inference in such models. However, convergence of such algorithms is rather difficult to verify, and as the complexity of the task at hand increases, the computational cost of such algorithms often becomes prohibitive. These limitations can be overcome by variational techniques. In this paper, we present a generalized framework for infinite HCRF models, and a novel variational inference approach on a model based on coupled Dirichlet Process Mixtures, the HCRF-DPM. We show that the variational HCRF-DPM is able to converge to a correct number of represented hidden states, and performs as well as the best parametric HCRFs -chosen via cross-validation- for the difficult tasks of recognizing instances of agreement, disagreement, and pain in audiovisual sequences. Bayesian structured prediction using Gaussian processes arXivabsAbstract: We introduce a conceptually novel structured prediction model, GPstruct, which is kernelized, non-parametric and Bayesian, by design. We motivate the model with respect to existing approaches, among others, conditional random fields CRFsmaximum margin Markov networks M3Nand structured support vector machines SVMstructwhich embody only a subset of its properties. We present an inference procedure based on Markov Chain Monte Carlo. The framework can be instantiated for a wide range of structured objects such as linear chains, trees, grids, and other general graphs. As a proof of concept, the model is benchmarked on several natural language processing tasks and a video gesture segmentation task involving a linear chain structure. We show prediction accuracies for GPstruct which are comparable to or exceeding those of CRFs and SVMstruct. Structure discovery in nonparametric regression through compositional kernel search. In 30th International Conference on Machine LearningAtlanta, Georgia, USA, June Abstract: Despite its importance, choosing the structural form of the kernel in nonparametric regression remains a black art. We define a space of kernel structures which are built compositionally by adding and multiplying a small number of base kernels. We present a method for searching over this space of structures which mirrors the scientific discovery process. The learned structures can often decompose functions into interpretable components and enable long-range extrapolation on time-series datasets. Our structure search method outperforms many widely used kernels and kernel combination methods on a variety of prediction tasks. Frederik Eaton and Zoubin Ghahramani. Model reductions for inference: Generality of pairwise, binary, and planar factor graphs Neural Computation, Abstract: We offer a solution to the problem of efficiently translating algorithms between different types of discrete statistical model. We investigate the expressive power of three classes of model-those with binary variables, with pairwise factors, and with planar topology-as well as their four intersections. We formalize a notion of "simple reduction" for the problem of inferring marginal probabilities and consider whether it is possible to "simply reduce" marginal inference from general discrete factor graphs to factor graphs binary each of these seven subclasses. We characterize the reducibility of each class, showing in particular that the class of binary pairwise factor graphs is able to simply reduce only positive models. We also exhibit a continuous "spectral reduction" based on polynomial interpolation, which overcomes this limitation. Experiments assess the performance of standard approximate inference algorithms on the outputs of our reductions. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. Weinberger, editors, Advances in Neural Information Processing Systems Curran Associates, Inc. We present a fully Bayesian approach to inference and learning in nonlinear nonparametric state-space models. We place a Gaussian process prior over the transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. However, to enable efficient inference, we marginalize over the dynamics of the model and instead infer directly the joint smoothing distribution through the use of specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. We make use of sparse Gaussian process models to greatly reduce the computational complexity of the approach. Integrated pre-processing for Bayesian nonlinear system identification with Gaussian processes. In Decision and Control CDCIEEE 52nd Annual Conference onAbstract: We introduce GP-FNARX: a new model for nonlinear system identification based on a nonlinear autoregressive exogenous model NARX with filtered regressors F where the nonlinear regression problem is tackled using sparse Gaussian processes GP. We integrate data pre-processing with system identification into a fully automated procedure that goes from raw data to an identified model. Both pre-processing parameters and GP hyper-parameters are tuned by maximizing the marginal likelihood of the probabilistic model. The automated approach, the modeling of uncertainty and its relatively low computational cost make of GP-FNARX a good candidate for applications in robotics and adaptive control. Yarin Gal and Phil Blunsom. A systematic Bayesian treatment of the IBM alignment models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Association for Computational Linguistics, Abstract: The dominant yet ageing IBM and HMM word alignment models underpin most popular Statistical Machine Translation implementations in use today. Though beset by the limitations of implausible independence assumptions, intractable optimisation problems, and an excess of tunable parameters, these models provide a scalable and reliable starting point for inducing translation systems. In this paper we build upon this venerable base by recasting these models in the non-parametric Bayesian framework. By replacing the categorical distributions at their core with hierarchical Pitman-Yor processes, and through the use of collapsed Gibbs sampling, we provide a more flexible formulation and sidestep the original heuristic optimisation techniques. The resulting models are highly extendible, naturally permitting the introduction of phrasal dependencies. Bayesian nonparametrics and the probabilistic approach to modelling Philosophical Transactions of the Royal Society AAbstract: Modelling is fundamental to many fields of science and engineering. A model can be thought of as a representation of possible data one could predict from a system. The probabilistic approach to modelling uses probability theory to express all aspects of uncertainty in the model. The probabilistic approach is synonymous with Bayesian modelling, which simply uses the rules of probability theory in order to make predictions, compare alternative models, and learn model parameters and structure from data. This simple and elegant framework is most powerful when coupled with flexible probabilistic models. Flexibility is achieved through the use of Bayesian nonparametrics. This article provides an overview of probabilistic modelling and an accessible survey of some of the main tools in Bayesian nonparametrics. The survey covers the use of Bayesian nonparametrics for modelling unknown functions, density estimation, clustering, time series modelling, and representing sparsity, hierarchies, and covariance structure. Scaling multidimensional Gaussian processes using projected additive approximations. In 30th International Conference on Machine LearningAbstract: Exact Gaussian Process GP regression has O N runtime for data size N, making it intractable for large N. Many algorithms for improving GP scaling approximate the covariance with lower rank matrices. Other work has exploited structure inherent in particular covariance functions, including GPs with implied Markov structure, and equispaced inputs both enable O N runtime. However, these GP advances have not been extended to the multidimensional input setting, despite the preponderance of multidimensional applications. This paper introduces and tests novel extensions of structured GPs to multidimensional inputs. To achieve optimal accuracy-complexity tradeoff, we extend this model with a novel variant of projection pursuit regression. Our primary result — projection pursuit Gaussian Process Regression — shows orders of magnitude speedup while preserving high accuracy. The natural second and third steps include non-Gaussian observations and higher dimensional equispaced grid methods. We introduce novel techniques to address both of these necessary directions. We thoroughly illustrate the power of these three advances on several datasets, achieving close performance to the naive Full GP at orders of magnitude less cost. Scaling multidimensional inference for structured Gaussian processes. IEEE Transactions on Pattern Analysis and Machine IntelligenceComment: to appear Creighton Heaukulani and Zoubin Ghahramani. Dynamic probabilistic models for latent feature propagation in social networks. In 30th International Conference on Machine LearningAtlanta, Georgia, USA, June Abstract: Current Bayesian models for dynamic social network data have focused on modelling the influence of evolving unobserved structure on observed social interactions. However, an understanding of how observed social relationships from the past affect future unobserved structure in the network has been neglected. In this paper, we introduce a new probabilistic model for capturing this phenomenon, which we call latent feature propagation, in social networks. Learning feature selection dependencies in multi-task learning. In Advances in Neural Information Processing Systems 27pages 1-9, Lake Tahoe, California, USA, December Abstract: A probabilistic model based on the horseshoe prior is proposed for learning de- pendencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. Generalized spike-and-slab priors for Bayesian group feature selection using expectation propagation Journal of Machine Learning Research, July Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efficiently using Expectation Propagation EP. A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about specific groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those signals group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design also known as active learningwhere the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. In NIPS Workshop on Randomized Methods for Machine LearningAbstract: Fully observed large binary matrices appear in a wide variety of contexts. However, current batch algorithms for PMF can be inefficient since they need to analyze the entire data matrix before producing any parameter updates. We also address the task of automatically selecting the size of the minibatches of data and we propose an algorithm that adjusts this hyper-parameter in an online manner. Gaussian process conditional copulas with applications to financial time series. In Advances in Neural Information Processing Systems 27pages 1-9, Lake Tahoe, California, USA, December Abstract: The estimation of dependencies between multiple variables is a central problem in the analysis of financial time series. A common approach is to express these dependencies in terms of a copula function. Typically the copula function is assumed to be constant but this may be inaccurate when there are covariates that could have a large influence on the dependence structure of the data. To account for this, a Bayesian framework for the estimation of conditional copulas is proposed. In this framework the parameters of a copula are non-linearly related to some arbitrary conditioning variables. We evaluate the ability of our method to predict time-varying dependencies on several equities and currencies and observe consistent performance gains compared to static copula models and other time-varying copula methods. Neil Houlsby and Guy Houlsby. Statistical fitting of undrained strength data Geotechnique,doi geot. P Abstract: We describe an approach, based on Bayesian statistical methods, that allows the fitting of a design profile to a set of measurements of undrained strengths. In particular we allow for the automatic determination of not only the positions of boundaries between geological units, but also the selection of the number of units to model the data in an appropriate way. However, it is unknown whether these representations, often formalized as priors in Bayesian inference, are specific for each task or subserve multiple tasks. Current approaches cannot distinguish between these two possibilities because they cannot extract comparable representations across different tasks. Here, we develop a novel method, termed cognitive tomography, that can extract complex, multi-dimensional priors across tasks. We apply this method to human judgments in two qualitatively different tasks, familiarity and odd-one-out, involving an ecologically relevant set of stimuli, human extracting. We show that priors over faces are structurally complex and vary dramatically across subjects, but are invariant across the tasks within each subject. The priors we extract from each task allow us to predict with high precision the behavior of subjects for novel stimuli both in the same task as well as in the other task. Our results provide the first evidence for a single high-dimensional structured representation of a naturalistic stimulus set that guides behavior in multiple tasks. Moreover, the representations estimated by cognitive tomography can provide independent, behavior-based regressors for elucidating the neural correlates of complex naturalistic priors. Tomoharu Iwata, David Duvenaud, and Zoubin Ghahramani. Warped mixtures for nonparametric cluster shapes. In 29th Conference on Uncertainty in Artificial IntelligenceBellevue, Washington, July Abstract: A mixture of Gaussians fit to a single curved or heavy-tailed cluster will report that the data contains many clusters. To produce more appropriate clusterings, we introduce a model which warps a latent mixture of Gaussians to produce nonparametric cluster shapes. The possibly low-dimensional latent mixture model allows us to summarize the properties of the high-dimensional clusters or density manifolds describing the data. The number of manifolds, as well as the shape and dimension of each manifold is automatically inferred. We derive a simple inference scheme for this model which analytically integrates out both the mixture parameters and the warping function. We show that our model is effective for density estimation, performs better than infinite Gaussian mixture models at recovering the true number of clusters, and produces interpretable summaries of high-dimensional datasets. Tomoharu Iwata, Neil Houlsby, and Zoubin Ghahramani. Active learning for interactive visualization. In 16th International Conference on Artificial Intelligence and StatisticsAbstract: Many automatic visualization methods have been proposed. However, a visualization that is automatically generated might be different to how a user wants to arrange the objects in visualization space. By allowing users to re-locate objects in the embedding space of the visualization, they can adjust the visualization to their preference. We propose an active learning framework for interactive visualization which selects objects for the user to re-locate so that they can obtain their desired visualization by re-locating as few as possible. The framework is based on an information theoretic criterion, which favors objects that reduce the uncertainty of the visualization. We present a concrete application of the proposed framework to the Laplacian eigenmap visualization method. We demonstrate experimentally that the proposed framework yields the desired visualization with fewer user interactions than existing methods. Tomoharu Iwata, Amar Shah, and Zoubin Ghahramani. Discovering latent influence in online social activities via shared cascade poisson processes. In KDDpages Association for Computing Machinery, Abstract: Many people share their activities with others through online communities. For example, users are likely to become interested in items that are adopted e. In this paper, we propose a probabilistic model for discovering latent influence from sequences of item adoption events. Extracting inhomogeneous Poisson process is used for modeling a sequence, in which adoption by a user triggers the subsequent adoption of the same item by other users. For modeling adoption of multiple items, we employ multiple inhomogeneous Poisson processes, which share parameters, such as influence for each user and relations between users. The proposed model can be used for finding influential users, discovering relations between users and predicting item popularity in the future. We present an efficient Bayesian inference procedure of the proposed model based on the stochastic EM algorithm. The effectiveness of the proposed model is demonstrated by using real data sets in a social bookmark sharing service. Bringing representativeness into social media monitoring and analysis. In 46th Hawaii International Conference on System SciencesManoa, Hawaii, Abstract: The opinions, expectations and behavior of citizens are increasingly reflected online - therefore, mining the internet for such data can enhance decision-making in public policy, communications, marketing, finance and other fields. However, to come closer to the representativeness of classic opinion surveys there is a lack of knowledge about the sociodemographic characteristics of those voicing opinions on the internet. This paper proposes to calibrate online opinions aggregated from multiple and heterogeneous data sources with traditional surveys enhanced with rich socio-demographic information to enable insights into which opinions are expressed on the internet by specific segments of society. The goal of this research is to provide professionals in citizen- and consumer- centered domains with more concise time-course real-time intelligence on online opinions. To become effective, the methodologies presented in this paper must be integrated into a coherent decision support system. Experimental adaptive Bayesian tomography Physical Review A, Abstract We report an experimental realization of an adaptive quantum state tomography protocol. Our method takes advantage of a Bayesian approach to statistical inference and is naturally tailored for adaptive strategies. Experiments are performed for polarization qubits, but the approach is readily adapted to any dimension. Simon Lacoste-Julien, Konstantina Palla, Alex Davies, Gjergji Kasneci, Thore Graepel, and Zoubin Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In KDDpages Association for Computing Machinery, Abstract: The Internet has enabled the creation of a growing number of large-scale knowledge bases in a variety of domains containing complementary information. Tools for automatically aligning these knowledge bases would make it possible to unify many sources of structured knowledge and answer complex queries. However, the efficient alignment of large-scale knowledge bases still poses a considerable challenge. Here, we present Simple Greedy Matching SiGMaa simple algorithm for aligning knowledge bases with millions of entities and facts. SiGMa is an iterative propagation algorithm which leverages both the structural information from the relationship graph as well as flexible similarity measures between extracting properties in a greedy local search, thus making it scalable. We provide additional experiments on benchmark datasets which demonstrate that SiGMa can outperform state-of-the-art approaches both in accuracy and efficiency. Gefcom hierarchical load forecasting: Gradient boosting machines and gaussian processes International Journal of ForecastingAbstract: This report discusses methods for forecasting hourly loads of a US utility as part of the load forecasting track of the Global Energy Forecasting Competition hosted on Kaggle. The randomized dependence coefficient. RDC is defined in terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper. Gaussian process vine copulas for multivariate dependence. In 30th International Conference on Machine LearningAtlanta, Georgia, USA, June Abstract: Copulas allow to learn marginal distributions separately from the multivariate dependence structure copula that links them together into a density function. Vine factorizations ease the learning of high-dimensional copulas by constructing a hierarchy of conditional bivariate copulas. However, to simplify inference, it is common to assume that each of these conditional bivariate copulas is independent from its conditioning variables. In this paper, we relax this assumption by discovering the latent functions that specify the shape of a conditional copula given its conditioning variables We learn these functions by following a Bayesian approach based on sparse Gaussian processes with expectation propagation for scalable, approximate inference. Experiments on real-world datasets show that, when modeling all conditional dependencies, we obtain better estimates of the underlying copula of the data. R asmussen, and Anders Krogh. On the accuracy of short read mapping. In Deep Sequencing Data Analysisvolumepages Springer, Abstract: The development of high-throughput sequencing technologies has revolutionized the way we study genomes and gene regulation. In a single experiment, millions of reads are produced. To gain knowledge from these experiments the first thing to be done is finding the genomic origin of the reads, i. In this new microarray, conventional alignment tools are obsolete, as they cannot handle this huge amount of data in a reasonable amount of time. Thus, new mapping algorithms have been developed, which are fast at the expense of a small decrease in accuracy. In this chapter we discuss the current problems in short read mapping and show that mapping reads correctly is a nontrivial task. Through simple experiments with both real and synthetic data, we demonstrate that different mappers can give different results depending on the type of data, and that a considerable fraction of uniquely mapped reads is potentially mapped to an incorrect location. Furthermore, we provide simple statistical results on the expected number of random matches in a genome E-value and the probability of a random match as a function of read length. Finally, we show that quality scores contain valuable information for mapping and why mapping quality should be evaluated in a probabilistic manner. In the end, we discuss the potential of improving the performance of current methods by considering these quality scores in a probabilistic mapping program. The supervised ibp: Neighbourhood preserving infinite latent feature models. In 29th Conference on Uncertainty in Artificial IntelligenceBellevue, USA, July Abstract: We propose a probabilistic model to infer supervised latent variables in the Hamming space from observed data. Our model allows simultaneous microarray of the number of binary latent variables, and their values. The latent variables preserve neighbourhood structure of the data in a sense that objects in the same semantic concept have similar latent values, and objects in different concepts have dissimilar latent values. We formulate the supervised infinite latent variable problem based on an intuitive principle of pulling objects together if they are of the same type, and pushing them apart if they are not. We then combine this principle with a flexible Indian Buffet Process prior on the latent variables. We show that the inferred supervised latent variables can be directly used to perform a nearest neighbour search for the purpose of retrieval. We introduce a new application of dynamically extending hash codes, and show how to effectively couple the structure of the hash codes with continuously growing structure of the neighbourhood preserving infinite latent feature space. Colorado Reed and Zoubin Ghahramani. Scaling the Indian buffet process via submodular maximization. This formulation yields a submodular function of the features that corresponds to a lower bound on the model evidence. By adding a constant to this function, we obtain a nonnegative submodular function that can be maximized via a greedy algorithm that obtains at least a one-third approximation to the optimal solution. Our inference method scales linearly with the size of the input data, and we show the efficacy of our method on the largest datasets currently analyzed using an IBP model. Amar Shah and Zoubin From. Determinantal clustering processes - A nonparametric Bayesian approach to kernel based semi-supervised clustering UAIAbstract: Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown and most models require this parameter as an input. Dirichlet process mixture models are appealing as they can infer the binary of clusters from the data. However, these models do not deal with high dimensional data well and can encounter difficulties in inference. We present a novel nonparameteric Bayesian kernel based method to cluster data points without the need to prespecify the number of clusters or to model complicated densities from which data points are assumed to be generated from. The key insight is to use determinants of submatrices of a kernel matrix as a measure of how close together a set of points are. We explore some theoretical properties of the model and derive a natural Gibbs based algorithm with MCMC hyperparameter learning. The model is implemented on a variety of synthetic and real world data sets. Viktoriia Sharmanska, Novi Quadrianto, and Christoph Lampert. Learning to rank using privileged information. In International Conference on Computer VisionAbstract: Many computer vision problems have an asymmetric distribution of information between training and test time. In this work, we study the case where we are given additional information about the training data, which however will not be available at test time. This situation is called learning using privileged information LUPI. We introduce two maximum-margin techniques that are able to make use of this additional source of information, and we show that the framework is applicable to several scenarios that have been studied in computer vision before. Experiments with attributes, bounding boxes, image tags and rationales as additional information in object classification show promising results. Gaussian process kernels for pattern discovery and extrapolation. In 30th International Conference on Machine LearningFebruary Abstract: Gaussian processes are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. We introduce simple closed form kernels that can be used with Gaussian processes to discover patterns and enable extrapolation. These kernels are derived by modelling a spectral density - the Fourier transform of a kernel - with a Gaussian mixture. The proposed kernels support a broad class of stationary covariances, but Gaussian process inference remains simple and analytic. We demonstrate the proposed kernels by discovering patterns and performing long range extrapolation on synthetic examples, as well as atmospheric CO2 trends and airline passenger data. We also show that we can reconstruct standard covariances within our framework. Gpatt: Fast multidimensional pattern extrapolation with Gaussian processes arXiv preprint arXivAbstract Gaussian processes are typically used for smoothing and interpolation on small datasets. We introduce a new Bayesian nonparametric framework - GPatt enabling automatic pattern extrapolation with Gaussian processes on large multidimensional datasets. GPatt unifies and extends highly expressive kernels and fast exact inference techniques. Without human intervention - no hand crafting of kernel features, and no sophisticated initialisation procedures - we show that GPatt can solve large scale pattern extrapolation, inpainting, and kernel discovery problems, including a problem with 383, training points. We find that GPatt significantly outperforms popular alternative scalable Gaussian process methods in speed and accuracy. Moreover, we discover profound differences between each of these methods, suggesting expressive kernels, nonparametric representations, and scalable inference which exploits model structure are useful in combination for modelling large scale multidimensional patterns. Dynamic covariance models for multivariate financial time series. In 30th International Conference on Machine LearningAtlanta, Georgia, USA, June Abstract: The accurate prediction of time-changing covariances is an important problem in the modeling of multivariate financial data. However, some of the most popular models suffer from a overfitting problems and multiple local optima, b failure to capture shifts in market conditions and c large computational costs. To address these problems we introduce a novel dynamic model for time-changing covariances. Over-fitting and local optima are avoided by following a Bayesian approach instead of computing point estimates. Changes in market conditions are captured by assuming a diffusion process in parameter values, and finally computationally efficient and scalable inference is performed using particle filters. Experiments with financial data show excellent performance of the proposed method with respect to current standard models. The basic assumption in beamforming is that the sources are uncorrelated, which allows for estimating each source independent of the others. In this paper, we incorporate the independence assumption of the standard beamformer in a linear dynamical system, thereby introducing the dynamic beamformer. Using empirical data, we show that the dynamic beamformer outperforms the standard beamformer in predicting the condition of interest which strongly suggests that it also outperforms the standard method in localizing the active neural generators. Gaussian Processes for time-marked time-series data. In 15th International Conference on Artificial Intelligence and Statisticspages, doi Abstract: In many settings, data is collected as multiple time series, where each recorded time series is an observation of some underlying dynamical process of interest. These observations are often time-marked with known event times, and one desires to do a range of standard analyses. When there is only one time marker, one simply aligns the observations temporally on that marker. When multiple time-markers are present and are at different times on different time series observations, these analyses are more difficult. We describe a Gaussian Process model for analyzing multiple time series with multiple time markings, and we test it on a variety of data. Robust filtering and smoothing with Gaussian processes IEEE Transactions on Automatic Control,doi TAC Abstract: We propose a principled algorithm for robust Bayesian filtering and smoothing in nonlinear stochastic dynamic systems when both the transition function and the measurement function are described by nonparametric Gaussian process GP models. GPs are gaining increasing importance in signal processing, machine learning, robotics, and control for representing unknown system functions by posterior probability distributions. This modern way of "system identification" is more robust than finding point estimates of a parametric function representation. Our numerical evaluations demonstrate the robustness of the proposed approach in situations where other state-of-the-art Gaussian filters and smoothers can fail. Modelling and control of nonlinear systems using Gaussian processes with partial model information. In 51st IEEE Conference on Decision and ControlAbstract: Gaussian processes are gaining increasing popularity among the control community, in particular for the modelling of discrete time state space systems. However, it has not been clear how to incorporate model information, in the form of known state relationships, when using a Gaussian process as a predictive model. An obvious example of known prior information is position and velocity related states. Incorporation of such information would be beneficial both computationally and for faster dynamics learning. This paper introduces a method of achieving this, yielding faster dynamics learning and a reduction in computational effort from O Dn to O D-F n in the prediction stage for a system with D states, F known state relationships and n observations. The effectiveness of the method is demonstrated through its inclusion in the PILCO learning algorithm with application to the swing-up and balance of a torque-limited pendulum and the balancing of a robotic unicycle in simulation. Collaborative Gaussian processes for preference learning. In Advances in Neural Information Processing Systems 26pages Curran Associates, Inc. Inference is simplified by using a preference kernel for GPs which allows us to combine supervised GP learning of user preferences with unsupervised dimensionality reduction for multi-user systems. The model not only exploits collaborative information from the shared structure in user behavior, but may also incorporate user features if they are available. Approximate inference is implemented using a combination of expectation propagation and variational Bayes. Finally, we present an efficient active learning strategy for querying preferences. The proposed technique performs favorably on real-world data against state-of-the-art multi-user preference learning algorithms. Optimally-weighted herding is Bayesian quadrature. In 28th Conference on Uncertainty in Artificial IntelligencepagesCatalina Island, California, July Abstract: Herding and kernel herding are deterministic methods of choosing samples which summarise a probability distribution. A related task is choosing samples for estimating integrals using Bayesian quadrature. We show that the criterion minimised when selecting samples in kernel herding is equivalent to the posterior variance in Bayesian quadrature. We then show that sequential Bayesian quadrature can be viewed as a weighted version of kernel herding which achieves performance superior to any other weighted herding method. Our results also imply an upper bound on the empirical error of the Bayesian quadrature estimate. Time-course Bayesian quantum tomography Physical Review A, Abstract In this paper we revisit the problem of optimal design of quantum tomographic experiments. In contrast to previous approaches where an optimal set of measurements data decided in advance of the experiment, we allow for measurements to be adaptively and efficiently re-optimised depending on data collected so far. Hyun-Chul Kim and Zoubin Ghahramani. In 15th International Conference on Artificial Intelligence and StatisticsAbstract: Bayesian model averaging linearly mixes the probabilistic predictions of multiple models, microarray weighted by its posterior probability. This is the coherent Bayesian way of combining multiple models only under certain restrictive assumptions, which we outline. We explore a general framework for Bayesian model combination which differs from model averaging in the context of classification. The framework does not require that the models be probabilistic they can even be human assessorsthat they share prior information or receive the same training data, or that they be independent in their errors. Finally, the Bayesian combiner does not need to believe any of the models is in fact correct. We test several variants of this classifier combination procedure starting from a classic statistical model proposed by Dawid and Skene and using MCMC to add more complex but important features to the model. Comparisons on sev- eral data sets to simpler methods like majority voting show that the Bayesian methods not only perform well but result in interpretable diagnostics on the data points and the models. Bayesian correlated clustering to integrate multiple datasets Bioinformatics, Abstract MOTIVATION: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct-but often complementary-information. We present a Bayesian method for the unsupervised integrative modelling of multiple datasets, which we refer to as MDI Multiple Dataset Integration. MDI can integrate information from a wide range of different datasets and data types simultaneously including the ability to model time series data explicitly using Gaussian processes. Each dataset is modelled using a Dirichlet-multinomial allocation DMA mixture model, with dependencies between these models captured through parameters that describe the agreement among the datasets. RESULTS: Using a set of six artificially constructed time series datasets, we show that MDI is able to integrate a significant binary of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real Saccharomyces cerevisiae datasets. We then move beyond the capabilities of current approaches and integrate gene expression, chromatin immunoprecipitation-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques-as well as to non-integrative approaches-demonstrate that MDI is competitive, while also providing information that would be difficult or impossible to extract using other methods. Bayesian correlated clustering to integrate multiple datasets BioinformaticsAbstract: Motivation: The integration of multiple datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct — but often complementary — information. Each dataset is modelled using a Dirichlet-multinomial allocation DMA mixture model, with dependencies between these models captured via parameters that describe the agreement among the datasets. Results: Using a set of artificially constructed time series datasets, we show that MDI is able to integrate a significant number of datasets simultaneously, and that it successfully captures the underlying structural similarity between the datasets. We also analyse a variety of real S. We then move beyond the capabilities of current approaches and integrate gene expression, ChIP-chip and protein-protein interaction data, to identify a set of protein complexes for which genes are co-regulated during the cell cycle. Comparisons to other unsupervised data integration techniques — as well as to non-integrative approaches — demonstrate that MDI is very competitive, while also providing information that would be difficult or impossible to extract using other methods. Comment: This paper is available from the Bioinformatics site and a Matlab implementation of MDI is available from this site. Random function priors for exchangeable arrays with applications to graphs and relational data. In Advances in Neural Information Processing Systems 26pages 1-9, Lake Tahoe, California, USA, December Abstract: A fundamental problem in the analysis of binary relational data like graphs, networks, databases, and matrices is to extract a summary of the common structure underlying relations between individual entities. Relational data are typically encoded in the form of arrays; invariance to the ordering of rows and columns corresponds to exchangeable arrays. Results in probability theory due to Aldous, Hoover and Kallenberg show that exchangeable arrays can be represented in terms of a random measurable function which constitutes the natural model parameter in a Bayesian model. We obtain a flexible yet simple Bayesian nonparametric model by placing a Gaussian process prior on the parameter function. Efficient inference utilises elliptical slice sampling combined with a random sparse approximation to the Gaussian process. We demonstrate applications of the model to network data and clarify its relation to models in the literature, several of which emerge as special cases. Semi-supervised domain adaptation with non-parametric copulas. In Advances in Neural Information Processing Systems 26pages 1-9, Lake Tahoe, California, USA, December Abstract: A new framework based on the theory of copulas is proposed to address semisupervised domain adaptation problems. The presented method factorizes any multivariate density into a product of marginal distributions and bivariate copula functions. Therefore, changes in each of these factors can be detected and corrected to adapt a density model accross different learning domains. Importantly, we introduce a novel vine copula model, which allows for this factorization in a non-parametric manner. Experimental results on regression problems with real-world data illustrate the efficacy of the proposed approach when compared to state-of-the-art techniques. Heller, and Zoubin Ghahramani. Bayesian and L 1 approaches for sparse unsupervised learning. In 29th International Conference on Machine LearningAbstract: The use of L1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L1 methods, in this paper we find that L1 regularisation often dramatically under-performs in terms of predictive performance when compared to other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L1 minimising factor models, Bayesian variants of "L1", and Bayesian models with a stronger L0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods out-perform L1 minimisation, even on a com- putational budget. We thus highlight the need to re-assess the wide use of L1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance. A nonparametric Bayesian model for multiple clustering with overlapping feature views. In 15th International Conference on Artificial Intelligence and StatisticsAbstract: Most clustering algorithms produce a single clustering solution. This is inadequate for many data sets that are multi-faceted and can be grouped and interpreted in many different ways. Moreover, for high-dimensional data, different features may be relevant or irrelevant to each clustering solution, suggesting the need for feature selection in clustering. Features relevant to one clustering interpretation may be different from the ones relevant for an alternative interpretation or view of the data. In this paper, we introduce a probabilistic nonparametric Bayesian model that can discover multiple clustering solutions from data and the feature subsets that are relevant for the clusters in each view. In our model, the features in different views may be shared and therefore the sets of relevant features are allowed to overlap. We model feature relevance to each view using an Indian Buffet Process and the cluster membership in each view using a Time-course Restaurant Process. We provide an inference microarray to learn the latent parameters corresponding to this multiple partitioning problem. Our model not only learns the features and clusters in each view but also automatically learns the number of clusters, number of views and number of features in each view. Roberts, and Zoubin Ghahramani. Active learning of model evidence using Bayesian quadrature. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. An infinite latent attribute model for network data. In 29th International Conference on Machine LearningEdinburgh, Scotland, June Abstract: Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a "flat" clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups subclusterscorresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks. A nonparametric variable clustering model. In Advances in Neural Information Processing Systems 26pages 1-9, Lake Tahoe, California, USA, December Abstract Factor analysis models effectively summarise the covariance structure of high dimensional data, but the solutions are typically hard to interpret. This motivates attempting to find a disjoint partition, i. We introduce a Bayesian non-parametric approach to this problem, and demonstrate advantages over heuristic methods proposed to date. Our Dirichlet process variable clustering DPVC model can discover block-diagonal covariance structures in data. We evaluate our method on both synthetic and gene expression analysis problems. Barnabas Poczos, Zoubin Ghahramani, and Jeff Schneider. Copula-based kernel dependency measures. In 29th International Conference on Machine LearningAbstract: The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions. Novi Quadrianto, Chao Chen, and Christoph Lampert. The most persistent soft-clique in a set of sampled graphs. In 29th International Conference on Machine LearningEdinburgh, Scotland, June Abstract: When searching for characteristic subpatterns in potentially noisy graph data, it appears self-evident that having multiple observations would be better than having just one. However, it turns out that the inconsistencies introduced when different graph instances have different edge sets pose a serious challenge. In this work we address this challenge for the problem of finding maximum weighted cliques. We introduce the concept of most persistent soft-clique. This is subset of vertices, that 1 is almost fully or at least densely connected, 2 occurs in all or almost all graph instances, and 3 has the maximum weight. We present a measure of clique-ness, that essentially counts the number of edge missing to make a subset of vertices into a clique. With this measure, we show that the problem of finding the most persistent soft-clique problem can be cast either as: a a max-min two person game optimization problem, or b a min-min soft margin optimization problem. Both formulations lead to the same solution when using a partial Lagrangian method to solve the optimization problems. By experiments on synthetic data and on real social network data we show that the proposed method is able to reliably find soft cliques in graph data, even if that is distorted by random noise or unreliable observations. In 12th European Conference on Computer VisionpagesAbstract: We propose a new learning method to infer a mid-level feature representation that combines the advantage of semantic attribute representations with the higher expressive power of non-semantic features. The idea lies in augmenting an existing attribute-based representation with additional dimensions for which an autoencoder model is coupled with a large-margin principle. This construction allows a smooth transition between the zero-shot regime with no training example, the unsupervised regime with training examples but without class labels, and the supervised regime with training examples and with class labels. The resulting optimization problem can be solved efficiently, because several of the necessity steps have closed-form solutions. Through extensive experiments we show that the augmented representation achieves better results in terms of object categorization accuracy than the semantic representation alone. Robust estimation of local genetic ancestry in admixed populations using a non-parametric Bayesian approach Genetics, Abstract: We present a new haplotype-based approach for inferring local genetic ancestry of individuals in an admixed population. Most existing approaches for local ancestry estimation ignore the latent genetic relatedness between ancestral populations and treat them as independent. In this paper, we exploit such information by building an inheritance model that describes both the ancestral populations and the admixed population jointly in a unified framework. Based on an assumption that the common hypothetical founder haplotypes give rise to both the ancestral and admixed population haplotypes, we employ an infinite hidden Markov model to characterize each ancestral population and further extend it to generate the admixed population. Through an effective utilization of the population structural information under a principled nonparametric Bayesian framework, the resulting model is significantly less sensitive to the choice and the amount of training data for ancestral populations than state-of-the-arts algorithms. We also improve the robustness under deviation from common modeling assumptions by incorporating population-specific scale parameters that allow variable recombination rates in different populations. Our method is applicable to an admixed population from an arbitrary number of ancestral populations and also performs competitively in terms of spurious ancestry proportions under general multi-way admixture assumption. We validate the proposed method by simulation under various admixing scenarios and present empirical analysis results on worldwide distributed dataset from Human Genome Diversity Project. Jacob Steinhardt and Zoubin Ghahramani. Flexible martingale priors for deep hierarchies. In 15th International Conference on Artificial Intelligence and StatisticsAbstract: When building priors over trees for Bayesian hierarchical models, there is a tension between maintaining desirable theoretical properties such as infinite exchangeability and important practical properties such as the ability to increase the depth of the tree to accommodate new data. We resolve this tension by presenting a family of infinitely exchangeable priors over discrete tree structures that allows the depth of the tree to grow with the data, and then showing that our family contains all hierarchical models with certain mild symmetry properties. Tatiana Tommasi, Novi Quadrianto, Barbara Caputo, and Christoph Lampert. Beyond dataset bias: Multi-task unaligned shared knowledge transfer. In 11th Asian Conference on Computer VisionAbstract: Many visual datasets are traditionally used to analyze the performance of different learning techniques. The evaluation is usually done within each dataset, therefore it is questionable if such results are a reliable indicator of true generalization ability. We propose here an algorithm to exploit the existing data resources when learning on a new multiclass problem. Our main idea is to identify an image representation that decomposes orthogonally into two subspaces: a part specific to each dataset, and a part generic to, and therefore shared between, all the considered source sets. This allows us to use the generic representation as un-biased reference knowledge for a novel classification task. By casting the method in the multi-view setting, we also make it possible to signals different features for different microarray. We call the algorithm MUST, Multitask Unaligned Shared knowledge Transfer. Through extensive experiments on five public datasets, we show that MUST consistently improves the cross-datasets generalization performance. The UKF suffers from arbitrary parameters necessary for sigma point placement, potentially causing it to perform poorly in nonlinear problems. We show how to treat sigma point placement in a UKF as a learning problem in a model based view. We demonstrate that learning to place the sigma points correctly from data can make sigma point collapse much less likely. Learning can result in a significant increase in predictive performance over default settings of the parameters in the UKF and other filters designed to avoid the problems of the UKF, such as the GP-ADF. At the same time, we maintain a lower computational complexity than the other methods. We call our method UKF-L. Turner and Maneesh Sahani. Decomposing signals into a sum of amplitude and frequency modulated sinusoids using probabilistic inference. In Acoustics, Speech and Signal Processing ICASSPIEEE International Conference onpagesmarchdoi ICASSP Abstract: There are many methods for decomposing signals into a sum of amplitude and frequency modulated sinusoids. In this paper we take a new estimation based approach. Identifying the problem as ill-posed, we show how to regularize the solution by imposing soft constraints on the amplitude and phase variables of the sinusoids. Estimation proceeds using a version of Kalman smoothing. We evaluate the method on synthetic and natural, clean and noisy signals, showing that it outperforms previous decompositions, but at a higher computational cost. Modelling input varying correlations between multiple responses. The GWP can naturally scale to thousands of response variables, as opposed to competing multivariate volatility models which are typically intractable for greater than 5 response variables. The GWP can also naturally capture a rich class of covariance dynamics - periodicity, Brownian motion, smoothness, - through a covariance kernel. Gaussian process regression networks. In 29th International Conference on Machine LearningEdinburgh, Scotland, June Abstract: We introduce a new regression framework, Gaussian process regression networks GPRNwhich combines the structural properties of Bayesian neural networks with the nonparametric flexibility of Gaussian processes. GPRN accommodates input predictor dependent signal and noise correlations between multiple output response variables, input dependent length-scales and amplitudes, and heavy-tailed predictive distributions. We derive both elliptical slice sampling and variational Bayes inference procedures for GPRN. We apply GPRN as a multiple output regression and multivariate volatility model, demonstrating substantially improved performance over eight popular multiple output multi-task Gaussian process models and three multivariate volatility models on real datasets, including a dimensional gene expression dataset. Storkey, and Zoubin Ghahramani. Continuous relaxations for discrete Hamiltonian Monte Carlo. Weinberger, editors, NIPSpagesAbstract: Continuous relaxations play an important role in discrete optimization, but have not seen much use in approximate probabilistic inference. Here we show that a general form of the Gaussian Integral Trick makes it possible to transform a wide class of discrete variable undirected models into fully continuous systems. The continuous representation allows the use of gradient-based Hamiltonian Monte Carlo for inference, results in new ways of estimating normalization constants partition functionsand in general opens up a number of new avenues for inference in difficult discrete systems. We demonstrate some of these continuous relaxation inference algorithms on a number of illustrative problems. Testing a Bayesian measure of representativeness using a large image database. In Advances in Neural Information Processing Systems 24Cambridge, MA, USA, The MIT Press. Abstract: How do people determine which elements of a set are most representative of that set? We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories. Ortega, Evangelos Theodorou, and Stefan Schaal. Path integral control and bounded rationality. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement LearningAbstract: Path integral methods have recently been shown to be applicable to a very general class of optimal control problems. Here we examine the path integral formalism from a decision-theoretic point of view, since an optimal controller can always be regarded as an instance of a perfectly rational decision-maker that chooses its actions so as to maximize its expected utility. The problem with perfect rationality is, however, that finding optimal actions is often very difficult due to prohibitive computational resource costs that are not taken into account. In contrast, a bounded rational decision-maker has only limited resources and therefore needs to strike some compromise between the desired utility and the required resource costs. extracting binary signals from microarray time-course data

Maths Tutorial: Smoothing Time Series Data (statistics)

Maths Tutorial: Smoothing Time Series Data (statistics)

3 thoughts on “Extracting binary signals from microarray time-course data”

  1. Dancer says:

    With a sample of one, you have what is called a large sample variance.

  2. akl says:

    No time to write that business report while taking care of the family responsibilities.

  3. and77 says:

    Steps must be taken to see that, there are no sectional appeals so that disunity among the people does not increase manifold.

Leave a Reply

Your email address will not be published. Required fields are marked *

inserted by FC2 system