How can we learn high quality models when data is inherently distributed across sites and cannot be shared or pooled? In federated learning, the solution is to iteratively train models locally at each site and share these models with the server to be aggregated to a global model. As only models are shared, data usually remains undisclosed. This process, however, requires sufficient data to be available at each site in order for the locally trained models to achieve a minimum quality – even a single bad model can render aggregation arbitrarily bad.
In healthcare settings, however, we often have as little as a few dozens of samples per hospital. How can we still collaboratively train a model from a federation of hospitals, without infringing on patient privacy?
At this year’s ICLR, my colleagues Jonas Fischer, Jilles Vreeken and me presented an novel building block for federated learning called daisy-chaining. This approach trains models consecutively on local datasets, much like a daisy chain. Daisy-chaining alone, however, violates privacy, since a client can infer from a model upon the data of the client it received it from. Moreover, performing daisy-chaining naively would lead to overfitting which can cause learning to diverge. In our paper “Federated Learning from Small Datasets“, we propose to combine daisy-chaining of local datasets with aggregation of models, both orchestrated by the server, and term this method Federated Daisy-Chaining (FedDC).
This approach allows us to train models successfully from as little as 2 samples per client. Our results on image data (Table 1) show that FedDC not only outperforms standard federated avering (FedAvg), but also state-of-the-art federated learning approaches, achieving a test accuracy close to centralized training.
Discovering causal relationships enables us to build more reliable, robust, and ultimately trustworthy models. It requires large amounts of observational data, though. In healthcare, for most diseases the amount of available data is large, but this data is scattered over thousands of hospitals worldwide. Since this data in most cases mustn’t be pooled for privacy reasons, we need a way to learn a structural causal model in a federated fashion.
At this year’s AISTATS, my co-authors Osman Mian, David Kaltenpoth, Jilles Vreeken and me presented the paper “Nothing but Regrets – Privacy-Preserving Federated Causal Discovery” in which we show that you can discover causal relationships by sharing only regret values with a server: The server sends a candidate causal model to each client and the clients reply with how much worse single-edge extensions of this global model are compared to the original global model. From this information alone, the server can compute the best extension of the current global model.
In practice, the environments at the local clients are not the same. We should expect local differences that could be modeled by interventions into the global causal structure. In our AAAI paper “Information-Theoretic Causal Discovery and Intervention Detection over Multiple Environments” we have shown how to discover a global causal structure as well as local interventions in a centralized setting. Our current goal is to combine these two works to provide an approach to federated causal discovery from heterogeneous environments.
At AAAI 2023 my colleagues Osman Mian, Jilles Vreeken and me presented our paper “Information-Theoretic Causal Discovery and Intervention Detection over Multiple Environments” in which we learn a global structural causal model over multiple environments, as well as discover potential local intervention that change some causal relationships within particular environments.
For medical data this has an enormous impact: Being able to reliably detect causal relationships in medical data, such as gene expressions or patient records, allows us not only to build more reliable and trustworthy models, but also to detect novel insights on diseases and risk factors.
Reliably detecting causal relationships requires large amounts of observational data, though. Therefore, it is paramount to develop privacy-preserving methods to tap into the large, but inherently distributed medical datasets in hospitals all over the world. What we need is federated causal discovery.
To explain the kernel trick to students at Monash University, I once created a jupyter notebook that explains the basics with a few illustrations. I embed the notebook below, but you can also view it directly, or check it out on github. It is all part of my Machine Learning Explained repository, where I collect little bits of information on machine learning.
Linara Adilova, Amr Abourayya, Jianning Li, Amin Dada, Henning Petzka, Jan Egger, Jens Kleesiek, Michael Kamp: FAM: Relative Flatness Aware Minimization. In: Proceedings of the ICML Workshop on Topology, Algebra, and Geometry in Machine Learning (TAG-ML), 2023.
Linara Adilova, Michael Kamp, Gennady Andrienko, Natalia Andrienko: Re-interpreting Rules Interpretability. In: International Journal of Data Science and Analytics, 2023.
Michael Kamp, Jonas Fischer, Jilles Vreeken: Federated Learning from Small Datasets. In: International Conference on Learning Representations (ICLR), 2023.
Osman Mian, David Kaltenpoth, Michael Kamp, Jilles Vreeken: Nothing but Regrets - Privacy-Preserving Federated Causal Discovery. In: International Conference on Artificial Intelligence and Statistics (AISTATS), 2023.
Osman Mian, Michael Kamp, Jilles Vreeken: Information-Theoretic Causal Discovery and Intervention Detection over Multiple Environments. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
Osman Mian, Michael Kamp, Jilles Vreeken: Information-Theoretic Causal Discovery and Intervention Detection over Multiple Environments. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.
Linara Adilova, Siming Chen, Michael Kamp: Informed Novelty Detection in Sequential Data by Per-Cluster Modeling. In: ICML workshop on Artificial Intelligence & Human Computer Interaction, 2023.
Osman Mian, David Kaltenpoth, Michael Kamp: Regret-based Federated Causal Discovery. In: The KDD'22 Workshop on Causal Discovery, pp. 61–69, PMLR 2022.
Junhong Wang, Yun Li, Zhaoyu Zhou, Chengshun Wang, Yijie Hou, Li Zhang, Xiangyang Xue, Michael Kamp, Xiaolong Zhang, Siming Chen: When, Where and How does it fail? A Spatial-temporal Visual Analytics Approach for Interpretable Object Detection in Autonomous Driving. In: IEEE Transactions on Visualization and Computer Graphics, 2022.
Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, Mario Boley: Relative Flatness and Generalization. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., 2021.
Florian Linsner, Linara Adilova, Sina Däubener, Michael Kamp, Asja Fischer: Approaches to Uncertainty Quantification in Federated Deep Learning. Machine Learning and Principles and Practice of Knowledge Discovery in Databases: International Workshops of ECML PKDD 2021, vol. 2, Springer, 2021.
Xiaoxiao Li, Meirui Jiang, Xiaofei Zhang, Michael Kamp, Qi Dou: FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. In: Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.
Lukas Heppe, Michael Kamp, Linara Adilova, Nico Piatkowski, Danny Heinrich, Katharina Morik: Resource-Constrained On-Device Learning by Dynamic Averaging. Proceedings of the Workshop on Parallel, Distributed, and Federated Learning (PDFL) at ECMLPKDD, 2020.
Disclaimer: this article is an advertisement for our workshop on parallel, distributed, and federated learning (PDFL’20) at ECMLPKDD this year and a call for contributions. First I wanted to just post the workshop description and be done with it, but then I thought I might add some actual content to give this article some actual value. Let’s see how that works out.
Distributed Machine Learning
Distributed machine learning is a huge topic since the Big Data hype. However, the landscape of machine learning applications is changing rapidly: large centralized datasets are replaced by high volume, high velocity data streams generated by a vast number of geographically distributed, loosely connected devices, such as mobile phones, smart sensors, autonomous vehicles or industrial machines. Current learning approaches centralize the data and process it in parallel in a cluster or cloud. This way, you again have a centralized dataset and can run your off-the-shelf efficient machine learning algorithm – think of Spark’s mllib, for example. And of course, a lot of really good research is done on how to improve machine learning in such a high-performance setup (e.g., the work of Janis Keuper). However, this has three major disadvantages: (i) it does not scale well with the number of data-generating devices since their growth exceeds that of computing centers, (ii) the communication costs for centralizing the data are prohibitive in many applications, and (iii) it requires sharing potentially privacy-sensitive data. Pushing computation towards the data-generating devices alleviates these problems and allows to employ their otherwise unused computing power.
Distributed Gradient Computation
The first more general parallelization scheme, i.e., one that is applicable to several ML algorithms at once, was the distributed mini-batch algorithm of Ofer Dekel and his colleagues from Microsoft (I think the idea precedes him, but he wrote the seminal paper). The idea is very simple: in any gradient based ML algorithm, e.g., stochastic gradient descent, you can calculate the gradients locally and send them to a coordinator node. The coordinator sums them up (which is theoretically sound, since the sum of local gradients is the gradient with respect to all local data points) and performs one update step with the summed up gradient. It then sends the updated model to the local nodes who compute the next round of gradients. There’s been extensive work on showing why this is theoretically sound, and it actually works in practice. The scalability of the approach is seemingly perfect, and it has even been called “embarrassingly parallel”, to underline how easy and effective the parallelization is. Since a lot of machine learning relies on gradient based methods, this technique is wide-spread, e.g., it is used in Spark’s mllib and in many implementations of the parameter server.
The Problem of Distributed Gradient Computation for ML
At this point, we could be done. Actually, during my PhD I at many points asked myself why I should work on other methods, since this is so effective. But the devil lies in the detail, and there are two large, connected issues with this technique. The first one is the amount of communication: for every update step, the locally computed gradients need to be send to the coordinator and the updated model has to be send back. This is no problem in a tightly connected system, like a high-performance cluster, but can already become an issue in a cloud, and is prohibitive for physically distributed devices, like cars and cellphones. So why not compute more local gradients each iteration and update a bit less frequently? On paper, this should solve the problem. Here comes the second, more severe issue into play. And for this one, I have to give a bit of background.
Two very prominent optimization algorithms used for machine learning are gradient descent (GD) and stochastic gradient descent (SGD). In gradient descent, you compute the gradient with respect to all samples in your dataset and then update your model, and keep iterating this until you converge. In stochastic gradient descent, you compute the gradient with respect to only a single sample and then update. In standard optimization settings, GD requires less updates, but each update is more costly, since you have to calculate all the gradients. SGD instead requires more updates, but each update is dirt cheap. There is also something in between, the mini-batch SGD: compute the gradient with respect to a hand full of samples (the mini-batch) and then update. The larger the mini-batch, the more this resembles GD, the smaller it is, the more it behaves like SGD.
Figure 1: Support vector regression of a one-dimensional example (sinus with upward trend) trained on a sample of 20 points. The higher the regularization parameter lambda, the less complex the model gets, up to a horizontal line for lambda = 100. The smaller the parameter, the more complex the model becomes, eventually being able to fit nearly each sample point, but behaving wildly wrong in-between samples.
With machine learning, there is however an interesting twist to the story, and this one becomes more theoretical, so please bear with me. In machine learning, you want to optimize the risk, or true error of your model. That is, you want to minimize the error the model makes on novel, unseen data. This is also called the generalization error. All that you can do, however, is minimize the empirical error, that is, the error you make on your training set. However, simply finding the model that minimizes the empirical risk can lead to an effect called overfitting: if the model is complex enough, it can memorize the training set and achieve zero error there, but be completely wrong on everything else. What is typically done to avoid this is called regularization: you restrict the model complexity a bit to find a nice trade-off between empirical error and generalization. In figure 1 I illustrate this for support vector machines. Their complexity can be controlled with the regularization parameter lambda. If we regularize too much, the model is very simple and cannot fit the data, so both empirical and true error are high. If we regularize too little, the model becomes very complex and can fit each training point nearly perfectly, but in between it shows crazy behavior. The optimum is somewhere in the middle. Of course, finding this optimum is not trivial and has to be done empirically for each new dataset.
Ok, so why this lengthy digression into generalization? Because SGD has a very cool property: if you draw a new sample for every update, then in expectation, SGD optimizes the true error directly, not the empirical error (see Chapter 14.5.1 in this awesome book). So while GD is better at finding the model that minimizes the empirical error, SGD will find a model that does not minimize the empirical error, but generalizes a lot better. And indeed, this has been observed over and over, especially with neural networks. And now, finally, we get back to the problem with distributed mini-batching. When you calculate more gradients on each local device, the update is made with respect to more samples. So your “mini-batch size” becomes larger. But then, your learning algorithm behaves more like GD and not like SGD. So your model’s true error will become worse. And it’s not only the communication frequency, it’s also the number of nodes. If you have 10 nodes, and each one computes a single gradient, then the effective mini-batch size for your updates is 10. If you use 1000 nodes each computing a single gradient, the mini-batch size becomes 1000. So here is the gist: with distributed mini-batching you cannot scale up your system arbitrarily and you cannot reduce communication that much, because otherwise it will behave too much like GD and will produce models that do not generalize well. Short disclaimer at this point: I am sure that lots of people have worked on clever practical tweaks to alleviate this problem and I hope we will see some of these at our workshop. However, there is no principle way around it – at least I don’t know of one, so please correct me if I am wrong.
Model Averaging and Federated Learning
Figure 2: Illustration of a non-convex error surface with four models (blue) each having reached a local minimum. The average (red) of these models has a substantially higher error than any of the four local models.
A different approach is to not parallelize the learning algorithm itself at all. Instead, you train a local model, send the model parameters to the coordinator and aggregate them into a better model (and then you could send that better model back and iterate the process). In my PhD-thesis I called this black-box parallelization, since it treats the learning algorithm as a black box. A simple but effective method to aggregate model parameters is to just average them. A lot of people where researching in this direction, but it was Google that brought the breakthrough in 2017 and gave this approach popularity, when they did averaging for neural networks and called it federated learning. This has created quite a hype around this topic, especially in the deep learning community. And indeed, it works. Google is using it to train their link recommendation models in their keyboard app for android phones. It uses less communication and is more resilient than distributed mini-batching. Moreover, this approach is highly privacy-preserving, and thus a hot candidate for machine learning in the medical domain. However, there is one big issue: the model you get by averaging is not necessarily the one you would get from centralized training. Instead, you always loose in model quality. And this time not only on the generalization ability, but actually also on the empirical error. It gets even worse: for neural networks there can be multiple minima of the empirical error, so the average of a bunch of models can be even a lot worse than the local models (see figure 2 for an illustration). In practice, this still works very well. Moreover, the communication can be further reduced by only communicating with a random subset of nodes (federated averaging), or by deciding in a data-driven manner when to communicate (dynamic averaging). And under some strict assumptions, one can even understand how it works theoretically.
Still, there is one subtle issue which might be a deal breaker. Let’s assume we learn using our off-the-shelf SGD which requires us to set a learning rate, let’s call that one l. It is the step size each update makes. To achieve good performance, this l has to be set correctly, not too small and not to large. If we choose it too small, the algorithm will take ages to converge to a good model. If we choose it too large, the algorithm will jump around wildly and might ultimately diverge, breaking the learning entirely. Now, what happens when you train two models separately, each with a learning rate l for exactly one update step and then average their parameters? You will get the same model you would get when training on both local datasets jointly, but with half the learning rate. If instead you used ten models, it would be one tenth the learning rate (see Prop. 3 in this paper on dynamic averaging for deep learning). So the learning rate goes down with the number of nodes in your distributed system. Now this is a huge problem! If you use too many nodes, your effective learning rate will be tiny and the model barely converges to a good one. One can compensate by setting higher learning rates at the local nodes – and this actually works – but it only goes so far; if you set the local learning rates too high, local training breaks completely and nothing works. This seems to put a limit on the scalability of the approach. I would love to see some analysis and ideas to overcome this in our workshop!
The Radon Machine
Finally, there is a more esoteric approach which does not average model parameters but iteratively calculates a form of high-dimensional median – called the Radon point – on them. The approach is called Radon machine and has some very nice theoretical properties. It scales well with the number of nodes, it gives a guarantee on the model quality, it is also privacy preserving, and it only requires a single round of communication. However, it currently only works for linear models and the guarantees only hold for convex learning problems. So its practical use is still quite limited. Still, going beyond averaging could be the way to overcome all these issues. I hope we can see some novel approaches along this line, as well.
Conclusion
So here we are: there are traditional distributed ML approaches that produce the same model as the centralized computation, but they require a prohibitive amount of communication and don’t scale well. Then there is distributed mini-batching, which computes gradients in parallel. This is working well in tightly coupled systems like high-performance clusters, but works not so well on physically distributed devices and doesn’t arbitrarily scale. Then there is model averaging, which works well in practice, but is not well understood theoretically, and seems to have an in-built limit to its scalability. And then there is the Radon machine, which is theoretically interesting, but limited to convex methods and linear models, so at its current stage it is not very useful in practice. Thus, despite the surge of papers in this phase of hype, there is still a lot to do. And with this, I will now blatantly advertise our workshop.
Workshop on Parallel, Distributed, and Federated Learning – PDFL’20
Pascal Welke, Florian Seiffarth, Michael Kamp, Stefan Wrobel: HOPS: Probabilistic Subtree Mining for Small and Large Graphs. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1275–1284, Association for Computing Machinery, Virtual Event, CA, USA, 2020, ISBN: 9781450379984.
Henning Petzka, Linara Adilova, Michael Kamp, Cristian Sminchisescu: Feature-Robustness, Flatness and Generalization Error for Deep Neural Networks. 2020.
Linara Adilova, Livin Natious, Siming Chen, Olivier Thonnard, Michael Kamp: System Misuse Detection via Informed Behavior Clustering and Modeling. 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE 2019.
Linara Adilova, Julia Rosenzweig, Michael Kamp: Information Theoretic Perspective of Federated Learning. NeurIPS Workshop on Information Theory and Machine Learning, 2019.
Henning Petzka, Linara Adilova, Michael Kamp, Cristian Sminchisescu: A Reparameterization-Invariant Flatness Measure for Deep Neural Networks. Science meets Engineering of Deep Learning workshop at NeurIPS, 2019.
Sven Giesselbach, Katrin Ullrich, Michael Kamp, Daniel Paurat, Thomas Gärtner: Corresponding Projections for Orphan Screening. Proceedings of the ML4H workshop at NeurIPS, 2018.
Michael Kamp, Linara Adilova, Joachim Sicking, Fabian Hüger, Peter Schlicht, Tim Wirtz, Stefan Wrobel: Efficient Decentralized Deep Learning by Dynamic Model Averaging. In: Machine Learning and Knowledge Discovery in Databases, Springer, 2018.
Ioannis Flouris, Nikos Giatrakos, Antonios Deligiannakis, Minos Garofalakis, Michael Kamp, Michael Mock: Issues in Complex Event Processing: Status and Prospects in the Big Data Era. In: Journal of Systems and Software, 2017.
Michael Kamp, Mario Boley, Olana Missura, Thomas Gärtner: Effective Parallelisation for Machine Learning. In: Advances in Neural Information Processing Systems, pp. 6480–6491, 2017.
Katrin Ullrich, Michael Kamp, Thomas Gärtner, Martin Vogt, Stefan Wrobel: Co-regularised support vector regression. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 338–354, Springer 2017.
Michael Kamp, Sebastian Bothe, Mario Boley, Michael Mock: Communication-Efficient Distributed Online Learning with Kernels. In: Frasconi, Paolo; Landwehr, Niels; Manco, Giuseppe; Vreeken, Jilles (Ed.): Machine Learning and Knowledge Discovery in Databases, pp. 805–819, Springer International Publishing, 2016.
Katrin Ullrich, Michael Kamp, Thomas Gärtner, Martin Vogt, Stefan Wrobel: Ligand-based virtual screening with co-regularised support Vector Regression. In: 2016 IEEE 16th international conference on data mining workshops (ICDMW), pp. 261–268, IEEE 2016.
Ioannis Flouris, Nikos Giatrakos, Antonios Deligiannakis, Minos Garofalakis, Michael Kamp, Michael Mock: Issues in Complex Event Processing: Status and Prospects in the Big Data Era. In: Journal of Systems and Software, 2017.
Michael Kamp, Mario Boley, Thomas Gärtner: Parallelizing Randomized Convex Optimization. Proceedings of the 8th NIPS Workshop on Optimization for Machine Learning, 2015.
Michael Kamp, Mario Boley, Daniel Keren, Assaf Schuster, Izchak Sharfman: Communication-Efficient Distributed Online Prediction by Dynamic Model Synchronization. In: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery (ECMLPKDD), Springer 2014.
Thomas Gärtner Michael Kamp Mario Boley: Beating Human Analysts in Nowcasting Corporate Earnings by using Publicly Available Stock Price and Correlation Features. In: Proceedings of the SIAM International Conference on Data Mining, pp. 641–649, SIAM 2014.
Michael Kamp, Mario Boley, Michael Mock, Daniel Keren, Assaf Schuster, Izchak Sharfman: Adaptive Communication Bounds for Distributed Online Learning. Proceedings of the 7th NIPS Workshop on Optimization for Machine Learning, 2014.
Michael Kamp, Andrei Manea: STONES: Stochastic Technique for Generating Songs. Proceedings of the NIPS Workshop on Constructive Machine Learning (CML), 2013.
Michael Kamp, Christine Kopp, Michael Mock, Mario Boley, Michael May: Privacy-preserving mobility monitoring using sketches of stationary sensor readings. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 370–386, Springer 2013.
Mario Boley, Michael Kamp, Daniel Keren, Assaf Schuster, Izchak Sharfman: Communication-Efficient Distributed Online Prediction using Dynamic Model Synchronizations.. First Internation Workshop on Big Dynamic Distributed Data (BD3) at the Internation Conference on Very Large Data Bases (VLDB), 2013.
Michael Kamp, Mario Boley, Thomas Gärtner: Beating Human Analysts in Nowcasting Corporate Earnings by Using Publicly Available Stock Price and Correlation Features. 2013 IEEE 13th International Conference on Data Mining Workshops, IEEE 2013.