This is a research entry associated with a third-party research project or paper. We are not responsible for the contents of any files associated with this submission, or for the accuracy of any code / results. Any questions should be directed to the author(s) of the work.
Hatch: Self-Distributing Systems for Data Centers
Abstract: Designing and maintaining distributed systems remains highly challenging: there is a high-dimensional design space of potential ways to distribute a system’s sub-components over a large-scale infrastructure; and the deployment environment for a system tends to change in unforeseen ways over time. For engineers, this is a complex prediction problem to gauge which distributed design may best suit a given environment. We present the concept of self-distributing systems, in which any local system built using our framework can learn, at runtime, the most appropriate distributed design given its perceived operating conditions. Our concept abstracts distribution of a system’s sub-components to a list of simple actions in a reward matrix of distributed design alternatives to be used by reinforcement learning algorithms. By doing this, we enable software to experiment, in a live production environment, with different ways in which to distribute its software modules by placing them in different hosts throughout the system’s infrastructure. We implement this concept in a framework we call Hatch, which has three major elements: (i) a transparent and generalized RPC layer that supports seamless relocation of any local component to a remote host during execution; (ii) a set of primitives, including relocation, replication and sharding, from which to create an action/reward matrix of possible distributed designs of a system; and (iii) a decentralized reinforcement learning approach to converge towards more optimal designs in real time. Using an example of a self-distributing webserving infrastructure, Hatch is able to autonomously select the most suitable distributed design from among 700,000 alternatives in about 5 minutes.
Status: Paper has no replication package, or it's hosted elsewhere; see paper for details.
Venue: FGCS 2022.