With the deluge of deep learning libraries and software innovation in the field over the last few years, it is an exciting time to be working on machine learning problems. Most of the libraries available evolved from fairly specialized computational code for large dense problems such as image classification into general frameworks for neural-network-based models offering marginal support for sparse models.
At Netflix, our machine learning scientists deal with a wide variety of problems across a broad spectrum of areas: from tailoring TV and movie recommendations to your taste to optimizing encoding algorithms. A subset of our problems involve dealing with extremely sparse data; the total dimensionality of the problem at hand can easily reach tens of millions of features, even though every observation may only have a handful of non-zero entries. For these cases, we felt the need for a minimalist library that is specifically optimized for training shallow feedforward neural nets on sparse data in a single-machine, multi-core environment. We wanted something small and easy to hack, so we built Vectorflow, one of the many tools our machine learning scientists use.
Design considerationsA few months after the project’s inception, we’ve seen a wide variety of use cases for the library and multiple research projects and production systems are now using Vectorflow for problems as diverse as causal inference, survival regression, density estimation or ranking algorithms for recommendation. In fact, we’re testing using Vectorflow to power part of the Netflix home page experience. It is also included in the default toolbox installed on basic instances used by Netflix machine learning practitioners.
As an example, we investigate the performance of the library on a marketing problem Netflix faces related to promoting our originals. In this case, we want to perform weighted Maximum Likelihood Estimation with a survival exponential distribution. To implement this, the custom callback function passed to Vectorflow is:
Using this callback for training, we can easily compare 3 models:
The data source is a Hive table stored on S3 using the columnar data format Parquet and we train directly against this data by streaming it to a c4.4xlarge instance and building in-memory the training set which we learn from.
The results are as follows:
Both decompression and feature encoding happen on a single thread so there is room for improvement, but the end-to-end runtime demonstrates that there is no need for a distributed solution for medium-scale sparse datasets and shallow architectures. Notice that the training time scales somewhat linearly with the sparsity of the data as well as the number of rows. One reason preventing linear scalability is that CPU memory hierarchy will create cache invalidation when multiple asynchronous SGD threads access the same weights, hence breaking the theoretical results of Hogwild if the model parameters access pattern is not sparse enough (see this paper for details).
Future work
In the future, we plan to broaden the possible topologies supported beyond simple linear, polynomial or feedforward architectures, develop more specialized layers (such as recurrent cells) and explore new parallelism strategies while maintaining the “minimalist” philosophy of Vectorflow.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3