Machine learning algorithms, especially deep learning neural networks often produce models that improve the accuracy of prediction. But the accuracy comes at the expense of higher computation and memory consumption. A deep learning algorithm, also known as a model, consists of layers of computations where thousands of parameters are computed in each layer and passed to the next, iteratively. The higher the dimensionality of the input data (e.g., a high-resolution image), the higher the computational need. GPU farms in the cloud are often used to meet these computational requirements.
When machine learning is used for use cases such as detecting the quality of a product in manufacturing, predicting the health of a critical piece of equipment, or video surveillance, it is expected that inference will be done in near real-time. Inferencing at the cloud requires moving data from the source to the cloud and introduces several challenges: (a) it is
costly to bring data to the cloud for real-time inference, (b) bringing data from the edge to the cloud will lead to higher network
latency, (c) sending data from the edge to the cloud introduces
scalability issues as the number of connected devices increase, and (d)
security concerns of user data risks sending data to the cloud.
Edge computing is a distributed computing paradigm which brings computation and data storage closer to the location where it is needed, to improve response times and save bandwidth. Though edge computing addresses connectivity, latency, scalability and security challenges, the computational resource requirements for deep learning models at the edge devices are hard to fulfill in smaller devices. Before determining the type of hardware for edge devices, it is important to establish key performance metrics for the inference. At a high level, the key performance metrics for machine learning at the edge can be summarized as
latency,
throughput,
energy consumption by the device, and
accuracy. The latency refers to the time it takes to infer one data point, throughput is the number of inference calls per second, and accuracy is the confidence level of the prediction output required by the use case. Depending on these requirements, one can take one or more of the following approaches to speed up the inference at the resource-constrained edge device.
The right machine learning model for edge device
Researchers have found that reducing the number of parameters in deep neural network models help decrease the computational resources needed for model inference. Some popular models which have used such techniques with minimum (or no) accuracy degradation are YOLO, MobileNets, Solid-State Drive (SSD), and SqueezeNet. Many of these pre-trained models are available to download and use in open-source platforms such as TensorFlow or PyTorch.
Model compression is another technique used to execute models at the edge device. A compressed model may lose some accuracy compared to the original model but in many cases this is acceptable. Using several compression techniques and caching intermediate results to reuse iteratively, researchers have improved the execution speed of deep neural network models.
DeepMon is one such machine learning framework for continuous computer vision applications at the edge device. Similar techniques are being used by TensorFlow Lite to run models at the edge.
Edge hardware
To speed up inference at the edge, hardware vendors recommend increasing the number of CPU cores or GPUs in the edge devices. There are also specialized hardware components such as edge TPU from Google, or FPGA (field-programmable gate array) based deep learning accelerators from Intel, Microsoft, and others.
Software
Hardware vendors are increasingly entering the machine learning area and providing tools and SDKs to support their existing investment and innovations. These tool kits help utilize the hardware resources efficiently for deep learning and so speed up the execution. Intel’s OpenVino tool kit leverages Intel chips including CPU, GPU, FPGA and vision processing units. The EGX platform from Nvidia and Neural Processing SDK from Qualcomm support their respective hardware. There are also generic libraries such as
RSTensorFlow, which uses GPUs to accelerate matrix manipulations in the deep learning model.
In general, many challenges remain to improve the performance of deep learning models. But it is apparent that the inference of the model is moving towards the edge devices. It is important to understand the business use case and key performance requirements of the model to speed up execution at the resource-constrained device.
SAP Edge Services and
Data Intelligence together provide an end-to-end tool for training machine learning models in the cloud and managing the life cycle and execution at the edge devices.