This blog outlines the advantages of compute attached to storage and reference architecture to implement it in the cloud with FPGAs.
Almost all the deep learning algorithms are very memory intensive and it takes more energy (power) to get data into and out of the CPU/GPU than the compute itself. Optimal scheduling of DMA is challenging owing to the several threads running to compute different parts of the network with dependencies rendering a lot of dead cycles. Moreover, 90% of the fully connected layer in ‘0’s weights, hence transferring these ‘0’s consume more energy. FPGA’s offer a healthy compromise between CPU/GPU and ASIC’s to implement fixed functions close to the memory. Storage can be attached near the cloud is the ideal choice for computing.
There are several advantages to FPGA technology.
- The reconfigurability makes them more flexible than a custom-built ASIC
- The performance of the FPGA is higher than the CPU’s and GPU’s owing to the hardened pipeline
- Can perform Inline processing with low latency enabling real-time response
- GPU’s depend on batch processing for performance increasing the latency
- FPGA can be close to the data source/sink avoiding DMA overheads
Case for FPGA acceleration near Storage
The graph below shows the roofline for CPU (as reported by Google in its TPU-1 paper-2017). Utilization of CPU compute is maximum for the cases above 20 operations per byte. For cases where they are fully connected networks, it requires 1 Operation per 2 weight bytes (16-bit weights), thereby making the CPU wait for loading most of the time. whereas the database processing tasks is in IO-bound with the Operation per word can be accessed for non-in memory database. In many cases, the database sizes are large and preclude fully fitting into memory when there is less number of nodes.
In doing, FPGA processing near the storage, the host processor receives only the results whereas the data can stay near the drives. Attachment of Caching in the memory to FPGAs helps to reuse the recently accessed data. This enables the implementation of very large databases/datasets with very less number of nodes and with very high performance reducing the overheads implementing clusters of processors.
Also, learn more about the data layer, check out Data Quality Improvement page
Moreover, the data can be back and forth with the SSD store to/from the FPGA without having to go to the network, alleviating IOPS at the processor, Network bandwidth at the Spine Switches.
Some of the operations on the data can be offloaded to the Accelerator as data is written into and read from Storage. Low latency and in-line processing have an advantage for such functionality. In order to reduce the footprint of the database, the functions like compression will be implemented in the accelerator.
Similar to the Deep-learning algorithms, Compute attached storage can be used to accelerate a lot of Database and Visualization functions. In order to process data in the FPGA, data fetched directly from the storage and only result in the CPU. This alleviates the IOPS requirement for the CPU and CPU just needs to issue a command to the FPGA.
FPGA processing latency is very low and without having to do any DMA transfers or pass through Network traffic, the latency is even lower enabling highly responsive response times. Moreover, FPGA can implement multiple threads with micro-engines enabling very high levels of parallelism for database tasks. FPGA can accelerate the search, another common database functions such as filtering, Arithmetic operations, etc.
Cloud Support for FPGA Acceleration
As per compute requirements in Cloud, FPGAs in cloud reprogrammed for different applications. For instance, the bing search is familiar with its use and efficiency.
Microsoft Azure has FPGA’s that connect to the TOR directly to fetch data without having to interrupt the Host processor enabling acceleration in line.
Amazon AWS provides F-1 instances and the instances can be built with SSD storage for fast access. However, AWS instances require that the Host processor schedule the DMA transfers. Here a Processor + FPGA + SSD Storage can form a Storage Node. It will perform with very high performance, low latency, and less network traffic.
Something got to give
Above all, One of the main disadvantages with the compute near Storage in Cloud storage acceleration with FPGA is that there could be new APIs for using the FPGA functions. For Deep-learning, it is relatively straightforward and functions can be offloaded at the layer-by-layer level or the complete network can be offloaded.
However for database acceleration either special offloading functions need to be called in-lieu of standard SQL commands. while a complete interpreter can be built similar to the memSQL and the backend can be implemented with FPGA functions.
Storage acceleration for deep learning gave about 5x improvement in performance compared to running the model in GPU.
About 25x improvement in performance compared to running the model in CPU. Storage acceleration for SQL/ETL applications gave an improvement of over 25x compared to CPU.
Firstly, Moving to compute closer to the storage reduces the overall power consumption and efficiency. Certainly, it provides a very scalable architecture obviating the need for high-performance GPU’s and Switches for performing simple tasks. Secondly, Moving to compute closer to the storage is not confined to private clouds, but is also feasible in the current cloud architectures at Amazon AWS and Microsoft Azure. Furthermore, Gyrus has implemented several such FPGA offloading projects successfully.