High-speed data processing in distributed storage
Fujitsu Laboratories has developed a technology that offers both high-speed data processing and high-capacity storage in distributed storage systems, in order to accelerate the processing of ever-increasing volumes of data.
Recently, customers have looked for processing speed improvements in storage systems that handle data analysis. This is in response to a growing need for technologies such as AI and machine learning to analyse rapidly growing volumes of data, including unstructured data such as video and log data. Conventionally, data was analysed in processing servers, but if data could be processed in the same systems where it is stored, it is expected that would increase the speed of data processing.
Data processing requires the processing server to read the data from the storage system. As the volume of data flowing between the storage system and the processing server increases, the time required to read the data can become a bottleneck when utilising large volumes of data. On the other hand, data processing at high speeds becomes possible when the processing is done on the storage system without moving the data. Nonetheless, this makes it difficult to analyse unstructured data distributed across the storage system, and to maintain stable operations in the system’s original storage functionality.
Fujitsu Laboratories has now developed Dataffinic Computing, a technology for handling data processing in distributed storage systems that distributes and collects data by connecting multiple servers through a network — without reducing the original storage functionality of the system. With this technology, storage systems can process large volumes of data at high speeds, including unstructured data, enabling the efficient utilisation of ever-increasing amounts of data, including security camera video, logs from ICT systems, sensor data from cars and genetic data.
In order to improve access performance, distributed storage systems do not store large amounts of data in the same place, but break the data into sizes that are easy to manage for storage. In the case of unstructured data such as videos and log data, however, individual pieces of data cannot be completely processed when the file is systematically broken down into pieces of specified size and stored separately. It was therefore necessary to once again gather together the distributed data for processing, placing a significant load on the system.
By breaking down unstructured data along natural breaks in the connections within the data, the technology stores the data in a state in which the individual pieces can still be processed. In addition, information essential for processing (such as header information) is attached to each piece of data. This means that the pieces of data scattered across the distributed storage can be processed individually, maintaining the scalability of access performance and improving the system performance as a whole.
In addition to the ordinary reading and writing of data, storage nodes face a variety of system loads to safely maintain data, including automatic recovery processing after an error, data redistribution processing after more storage capacity is added and disk checking processing as part of preventive maintenance. The technology models the types of system loads that occur in storage systems, predicting resources that will be needed in the near future. Based on this, the technology controls data processing resources and their allocation, so as not to reduce the performance of the system’s storage functionality. This enables high-speed data processing while still delivering stable operations for the original storage functionality.
Fujitsu Laboratories implemented the technology in Ceph, an open source distributed storage software solution, and evaluated its effects. Five storage nodes and five processing servers were connected with a 1 Gbps network, and data processing performance was measured when extracting objects such as people and cars from 50 GB of video data. With the conventional method, it took 500 seconds to complete processing, but with the newly developed technology, the data processing could be done on the storage nodes, without the need to bring the data together. Moreover, the processing was completed in 50 seconds — 10 times the speed of the previous method.
Fujitsu Laboratories will continue to verify the technology for commercial applications, planning for its development into a product within fiscal 2019.
What data scientists and engineers need to know when working with big data as they move from...
Every machine with moving parts suffers some wear and tear and will inevitably need to be...
Researchers have used a customised, low-cost 3D printer to print electronics on a real hand for...