BIOS IT Blog
Unlocking Data Analytics Excellence with Neuroblade: Insight from BIOS IT's Laurence Sawbridge
In this video, we're diving deep into the transformative technology of Neuroblade with none other than Laurence Sawbridge. Get ready to explore the future of data processing as Laurence unveils the secrets behind Neuroblade's groundbreaking solutions. If you're passionate about data analytics and innovation, you're in for a treat. Let's jump in and discover how Neuroblade is reshaping the landscape of data analytics.
If you would like anymore information on Neuroblades offering or are interested in testing its capabilties, please doesnt hesitate to contact us at [email protected]
Video transcript below:
Can you explain the concept of separating storage and compute in modern data analytics, and what benefits does it offer to organizations?
Modern data is big data, and there are very few compute configurations that have the space to store big data. That’s when you need to separate your data from your compute by using a storage array with a separate database engine to do the processing of queries.
How does Neuroblade's special-purpose compute architecture address the performance inefficiencies introduced by the separation of storage and compute in data analytics systems?
The Neuroblade SPU has been designed from the ground up to process data in a fast and efficient manner. Separating storage from compute solves the problem of space, yet as with most engineering issues, fixing one problem often causes another, and here what we’re talking about is latency. One of my favourite sayings this year is that data is cheap to generate, but expensive to move. The expense in this instance, is the time it takes to read large amounts of data and present it to a user or program in a reasonable time frame.
Could you describe the key tiers involved in modern data analytics solutions, and why each of them is important for efficient data processing?
The first layer is the storage layer, which encompasses databases, cloud storage, and local storage, among others. Fast access to this data is always beneficial, but these benefits are magnified when dealing with big data queries.
Then there’s the compute layer. This is where the magic happens. Data queries are executed, data is manipulated, sub-queried, and processed, before being handed over to the application or user.
Once data has been retrieved, visualisations are made to help display it to the user in a human readable format, or it is passed to a reporting service or interactive dashboard.
Then there’s the new tier on the block, which is machine learning. Given the high cost of training a new model, you want to ensure that the data you’re feeding into it has been properly sanitised, verified, and complete. This is where query engines can help by ensuring the right data is passed to the trainer, and no erroneous information is used which may skew the accuracy of the AI model.
What challenges does the rapid growth of data in terms of volume, variety, and velocity pose to traditional analytics approaches, and how does Neuroblade address these challenges?
These days data is ingested from multiple sources and usually in many different data formats. Gone are the days when you’re just querying a single database or data source. Nowadays you might need to query data from a flat file, an SQL database, a time series database, and more. The Neuroblade SPU has been designed from the ground up to speed up data access and overcome the latency introduced by gathering the data from these disparate sources.
So its common knowledge general-purpose processors like CPUs struggle to keep up with the speed of I/O. Can you elaborate on how Neuroblade's SPU bridges this gap?
Traditional CPUs are great at performing a multitude of tasks, hence the moniker ‘general purpose’. All the new processing units that have been release in the last few years have been introduced to solve specific problems that CPUs aren’t great at dealing with, and the Neuroblade SPU is no different. By being specialised towards a particular task, the SPU can perform this task in a much more efficient way, with regards to both power and processing time.
Can you explain the typical query flow in a data analytics system, and how the compute layer interacts with the storage layer to retrieve and process data?
Step one: Query submission
The first step in the process is query submission, where the user or program sends a query to the processing engine. The query specifies the data manipulation or computations to be performed on the dataset.
Step two: Query optimisation
Next, the compute layer optimises the query by considering factors such as query complexity, available resources, and data distribution, and determines the most efficient way to process the query by considering techniques such as query rewriting and join reordering.
Step three: Data retrieval
The compute layer interacts with the storage layer to retrieve the relevant data needed for the query execution. This may involve collecting data from multiple disparate sources.
Step four: Parallel processing
The compute layer then processes the retrieved data in parallel, so multiple processes are split off from the main process to compute significant parts of the query result set. These processes can be run from multiple CPU cores or even separate nodes.
Step five: Data transformation and aggregation
The compute layer applies transformations, aggregations, filters, and other operations specified in the query. It processes the data according to the query logic, performing tasks such as data joins, groupings, sorting, and calculations.
Step six: Result assembly
As the compute layer completes the query execution, it gathers the intermediate results generated by different compute nodes. It combines and merges these intermediate results to form the result set.
Step seven: Result delivery
The compute layer then presents the results of the query back to the user or program, or stores the result in a dedicated location for future retrieval.
Throughout this query flow, the compute layer interacts closely with the storage layer, leveraging parallel processing, data transformations, and optimizations to efficiently execute the query. The result is obtained through the coordination of various computational and storage resources, providing users with the desired data.
What are the two major architectural bottlenecks that hinder efficient query execution, and how does Neuroblade tackle them?
The two main bottlenecks are computation or memory access, and data retrieval.
Computation and memory access are bottlenecks that don’t just affect big data queries, but many other areas of compute as well. Many accelerators have been developed to solve these problems in their respective areas, such as GPU’s for graphics and AI processing, DPU’s for cloud computing and high-speed networking, and now the SPU for large scale data processing.
Data retrieval from disparate storage engines and locations normally has one large issue, and that’s that you cant run a query until all the data has been gathered, and if one of your mechanisms is slow, it will slow down the overall execution time of the query.
Could you provide examples of software integrations, such as Presto with Velox or Apache Spark with Gluten plugin, and explain how they benefit from Neuroblade's hardware acceleration?
Velox, as a plugin for PrestoDB, enables it to utilise accelerators for distributed processing of large-scale queries. Velox provides a seamless interface that allows Presto to utilise the full power offered by the Neuroblade SPU.
Similarly, the Gluten plugin for Apache spark enables it to offload the computationally intensive tasks to the Neuroblade SPU, speeding up processing and offering faster time to results, and ensuring a faster ROI for your investment in big data hardware.
What are some real-world use cases where organizations can benefit from Neuroblade's accelerated data analytics capabilities, such as in genomics research or A/B testing?
An organisation that employs data warehouses, data lakes, or data oceans will already know the pain involved in efficiently querying these structures to gain insights. Current methods of getting around limitations involve simplifying queries, which means you’re not taking full advantage of the data you have. You can index more of your data, which can be hard to do and incur significant performance and storage costs with a large dataset.
Neuroblade offers different deployment options for its solution. Can you describe these deployment options and the advantages of each in various scenarios?
There are three ways of deploying the Neuroblade SPU. Simply put, it is packaged on a standard PCIe card which can be used in three ways:
- The Neuroblade appliance, containing eight SPU’s, storage, and compute, and a NeuroLink fabric to allow the SPU’s to coordinate data with low latency and high bandwidth.
- Install the PCIe card on a compute node
- Install the PCIe card on a storage server
Not what you're looking for? Check out our archives for more content