How to Configure High-Performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

In today’s data-driven world, the speed of computation can be a game-changer. If you’re a data scientist or software engineer working with large datasets, you may already be familiar with Breeze, a powerful library for numerical processing in Scala. Breeze leverages lower-level libraries such as BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) to achieve high-performance computation.

How to Configure High-Performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

In today’s data-driven world, the speed of computation can be a game-changer. If you’re a data scientist or software engineer working with large datasets, you may already be familiar with Breeze, a powerful library for numerical processing in Scala. Breeze leverages lower-level libraries such as BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) to achieve high-performance computation.

But how can we configure these libraries for optimal performance on Amazon Elastic MapReduce (EMR) or Elastic Compute Cloud (EC2)? In this blog post, we’ll guide you through the steps to achieve this.

Step 1: Install BLAS/LAPACK

First, let’s install the optimized versions of BLAS and LAPACK. Intel’s MKL (Math Kernel Library) is an excellent choice for this, as it’s highly optimized for speed.

sudo apt-get update
sudo apt-get install -y libmkl-dev

Step 2: Install Breeze

Next, let’s install Breeze. You can add it to your build.sbt file if you’re using sbt, or to your pom.xml if you’re using Maven.

// For sbt
libraryDependencies += "org.scalanlp" %% "breeze" % "2.0.2"

// For Maven
<dependency>
  <groupId>org.scalanlp</groupId>
  <artifactId>breeze_2.12</artifactId>
  <version>2.0.2</version>
</dependency>

Step 3: Configure Breeze to Use MKL

By default, Breeze uses netlib-java for its BLAS/LAPACK routines, which is a wrapper for native system BLAS/LAPACK. However, it’s not optimized for performance as MKL is. To configure Breeze to use MKL, include the following lines in your code:

import com.github.fommil.netlib.BLAS;
BLAS.getInstance().getClass().getName();

This will print the class name of the BLAS instance. If it’s com.github.fommil.netlib.NativeSystemBLAS, it means Breeze is using the system BLAS, not MKL. To make Breeze use MKL, we need to set the com.github.fommil.netlib.NativeSystemBLAS property to the path of the MKL implementation.

System.setProperty("com.github.fommil.netlib.NativeSystemBLAS", "/path/to/mkl")

Replace “/path/to/mkl” with the actual path to your MKL implementation.

Step 4: Optimize for EC2/EMR

When running on Amazon EC2 or EMR, there are some additional optimizations you can make. For EC2, you can choose an instance type optimized for compute-intensive tasks, such as the C5 series.

For EMR, remember to configure Spark to use all available cores, and increase the executor memory if needed. Also, consider using EMRFS (Amazon EMR File System) for your data storage, as it’s optimized for high-performance processing.

<property>
  <name>spark.executor.instances</name>
  <value>2</value>
</property>
<property>
  <name>spark.executor.memory</name>
  <value>2g</value>
</property>

Remember, the key to high-performance computation isn’t just in the software, but also in the infrastructure. By carefully choosing and configuring your BLAS/LAPACK libraries and your EC2/EMR settings, you can greatly speed up your numerical computations with Breeze.

Conclusion

Configuring BLAS/LAPACK for Breeze on Amazon EMR, EC2 can undoubtedly elevate your numerical computation performance. This step-by-step guide ensures you have the necessary knowledge to leverage the maximum potential of these libraries.

As always, it’s important to iterate and experiment with different configurations to find what works best for your specific use case. Happy computing!

Keywords: BLAS, LAPACK, Breeze, Amazon EMR, Amazon EC2, MKL, data scientist, software engineer, high-performance computation, numerical processing, Scala, configuration, optimization.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.