This post was originally published on the Quansight Labs blog and has been modified and republished here with Quansight's permission.
Web applications are rapidly emerging as a new frontier for high-performance scientific computation and AI-enabled end-user experiences. Underpinning the ML/AI revolution is linear algebra, a branch of mathematics concerning linear equations and their representations in vector spaces and via matrices. LAPACK ("Linear Algebra Package") is a fundamental software library for numerical linear algebra, providing robust, battle-tested implementations of common matrix operations. Despite LAPACK being a foundational component of most numerical computing programming languages and libraries, a comprehensive, high-quality LAPACK implementation tailored to the unique constraints of the web has yet to materialize. That is...until now.
Earlier this year, I had the great fortune of being a summer intern at Quansight Labs, the public benefit division of Quansight and a leader in the scientific Python ecosystem. During my internship, I worked to add initial LAPACK support to stdlib, a fundamental library for scientific computation written in C and JavaScript and optimized for use in web browsers and other web-native environments, such as Node.js and Deno. In this blog post, I'll discuss my journey, some expected and unexpected (!) challenges, and the road ahead. My hope is that this work, with a little bit of luck, provides a critical building block in making web browsers a first-class environment for numerical computation and machine learning and portends a future of more powerful AI-enabled web applications.
Sound interesting? Let's go!
Readers of this blog who are familiar with LAPACK are likely to not be intimately familiar with the wild world of web technologies. For those coming from the world of numerical and scientific computation and have familiarity with the scientific Python ecosystem, the easiest way to think of stdlib is as an open source scientific computing library in the mold of NumPy and SciPy. It provides multi-dimensional array data structures and associated routines for mathematics, statistics, and linear algebra, but uses JavaScript, rather than Python, as its primary scripting language. As such, stdlib is laser-focused on the web ecosystem and its application development paradigms. This focus necessitates some interesting design and project architecture decisions, which make stdlib rather unique when compared to more traditional libraries designed for numerical computation.
To take NumPy as an example, NumPy is a single monolithic library, where all of its components, outside of optional third-party dependencies such as OpenBLAS, form a single, indivisible unit. One cannot simply install NumPy routines for array manipulation without installing all of NumPy. If you are deploying an application which only needs NumPy's ndarray object and a couple of its manipulation routines, installing and bundling all of NumPy means including a considerable amount of "dead code". In web development parlance, we'd say that NumPy is not "tree shakeable". For a normal NumPy installation, this implies at least 30MB of disk space, and at least 15MB of disk space for a customized build which excludes all debug statements. For SciPy, those numbers can balloon to 130MB and 50MB, respectively. Needless to say, shipping a 15MB library in a web application for just a few functions is a non-starter, especially for developers needing to deploy web applications to devices with poor network connectivity or memory constraints.
Given the unique constraints of web application development, stdlib takes a bottom-up approach to its design, where every unit of functionality can be installed and consumed independently of unrelated and unused parts of the codebase. By embracing a decomposable software architecture and radical modularity, stdlib offers users the ability to install and use exactly what they need, with little-to-no excess code beyond a desired set of APIs and their explicit dependencies, thus ensuring smaller memory footprints, bundle sizes, and faster deployment.
As an example, suppose you are working with two stacks of matrices (i.e., two-dimensional slices of three-dimensional cubes), and you want to select every other slice and perform the common BLAS operation y = a * x, where x and y are ndarrays and a is a scalar constant. To do this with NumPy, you'd first install all of NumPy
pip install numpy
and then perform the various operations
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
With stdlib, in addition to having the ability to install the project as a monolithic library, you can install the various units of functionality as separate packages
npm install @stdlib/ndarray-fancy @stdlib/blas-daxpy
and then perform the various operations
// Individually import desired functionality: import FancyArray from '@stdlib/ndarray-fancy'; import daxpy from '@stdlib/blas-daxpy'; // Define ndarray meta data: const shape = [4, 4, 4]; const strides = [...]; const offset = 0; // Define arrays using a "lower-level" fancy array constructor: const x = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); const y = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); // Perform operation: daxpy(5.0, x['::2,:,:'], y['::2,:,:']);
Importantly, not only can you independently install any one of stdlib's over 4,000 packages, but you can also fix, improve, and remix any one of those packages by forking an associated GitHub repository (e.g., see @stdlib/ndarray-fancy). By defining explicit layers of abstraction and dependency trees, stdlib offers you the freedom to choose the right layer of abstraction for your application. In some ways, it's a simple—and, if you're accustomed to conventional scientific software library design, perhaps unorthodox—idea, but, when tightly integrated with the web platform, it has powerful consequences and creates exciting new possibilities!
Okay, so maybe your interest has piqued; stdlib seems intriguing. But what does this have to do with LAPACK in web browsers? Well, one of our goals this past summer was to apply the stdlib ethos—small, narrowly scoped packages which do one thing and do one thing well—in bringing LAPACK to the web.
But wait, you say! That is an extreme undertaking. LAPACK is vast, with approximately 1,700 routines, and implementing even 10% of them within a reasonable time frame is a significant challenge. Wouldn't it be better to just compile LAPACK to WebAssembly, a portable compilation target for programming languages such as C, Go, and Rust, which enables deployment on the web, and call it a day?
Unfortunately, there are several issues with this approach.
To help illustrate the last point, let's return to the BLAS routine daxpy, which performs the operation y = a*x y and where x and y are strided vectors and a a scalar constant. If implemented in C, a basic implementation might look like the following code snippet.
pip install numpy
After compilation to WebAssembly and loading the WebAssembly binary into our web application, we need to perform a series of steps before we can call the c_daxpy routine from JavaScript. First, we need to instantiate a new WebAssembly module.
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
Next, we need to define module memory and create a new WebAssembly module instance.
npm install @stdlib/ndarray-fancy @stdlib/blas-daxpy
After creating a module instance, we can now invoke the exported BLAS routine. However, if data is defined outside of module memory, we first need to copy that data to the memory instance and always do so in little-endian byte order.
// Individually import desired functionality: import FancyArray from '@stdlib/ndarray-fancy'; import daxpy from '@stdlib/blas-daxpy'; // Define ndarray meta data: const shape = [4, 4, 4]; const strides = [...]; const offset = 0; // Define arrays using a "lower-level" fancy array constructor: const x = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); const y = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); // Perform operation: daxpy(5.0, x['::2,:,:'], y['::2,:,:']);
Now that data is written to module memory, we can call the c_daxpy routine.
void c_daxpy(const int N, const double alpha, const double *X, const int strideX, double *Y, const int strideY) { int ix; int iy; int i; if (N <= 0) { return; } if (alpha == 0.0) { return; } if (strideX < 0) { ix = (1-N) * strideX; } else { ix = 0; } if (strideY < 0) { iy = (1-N) * strideY; } else { iy = 0; } for (i = 0; i < N; i++) { Y[iy] += alpha * X[ix]; ix += strideX; iy += strideY; } return; }
And, finally, if we need to pass the results to a downstream library without support for WebAssembly memory "pointers" (i.e., byte offsets), such as D3, for visualization or further analysis, we need to copy data from module memory back to the original output array.
const binary = new UintArray([...]); const mod = new WebAssembly.Module(binary);
That's a lot of work just to compute y = a*x y. In contrast, compare to a plain JavaScript implementation, which might look like the following code snippet.
// Initialize 10 pages of memory and allow growth to 100 pages: const mem = new WebAssembly.Memory({ 'initial': 10, // 640KiB, where each page is 64KiB 'maximum': 100 // 6.4MiB }); // Create a new module instance: const instance = new WebAssembly.Instance(mod, { 'env': { 'memory': mem } });
With the JavaScript implementation, we can then directly call daxpy with our externally defined data without the data movement required in the WebAssembly example above.
// External data: const xdata = new Float64Array([...]); const ydata = new Float64Array([...]); // Specify a vector length: const N = 5; // Specify vector strides (in units of elements): const strideX = 2; const strideY = 4; // Define pointers (i.e., byte offsets) for storing two vectors: const xptr = 0; const yptr = N * 8; // 8 bytes per double // Create a DataView over module memory: const view = new DataView(mem.buffer); // Resolve the first indexed elements in both `xdata` and `ydata`: let offsetX = 0; if (strideX < 0) { offsetX = (1-N) * strideX; } let offsetY = 0; if (strideY < 0) { offsetY = (1-N) * strideY; } // Write data to the memory instance: for (let i = 0; i < N; i++) { view.setFloat64(xptr+(i*8), xdata[offsetX+(i*strideX)], true); view.setFloat64(yptr+(i*8), ydata[offsetY+(i*strideY)], true); }
At least in this case, not only is the WebAssembly approach less ergonomic, but, as might be expected given the required data movement, there's a negative performance impact, as well, as demonstrated in the following figure.
Figure 1: Performance comparison of stdlib's C, JavaScript, and WebAssembly (Wasm) implementations for the BLAS routine daxpy for increasing array lengths (x-axis). In the Wasm (copy) benchmark, input and output data is copied to and from Wasm memory, leading to poorer performance.
In the figure above, I'm displaying a performance comparison of stdlib's C, JavaScript, and WebAssembly (Wasm) implementations for the BLAS routine daxpy for increasing array lengths, as enumerated along the x-axis. The y-axis shows a normalized rate relative to a baseline C implementation. In the Wasm benchmark, input and output data is allocated and manipulated directly in WebAssembly module memory, and, in the Wasm (copy) benchmark, input and output data is copied to and from WebAssembly module memory, as discussed above. From the chart, we may observe the following:
Overall, WebAssembly can offer performance improvements; however, the technology is not a silver bullet and needs to be used carefully in order to realize desired gains. And even when offering superior performance, such gains must be balanced against the costs of increased complexity, potentially larger bundle sizes, and more complex toolchains. For many applications, a plain JavaScript implementation will do just fine.
Now that I've prosecuted the case against just compiling the entirety of LAPACK to WebAssembly and calling it a day, where does that leave us? Well, if we're going to embrace the stdlib ethos, it leaves us in need of radical modularity.
To embrace radical modularity is to recognize that what is best is highly contextual, and, depending on the needs and constraints of user applications, developers need the flexibility to pick the right abstraction. If a developer is writing a Node.js application, that may mean binding to hardware-optimized libraries, such as OpenBLAS, Intel MKL, or Apple Accelerate in order to achieve superior performance. If a developer is deploying a web application needing a small set of numerical routines, JavaScript is likely the right tool for the job. And if a developer is working on a large, resource intensive WebAssembly application (e.g., for image editing or a gaming engine), then being able to easily compile individual routines as part of the larger application will be paramount. In short, we want a radically modular LAPACK.
My mission was to lay the groundwork for such an endeavor, to work out the kinks and find the gaps, and to hopefully get us a few steps closer to high-performance linear algebra on the web. But what does radical modularity look like? It all begins with the fundamental unit of functionality, the package.
Every package in stdlib is its own standalone thing, containing co-localized tests, benchmarks, examples, documentation, build files, and associated meta data (including the enumeration of any dependencies) and defining a clear API surface with the outside world. In order to add LAPACK support to stdlib, that means creating a separate standalone package for each LAPACK routine with the following structure:
pip install numpy
Briefly,
Given stdlib's demanding documentation and testing requirements, adding support for each routine is a decent amount of work, but the end result is robust, high-quality, and, most importantly, modular code suitable for serving as the foundation for scientific computation on the modern web. But enough with the preliminaries! Let's get down to business!
Building on previous efforts which added BLAS support to stdlib, we decided to follow a similar multi-phase approach when adding LAPACK support in which we first prioritize JavaScript implementations and their associated testing and documentation and then, once tests and documentation are present, back fill C and Fortran implementations and any associated native bindings to hardware-optimized libraries. This approach allows us to put some early points on the board, so to speak, quickly getting APIs in front of users, establishing robust test procedures and benchmarks, and investigating potential avenues for tooling and automation before diving into the weeds of build toolchains and performance optimizations. But where to even begin?
To determine which LAPACK routines to target first, I parsed LAPACK's Fortran source code to generate a call graph. This allowed me to infer the dependency tree for each LAPACK routine. With the graph in hand, I then performed a topological sort, thus helping me identify routines without dependencies and which will inevitably be building blocks for other routines. While a depth-first approach in which I picked a particular high-level routine and worked backward would enable me to land a specific feature, such an approach might cause me to get bogged down trying to implement routines of increasing complexity. By focusing on the "leaves" of the graph, I could prioritize commonly used routines (i.e., routines with high indegrees) and thus maximize my impact by unlocking the ability to deliver multiple higher-level routines either later in my efforts or by other contributors.
With my plan in hand, I was excited to get to work. For my first routine, I chose dlaswp, which performs a series of row interchanges on a general rectangular matrix according to a provided list of pivot indices and which is a key building block for LAPACK's LU decomposition routines. And that is when my challenges began...
Prior to my Quansight Labs internship, I was (and still am!) a regular contributor to LFortran, a modern interactive Fortran compiler built on top of LLVM, and I was feeling fairly confident in my Fortran skills. However, one of my first challenges was simply understanding what is now considered "legacy" Fortran code. I highlight three initial hurdles below.
LAPACK was originally written in FORTRAN 77 (F77). While the library was moved to Fortran 90 in version 3.2 (2008), legacy conventions still persist in the reference implementation. One of the most visible of those conventions is formatting.
Developers writing F77 programs did so using a fixed form layout inherited from punched cards. This layout had strict requirements concerning the use of character columns:
Fortran 90 introduced the free form layout which removed column and line length restrictions and settled on ! as the comment character. The following code snippet shows the reference implementation for the LAPACK routine dlacpy:
pip install numpy
The next code snippet shows the same routine, but implemented using the free form layout introduced in Fortran 90.
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
As may be observed, by removing column restrictions and moving away from the F77 convention of writing specifiers in ALL CAPS, modern Fortran code is more visibly consistent and thus more readable.
Another common practice in LAPACK routines is the use of labeled control structures. For example, consider the following code snippet in which the label 10 must match a corresponding CONTINUE.
npm install @stdlib/ndarray-fancy @stdlib/blas-daxpy
Fortran 90 obviated the need for this practice and improved code readability by allowing one to use end do to end a do loop. This change is shown in the free form version of dlacpy provided above.
To allow flexibility in handling arrays of varying sizes, LAPACK routines commonly operate on arrays having an assumed-size. In the dlacpy routine above, the input matrix A is declared to be a two-dimensional array having an assumed-size according to the expression A(LDA, *). This expression declares that A has LDA number of rows and uses * as a placeholder to indicate that the size of the second dimension is determined by the calling program.
One consequence of using assumed-size arrays is that compilers are unable to perform bounds checking on the unspecified dimension. Thus, current best practice is to use explicit interfaces and assumed-shape arrays (e.g., A(LDA,:)) in order to prevent out-of-bounds memory access. This stated, the use of assumed-shape arrays can be problematic when needing to pass sub-matrices to other functions, as doing so requires slicing which often results in compilers creating internal copies of array data.
Needless to say, it took me a while to adjust to LAPACK conventions and adopt a LAPACK mindset. However, being something of a purist, if I was going to be porting over routines anyway, I at least wanted to bring those routines I did manage to port into a more modern age in hopes of improving code readability and future maintenance. So, after discussing things with stdlib maintainers, I settled on migrating routines to Fortran 95, which, while not the latest and greatest Fortran version, seemed to strike the right balance between maintaining the look-and-feel of the original implementations, ensuring (good enough) backward compatibility, and taking advantage of newer syntactical features.
One of the problems with pursuing a bottom-up approach to adding LAPACK support is that explicit unit tests for lower-level utility routines are often non-existent in LAPACK. LAPACK's test suite largely employs a hierarchical testing philosophy in which testing higher-level routines is assumed to ensure that their dependent lower-level routines are functioning correctly as part of an overall workflow. While one can argue that focusing on integration testing over unit testing for lower-level routines is reasonable, as adding tests for every routine could potentially increase the maintenance burden and complexity of LAPACK's testing framework, it means that we couldn't readily rely on prior art for unit testing and would have to come up with comprehensive standalone unit tests for each lower-level routine on our own.
Along a similar vein to test coverage, outside of LAPACK itself, finding real-world documented examples showcasing the use of lower-level routines was challenging. While LAPACK routines are consistently preceded by a documentation comment providing descriptions of input arguments and possible return values, without code examples, visualizing and grokking expected input and output values can be challenging, especially when dealing with specialized matrices. And while neither the absence of unit tests nor documented examples is the end of the world, it meant that adding LAPACK support to stdlib would be more of a slog than I expected. Writing benchmarks, tests, examples, and documentation was simply going to require more time and effort, potentially limiting the number of routines I could implement during the internship.
When storing matrix elements in linear memory, one has two choices: either store columns contiguously or rows contiguously (see Figure 2). The former memory layout is referred to as column-major order and the latter as row-major order.
Figure 2: Schematic demonstrating storing matrix elements in linear memory in either (a) column-major (Fortran-style) or (b) row-major (C-style) order. The choice of which layout to use is largely a matter of convention.
The choice of which layout to use is largely a matter of convention. For example, Fortran stores elements in column-major order, and C stores elements in row-major order. Higher-level libraries, such as NumPy and stdlib, support both column- and row-major orders, allowing you to configure the layout of a multi-dimensional array during array creation.
pip install numpy
While neither memory layout is inherently better than the other, arranging data to ensure sequential access in accordance with the conventions of the underlying storage model is critical in ensuring optimal performance. Modern CPUs are able to process sequential data more efficiently than non-sequential data, which is primarily due to CPU caching which, in turn, exploits spatial locality of reference.
To demonstrate the performance impact of sequential vs non-sequential element access, consider the following function which copies all the elements from an MxN matrix A to another MxN matrix B and which does so assuming that matrix elements are stored in column-major order.
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
Let A and B be the following 3x2 matrices:
When both A and B are stored in column-major order, we can call the copy routine as follows:
pip install numpy
If, however, A and B are both stored in row-major order, the call signature changes to
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
Notice that, in the latter scenario, we fail to access elements in sequential order within the innermost loop, as da0 is 2 and da1 is -5 and similarly for db0 and db1. Instead, the array index "pointers" repeatedly skip ahead before returning to earlier elements in linear memory, with ia = {0, 2, 4, 1, 3, 5} and ib the same. In Figure 3, we show the performance impact of non-sequential access.
Figure 3: Performance comparison when providing square column-major versus row-major matrices to copy when copy assumes sequential element access according to column-major order. The x-axis enumerates increasing matrix sizes (i.e., number of elements). All rates are normalized relative to column-major results for a corresponding matrix size.
From the figure, we may observe that column- and row-major performance is roughly equivalent until we operate on square matrices having more than 1e5 elements (M = N = ~316). For 1e6 elements (M = N = ~1000), providing a row-major matrix to copy results in a greater than 25% performance decrease. For 1e7 elements (M = N = ~3160), we observe a greater than 85% performance decrease. The significant performance impact may be attributed to decreased locality of reference when operating on row-major matrices having large row sizes.
Given that it is written in Fortran, LAPACK assumes column-major access order and implements its algorithms accordingly. This presents issues for libraries, such as stdlib, which not only support row-major order, but make it their default memory layout. Were we to simply port LAPACK's Fortran implementations to JavaScript, users providing row-major matrices would experience adverse performance impacts stemming from non-sequential access.
To mitigate adverse performance impacts, we borrowed an idea from BLIS, a BLAS-like library supporting both row- and column-major memory layouts in BLAS routines, and decided to create modified LAPACK implementations when porting routines from Fortran to JavaScript and C that explicitly accommodate both column- and row-major memory layouts through separate stride parameters for each dimension. For some implementations, such as dlacpy, which is similar to the copy function defined above, incorporating separate and independent strides is straightforward, often involving stride tricks and loop interchange, but, for others, the modifications turned out to be much less straightforward due to specialized matrix handling, varying access patterns, and combinatorial parameterization.
LAPACK routines primarily operate on matrices stored in linear memory and whose elements are accessed according to specified dimensions and the stride of the leading (i.e., first) dimension. Dimensions specify the number of elements in each row and column, respectively. The stride specifies how many elements in linear memory must be skipped in order to access the next element of a row. LAPACK assumes that elements belonging to the same column are always contiguous (i.e., adjacent in linear memory). Figure 4 provides a visual representation of LAPACK conventions (specifically, schematics (a) and (b)).
Figure 4: Schematics illustrating the generalization of LAPACK strided array conventions to non-contiguous strided arrays. a) A 5-by-5 contiguous matrix stored in column-major order. b) A 3-by-3 non-contiguous sub-matrix stored in column-major order. Sub-matrices can be operated on in LAPACK by providing a pointer to the first indexed element and specifying the stride of the leading (i.e., first) dimension. In this case, the stride of leading dimension is five, even though there are only three elements per column, due to the non-contiguity of sub-matrix elements in linear memory when stored as part of a larger matrix. In LAPACK, the stride of the trailing (i.e., second) dimension is always assumed to be unity. c) A 3-by-3 non-contiguous sub-matrix stored in column-major order having non-unit strides and generalizing LAPACK stride conventions to both leading and trailing dimensions. This generalization underpins stdlib's multi-dimensional arrays (also referred to as "ndarrays").
Libraries, such as NumPy and stdlib, generalize LAPACK's strided array conventions to support
Support for non-unit strides in the last dimension ensures support for O(1) creation of non-contiguous views of linear memory without requiring explicit data movement. These views are often called "slices". As an example, consider the following code snippet which creates such views using APIs provided by stdlib.
pip install numpy
Without support for non-unit strides in the last dimension, returning a view from the expression x['::2,::2'] would not be possible, as one would need to copy selected elements to a new linear memory buffer in order to ensure contiguity.
Figure 5: Schematics illustrating the use of stride manipulation to create flipped and rotated views of matrix elements stored in linear memory. For all sub-schematics, strides are listed as [trailing_dimension, leading_dimension]. Implicit for each schematic is an "offset", which indicates the index of the first indexed element such that, for a matrix A, the element Aij is resolved according to i⋅strides[1] j⋅strides[0] offset. a) Given a 3-by-3 matrix stored in column-major order, one can manipulate the strides of the leading and trailing dimensions to create views in which matrix elements along one or more axes are accessed in reverse order. b) Using similar stride manipulation, one can create rotated views of matrix elements relative to their arrangement within linear memory.
Support for negative strides enables O(1) reversal and rotation of elements along one or more dimensions (see Figure 5). For example, to flip a matrix top-to-bottom and left-to-right, one need only negate the strides. Building on the previous code snippet, the following code snippet demonstrates reversing elements about one or more axes.
pip install numpy
Implicit in the discussion of negative strides is the need for an "offset" parameter which indicates the index of the first indexed element in linear memory. For a strided multi-dimensional array A and a list of strides s, the index corresponding to element Aij⋅⋅⋅n can be resolved according to the equation
where N is the number of array dimensions and sk corresponds to kth stride.
In BLAS and LAPACK routines supporting negative strides—something which is only supported when operating on strided vectors (e.g., see daxpy above)—the index offset is computed using logic similar to the following code snippet:
pip install numpy
where M is the number of vector elements. This implicitly assumes that a provided data pointer points to the beginning of linear memory for a vector. In languages supporting pointers, such as C, in order to operate on a different region of linear memory, one typically adjusts a pointer using pointer arithmetic prior to function invocation, which is relatively cheap and straightforward, at least for the one-dimensional case.
For example, returning to c_daxpy as defined above, we can use pointer arithmetic to limit element access to five elements within linear memory beginning at the eleventh and sixteenth elements (note: zero-based indexing) of an input and output array, respectively, as shown in the following code snippet.
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
However, in JavaScript, which does not support explicit pointer arithmetic for binary buffers, one must explicitly instantiate new typed array objects having a desired byte offset. In the following code snippet, in order to achieve the same results as the C example above, we must resolve a typed array constructor, compute a new byte offset, compute a new typed array length, and create a new typed array instance.
npm install @stdlib/ndarray-fancy @stdlib/blas-daxpy
For large array sizes, the cost of typed array instantiation is negligible compared to the time spent accessing and operating on individual array elements; however, for smaller array sizes, object instantiation can significantly impact performance.
Accordingly, in order to avoid adverse object instantiation performance impacts, stdlib decouples an ndarray's data buffer from the location of the buffer element corresponding to the beginning of an ndarray view. This allows the slice expressions x[2:,3:] and x[3:,1:] to return new ndarray views without needing to instantiate new buffer instances, as demonstrated in the following code snippet.
// Individually import desired functionality: import FancyArray from '@stdlib/ndarray-fancy'; import daxpy from '@stdlib/blas-daxpy'; // Define ndarray meta data: const shape = [4, 4, 4]; const strides = [...]; const offset = 0; // Define arrays using a "lower-level" fancy array constructor: const x = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); const y = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); // Perform operation: daxpy(5.0, x['::2,:,:'], y['::2,:,:']);
As a consequence of decoupling a data buffer from the beginning of an ndarray view, we similarly sought to avoid having to instantiate new typed array instances when calling into LAPACK routines with ndarray data. This meant creating modified LAPACK API signatures supporting explicit offset parameters for all strided vectors and matrices.
For simplicity, let's return to the JavaScript implementation of daxpy, which was previously defined above.
pip install numpy
As demonstrated in the following code snippet, we can modify the above signature and implementation such that the responsibility for resolving the first indexed element is shifted to the API consumer.
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
For ndarrays, resolution happens during ndarray instantiation, making the invocation of daxpy_ndarray with ndarray data a straightforward passing of associated ndarray meta data. This is demonstrated in the following code snippet.
npm install @stdlib/ndarray-fancy @stdlib/blas-daxpy
Similar to BLIS, we saw value in both conventional LAPACK API signatures (e.g., for backward compatibility) and modified API signatures (e.g., for minimizing adverse performance impacts), and thus, we settled on a plan to provide both conventional and modified APIs for each LAPACK routine. To minimize code duplication, we aimed to implement a common lower-level "base" implementation which could then be wrapped by higher-level APIs. While the changes for the BLAS routine daxpy shown above may appear relatively straightforward, the transformation of a conventional LAPACK routine and its expected behavior to a generalized implementation was often much less so.
Enough with the challenges! What does a final product look like?!
Let's come full circle and bring this back to dlaswp, a LAPACK routine for performing a series of row interchanges on an input matrix according to a list of pivot indices. The following code snippet shows the reference LAPACK Fortran implementation.
// Individually import desired functionality: import FancyArray from '@stdlib/ndarray-fancy'; import daxpy from '@stdlib/blas-daxpy'; // Define ndarray meta data: const shape = [4, 4, 4]; const strides = [...]; const offset = 0; // Define arrays using a "lower-level" fancy array constructor: const x = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); const y = new FancyArray('float64', [...], shape, strides, offset, 'row-major'); // Perform operation: daxpy(5.0, x['::2,:,:'], y['::2,:,:']);
To facilitate interfacing with the Fortran implementation from C, LAPACK provides a two-level C interface called LAPACKE, which wraps Fortran implementations and makes accommodations for both row- and column-major input and output matrices. The middle-level interface for dlaswp is shown in the following code snippet.
void c_daxpy(const int N, const double alpha, const double *X, const int strideX, double *Y, const int strideY) { int ix; int iy; int i; if (N <= 0) { return; } if (alpha == 0.0) { return; } if (strideX < 0) { ix = (1-N) * strideX; } else { ix = 0; } if (strideY < 0) { iy = (1-N) * strideY; } else { iy = 0; } for (i = 0; i < N; i++) { Y[iy] += alpha * X[ix]; ix += strideX; iy += strideY; } return; }
When called with a column-major matrix a, the wrapper LAPACKE_dlaswp_work simply passes along provided arguments to the Fortran implementation. However, when called with a row-major matrix a, the wrapper must allocate memory, explicitly transpose and copy a to a temporary matrix a_t, recompute the stride of the leading dimension, invoke dlaswp with a_t, transpose and copy the results stored in a_t to a, and finally free allocated memory. That is a fair amount of work and is common across most LAPACK routines.
The following code snippet shows the reference LAPACK implementation ported to JavaScript, with support for leading and trailing dimension strides, index offsets, and a strided vector containing pivot indices.
const binary = new UintArray([...]); const mod = new WebAssembly.Module(binary);
To provide an API having consistent behavior with conventional LAPACK, I then wrapped the above implementation and adapted input arguments to the "base" implementation, as shown in the following code snippet.
pip install numpy
I subsequently wrote a separate but similar wrapper which provides an API mapping more directly to stdlib's multi-dimensional arrays and which performs some special handling when the direction in which to apply pivots is negative, as shown in the following code snippet.
# Import all of NumPy: import numpy as np # Define arrays: x = np.asarray(...) y = np.asarray(...) # Perform operation: y[::2,:,:] += 5.0 * x[::2,:,:]
A few points to note:
For server-side applications hoping to leverage hardware-optimized libraries, such as OpenBLAS, we provide separate wrappers which adapt generalized signature arguments to their optimized API equivalents. In this context, at least for sufficiently large arrays, creating temporary copies can be worth the overhead.
Despite the challenges, unforeseen setbacks, and multiple design iterations, I am happy to report that, in addition to dlaswp above, I was able to open 35 PRs adding support for various LAPACK routines and associated utilities. Obviously not quite 1,700 routines, but a good start! :)
Nevertheless, the future is bright, and we are quite excited about this work. There's still plenty of room for improvement and additional research and development. In particular, we're keen to
While my Quansight Labs internship has ended, my plan is to continue adding packages and pushing this effort along. Given the immense potential and LAPACK's fundamental importance, we'd love to see this initiative of bringing LAPACK to the web continue to grow, so, if you are interested in helping drive this forward, please don't hesitate to reach out! And if you are interested in sponsoring development, the folks at Quansight would be more than happy to chat.
And with that, I would like to thank Quansight for providing this internship opportunity. I feel incredibly fortunate to have learned so much. Being an intern at Quansight was long a dream of mine, and I am very grateful to have fulfilled it. I want to extend a special thanks to Athan Reines and to Melissa Mendonça, who is an amazing mentor and all around wonderful person! And thank you to all the stdlib core developers and everyone else at Quansight for helping me out in ways both big and small along the way.
Cheers!
stdlib is an open source software project dedicated to providing a comprehensive suite of robust, high-performance libraries to accelerate your project's development and give you peace of mind knowing that you're depending on expertly crafted, high-quality software.
If you've enjoyed this post, give us a star ? on GitHub and consider financially supporting the project. Your contributions and continued support help ensure the project's long-term success and are greatly appreciated!
The above is the detailed content of LAPACK in your web browser. For more information, please follow other related articles on the PHP Chinese website!