Unrolling the Four Node Quad

The FourNodeQuad was one of the first, if not the first, solid elements in OpenSees.

Despite the element’s mediocre implementation, the coding style was copied by others into subsequent solid element implementations. Fortunately, before that copying happened, I revised the implementation to be more computationally efficient–and I was kind enough to leave the original author’s name in the source code.

The improved computational efficiency came from unrolling the matrix-vector operations involving the strain-displacement matrix, {\bf B}, which enters several important stages of the element state determination:

  • Compatibility of strains with nodal displacements
    {\boldsymbol \varepsilon}={\bf B}{\bf u}
  • Equilibrium of nodal forces and stresses
    {\displaystyle {\bf P} = \int_V {\bf B}^T {\boldsymbol \sigma} \: dV}
  • Tangent stiffness matrix
    {\displaystyle {\bf K} = \int_V {\bf B}^T {\bf D} {\bf B} \: dV}

For more details on these equations, see the “Cook book”.

These operations with the {\bf B} matrix occur at each Gauss point, in each quad element, at each Newton iteration within each analysis time step. So, it behooves us to trim as many flops as possible from the state determination.

The original implementation of FourNodeQuad used the overloaded matrix-vector operators:

// Compatibility
eps = B*u;
// Equilibrium
P = P + B^sig*dV;
// Stiffness
K = K + B^D*B*dV;

The first step to efficiency was to use the daxpy-like functions in order to avoid the temporary Matrix and Vector objects created by the overloaded operators:

// Compatibility
eps.addMatrixVector(0.0, B, u, 1.0);
// Equilibrium
P.addMatrixTransposeVector(1.0, B, sig, dV);
// Stiffness
K.addMatrixTripleProduct(1.0, B, D, dV);

But we can do better than that. Inspection of the {\bf B} matrix shows there are always zeros in known locations:

Taking the sparsity pattern into account, the final unrolled implementation does away with storage of the {\bf B} matrix and loops over the nodes for contributions to strain, forces, and stiffness.

// Compatibility
for (int beta = 0; beta < 4; beta++) {
   eps(0) += shp[0][beta]*u[0][beta];
   eps(1) += shp[1][beta]*u[1][beta];
   eps(2) += shp[0][beta]*u[1][beta] + shp[1][beta]*u[0][beta];

// Equilibrium
for (int alpha = 0, ia = 0; alpha < 4; alpha++, ia += 2) {
   P(ia) += dV*(shp[0][alpha]*sigma(0) +
   P(ia+1) += dV*(shp[1][alpha]*sigma(1) + 

// Stiffness
A lot of code that's not shown here

I omitted some important implementation details here, but you get the idea. For the full picture, take a look at FourNodeQuad.cpp. Also, FEAP programmers may recognize the unrolling style as I embarked on this efficiency quest after taking CE 233 in Fall 2000 from Prof. Taylor, who came out of quasi-retirement to help cover Prof. Armero’s sabbatical.

So, how much difference does all of this make?

I created a mesh of 20 elastic isotropic quad elements then loaded the model to a peak value in 100,000 time steps. Even though the model is linear, I used a Newton algorithm to mimic what goes on in a nonlinear analysis. The mesh size is small so the equation solution doesn’t matter. The 100,000 time steps wash out variance in timing individual analysis steps.

Everything else being equal, differences in implementation of the element state determination are apparent from timing the ops.analyze(100000) command.

ImplementationWall Time Relative to Unrolled Implementation
Overloaded operators1.322

Using the overloaded matrix-vector operators takes about 30% more wall time than the unrolled implementation. Similarly, the daxpy-like implementation takes about 20% more wall time than the unrolled implementation. And the {\bf B} matrix is not all that sparse.

So, going from the overloaded operators to daxpy-like implementation speeds up the analysis–it’s a win. But, if you want to go for the gold, unroll the calculations to account for zeros–especially if you are multiplying block diagonal matrices. Unrolling will make a BIG difference.

6 thoughts on “Unrolling the Four Node Quad

  1. Great post! Another big source of slow code I find all around is the dynamic memory allocation that comes with using the Matrix, Vector and ID classes that create temporaries within low-level function calls like this one.

    A while back I added move semantics for the Matrix and Vector classes that can be enabled with the USE_CXX11 preprocessor macro, which I don’t know anyone uses (I just realized I haven’t for a while). Back when I did it I also found a 5% to 10% improvement in runtime just by enabling this.

    I did not know the history of the development of the solid elements. I’ll add comments to the ones I’ve developed, saying that they are based off your original coding style. I know I took mine over from the Brick elements that are attributed to Ed Love.


    1. Thanks! I’m not looking for attribution of coding style in different element implementations. But, attribution and original authorship should be maintained when folks come along and change a few lines of code in the original implementation, then replace the author’s name with their own, e.g., when Ed’s name was removed from ShellMITC4 after some minor changes.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.