Clang 15.0 produces slow c++ applications

Hello,

I Run MacOs ventura 13.6 and command line tools 15.0 on MacBook Intel I7 post 2018.

After installing clang 15.0 the performance of C++ test programs shows 4 at 5 times slower execution time compared to Clang 13.0

Has anybody observed this slow down ?

The tests using a lot of mathematical computations is compiled with the folowing command :

g++ -std=c++17 -march=native -funroll-loops -Ofast -DNDEBUG -o a atest.cpp

So I had to revert to Clang 13.0 to have reasonnable execution time .

What makes C++ code so slow ?

Hi,

Could you supply an example of what sort of programs are showing a slowdown? Have you logged a Feedback with any details?

Hi, No I can't give all the codes of my applications. To give an idea, with and without Eigen library it is about factor analysis. I made C and C++ versions. The tests compute the factors of 64 questions of a survey . It gives a square matrix of 306 items . The method used to extract the factors is algorithmic.

On MacBook pro Intel 2020 the total computation in C++ or C gives about 10 seconds with clang 13.x but 54 seconds with clang.15 .

For C++ as I mentionned the complier option are -std=c++17 -Ofast -march=native -funroll-loops -flto -DNDEBUG

For the C version idem

gcc -Ofast -march=native -funroll-loops -flto -DBEBUG -o a file1.c file2.c cpp1.o file3.c -o a -lm -lstdc++

The computation time suffers a big regression. I do not see where it comes from.

djm44

I can understanding not being able to provide all of your code! is it possible to distill the algorithmic portion to a simple file that can be compiled and tested? Performance tuning (and regressions) can often be workload specific, so trying to narrow the space down a bit further. If it's easier, you can log a Feedback so the example is not on a public forum, just reference the number here thanks!

I'am afraid the algorithmic portion has nothing to do with the informations I gave to you. I can confirm in this sense. And more on my mac I have two virtual machines for simple use , vmware and virtualbox. The same code I mentionned earlier runs 12 seconds on vimware and 13 seconds on virtualbaox knowing that these two virtual machine use a lot less ressources than the host MacOs. I precise the regression appears with clang 15.0 not clang 13.x. . So 5 times slower for g++ clang 15.0

if you want a part of computational code here :

template < typename T >
int  OAnacorr<T>::compute_afc_for_burt()
{
   size_t n = 0, i = 0 ;
   T AKSI, ALAMBDA, PHI2, perc = 0 ;
   T CUMUL ;
   long TEST0, TEST1   ;

   T TOTA = M.sum() ;

   KT VL, VC ;
   VC.resize( M.getnc() );
   VL.resize( M.getnl() );

   Mtype M2( M.getnl(), M.getnc() ) ;

   KT PII ;
   M.peek_sum_rows( PII ) ;

   M.to_percent();
   KT PJ ;
   M.peek_sum_cols( PJ );

   Mtype K2( M ) ;

   KT KI( M.getnc() ) ;

   CUMUL =  0.0 ;

   PHI2 = K2.khi_deux_pond() ;

   if( Mtype::isnan(PHI2) || PHI2 <= 0  )
      return 1;

   K2.peek_sum_cols( KI ) ;


   KT VVI( KI ) ;
   KT SVI( KI ) ;
   Mtype TH = M.get_theoric() ;


   /*****************************************/

   KT VCSUP ;
   KT KSUP ;
   KT SKI ;
   KT A, D ;

   if( g_xsup > 0 )
   {
      VCSUP.resize( TSC.getnl());
      KSUP.resize( TSC.getnl() );
      A.resize ( VCSUP.size() );
      D.resize ( VCSUP.size() ) ;


      TSC.peek_sum_rows( KSUP );

      Mtype STHEO ( TSC.getnl(), TSC.getnc() ) ;

      for( size_t i = 0 ; i < TSC.getnl(); i++ )
         for( size_t j = 0; j < TSC.getnc(); j++ )
            STHEO(i,j) =  (PII[j] * KSUP[i]) / TOTA ;

      Mtype ECA  = TSC - STHEO;

      Mtype SKI2 = ( ECA * ECA ) / STHEO  ;
      SKI2 /= TOTA ;

      SKI2.peek_sum_rows( SKI );
      for( size_t i  = 0; i < VCSUP.size(); i++ )
         D[i] = KSUP[i] / TOTA ;

      TSC.to_percent();
      vect_to_percent( KSUP );

      inth( TSC,  KSUP, PJ, 1.0 );

   }

   os << "\nAnalyse des correspondances (AFC)" << std::endl << std::endl ;
   os << "Phi Deux          = " << std::setw(8) << std::setprecision(6) << std::fixed << PHI2 << std::endl;

   for ( n = 0 ; n < g_nbf ; n++ )
   {

      if( ( 100.00 - CUMUL ) < 0.0000001  ) goto tend ;

      vect_zero( M, VL ) ;
      i=0 ;

      do
      {
         TEST0 =  (VL[0] * 10000000.0) ;
         prod_by_cols( M, VC, VL ) ;
         AKSI = reduce_by_pond( PJ, VC) ;
         i++;

         if( i > 20000 )
         {
            return 1 ;
         }
         prod_by_rows( M, VL, VC ) ;
         AKSI = reduce_by_pond( PJ, VL ) ;

         TEST1 =  (VL[0] * 10000000.0)  ;
      }
      while ( TEST1 != TEST0 );


      if( g_xsup > 0 )
      {
         prod_by_rows( TSC, VCSUP, VC );
         T RX = reduce_by_pond( KSUP, VCSUP   );
         mul_vect( VCSUP, RX );
         for ( size_t i = 0 ; i < VCSUP.size(); i++ )
            if( Mtype::isnan(VCSUP[i])) VCSUP[i] = 0 ;

         WSUP.push_back( VCSUP );
      }


      mul_vect( VC, AKSI );
      WWC.push_back( VC ) ;
      ALAMBDA = ( n != 0 ) ? AKSI * AKSI : PHI2 ;

      rebuild_pond( M2, VC, PJ, AKSI ) ;

      M -= M2  ;

      if( n == 0 )
      {
         WWC.push_back( PJ );
         if( g_xsup > 0 )
         {
            WSUP.push_back(VCSUP);
         }
      }

      if( n != 0 )
      {
         mul_and_div( M2, TH );
         M2.peek_sum_cols( KI ) ;

         SVI = KI ;
         div_vect( SVI, VVI );
         WWC.push_back( SVI );

         perc = (ALAMBDA / PHI2) * 100 ;
         CUMUL += perc ;

         g_nbvectors += 1 ;

         if( g_xsup > 0 )
         {

            for( size_t i = 0 ; i < VCSUP.size(); i++ )
            {
               A[i] =  VCSUP[i] * VCSUP[i] * D[i]  /  SKI[i]    ;   // cos2
               if ( Mtype::isnan(A[i])) A[i] = 0 ;
               if( A[i] >= 1 ) A[i] = 0.999;


            }

            WSUP.push_back(A);

         }


         std::ostringstream ox ;
         ox << "F" << n ;
         os  <<  std::setw(5) << std::setfill(' ') << std::left <<  ox.str()
             << " Val Propre  = "
             << std::setw(8) << std::setprecision(6) << std::fixed << ALAMBDA
             << " Pourcent= " << std::setw(5) << std::setprecision(2) << std::right << std::fixed << perc
             << " Cumulé= "  << std::setw(6) << std::setprecision(2) << std::right << CUMUL
             << " Nb iter= "
             <<  std::setw(5) << std::right << ((n>0) ? i : i)  << std::endl ;

      }

      div_vect( KI, ALAMBDA );
      WWC.push_back( KI);


      if( g_xsup > 0 )
      {

         for( size_t i = 0 ; i < VCSUP.size(); i++ )
         {
            A[i] = VCSUP[i] * VCSUP[i] * D[i]  / ALAMBDA ;  // cpf

            if ( Mtype::isnan(A[i])) A[i] = 0 ;
            if( A[i] >= 1 ) A[i] = 0.999;

         }

         WSUP.push_back(A);
      }

   }

tend:
   g_nbf = n ;
   os << std::endl;
   return 0;
}

djm44

Have you tried other optimisation settings, e.g. -O3, rather than -Ofast ? Just a guess, maybe the meaning of -Ofast has been changed?

What are the types T, KT etc. ? Are thry float/double, or are they something exotic?

KT is std::vector < double > Mtype is std::vector < double > with different subscript T is double There no assembler instructions

-03 makes better than -Ofast 3 times slower instead of 5 times slower. You pointed a right thing.

It seems the -Ofast option does not work any more My opinion there is a security addon which blocks the code , something that blocks the memory access or that verify systematically array access or array bounds ? Or some new default options that makes code slower ?

The codes of the two versions of my code in C and C++ has been tested with valgrind on Linux and gives no errors ( memory, and array bounds ).

You might like to ask on the clang mailing list / forum to see if anyone knows what changed. Or maybe it's in the release notes!

I believe that -Ofast is supposed to enable some "approximate maths" features. You might want to try turning them on explicitly.

I am not used to use mailing list. I read the release notes of Apple clang 15.0 . It's very bulky. I did not notice any thing about the changes in the -Ofast command option. What are the specific flags added to -O3 in -Ofast ? For my test -Ofast was compatible with the computations.

Hello, appreciate all the comments and feedback here.

We've tried to reproduce your results. We're not able to compile the example source above. There's missing types, functions, and other bits that are important for understanding what the compiler is doing.

  • Ideally, a small reproducer source file could be shared.
  • If not, it may be possible to glean info from assembly of good and bad versions. Assembly can be generated by compiling with -S or disassembling the built binary with otool -tV <path-to-binary>.
  • Instruments.app can be used to profile the code and see where time is being spent.

I'm assuming that when you mention clang 15, you're running it from Xcode 15 toolchain. Similarly, when you say clang 13, you're running it from an Xcode 13 toolchain. Have you given Xcode 14 a try?

What are the specific flags added to -O3 in -Ofast ?

For comparing different versions of compilers or -O3 and -Ofast, using clang++ -### <rest-of-args> will show the options that are being passed from top-level driver.

So at this point we notice that gnu c++ or clang++ is 2 times faster than Apple clang on virtual machines like vmware and virtualbox emulating linux. The guest's compilers are running faster than the host ones ! A strange surprise .

I'm a little confused by the above. Are you measuring how long it takes to compile or runtime of the built products?

I assume clang++ here represents upstream LLVM clang++. Have you tried using that on macOS? Does it produce different results than Apple clang within Xcode?

Hi ,

Yes I compare things that are comparable . Clang 15.0 with the previous version using command line tools not Xcode. On the same macbook pro intel 2020.

My previous toochain was Command_Line_Tools_for_Xcode_14.3.1_Release_Candidate.

What I'm measuring is run time not compile time. My code has sense being fast , I do not bother about compile time. So eliminating -march=native I get a less worse performance . For now with different modifications I get 2 times slower than the previous clang.

Some things have changed in this last version of clang compiler that is not documented. I confirm the same application in C or C++ works about 2 times faster on the linux guests vmware and virtualBox. And the same 2 times faster with mingw c++ on Windows Bootcamp.

The only one which lacks performance is the last Apple clang 15.0 g++ or clang .

With the previous toolchain I used :

g++ -std=c++17 -Ofast -march=native -funroll-loops -lfto -DNDEBUG -o a prog.cpp

With clang 15.0 to get less worse performance :

g++ -std=c++17 -Ofast -funroll-loops -lfto -DNDEBUG -o a prog.cpp

the perf :

On linux guests about 11 seconds

On previous clang about 10 seconds

On Windows BootCamp about 11 seconds

On last Apple clang 15.0 about 22 seconds.

My previous toochain was Command_Line_Tools_for_Xcode_14.3.1_Release_Candidate.

This has

 % clang -v
Apple clang version 14.0.3 (clang-1403.0.22.14.1)

Some things have changed in this last version of clang compiler that is not documented.

We're trying to narrow down what those changes could have been. Without being able to reproduce or see assembly, we can only really offer suggestions on what to look for or things to try.

  1. Compare compiler flags (clang -###)
  2. Compare disassembly of hotspots identified with Instruments.app. Are loops being unrolled? The same amount? Is one vectorized, but the other isn't?

For Linux VM, which version of LLVM clang are you using (clang -v)?

Unlikely, however, since you're using -flto, have you tried with the older linker -Wl,-ld_classic?

Hi ,

I don't see what you mean with clang -###. I won't try all command line options.

I am not able to search in disassemblies

. On Linux it is 15.0.7 clang version but I use gnu gcc

. What is -ld_classic?

I notice a global loss of run time performance i between

Apple clang version 14.0.3 (clang-1403.0.22.14.1)

and

Apple clang version 15.0.0 (clang-1500.0.40.1)

And Apple made it impossible to revert to 14.0.3 command line tools.

I don't see what you mean with clang -###. I won't try all command line options.

clang is the top-level driver that invokes other tools to build. One of those tools is the clang frontend clang -cc1. Passing -### along with your other options to the clang driver will show the options that it is passing to the frontend. Comparing these between versions, -O3 vs -Ofast, with or without -march=native can uncover whether there's certain options contributing to the performance difference.

I am not able to search in disassemblies

Which method did you use to generate assembly (clang -S <other-args>) or disassembly (otool -tV <path-to-binary>)?

Also, have you tried running the poor performance code under Instruments.app's Time Profiler? That should show where time is spent and can help with further investigation.

What is -ld_classic?

Xcode 15 (and aligned command-line tools) ship with a new linker. The older linker is still available with the -ld_classic linker flag, or -Wl,-ld_classic if you're passing it to the clang driver. As mentioned, it's unlikely this is the cause, but it's an easy check so might be worth it.

So -O3 makes code slower than -Ofast

I tried -Wl,-ld_classic gives no difference

I notice -march=native make code very much slower.

I guess what changed with this last version is -march=native

May be less support for Intel processor

With the previous version of clang -march=native made code faster.

I don't know about assembly.

The fact is that my code and the compiler flags have not changed but the changes are in clang 15.0 and are not documented .

Same codes run faster on Linux guests vmware and virtualbox. with gnu gcc or g++ And on Windows BootCamp with Mingw gcc g++

I precise my codes do not use graphical UI. It gives only results on the terminal.

I don't see what you mean with clang -###.

He means literally add -### to your invocation.

Clang 15.0 produces slow c++ applications
 
 
Q