Profiling tips & tricks

Here there are some findings or tricks we have found useful at different times regarding profiling the Ouroboros Consensus layer.

Time reported in GHC `-p` profiling is off

When a Haskell program is run, the reported time on the -p report is the result of

total_ticks * tick_interval / num_capabilities

The Cardano-node is run by default with two capabilities, this means that the total time will be half of what one would expect.

Time reported is lower than half of what one would expect

This can be because there are safe and unsafe calls in different parts of the codebase.

NOTE

To explain this we have to take some detour on FFI. The following table shows some context about it:

	`safe`	`unsafe`
Can call into Haskell code	Yes	No
Releases the capability	Yes	No
Shows up in time profiling	No	Yes

In particular, for safe calls, as the capability is released just before the foreign function is called, from the Haskell point of view, said task is not running.

Imagine the following code in Haskell:

foreign import ccall unsafe "foo" foo :: IO ()
foreign import ccall safe   "bar" foo :: IO ()

baz = do
  foo
  bar

This will be compiled to Core (-ddump-simpl) as:

...
 case {__pkg_ccall main State# RealWorld -> (# State# RealWorld #)}_d20O
                        ds_X21f
                 of
...
case {__pkg_ccall_GC main State# RealWorld -> (# State# RealWorld #)}_d20S
             ipv4_X284
      of
...

And to C-- (-ddump-cmm) as:

...
           call "ccall" arg hints:  []  result hints:  [] foo();
...
           (_u3jD::I64) = call "ccall" arg hints:  [PtrHint,]  result hints:  [PtrHint] suspendThread(BaseReg, 0);
           call "ccall" arg hints:  []  result hints:  [] bar();
           (_u3jE::I64) = call "ccall" arg hints:  [PtrHint]  result hints:  [PtrHint] resumeThread(_u3jD::I64);
...

This shows how a safe call indeed releases the capability. Imagining a super simplified scenario with initially one OS thread, one Haskell task/thread and only one capability that performs the baz call:

= has the capability
- does not have the capability

           v call to `foo`      v call to `bar`
...===============================-------------------========= ...
                              ^ return             ^ return

So in the above, the time during the safe call, the Haskell capability is IDLE thus not counting for the time spent. What happens if we have more Haskell tasks waiting to be executed (note this will require multiple OS threads)?

= has the capability
- does not have the capability

         v call to `foo`      v call to `bar`
...===============================----------------------========= ...
                              ^ return             ^ return
...-----------------------------========================-------- ...

Note now that the foreign call might finish before the capability is available again, and in this case it has to wait.

From Haskell's point of view, no waiting time happened in this last scenario as the capability was all the time running Haskell code. In case no Haskell tasks can be run in that time when the C code is being executed, then said time will not appear in the profiling report as no Haskell code is being executed (a case similar to the first diagram).

Now, the reported time on the Haskell time profiling will be off due to these safe foreign calls not being accounted for. Apart from that, unsafe calls sometimes lose ticks when running with the default tick interval. Therefore it is advised to run with -V0.01 when doing time measurements.

Time profiling and eventlog

One of the few approaches that seemed useful for investigating time consumption was using hs-speedscope to graph all the execution time registered by Haskell, and then combining it with ghc-events-analyze to see time elapsed between different events that we care of.

In order to emit time profiling on the eventlog, the executable has to be linked with -eventlog to pull the right RTS and then run with -l. If the eventlog produced is too big, one can run with -l-au so that (-a) all events are omitted, (u) except the user events. Also one can set the output name for the eventlog file with -ol<name>. An example from cardano-node:

> cat cardano-node.cabal
...
executable cardano-node
...
    ghc-options:        -eventlog
                        "-with-rtsopts=-T -I0 -A16m -N2 --disable-delayed-os-memory-return"
...
> cat cabal.project.local
...
profiling: true
profiling-detail: all-functions

package plutus-core
  ghc-options: -fexternal-interpreter
...
> cabal run exe:cardano-node -- run +RTS -l-au -p -olname.eventlog -RTS
...
> ls
...
name.eventlog

It is important to note too that -fno-prof-count-entries can make the concurrent profiled code run significantly faster (see this).

> cabal run --ghc-options="-fno-prof-count-entries" <exe> -- <args> +RTS -l-au -p -olname.eventlog -RTS

External interpreter (`ByteCodeLink.lookupCE`)

Template haskell has some strange interaction with profiling, which was reported to the GHC team at this ticket. The solution is simple: use -fexternal-interpreter for the packages that fail with that error.

However it is important to note that at least in our case, we cannot build the whole project with -fexternal-interpreter as there is an error related to mallocing memory:

> cabal build --ghc-options="-fexternal-interpreter" ...
...
[45 of 63] Compiling Cardano.Node.Tracing.Tracers.ChainDB ( src/Cardano/Node/Tracing/Tracers/ChainDB.hs, .../dist-newstyle/build/x86_64-linux/ghc-8.10.4/cardano-node-1.33.0/build/Cardano/Node/Tracing/Tracers/ChainDB.o, .../dist-newstyle/build/x86_64-linux/ghc-8.10.4/cardano-node-1.33.0/build/Cardano/Node/Tracing/Tracers/ChainDB.dyn_o ) [Data.Aeson.KeyMap changed]
ghc-iserv: mmap 131072 bytes at (nil): Cannot allocate memory
ghc-iserv: Try specifying an address with +RTS -xm<addr> -RTS
ghc-iserv: internal error: m32_allocator_init: Failed to map
    (GHC version 8.10.4 for x86_64_unknown_linux)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
ghc: ghc-iserv terminated (-6)
cabal: Failed to build cardano-node-1.33.0 (which is required by
exe:cardano-node from cardano-node-1.33.0).

So the solution is as shown in the example above, use the external interpreter only on plutus-core which is the package that currently fails:

> cat cabal.project.local
...
profiling: true
profiling-detail: all-functions

package plutus-core
  ghc-options: -fexternal-interpreter
...

Entering a profiled Nix shell

We have cached profiled Nix shells built on CI so you don't have to locally recompile all dependencies with profiling. To use them, enter the .#ghc96-profiled shell, eg via

nix develop .#ghc96-profiled

and create a cabal.project.local like this:

profiling: True
-- https://well-typed.com/blog/2023/03/prof-late/
profiling-detail: late

Now, any built executable will have profiling support.

For example, to find out where time is spent in some test, you could use

cabal run ouroboros-consensus:test:storage-test -- -p 'ChainDB q-s-m' +RTS -pj

and eg load the resulting .prof file into speedscope.

Time reported in GHC -p profiling is off​

Time reported is lower than half of what one would expect​

Time profiling and eventlog​

External interpreter (ByteCodeLink.lookupCE)​

Entering a profiled Nix shell​

Time reported in GHC `-p` profiling is off

Time reported is lower than half of what one would expect

Time profiling and eventlog

External interpreter (`ByteCodeLink.lookupCE`)

Entering a profiled Nix shell