Part 3.4: Getting the most out of Burst

Tutorial

advanced

+5XP

20 mins

(105)

Unity Technologies

In this section of the DOTS Best Practices guide, you will:

Learn how to efficiently use the Unity.Mathematics library, Burst aliasing hints and SIMD optimizations to enable the Burst compiler to generate fast code

Languages available:

1. Use Unity.Mathematics

Use the Unity.Mathematics package for any mathematical operations you want to perform in DOTS code, rather than the traditional Mathf API. The same goes for the mathematical types: use float3 instead of Vector3, quaternion instead of Quaternion, float4x4 instead of Matrix4x4.

Why? Because the data types in Unity.Mathematics form the basis of the SIMD optimizations that Burst implements. While you might not see much performance difference between the two math libraries in non Burst-compiled code, once Burst is involved, code involving Unity.Mathematics and the new data types is considerably faster than its old OOP UnityEngine.Mathf counterparts.

Understand operators

You should be aware that many of the arithmetic operators which are defined for Unity.Mathematics types do not behave in the same way as the operators for OOP UnityEngine types. For SIMD types such as float3 or float4x4, almost all of the arithmetic operators are applied in a component-wise manner, which is not necessarily the case with the old UnityEngine types.

This is particularly important to remember when dealing with matrix types. If you use Matrix4x4.operator * to multiply two Matrix4x4, the result is a standard matrix multiplication, in which each element of the resulting matrix is the dot product of the rows and columns. However, float4x4.operator * is a component-wise operation, which gives a different result. For a standard matrix multiplication, you should use math.mul() instead.

Know what you can Burst compile, and how to do it

You should use Burst to compile as much of your DOD code as possible. You can instruct Burst to compile any code that conforms to the standards of High-Performance C# (HPC#), as described in the Burst package documentation on C# Language Support. Flag methods and structs to be compiled by Burst by adding the [BurstCompile] attribute. In ECS code, this generally means adding the attribute to:

Job structs (but not the Execute() method in those structs)
The OnCreate() / OnUpdate() / OnDestroy() methods in an ISystem (but not the ISystem struct itself)

This guide shows correct usage of the attribute in many of its code examples. For a more in-depth discussion of how to use [BurstCompile], see this post written by one of our engineers in the Unity Forums: When, where, and why to put [BurstCompile], with mild under-the-hood explanation.

Embed Random generators in components

Unity.Mathematics contains a Random struct, which can be used to efficiently generate pseudo-random numbers. Random must be initialized with a seed, which it stores in a UInt32 field called state. Every time your code calls one of the Next() methods, Random performs some mathematical operations using state as the input, in order to generate the new random number. The methods also store the newly-generated random number in state, ensuring a different input the next time another random number is required.

This method of generating random numbers is very fast, but requires care when you want to incorporate it into multithreaded code. It can be tempting to create a single instance of Random on the main thread and pass it to jobs that require a random number generator. However, this should be avoided. The data every job instance requires is copied in in order to allow safe, multithreaded code. This means that every thread has its own copy of your original main thread Random struct, and all of them are seeded with the same value - a copy of the value of state in the original struct. This is not very random! To make matters worse, the value of state is never copied back from any of the job instances into the main thread instance, so unless random numbers are also requested on the main thread the jobs will always generate the same “random” numbers every frame!

There are a variety of ways to avoid this. For example, you can create an array of Random instances, initialized using CreateFromIndex() and use the thread or chunk index in the job to select one of them to use. However, the simplest way to generate good quality random numbers is simply to have one instance of Random per Entity, stored in a component. Jobs can use this instance to generate any random numbers which are required for data transformations on a particular entity, and the state will be unique and persist as long as its component exists.

2. Use Burst aliasing hints

Aliasing is a word to describe the situation in which code is manipulating the contents of two references or pointers which point to the same place in memory. Here’s a simple example:

int Foo(ref int a, ref int b)  
{  
    b = 13;  
    a = 42;  
    return b;  
}

What does this method return? If a and b both refer to different areas of memory, the location that b refers to contains 13. However, if both references point to the same area of memory, that location contains 42, because that was the last value we set to that location. When a and b are backed by the same memory, they are said to alias each other.

The compiler doesn’t know at compile time whether these two references are aliased, so it must produce assembly code which is suboptimal but guarantees a correct result in every circumstance.

mov dword ptr [rdx], 13    // Store 13 into b  
mov dword ptr [rcx], 42    // Store 42 into a  
mov eax, dword ptr [rdx]   // Load the contents of b back into the register
ret                        // Return the contents of the register

If this code occurs inside a Burst-compiled job and you know that a and b definitely never alias, you can use the [NoAlias ] attribute to tell the compiler, and it can produce more efficient assembly. In this example, it can avoid the need to load the contents of b back into the register, and instead simply returns 13.

The [NoAlias ] page, in the Memory Aliasing section of the Burst documentation, describes various ways to use [NoAlias] to tell Burst when pointers or references do not alias in a variety of situations:

Method parameters
Method return values
Structs
Struct fields
Jobs

The article also shows how to use Unity.Burst.CompilerServices.Aliasing in order to add intrinsics to your code which perform compile-time checks to ensure that your assumptions about aliasing are correct.

3. SIMD optimization

So, you’ve followed everything else in this Best Practices guide; you’ve understood DOD, planned out your data and transformations, and avoided the common performance pitfalls in the implementation - but you still have one or two jobs that really need a little bit more performance squeezed out of them. Perhaps the jobs are efficient enough in terms of memory access, but just perform a lot of mathematical operations on large data sets (such as custom culling, or manipulating large amounts of pixel or vertex data on the CPU for procedural content, visual effects or some custom physics simulation). If this is the case, maybe it’s time to consider SIMD.

SIMD stands for Single Instruction, Multiple Data, and in Unity.Mathematics terms it’s a way to take advantage of the fact that to a CPU, performing some mathematical operations on a float4 is as fast as performing those same operations on a single float.

Burst is pretty good at automatically vectorizing code in a lot of cases, so the first thing you should do is look in the Burst Inspector at the assembly that it produces for your job. It helps if you can read assembly, but even if you can’t, you should be able to count the SIMD and scalar instructions to get a sense of the complexity of the job. In x86 assembly, if you see a lot of instructions in the form xxxps (for example, addps, mulps), all feeding into each other, those are vectorised SIMD instructions, and that’s good. If you see lots of xxxss (for example, addss, mulss), those are scalar instructions, and that’s not good. If you see a mix of the two instruction types, you’re seeing a mix of SIMD and scalar code, and that’s also not good. Count the number of SIMD and scalar instructions. Your aim is to refactor the code to help the Burst compiler to remove as many scalar instructions as possible.

The Unite Copenhagen 2019 presentation Intrinsics: Low-level engine development with Burst is an excellent introduction to convert some of your code to SIMD, and the benefits you can expect from doing so. Towards the end of the video, Andreas lists some general best practice advice for writing SIMD-friendly code with Burst:

Become familiar with the Burst inspector. Even if you’re not an assembly expert, viewing assembly can give you a good idea of how optimal your code is and gives you a baseline for comparison once you start making changes.
Eliminate branches. CPUs are faster at running through a linear set of instructions than branching conditional code.
Prefer wider batches of input data. DOD is about processing buffers of many things rather than processing things individually
Use Unity.Mathematics vertically. Don’t use a float3 to represent an (x, y, z) vector when you can use three float4 (one for the x values, one for y values, one for z values) and use SIMD to process 4 vectors in the same amount of code that it would have taken to process one.

For more information about SIMD, see Andreas Fredriksson’s 2015 GDC presentation SIMD at Insomniac Games: How We Do the Shuffle.

4. SIMD optimization example: simple frustum culling

Here’s an example of what SIMD optimization looks like in the imaginary game Beach Ball Simulator: Special Edition.

In the bonus round, when 100,000 beach balls are released, rendering performance becomes an issue and the team decides to implement a custom system to perform frustum culling on the beach balls’ bounding spheres. Each ball has a LocalToWorld component which includes its position in world space, a SphereRadius component defining the bounding sphere size, and a SphereVisible component which the culling system must set according to whether any part of the sphere is within the planes of a camera frustum. The Value in SphereVisible could be a bool, but we’re making it an int for now because it will make some things simpler later on.

public struct SphereRadius : IComponentData  
{  
    public float Value;  
}  
  
public struct SphereVisible : IComponentData  
{  
    //public bool Value;
    public int Value;  // 1 = visible, 0 = not visible
}

First, the culling system packs the data which defines the six camera planes into a float4, in which the (x,y,z) components are the plane’s normal vector and the w component is the plane’s distance from the origin. We have a helper struct with a method to do that:

public struct FrustumCullHelper
{
    static Camera _camera;
    static Plane[] _planesOOP = new Plane[6];

    public static void UpdateFrustumPlanes(ref NativeArray<float4> planes)
    {
        if (_camera == null)
            _camera = Camera.main;
        
        GeometryUtility.CalculateFrustumPlanes(_camera, _planesOOP);
        
        for (int i = 0; i < 6; ++i)
            planes[i] = new float4(_planesOOP[i].normal, _planesOOP[i].distance);
    }
}

Here are the bare bones of our FrustumCullSystem. The interesting parts will happen inside the CullJob. Note that we’re unable to [BurstCompile] the OnUpdate() method as a whole because it references a managed type (the main Camera, inside FrustumCullHelper.UpdateFrustumPlanes()), but once that’s dealt with we can call another method, DoCulling() which is Burst-compiled.

public partial struct FrustumCullSystem : ISystem
{
    private NativeArray<float4> _planes;
    [BurstCompile] public void OnCreate(ref SystemState state)
    {
        _planes = new NativeArray<float4>(6, Allocator.Persistent);
    }
    
    [BurstCompile] public void OnDestroy(ref SystemState state)
    {
        _planes.Dispose();
    }
    
    public void OnUpdate(ref SystemState state)
    { 
        FrustumCullHelper.UpdateFrustumPlanes(ref _planes);
        DoCulling(ref state);
    }
    
    [BurstCompile] public void DoCulling(ref SystemState state)
    { 
        state.Dependency = new CullJob { Planes = _planes }.ScheduleParallel(state.Dependency);
        state.Dependency.Complete();
    }
}

To test whether a sphere is inside a camera frustum plane, you can perform a dot product between the plane normal and the sphere center, add the plane distance and sphere radius, and see whether the result is bigger than zero. Here’s a naive implementation which does just that; it iterates over all of the spheres, then loops through all the planes and performs the tests one by one.

[BurstCompile(OptimizeFor = OptimizeFor.Performance)]
partial struct CullJob : IJobEntity
{
   [ReadOnly] public NativeArray<float4> Planes;

   void Execute(ref SphereVisible visibility, in LocalToWorld localToWorld, in SphereRadius radius)
   {
       bool visible = true;
       for (int planeID = 0; planeID < 6; ++planeID)
       {
           if (math.dot(Planes[planeID].xyz, localToWorld.Position) +
               Planes[planeID].w + radius.Value <= 0)
           {
               visible = false;
               break;
           }
       }
       visibility.Value = visible ? 1 : 0;
   }
}

While this might seem like a nice efficient job, there is scope for performance improvement here. First, there’s the for loop that iterates over the planes. More specifically, there’s a break statement which exits the loop if a sphere is outside one of the frustum planes. This generates a branch in the assembly code, which is something that should be avoided.

Here’s a version that removes the plane loop and the branch.

[BurstCompile(OptimizeFor = OptimizeFor.Performance)]
partial struct CullJob : IJobEntity
{
   [ReadOnly] public NativeArray<float4> Planes;

   void Execute(ref SphereVisible visibility, in LocalToWorld localToWorld, in SphereRadius radius)
   {
       var pos = localToWorld.Position;
       visibility.Value =
           (math.dot(Planes[0].xyz, pos) + Planes[0].w + radius.Value > 0) &&
           (math.dot(Planes[1].xyz, pos) + Planes[1].w + radius.Value > 0) &&
           (math.dot(Planes[2].xyz, pos) + Planes[2].w + radius.Value > 0) &&
           (math.dot(Planes[3].xyz, pos) + Planes[3].w + radius.Value > 0) &&
           (math.dot(Planes[4].xyz, pos) + Planes[4].w + radius.Value > 0) &&
           (math.dot(Planes[5].xyz, pos) + Planes[5].w + radius.Value > 0) ? 1 : 0;
   }
}

This job performs more actual computation, but the performance gains from removing the branch mean that this version actually runs faster than the first version. But what about the verticality of the data? Although this job checks all the frustum planes in a single statement, the checks are all scalar. If the plane data were packed differently, SIMD instructions could check a sphere against 4 planes in the same number of math operations needed to check 1 plane.

Let’s add a new method to our FrustumCullHelper. This one calculates the six float4 frustum planes from the Camera as before, but then re-packs them into two PlanePacket4 structs. One of these structs can represent four frustum planes, so we only need two of them to represent the whole frustum (with some redundant data in the second packet). Importantly, the x, y, z and distance values of the planes are packed together, which will allow us to write a more efficient job.

public struct PlanePacket4
{
    public float4 Xs;
    public float4 Ys;
    public float4 Zs;
    public float4 Distances;
}

public static void CreatePlanePackets(ref NativeArray<PlanePacket4> planePackets)
{
    var planes = new NativeArray<float4>(6, Allocator.Temp);
    FrustumCullHelper.UpdateFrustumPlanes(ref planes);
    
    int cullingPlaneCount = planes.Length;
    int packetCount = (cullingPlaneCount + 3) >> 2;

    for (int i = 0; i < cullingPlaneCount; i++)
    {
        var p = planePackets[i >> 2];
        p.Xs[i & 3] = planes[i].x;
        p.Ys[i & 3] = planes[i].y;
        p.Zs[i & 3] = planes[i].z;
        p.Distances[i & 3] = planes[i].w;
        planePackets[i >> 2] = p;
    }

    // Populate the remaining planes with values that are always "in"
    for (int i = cullingPlaneCount; i < 4 * packetCount; ++i)
    {
        var p = planePackets[i >> 2];
        p.Xs[i & 3] = 1.0f;
        p.Ys[i & 3] = 0.0f;
        p.Zs[i & 3] = 0.0f;

        // We want to set these distances to a very large number, but one which
        // still allows us to add sphere radius values. Let's try 1 billion.
        p.Distances[i & 3] = 1e9f;

        planePackets[i >> 2] = p;
    }
}

Here’s how our FrustumCullSystem and its CullJob look now, after taking advantage of this shuffled data:

public partial struct FrustumCullSystem : ISystem
{
    private NativeArray<PlanePacket4> _planePackets;
    [BurstCompile] public void OnCreate(ref SystemState state)
    {
        _planePackets = new NativeArray<PlanePacket4>(2, Allocator.Persistent);
    }
    
    [BurstCompile] public void OnDestroy(ref SystemState state)
    {
        _planePackets.Dispose();
    }
    
    public void OnUpdate(ref SystemState state)
    { 
        FrustumCullHelper.CreatePlanePackets(ref _planePackets);
        DoCulling(ref state);
    }
    
    [BurstCompile] public void DoCulling(ref SystemState state)
    {
        state.Dependency = new CullJob { PlanePackets = _planePackets }.ScheduleParallel(state.Dependency);
        state.Dependency.Complete();
    }

    [BurstCompile(OptimizeFor = OptimizeFor.Performance)]
    partial struct CullJob : IJobEntity
    {
        [ReadOnly] public NativeArray<PlanePacket4> PlanePackets;

        void Execute(ref SphereVisible visibility, in LocalToWorld localToWorld, in SphereRadius radius)
        {
            var pos = localToWorld.Position;
            var p0 = PlanePackets[0];
            var p1 = PlanePackets[1];
            
            bool4 masks = (p0.Xs * pos.x + p0.Ys * pos.y + p0.Zs * pos.z + p0.Distances + radius.Value <= 0) |
                          (p1.Xs * pos.x + p1.Ys * pos.y + p1.Zs * pos.z + p1.Distances + radius.Value <= 0);
    
            visibility.Value = masks.Equals(new bool4(false)) ? 1 : 0;
        }
    }
}

This replaces the math.dot() method with explicit multiply and add operations, because we’re now performing a dot product between a single position vector (pos.x, pos.y, pos.z) and four plane normal vectors (for instance, (p0.Xs, p0.Ys, p0.Zs)) simultaneously. This solution can check a sphere against 2 PlanePacket4 in 33% of the mathematical operations needed to check against 6 planes.

It’s possible to optimize even further. Because 6 (the number of planes) doesn’t divide neatly into 4 (the number of possible simultaneous operations on a SIMD register), that second plane packet only contains two useful planes, as well as two dummy planes which always produce positive results. That’s wasted processing power.

So, what if we embrace the concept of preferring wider batches of input data and, instead of packing the frustum planes into SIMD-friendly packets of 4, we pack the spheres instead? This is certainly possible, and such an approach makes better use of the CPU, because it places meaningful data in as many operations as possible. However, the code must do some extra work in order to pack sets of 4 LocalToWorld positions and SphereRadius components into a vertical SIMD format before the cull test, and to unpack the bool4 results back into the SphereVisible components afterwards. In fact, this is the reason we chose to store visibility in an int rather than a bool - to allow for faster unpacking. To process 4 spheres at a time we need to use an IJobChunk, and we need to remember to process the last few spheres individually if the number of spheres in a chunk isn’t neatly divisible by 4.

[BurstCompile] public void DoCulling(ref SystemState state)
{
   var query = SystemAPI.QueryBuilder()
       .WithAll<LocalToWorld, SphereRadius>()
       .WithAllRW<SphereVisible>()
       .Build();
  
   state.Dependency = new CullJob
   {
       LocalToWorldTypeHandle = SystemAPI.GetComponentTypeHandle<LocalToWorld>(true),
       RadiusTypeHandle = SystemAPI.GetComponentTypeHandle<SphereRadius>(true),
       FP = _planes,
       VisibilityTypeHandle = SystemAPI.GetComponentTypeHandle<SphereVisible>()
   }.ScheduleParallel(query, state.Dependency);
   state.Dependency.Complete();
}

[BurstCompile(OptimizeFor = OptimizeFor.Performance)]
partial struct CullJob : IJobChunk
{
   [ReadOnly] public ComponentTypeHandle<LocalToWorld> LocalToWorldTypeHandle;
   [ReadOnly] public ComponentTypeHandle<SphereRadius> RadiusTypeHandle;
   [ReadOnly] public NativeArray<float4> FP;
  
   public ComponentTypeHandle<SphereVisible> VisibilityTypeHandle;
  
   public void Execute(in ArchetypeChunk chunk, int unfilteredChunkIndex, bool useEnabledMask, in v128 chunkEnabledMask)
   {
       // Get arrays of the components in this chunk that we're interested in.
       // Reinterpret the data as floats to make it easier to manipulate for packing
       var chunkTransforms = chunk.GetNativeArray(ref LocalToWorldTypeHandle).AsReadOnly();
       var chunkRadii = chunk.GetNativeArray(ref RadiusTypeHandle).Reinterpret<float>();
       var chunkVis = chunk.GetNativeArray(ref VisibilityTypeHandle);
      
       var p0 = FP[0];
       var p1 = FP[1];
       var p2 = FP[2];
       var p3 = FP[3];
       var p4 = FP[4];
       var p5 = FP[5];

       for (var i = 0; chunk.Count - i >= 4; i += 4)
       {
           // Load 4 float3 positions, then "shuffle" them into vertical Xs, Ys and Zs 
           var a = chunkTransforms[i].Position;
           var b = chunkTransforms[i+1].Position;
           var c = chunkTransforms[i+2].Position;
           var d = chunkTransforms[i+3].Position;
           var Xs = new float4(a.x, b.x, c.x, d.x);
           var Ys = new float4(a.y, b.y, c.y, d.y);
           var Zs = new float4(a.z, b.z, c.z, d.z);
          
           // Grab 4 radii in a single float4 
           var Radii = chunkRadii.ReinterpretLoad<float4>(i);
          
           // Test each of the 6 planes against the 4 shuffled spheres
           bool4 mask =
               p0.x * Xs + p0.y * Ys + p0.z * Zs + p0.w + Radii > 0.0f &
               p1.x * Xs + p1.y * Ys + p1.z * Zs + p1.w + Radii > 0.0f &
               p2.x * Xs + p2.y * Ys + p2.z * Zs + p2.w + Radii > 0.0f &
               p3.x * Xs + p3.y * Ys + p3.z * Zs + p3.w + Radii > 0.0f &
               p4.x * Xs + p4.y * Ys + p4.z * Zs + p4.w + Radii > 0.0f &
               p5.x * Xs + p5.y * Ys + p5.z * Zs + p5.w + Radii > 0.0f;

           chunkVis.ReinterpretStore(i, new int4(mask));
       }

       // In case the number of entities in this chunk isn't neatly divisible by 4, cull the last few spheres individually
       for (var i = (chunk.Count >> 2) << 2; i < chunk.Count; ++i)
       {
           var pos = chunkTransforms[i].Position;
           var radius = chunkRadii[i];

           int visible =
               (math.dot(p0.xyz, pos) + p0.w + radius > 0.0f &&
                math.dot(p1.xyz, pos) + p1.w + radius > 0.0f &&
                math.dot(p2.xyz, pos) + p2.w + radius > 0.0f &&
                math.dot(p3.xyz, pos) + p3.w + radius > 0.0f &&
                math.dot(p4.xyz, pos) + p4.w + radius > 0.0f &&
                math.dot(p5.xyz, pos) + p5.w + radius > 0.0f) ? 1 : 0;
          
           chunkVis[i] = new SphereVisible { Value = visible };
       }
   }
}

When working on optimizations like this, you should profile before and after each change to ensure that the changes are working as expected. However, in the interests of drama and suspense, we’ve saved the results of this experiment until the end.

A great way to check progress of SIMD mathematics optimizations is to use the Burst Inspector to examine assembled code. For brevity, we’ve omitted the Burst-generated code from this guide, but the table below shows an approximation of the complexity of each version, gained by manually counting the number of mathematical operations (multiply, add, bitwise and logical ANDs and ORs) that would be performed in a scene with 100,000 spheres. For the purpose of counting operations, we’re assuming that every mathematical operation has the same performance impact, regardless of whether it’s add or multiply, scalar or SIMD.

The table below displays these counts alongside median CPU times recorded on a Standalone Development build running on a 2019 MacBook Pro (2.4 GHz 8-Core Intel Core i9). The Optimize For project setting (Project Settings > Burst AOT Settings > Optimize For) was set to Performance to get the most out of Burst’s performance optimizations. The setup code set JobsUtility.JobWorkerCount = 1 in order to exclude most of the scheduling costs. CPU times are the execution times of the jobs across all threads.

The notable results:

In general, counting the number of mathematical operations in a version of the algorithm is a good predictor of its performance.
Version 2 is slightly faster than version 1 despite processing at least 500K more mathematical operations. This is likely due to the removal of the break statement. The main reason that the performance difference between V1 and V2 isn’t more noticeable is because Burst had already figured out how to unroll the for loop in V1. With more complex jobs, this is not guaranteed.
Version 4 is the fastest version of the culling job, because it performs the smallest number of operations overall, but it’s not four times the speed of V1 or V2. This is due to the extra work it has to do to shuffle the data from components into a packed SIMD format every frame. If the other aspects of the game’s data design made it possible to store sphere position and radius data as SIMD data all the time and avoid the conversion entirely, V4 could be considerably faster.

5. Know your compiler

For most purposes, Burst is extremely straightforward to use: adding Burst compilation to a job or function pointer is as simple as adding the [BurstCompile] attribute. In the right circumstances, Burst can automatically vectorize your code, giving you some of that SIMD shuffle magic without any effort on your part.

However, to squeeze the final bit of performance out of your target hardware, your high level code is only ever as good as the compiler. Sometimes you need to control your compiler’s behavior to get the best out of it, and for that you need to know about its capabilities and what hints you need to provide to allow it to generate the best possible machine code.

The Unity.Burst.Intrinsics namespace is where you’ll find the intrinsics for your specific target CPU architectures. The Intrinsics API reference shows you every available command to get right down to the metal when you really need it. The Unity blog post “Bursting into 2021 with Burst 1.5” shows some clear examples of how to use intrinsics to get the most out of a particular CPU architecture, using Arm Neon as an example.