Jan 9 2012

Casting just a little faster

When choosing to cast in C#, there are at least two obvious patterns.

Explicit Casting

var foo = (Foo)instance;

‘as’ Keyword Casting

var foo = instance as Foo;

These two casting options have very different performance characteristics which should be considered. In tight loops you may wish to choose ‘as’ casting just for the performance benefits which I've measured as high as 50% over the alternative

Although this may seem like just a syntactic shortcut; under the covers; it’s actually two different sets of behaviors which account for the different costs

Explicit Casting

IL_0008:  castclass  Background_Activator_Test.Foo

‘as’ Keyword Casting

IL_000f:  isinst     Background_Activator_Test.Foo

Go on... Give it a go.


Dec 18 2011

The Pre-emptively active Activator

In some high-performance scenario’s you may need to create objects at a higher rate than previous methods can support. This could also be true for scenario’s where the object creation processes involve a large number of steps before the object is considered fully constructed and ready for use. This can often be the case where surges of requests cause applications to breach Quality of Service (QoS) thresholds.

In cases such as this, a background object builder can help offset the expense of construction and provide the host some additional mitigation of its overall capacity during surges. Here is a sample implementation which will pre-build a queue of instances and will refresh its internal queue when the size falls below a configurable threshold.

public sealed class BackgroundActivator<T> : IBackgroundActivator<T>
    where T : class
    private readonly Func<T> _callback;
    private readonly int _refreshThreshold;
    private readonly ConcurrentQueue<T> _queue;
    private readonly Func<T> _constructor;
    private readonly int _size;
    private volatile bool _runningRefresh;

    public BackgroundActivator(int size, int refreshThreshold)
        : this(size, refreshThreshold, null)

    public BackgroundActivator(int size, int refreshThreshold, Func<T> callback)
        if (size <= 0)
            throw new Exception("size <= 0");

        if (refreshThreshold >= size)
            throw new Exception("low >= size");

        if (callback == null)
            _constructor = GetConstructor();

        _size = size;
        _refreshThreshold = refreshThreshold;

        _callback = callback;
        _queue = new ConcurrentQueue<T>();

        for (int i = 0; i < size; ++i)
            T t = CreateInstanceInternal();


    private static Func<T> GetConstructor()
        ConstructorInfo constructorInfo = typeof(T).GetConstructor(new Type[0]);

        if (constructorInfo == null)
            throw new Exception("The type '" + typeof(T) + "' does not have a default constructor");

        return Expression.Lambda<Func<T>>(Expression.New(constructorInfo)).Compile();

    public T CreateInstance()
        if (_queue.Count == 0)
            return CreateInstanceInternal();

        if (_queue.Count <= _refreshThreshold && _runningRefresh == false)
            _runningRefresh = true;
            ThreadPool.QueueUserWorkItem(RefreshCallback, this);

        T queuedItem;

        _queue.TryDequeue(out queuedItem);

        return queuedItem;

    private void RefreshCallback(object o)
        var list = new List<T>();

        for (int i = _queue.Count; i < _size; ++i)

        list.ForEach(t => _queue.Enqueue(t));
        _runningRefresh = false;

    private T CreateInstanceInternal()
        return _callback == null ? _constructor() : _callback();

Lets take a look at how we might use this. There are two basic usage scenarios that I can think of...

Activator Object Construction

First define an instance of the BackgroundActivator. In doing so you can specify the size of queued instances, and the size of the queue at which the it should be refilled.

var backgroundActivator = new BackgroundActivator<MyType>(200, 100);

Once we have an instance of the activator we can just call the CreateInstance method any time we need an instance of the object

var instance = backgroundActivator.CreateInstance();

Custom Object Construction

This scenario is ideal if your construction requirements are a bit more involved. In this case, a function is supplied which is called to perform the underlying construction.

var backgroundActivator = new BackgroundActivator<MyType>(200, 100, () =>
    var instance = new MyType();
    //TODO: Do more stuff
    return instance;

This is a pattern I use within my IoC Container and can really provide that little extra boost when you need it. Use in good health. :-)

Tags: ,

Aug 16 2011

Log4Net... Friend or Foe?

Despite the title, I'm actually a big fan of Log4Net. While powerful, Log4Net can become the bottleneck for some applications so we need to explore these issues a little more closely so you can configure Log4Net in a way that’s optimal for your system. Identifying this as a potential performance bottleneck is relatively straightforward. If you have a repeatable performance test, profile the application with your normal production log configurations, and then again with all your logging disabled. If the change in performance is dramatic, then you may want to consider a new strategy.

Check logging levels before logging

To avoid repeating information from the authors of this component… as a general rule you should always test if the target log level is enabled before you actually log anything. This will offset some hidden costs which the vendor’s article details.
More information can be found on the vendor’s site (http://logging.apache.org/log4net/release/manual/internals.html).

Using the RollingFileAppender

As a file I/O-bound resource, this Appender can block your application from doing real work while the Appender flushes its buffer to disk. Profiling your application will uncover this and there are a couple options to mitigate this expense.

Mitigation strategies
  • Disable the immediate flush option. According to the Log4Net documentation, this can result in performance increases of 15-20% but has a risk that the final buffer may not be flushed to disk if the host process crashes

Using the AdoNetAppender

As a database I/O-bound resource, this Appender can block your application from doing real work while the Appender flushes its buffer to the database. If you analyze this closely, you’ll see that the native Appender will execute a Stored Procedure or SQL command for each and every log entry to be sent to the database. Unlike the RollingFileAppender, buffering will never offset the cost of these database interactions. As a result, more invasive strategies will be required.

Mitigation strategies
  • Implement a custom Appender that inherits from AdoNetAppender and override the SendBuffer(LoggingEvent[]) method. At this point you can then change the implementation to suite your performance demands while preserving all the base class functionality such as managing connections, etc… Here are a few ideas on what you could implement:
    • Write to shared memory and offload to an external process to perform write-behind
    • Perform bulk inserts into the database
    • Batch multiple log entries into a single Stored Procedure call
    • Queue to an MSMQ and use an external process to perform write-behind
    • Etc…
  • Avoid using custom parameter formatters. Under the covers many of the underlying PatternConverters result in the creation of StringBuilders which can be quite expensive. If you’re building a custom implementation of this Appender, it may be more performant to handle this using different patterns

Hope this helps you find that little something extra and find performance that you need. :-)

Tags: ,

Jul 11 2011

Being recursive

When authoring recursive routines there are special memory and performance considerations to be made. Firstly, authors must consider the overall size the call Stack could reach given the point in the Stack where the call starts and the number of Stack Frames that could be added as a result of the recursion. In doing so, you can estimate some key factors.

  1. What is the likelihood of a StackOverflowException?
  2. How much memory will be required to support all the Stack Frames added by the recursive call?

With just these two elements, we can assert a couple performance implications:

  • If your recursion is expected to create a large number of Stack Frames, then your product may suffer from the direct impact of managing a large chunk of memory for the Stack and the indirect expense of collapsing the recursion results when the recursion is completed. In this case, optimizations may be valuable and will very likely result in an easily measureable performance advantage
  • In cases, where you’re creating very shallow recursions (e.g. few Stack Frames) this is rarely a concern unless the state on the stack is very large

Assuming you need to optimize, you can try and achieve a Tail call (a.k.a. tail-recursive call). Before we go on, let’s define more precisely what a Tail Call is.

In computer science, a tail call is a subroutine call that happens inside another procedure and that produces a return value, which is then immediately returned by the calling procedure. The call site is then said to be in tail position, i.e. at the end of the calling procedure. If a subroutine performs a tail call to itself, it is called tail-recursive. This is a special case of recursion.

Tail calls are significant because they can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is not needed any more, and it can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called tail call elimination, or tail call optimization.” --- Wikipedia

As explained in the definition, tail calls will prevent your recursion from producing any more than two Stack Frames. This is because the second Frame will continually be reused for each point of recursion.

Unfortunately, achieving tail calls in .NET is somewhat difficult. There are two commonly used patterns for trying to achieve this:

  1. Modify your method structure to make it eligible for tail-elimination optimizations by the C# compiler. To determine if this has already happened for your method you simply need to review the IL and see if the following IL is present near the end of your method
    IL_001a: tail

    If the tail IL is already present, then you’re all done… well done! If not, then making your method eligible simply involves modifying the method so that the recursive call is the last statement in the method
    private int ARecursiveMethod(int arg)
        //TODO: Do work
        return ARecursiveMethod(arg);
  2. Force a tail call. Unfortunately, the compiler’s exception list for automatically applying this optimization is very long. Consequently there are a number of reason’s this will not occur automatically. As it turns out, the list of reasons is largely a result of just-in-case scenarios where they want to avoid breaking your method. Although a more advanced set of steps is required, you can force tail calls to occur if your process requires it. Here is what you need to do:
  • Compile the parent assembly as normal
  • Extract the IL with ILDASM
  • Modify the IL to insert the tail instruction
  • Reassemble the assembly with ILASM
  • Test, test, test…

Note: This can all be automated but obviously carries some risk. In my experience, this should occur in CI between build time and Unit Test runs

Have Fun!

Tags: ,

Jun 9 2011

Lost Priorities

A common practice when developing real-time systems is to elevate the priority of the host process. A misconception that is often held is that raising a processes priority will implicitly provide greater priority to threads spawned within the host process. In truth, this is only true for the main thread and has no impact on Threads that were spawned during host execution. To ensure new threads are synchronized with the host Process, you’ll need to perform the following:

Create a means of mapping between the two thread priority options. Below is an example of a simple mapping which could serve a number of application needs

public static ThreadPriority GetThreadPriority(this Process process)
    switch (process.PriorityClass)
        case ProcessPriorityClass.AboveNormal:
            return ThreadPriority.AboveNormal;
        case ProcessPriorityClass.BelowNormal:
            return ThreadPriority.BelowNormal;
        case ProcessPriorityClass.High:
            return ThreadPriority.Highest;
        case ProcessPriorityClass.Idle:
            return ThreadPriority.Normal;
        case ProcessPriorityClass.Normal:
            return ThreadPriority.Normal;
        case ProcessPriorityClass.RealTime:
            return ThreadPriority.Highest;
            return ThreadPriority.Normal;

Set the priority of each thread you expect to be mapped to the host processes priority

var _workerThread = new Thread(WorkerStart)
    IsBackground = true,
    Priority = Process.GetCurrentProcess().GetThreadPriority(),
    Name = "My Worker Thread"

That's it, use it in good health.

Tags: ,

May 16 2011

Stay out of the JIT's way

The JIT compiler will; unless told otherwise; apply a number of optimizations at runtime to address both minor and major inefficiencies in the code we write. Examples include the following:

  • Constant Folding
  • Constant and Copy Propagation
  • Method Inlining
  • Code Hoisting and Dominators
  • Loop Unrolling
  • Common SubExpression Elimination
  • Enregistration
  • And others…

Generally we don’t need to think about these details too much, although where hyper-performance concerns exist we should. The reason is because our coding patterns can lead to situations where the JIT decides to completely skip optimizations to ensure it doesn’t break the code we’ve written. Let’s take Method Inlining for example. Here is a list of JIT rules that guide its decision to optimize:

… Method Inlining
There is a cost associated with method calls; arguments need to be pushed on the stack or stored in registers, the method prolog and epilog need to be executed and so on. The cost of these calls can be avoided for certain methods by simply moving the method body of the method being called into the body of the caller. This is called Method In-lining. The JIT uses a number of heuristics to decide whether a method should be in-lined. The following is a list of the more significant of those (note that this is not exhaustive):

  • Methods that are greater than 32 bytes of IL will not be inlined
  • Virtual functions are not inlined
  • Methods that have complex flow control will not be in-lined. Complex flow control is any flow control other than if/then/else; in this case, switch or while
  • Methods that contain exception-handling blocks are not inlined, though methods that throw exceptions are still candidates for inlining
  • If any of the method's formal arguments are structs, the method will not be inlined. …

As developers, there are a few basic behaviors we should follow to ensure we can take advantage of these optimizations:

  • Keep method size small (e.g. < 32 bytes of IL)
  • Keep the contents of a try-catch-finally block small by moving the code to be executed in the try and catch blocks to another method
  • Don’t make members virtual unless you have a clear need to do so

Tags: ,

Apr 16 2011

A few principles for writing blazing fast code in .NET

When authoring high performance applications, the following generalized rules can be considered valuable for creating highly performant capabilities:

Share nothing across threads, even at the expense of memory

  • Sharing across thread boundaries leads to increased preemptions, costly thread contention, and may introduce other less obvious expenses in L2 cache, and more
  • When working with shared state that is seldom or never updated, give each thread its own copy even at the expense of memory
  • Create thread affinity if the workload represents saga’s for object state, but keep in mind this may limit scalability within a single instance of a process
  • Where possible, isolating threads is ideal

Embrace lock-free architectures

  • The fewer locks the better, which is obvious to most people
  • Understanding how to achieve thread-safety using lock-free patterns can be somewhat nuanced, so digging into the details of how the primitive/native locking semantics work and the concepts behind memory fencing can help ensure you have leaner execution paths

# dedicated long-running Threads == Number of processing cores

  • It’s easy to just spin up another thread. Unfortunately, the more threads you create the more contention you are likely to create with them. Eventually, you may find your application is spending so much time jumping between threads; there is no time to do any real work. This is known as a ‘Live Lock’ scenario and is somewhat challenging to debug
  • Test the performance of your application using different threading patterns on hardware that is representative of the production environment to ensure the number of threads you’ve chosen is actually optimal
  • For background tasks that have more flexibility in how often they can be run, when, and how much work can be done at any given time, consider Continuation or Task Scheduler patterns and collapse them onto fewer (or a single) threads
  • Consider using patterns that utilize the ThreadPool instead of using dedicated long-running threads

Stay in-memory and avoid or batch I/O where possible

  • File, Database, and Network I/O can be costly
  • Consider batching updates when I/O is required. This includes buffering file writes, batch message transmissions, etc…
  • For database interactions, try using bulk inserts even if it’s only to temp tables. You can use Stored Procedures to signal submission of data, which can then perform ETL like functions in the database

Avoid the Heap

  • Objects placed on the Heap carry with them the burden of being garbage collected. If your application produces a large number of objects with very short lives, the burden of collection can be expensive to the overall performance of your application
  • Consider switching to Server GC (a.k.a multi-core GC)
  • Consider switching to Value Types maintained on the call stack and don’t box them
  • Consider reusing object instances. Samples of each of these can be found in the below Coding Guidelines

Use method-level variables during method execution and merge results with class-level variables after processing

  • Using shared variables that are frequently updated can create inefficiencies in how the call stack is managed and how L2 Cache’s behave
  • When working with relatively small variables, follow a pattern of copy-local, do work, merge changes
  • For shared state that is updated frequently by multiple threads, be aware of ‘False Sharing’ concerns and code accordingly

Avoid polling patterns

  • Blind polling can lead to inefficient use of resources and reduce a systems ability to scale or reduce overall performance. Where possible, apply publish-subscribe patterns

Know what things cost

  • If you dig a little deeper into the Framework you may find some surprises with regard to what things cost. Take a look at Know What Things Cost.

Develop layers with at least two consumers in mind… Production & Tests

  • Developing highly performant systems requires a fair amount of testing. As such, each new layer/component needs to be testable in isolation so that we can better isolate performance bottlenecks, measure thresholds and capacity, and model/validate behavior under various load scenarios
  • Consider using Dependency Injection patterns to allow injection of alternate dependencies, mocks, etc…
  • Consider using Provider patterns to make selection of separate implementations easier. It’s not uncommon for automated test systems to configure alternate implementations to help suite various test cases
    • Ex. Layers that replicate network instability, layers that accommodate Bot-users that drive change in the system, layers that replicate external resources with predictable behaviours, etc…

Tags: , ,

Mar 6 2011

When going LowLatency doesn't actually help

The GC supports a variety of latency modes (e.g. System.Runtime.GCLatencyMode). These modes guide the behavior of the GC such that it manages its level of intrusiveness to the process it’s servicing. Of particular interest is the LowLatency mode, which can offer temporary performance enhancements. Setting this can be tricky and can actually degrade performance if not used correctly; however, under the correct circumstances this can be quite beneficial.

This setting is only meant to be used for very short-periods of time when select processes need to run with minimal (but not zero) interruptions by GC. Here are a few rules to guide its consideration?

  • Only consider this option if you’re using Workstation GC. If you’re using Server GC, this will provide you no value because LowLatency is not supported for Server GC
  • If your application causes relatively high rates of Generation 2 collection this could be a candidate. This will not reduce time in GC of Generation 0 or 1
  • If the system hosting your process is always under memory pressure, this will likely have little effect. This is because LowLatency constraints are bypassed if the OS signals low memory conditions
  • If you enable LowLatency, make sure any processes you have that would explicitly request collection are also disabled (especially gen 2), otherwise the effects will be lost
  • Only consider LowLatency if you have a clearly definable line-of-execution that you feel needs to run with little interruption
    • For Example: If your product follows a circuit-breaker pattern to guarantee a QoS, when the breaker is tripped you could enable Low Latency long enough to allow your process to recover itself, catch-up, etc. When the breaker is reset, revert LowLatency mode


Feb 13 2011

Knowing what things cost

Category: Intellectual PursuitsJoeGeeky @ 17:01

There is an old saying that knowing what things cost is half the battle, and in this case that is certainly true. Understanding the relative cost for various approaches can be important when you’re looking to get as many nanoseconds out of your process as possible. Here is a summary captured on relatively old equipment, but keep in mind we are looking to understand the relative costs for various choices. If you do a little profiling of your own you may find even more surpises.

Constructing Objects

new value type L1 2.6 ns
new value type L2 4.6 ns
new value type L3 6.4 ns
new value type L4 8.0 ns
new value type L5 22.9 ns
new ref type L1 20.3 ns
new ref type L2 23.9 ns
new ref type L3 27.5 ns
new ref type L4 30.8 ns
new ref type L5 34.4 ns

Note: times will go higher for ref types depending on ctor param arrangement

Arithmetic Operations

Addition 1.0 ns
Subtract 1.0 ns
Multiply 2.7 ns
Divide 35.7 ns
Shift 2.1 ns

Method Calls

static call 6.1 ns
instance call 6.8 ns
instance this call 6.2 ns
Inlined static call 0.2 ns
Inlined instance call 1.0 ns
Inlined instance this call 0.2 ns
virtual call 5.4 ns
this virtual call 5.4 ns
interface call 6.5 ns


cast up 1 0.4 ns
cast down 1 8.8 ns

Field and Property Access

get field 1.0 ns
get property 1.2 ns
set field 1.2 ns
set property 1.2 ns
get virtual property 6.3 ns
set virtual property 6.3 ns


Box int 21.6 ns
Unbox int 3.0 ns


delegate invocation 40.9 ns


Jan 1 2011

Dropping the locks

When your working on high performance products you'll eventually find yourself creating some form of a Near-Cache. If things really need to go fast, you may start thinking about going lock-free. This may be perfectly reasonable, but lets explore this a little to see where it makes sense.

For the purposes of this article, a Near-Cache is defined as any collection of state from durably persistent mediums that is maintained in-process and services requests concurrently. This may be a direct projection of state from a single medium or a custom View of state from many different mediums.

Typically, the implementation of Near-Cache patterns fall into a couple broad categories


  • As the name suggests this is a cache that is seldom written too, and spends most of its time servicing read/lookup requests. A common example would be maintaining an in-memory View of static (or seldom updated) state in a database to offset the expense of calling back to the database for each individual lookup

Volatile Read-Write

  • This type of cache typically has data written-to or altered as frequently as its read. In many cases, caches of this type maintain state that is discovered through more dynamic processes, although it could also be backed against persistent or other semi-volatile mediums

The benefits and/or consequences of selecting either pattern varies depending on the needs of your application. Generally speaking, both approaches will save costs as compared to calling-out to more expensive external resources. In C# there are several patterns which could be used to create a Thread-Safe Near-Cache. Exploring two samples to demonstrate opposite ends of the spectrum will help reveal their relative merits. Along the way I'll identify some criteria that may be useful in determining which is the most appropriate for a given scenario.

The Full-lock Cache

This pattern is suitable for either Write-Once/Read-Many or Volatile Read-Write caches

public static class FullLockCache
    private static IDictionary<int, string> _cache;
    private static object _fullLock;

    static FullLockCache()
        _cache = new Dictionary<int, string>();
        _fullLock = new object();

    public static string GetData(int key)
        lock (_fullLock)
            string returnValue;
            _cache.TryGetValue(key, out returnValue);
            return returnValue;

    public static void AddData(int key, string data)
        lock (_fullLock)
            if (!_cache.ContainsKey(key))
                _cache.Add(key, data);

  • The locking and Thread-Safety semantics are easily understood
  • Unless you have high numbers of threads accessing the cache concurrently or have extremely high performance demands, the relative performance of this pattern can be quite reasonable
  • This can be relatively easy to test and could be made easier with subtle extensions
  • The consequences of this cache pattern would be the full lock placed on each call. This can start to manifest as higher contention rates for applications that have large numbers of threads or fewer threads with very high performance demands

The Lock-Free Cache

This pattern is only suitable for a Write-Once/Read-Many cache

public static class NoLockDictionaryCache
    private static IDictionary<int, string> _cache;

    static NoLockDictionaryCache()
        _cache = new Dictionary<int, string>();

    public static string GetData(int key)
        string returnValue;
        _cache.TryGetValue(key, out returnValue);
        return returnValue;

    public static void LoadCacheFromSource()
        IDictionary<int, string> tempCache = new Dictionary<int, string>();

        //TODO: Load Cache

        Interlocked.Exchange(ref _cache, tempCache);

  • No locking is required
  • Read performance of a cache is significantly improved
  • This can be relatively easy to test and could be made easier with subtle extensions
  • Writes can be relatively slow when compared to other patterns. This is a result of reloading the entire cache, or cloning and then updating the existing cache
  • The Hashtable and ConcurrentDictionary types work very well in place of the Dictionary. These perform almost the same as the Dictionary

Tags: , ,