March 2015

Data Structures and Algorithms

By Ben Kester

ben-kester“In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful… I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important.” Linus Torvalds

The way we store information and the methods we use to solve problems are the main drivers for efficiency in a program. Languages and hardware evolve, and it is easy to cut corners with sloppy code. While this may not cause problems right now, what about when the project scope expands or when the amount of data increases tenfold?

 

First, Some Notation

How do we measure an algorithm’s efficiency? An easy way to do that is to calculate the number of times a line of code will be run. Say you have a list of N items. You have 100 lines of initialization and 10 lines of code in a loop which pairs each item against another. That loop will run N^2 times, so your code will run 10N^2 + 100 times. Let’s call this Algorithm A.

 

You have another algorithm (B) with 1,000 lines of initialization and 100 lines of code within the loop. However, the loop is just run N times, so your total number of lines is 100N + 1,000.

 

Which one is better? For N <= 15,="" a="" will="" run="" quicker.="" the="" difference="" will="" be="" minimal,="" though,="" because="" they="" will="" both="" run="" very="" quickly="" for="" very="" small="" n’s.="" if="" n="100," b="" will="" be="" 9.1="" times="" faster="" than="" a.="" if="" n="1,000," n="" will="" be="" 99.01="" times="" faster.="" at="" n="10,000," the="" factor="" is="" 999.="" consider="" this="" exponential="" growth="" in="" light="" of="" the="" very="" large="" datasets="" you="" may="" be="" working="" with.=""> </=>

 

 

 

 

 

 

As you can see, the constant and even the coefficient have little impact on algorithm efficiency. We don’t care about the number of lines within a loop nearly as much as the number of times the loop is called. Computer scientists use “Big O” notation to address this. Algorithm A would run at O(N^2) whereas B would be a speedy O(N).

 

Most algorithms aren’t clean enough to give a definitive Big O result. For instance, an algorithm could be O(N) in the best case, O(N log2N) in the average case, and O(N^2) in the worst case, depending on the particular data input. We will mostly look at the average case and worst case; your specific situation may come into play.

Keep in mind that when counting lines of code, you also have to consider the methods you are calling, including system libraries. If your O(N) method calls another O(N) method within the loop, it is actually an O(N^2) method.

Lists

We can’t dive into algorithms quite yet. We first must cover some basic data structures.

There are two primary ways to store data—arrays and linked lists. You are doubtless familiar with arrays. These are sets of data, indexed by an integer. [1]They are excellent if you know the number of items in the list ahead of time, don’t have to add/remove many items, and especially if you want to access the Mth item. Implementations of arrays will dynamically resize the array as more data is added. Arrays are generally the preferred data type because they perform much better in accessing data.

Linked lists work a little differently. [2]Each item (a node) on the linked list has three parts: the data, a pointer to the next item, and a pointer to the previous item. You can imagine how linked lists are better if you don’t know the number of items in the list or you have to add/remove an item. They suffer in performance if you have to reference items by index.

 

The following table summarizes the average efficiency of these two list types:

  Array  Linked List 
Add to end  0(1)  0(1) 
Add to beginning  0(N)  0(1) 
Add to middle  0(N)  0(1) 
Remove from end  0(1)  0(1) 
Remove from Middle  0(N)  0(1) 
Iterate through all  0(N)  0(N) 
Search by index  0(1)  0(N) 

Stacks and Queues

These objects are the next level of list types. They are based on either an array or a linked list and don’t add any additional functionality. Rather, they provide structure for how we use them.

Stacks are LIFO (last in, first out) lists with two main features. You can “push” a new item onto the top of the stack and “pop” an item off the top. One handy feature of a stack is that it can replicate any recursive code. We may delve into this at some point because recursive methods are so common to actuaries.

If you spent time in the United Kingdom, you know they refer to a line of people as a queue. This is a better technical term than we use and it describes the programming behavior as well. Queues have two features similar to stacks: “enqueue” adds an item to the end of the queue, while “dequeue”

A good exercise for you is to write a stack and a queue data structure. If you need help, there are plenty of online resources. Though most libraries provide the functionality, [3]it’s helpful for you to understand how they work. Email me your code (benjamin@passedtense.com) if you want feedback.

 

That’s all for this edition. Next time, we’ll survey sorting algorithms. If you want to see particular algorithms or have a difficult or interesting problem to solve, shoot me a line. We may well be able to cover it as a future topic.

[1] In almost every programming language (except VBA), arrays start with an index of zero. So an array of five items would go from array[0] to array[4]. For VBA, the default start index is one.

[2] Assume the common double linked list.

[3]Again, VBA is an exception. Can you tell which language to avoid if at all possible?

 

Ben Kester, FSA, MAAA, Technology and Research, Coaching Actuaries. He can be contacted at bkester@coachingactuaries.com.