Algorithms for People Who Value Their Time — part 1

Tom Palczewski
18 min readJun 28, 2021

It’s time for Algorithms ;) Somebody said that by studying algorithms you might become the better programmer, you will have a better understanding how to do ETL in an efficient way. Someday maybe you will even come up with your own new algorithm but let’s start from basics.

BTW, those little things are also useful if you want to look good at technical interviews.

And the last point, somebody also said that algorithms are good for your brain, I’m not sure that’s true but it’s hard to reject this hypothesis with data that we have.

First of all, what are algorithms? It’s just a set of computations that takes some kind of input and outputs specific kind of outputs — so now data engineers think that they have this already and it’s called ETL. What’s the difference? rather small, in algorithms you try to do things in an efficient way (in some ETLs you don’t need to depending on requirements), in general algorithms are theoretical instructions to solve a problem. Ok, so let’s do not waste time on those discussions and jump into algorithms

But, I have a question about this efficiency. What does that mean?

In general by efficient algorithms we mean algorithms that are fast and use small number of resources. So we arrived to a point when one can see connection between algorithms and technology. Total system performance is a combination of efficient instructions and efficient hardware.

How do we measure efficiency of algorithms?

Simple, use something called “Asymptotic Notation”. Ok, what is it? Big picture is that’s an estimation based on your understanding of instructions. Somebody, highly-skilled in algorithms said that you can just remember seven words:

“suppress constant factors and lower-order terms” [1]

Sounds cool, so we just need to focus on important things, right? yes, exactly and here comes Big O notation — mathematical notation that tells us limiting behavior of the function when arguments go towards infinity (or some high numbers). I hope that now we can see that Big O notation tells us what runtimes we should see when using our function/algorithm. Cool, so we have a tool to check efficiency and comapre between different algorithms.

What should I know about Big O?

I would say that it would be great if you knew that you can use it to check efficiency of algorithms and that’s probably it. However, If you are really into that you should check references below and do not forget to look up also Big-Omega and Big-Theta Notations.

Ok, and now important thing

Taken from [https://danielmiessler.com/study/big-o-notation/]

So depending on the algorithm you might have your outputs almost instantly or wait for days, years, … .

So we see O(1), O(logN), O(N), O(NlogN), O(N²), O(2^N), O(N!) from fastes to slower.

O(1)— 1 item = 1 time unit , 10 items = 1 time unit, … always 1 time unit.

O(logN) — logarithmic complexity — 1 item = 1 time unit, 10 items = 2 time unit, 100 items = 3 time unit.

O(N) — linear complexity — 1 item — 1 time unit, 10 items = 10 time units

I think you got a point so we have a notation, we know what we need to drop and we can evaluate algorithms and compare them.

Ok, so we know why, what, and how to evaluate. It’s time for our first algorithms.

Insertion Sort — yes, I always wanted to know how to sort things. This sorting technique is similar to sorting a hand of cards. We will see that this algorithm is only efficient for sorting a small number of elements but we need to start somewhere.

So idea is simple, let’s keep this example of cards, your left hand is empty at the beginning, you pick a card and put it to your left hand in the right position. To find a correct position you need to compare the card you picked with all other cards that you have in your left hand.

Ok, cool, it’s time for a pseudocode, so we have an input let’s call it C and as an output we want to have a sorted C. We already defined our instructions above so we have first algorithm, let’s call it A.

C → A → Sorted C. So we are done, right? Ok not so fast, this is a valid pseudocode every sorting algorithms, let’s go a little bit deeper.

Let’s define A = [1, 3, 2, 7, 6, 5, 4]

we want our output to look as follows: [1, 2, 3, 4, 5, 6, 7]

so we can do Insertion-Sort as follows:

[1] for element = 2 to end of A length 
[2] key = A[element]
[3] i = element-1
[4] while i is larger than 0 and A[i] is larger than key
[5] A[i+1] = A[i]
[6] i = i - 1
[7] A[i + 1] = key

looks complicated, so what exactly is happening here. So you start with a second element as we know that first you just put into your left hand. You see that your element in our example is 3 (key = 3), now you want to check what you have in your hand, so you start with element= 2–1 = 1, your first card, in our example that’s 1 → A[i] = A[1] = A[first element] = 1. So we will not go into the while loop as A[1] = 1 is smaller than A[2] = 3. We should now check other cards in the hand but at this point we just have one so if we do again i-1 we get 0 so that’s it. We are now in line 7 of the first element of our loop in line 1. We just set A[i+1] = key → A[2] = A[second element] = 3. After first iteration we have [1,3]. We continue this journey. Let’s do one more so we fully understand this approach. Next element is a third one A[3] = 2. Ok we are now in line 2 again key = 2 , line 3 i = 2so we know that and A[2] = 3, we go into the while loop we are in the line [5] now A[3] = A[2] so we moved our card 3 to third position, we then compare again our card 2 with card 1 but we do not move anything, we hit the last line set A[2] to be 2 as a result we have [1,2,3].

So now, it’s time for python.

def insertionSort(a):
'''
:params a: our array of integers
:output: sorted array of integers
'''
for el in range(1, len(arr)):
# el goes from 1 to the end of array length, in our example
# we were starting from 2 but in python index starts from 0
# arr[1] means second element of array
key = arr[el]
j = el-1
while j > 0 and key < arr[j]:
arr[j+1] = arr[j]
j = j - 1
arr[j+1] = key

We have it! We can copy this to a jupyter notebook (for example or anything else you like) and test it.

Ok, so we did our first sorting algorithms. Is it efficient?

So we see that we need to go through all elements of the array, line 1 — the for loop — this is linear time complexity — O(N) but then we need to compare all the elements, if an array is already sorted we do nothing so the best case is O(N), if we need to compare everything with everything we see that we are getting closer to O(N²). To be a little bit more precise time complexity in this case is O(n + f(n)) where f(n) is our count of inversions.

What’s next? I kinda like those algorithms. Let’s stay for a little bit longer in sorting. Let’s look at MergeSort, btw it was created in 1945. It’s old but it is commonly used in many practical applications. So again we have our

A = [1, 3, 2, 7, 6, 5, 4]

we want our output to look as follows: [1, 2, 3, 4, 5, 6, 7]

what is instead of going through all the examples from left to right try to divide this array into smaller chunks and sort them and at the end merge?

Let’s try to do this and see what happens.

We start with [1, 3, 2, 7, 6, 5, 4]

we split it into two ~half → [1, 3, 2, 7] and [6,5,4]

then we have [1,3] and [2,7] and [6,5] and [4]

on the next split we will have [[1], [3]], [[2], [7]] , [[6], [5]] , [ [4]]

wait, and what now? now you sort and merge back? so let’s look at those two element arrays:

so [1], [3] stays the same, [2], [7] stays the same, then you switch 5 with 6 and have [5], [6], and last one is [4]

you merge that back, ok show me how?

So now we want algorithm that will take as an input two sorted array let’s say C and D both of size n/2 and output sorted array B of size n. For simplicity let’s assume that n is even for a second so we can do something like this:

i = 1
j = 1
for k = 1 to n do
if C[i] < D[j] then
B[k] = C[i]
i = i + 1
else
B[k] = D[j]
j = j +1

cool, so we travers our output with index k and our inputs with i and j (from left to right). So we see which element in smaller, one from C or D and we put the smaller one to B. So let’s go back to our example, I hope you remember that we were in this place — [1,3] and [2,7] and [6,5] and [4]

so our first pair of C and D is [1,3] and [2,7]

we compare 1 with 2 and put 1 as a first position in B, i is incremented so next we compare 3 with 2 and we put 2 as second position in B, now j is incremented and we compare 7 with 3 and we put 3 as a third position in B and then we have 7 → [1,2,3,7]

we do the same for [6,5] and [4] → [4,5,6]

and we run merge again on [1,2,3,7] and [4,5,6] → [1,2,3,4,5,6,7] and we are done. It’s a lot to process, I know.

So let’s try the python code and maybe things will be clearer.

def mergeSort(arr):
if len(arr) > 1:
mid = len(arr)//2 # we need to divide into ~halves
Left = arr[:mid]
Right = arr[mid:]
# Sort the first half
mergeSort(Left)
# Sort the second half
mergeSort(Right)

i = j = k = 0 # we know that we will need 3 indices
while i < len(Left) and j < len(Right):
if Left[i] < Right[j]:
arr[k] = Left[i]
i += 1
else:
arr[k] = Right[j]
j += 1
k += 1
# In the pseudo code above we assumed that n is even but
# in real life we need to check if some element
# were left alone and put them into right places
while i < len(Left):
arr[k] = Left[i]
i += 1
k += 1
while j < len(Right):
arr[k] = Right[j]
j += 1
k += 1

Again, let’s put this to a jupyter notebook and see if it works:

If you are still confused, try to watch this video

https://www.youtube.com/watch?v=JSceec-wEyw from GeeksforGeeks.

Ok, that was cool and a little bit complicated but we made it. So is it better than insertion sort?

Time complexity of Merge Sort is (n log n), you can find time complexity analysis in the resources listed in the references. So if somebody asks you how fast you can sort you can now say (n log n) is easy for me.

So that was a second algorithm and this time we used technique that is called “divide and conquer”. We were splitting our array and then we were merging back while sorting. What’s next? Are there any other sorting algorithms that I should know?

Quick sort — this time we will use a pivot point. The big picture is as follows: select your pivot point, put everything smaller than a pivot point to left and everything larger to the right. Then we need to recursively sort fist part and the second part. So probably know you started to think that efficiency will depend how we select a pivot point and you might be on the right track. In general, you can think about a simple selection, like a first element, last element, random element, maybe median as a pivot point. Again if you want to read a nice analysis of this problem please have a look at references below. Here I’ll focus only on the simple implementation.

Ok, so again we are talking about divide and conquer approach. Instead of splitting in ~halves we want to use a pivot point.

Let’s start with the pseudocode for sorting A[p, …, r], we know that we want to split A into two subarrays A[p, …, q-1] and A[q, …, r] such that all elements in first array are smaller or equal than A[q] and all elements of second are larger than A[q]. After divide we want to conquer so we want to sort those subarray by recursively calling quick sort. Last step is of course to combine but as subarrays are sorted we do not need to do much here.

quicksort(A, p, r) 
if p < r
q = partition(A, p, r)
quicksort(A, p, q - 1)
quicksort(A, q + 1, r)

Now we see that the most important part here is to have a good partition procedure. In this sorting approach we compare values on the left side of pivot with the ones on the right side, if we see that value on the left side is larger than pivot we swap and that’s how we sort left and right sides.

Here is a video showing this approach:

https://www.youtube.com/watch?v=SLauY6PpjW4

Now, let’s try to write a pseudocode for partition function

partition(A, p, r)
x = A[r] # our pivot point element
i = p - 1
for j = p to r - 1
if A[j] <= x
i = i +1
change A[i] with A[j]
change A[i + 1] with A[r]
return i + 1

I noticed that for some people it is confusing when they watch those videos or walk through the explanations for the first time as they somehow think that pivot point should be swapped or maybe some value is larger but it is not swapped and so on, so let’s do our example here:

A = [1, 3, 2, 7, 6, 5, 4]

we want our output to look as follows: [1, 2, 3, 4, 5, 6, 7]

let’s say that at first we select pivot point to be a second element of A = 3

so we start one pointer from the left side and the other from the right side (or we can have both pointers on the left side and move pivot to the end and then put it back but let’s not make things too complicated and just go through one option):

we are on 1 and from the other side we are on 4, 1 is smaller than 3 and 4 is larger than 3 so we are good we move to a second position, 3, and compare this with our pivot value 3 so we need to swap (if A[j] <= x) we move pointer from the right side to find an element that is smaller than pivot so we go through 5, 6, 7, up to 2 and now we can swap so we have:

[1, 2, 3, 7, 6, 5, 4]

and in natural way we have two subparts [1, 2, 3] and [7, 6, 5, 4]

so we see that [1,2,3] is already sorted so let’s focus on other part

we select pivot to be for example 6 (again a second position) so we start one pointer on the left side and one on the right side, 7 is larger and 4 is smaller so we swap we have [4, 6, 5, 7], we move pointer to next position we see 6 so we need to swap (again if A[j] <= x), we see that 5 is smaller so we swap and we have [4, 5, 6, 7]

Now we just need to combine both to get [1, 2, 3, 4, 5, 6, 7]. You can now re-do this example with both pointers going from the left and hopefully you can figure out that too. I wonder if you can figure out if the code snippet below is for moving pointer from left or from opposite sides.

I think we are ready for the python code:

def partition(arr, p, r):
x = arr[r]
i = p - 1
for j in range(p, r):
if arr[j] <= x:
i = i + 1
arr[i], arr[j] = arr[j], arr[i]
arr[i+1], arr[r] = arr[r], arr[i+1]
return i+1
def quicksort(arr, p, r):
if len(arr) == 1:
return arr
if p < r:
part = partition(arr, p, r)
quickSort(arr, p, part - 1)
quickSort(arr, part + 1, r)
return arr

Let’s put this to our jupyter notebook and test:

Cool, we have three different sorting algorithms in place. Can you guess a time complexity of quick sort — yes, (n log n).

Does python use QuickSort, MergeSort, or InsertionSort?

Timsort is used , it is a hybrid sorting algorithm, derived from merge sort and insertion sort, designed to perform well on many kinds of real-world data. Time complexity = (n log n).

In python you have two quick options — sort and sorted

# sorted build in function a = [1, 3, 2, 7, 6, 5, 4]
print(a) # -> [1, 3, 2, 7, 6, 5, 4]
sorted(a) # prints [1, 2, 3, 4, 5, 6, 7] but a stays the same
print(a) # -> [1, 3, 2, 7, 6, 5, 4]
b = sorted(a)
print(b) # -> [1, 2, 3, 4, 5, 6, 7]
a = [1, 3, 2, 7, 6, 5, 4]
a.sort()
print(a) # -> [1, 2, 3, 4, 5, 6, 7]

If you want to sort string check out key parameter and other options — see for example: https://realpython.com/python-sort/

What’s next? So we covered Insertion Sort, Merge Sort, Quick Sort (basics with going into details about partition selection — check randomized Quick Sort algorithm that runs in linear time on average for more excitement) so just to end sorting algorithms let’s cover probably one of the simplest sorting algorithms — Bubble Sort.

The key idea is very simple: swap adjacent elements if they are in wrong order. It’s so simple so we do not really need pseudocode, let’s do python

def bubblesort(arr):
n = len(arr)
for i in range(n):
for j in range(0, n - i - 1):
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]
return arr

so the time complexity as we have two for loops is of course O(N²). This closes our sorting algorithm part.

We touched some ideas behind divide and conquer approaches, we went through a few sorting algorithms. It’s time to move to selection problem. Let’s say I have unsorted array and would like to find the ith-smallest element of this array. At this point we know that we can sort this array in (n log n) and then select what we want but can we do this in a more efficient way?

So as I mentioned our first idea should be right now very simple:

def getith(arr, ith):
b = sorted(arr)
return b[ith]

can we do something better?

If you are familiar with HEAP data structure you might think about using Max Heap with i capacity, and that’s a good idea that you can start with. So let’s try to have a look and then maybe we will go back to Quick Sort and see if there is something there that can help us in solving this problem in an efficient way.

What is heap data structure? It is a data structure that helps us with fast minimum or maximum computations. Ok, what is it? A picture is worth a thousand words so here you go, min and max heap:

Ok, so some kine of trees, I see that for min heap the master node has a minimum value and for the max heap it has the maximum value.

So let’s go back to our example

A = [1, 3, 2, 7, 6, 5, 4]

now we want to get ith-smallest element, let’s say that we want third smallest element, we see that our algorithm should give us 3 as an output as sorted array is [1, 2, 3, 4, 5, 6, 7]. So let’s solve this with heap. We want third smallest element of the array. We start to create min heap but set capacity of the heap to be equal to 3 (as third smallest element). We go through the array, we see 1 we put in in heap, we see 3 we put in in heap, we see 2 put it in heap, now we see 7 we put it in but our capacity is 3 so we need to reject it, and we continue doing this as an output we will have heap with three elements 3, 2, 1, and master node will be the third smallest element (in our case 3.

In python heap is realized by heapq. There is no good implementation of max and min heap but you can solve this problem simply like that:

import heapq
a = [1, 3, 2, 7, 6, 5, 4]
heapq.heapify(a)
print(heapq.nsmallest(3, a)[-1]) # -> 3

So with heap we moved from time complexity (n log n) to something faster as we are asking only about i-th smallest numbers (n log i )

I hope that you get the point, here is a good video that goes over this approach but for the ith-largest element (in that case you would use min heap)

https://www.youtube.com/watch?v=hGK_5n81drs

you can watch the whole video or stop at at around 8:25 min. If you watched everything you will know where we are going now.

So now we need to remind ourselves Quick Sort’s partition function:

def partition(arr, p, r):
x = arr[r]
i = p - 1
for j in range(p, r):
if arr[j] <= x:
i = i + 1
arr[i], arr[j] = arr[j], arr[i]
arr[i+1], arr[r] = arr[r], arr[i+1]
return i+1

and now we can use pivot point to reduce a problem size, tell me how! ok so let’s do our example again:

[1, 2, 3, 7, 6, 5, 4]

and in natural way we have two subparts [1, 2, 3] and [7, 6, 5, 4] “

what you can notice is that pivot point is in the right place (it’s already in the right sorted position) on the left and right we have unsorted elements but we know that left side is smaller and right side is larger values. So when we know that after partition our randomly selected pivot is on the 5th position and we want to look for for example 3rd smallest element we know that we need to look at the left side of the pivot! wow, you are right! we can discard right side of the pivot and we are reducing search space. This is RSelect!

Code please,

RSelect — select ith smallest element of the array

and we arrived at the linear time complexity (on average), wow! So let’s do pseudocode as well to see that we have this fully covered. If you have any doubts always check both and try to run a few examples in your head or on paper / computer

def rselect(arr, i):
n = len(arr)
if n = 1 then
return arr
choose pivot element p
j = p
if j = i then
return p # we have our element
else if j > i then
return rselect(first part of A, i)
else
return rselect(second part of A, i - j)

Ok, we have it all! What should we remember — that after partitioning the pivot is in the rightful position and we can use this information to minimize search space. Time complexity of RSelect is linear O(n) on average. We remember that pivot selection was important, you might be unlucky and find bad pivots like min or max elements of the array. Randomization is a good option but can we do something else. The answer is — yes, here comes “median-of-medians” approach but we will not cover this here. Please check references if you want to find more details (look for median-of-medians and/or DSelect)

In the next post, I’ll try to cover graphs with search problem on graphs, search trees, hash tables, greedy algorithms, and dynamic programming (some of those might go to part 3 but we will see). So stay tuned!

References:

  1. “Algorithms Illuminated” by Tim Roughgarden — 4 book series
  2. “Introduction to Algorithms” T. H. Cormen et. al.
  3. https://www.youtube.com/watch?v=JSceec-wEyw
  4. https://www.youtube.com/watch?v=SLauY6PpjW4
  5. https://www.youtube.com/watch?v=hGK_5n81drs
  6. https://realpython.com/python-sort/
  7. https://danielmiessler.com/study/big-o-notation/]
picture taken from [https://datafloq.com/read/12-algorithms-every-data-scientist-should-know/2024] post about different types of algorithms ;) lol

--

--