Genomes and object granularity I

Genomes
I enjoy dedicating a good share of my free time reading and learning new skills. It is a necessary trait that keeps my sanity level high on day-to-day basis and helps me abstract from boring tasks at work. But when I do want to learn a new language I prefer having a decent and meaningful project that will keep me motivated.

This time it all started with a statement from one of my favourite theoretical physicists – Michio Kaku – when addressing the question of genetics as the key to immortality. The argument was that as the price of genomics goes down the need to devise new gene analysis methods will go up in which he concludes by saying that Biology will be reduced to Computer Science.

Now that got me thinking: What would be the challenges when modelling an entire genome sequence (of an organism) and interact with it real-time? What would be the architecture? Does it exist already at all?

As I had already imagined, these questions are not trivial at all. Without even going into scientific articles and just using Wikipedia alone I found that there is already a computer model of an organism back from 2012. That organism is the smallest free-living organism. This is already remarkable work. I had to get some idea of the source and of the complexity involved.

Back to basics

If you’re not aware of the basics in genomics don’t feel bad I didn’t remember either. I had to go Indiana Jones on my biology books in order to be sure I was understanding the problem correctly.

Living organisms consist of cells. Each cell is composed by DNA string(s) denominated as chromosome(s). These can be pictured as the “blueprints” for the organism. The chromosomes can then be divided into sequences of genes. Roughly we can say genes are protein configurations (= sequences of amino-acids) and can behave differently depending on their position in the DNA string. The complete collection of chromosomes is then called the genome.

I used the word “sequence” on purpose to reiterate the fact that its the multiple sequencing configurations of these “objects” that makes us who we are and how our bodies react to other organisms. But lets look at a small example:

Say you want to search a specific sequence of amino acids (= protein), the average size is about 300 (source) but can go over 30000. There are 20 possible amino-acids in each position of the chain this means a simple search for a particular protein offers roughly infinite possibilities. There is of course many years of study and research in this area and many breakthroughs allowed scientists to find solutions to these kind of problems.

I then found this website as I was curious to know at least the order of magnitude comparing bacteria genome to human genome. The answer is:

– Mycoplasma genitalium

  • Base pairs: 580073
  • Genes: 517

– Humans

  • Base pairs: 3.3 x 10^9
  • Genes: ~21000

– Humans have about 40 times more genes and about 5000 times more base pairs.

Of course I’m sure you cannot just compare this data by magnitude alone but it still kind of makes me feel smashed by the complexity just by imagining.

Finally I got to the official Github repo. I could finally take a look at what kind of programming we are talking here.

Problem

First of all it seems that it took a cluster of 128 machines alone to run all simulations of the organism on it’s natural habitat, with results coming very close to the real deal. That alone screams for effective and efficient distributed computing if we ever want to see this kind of work for larger scale organisms.

The guys at Stanford used and abused Matlab in order to achieve this great feat. Personally I had a bit of Matlab experience before and I totally understand how it can save on testing and programming time. The choice of organism is explained as an important first step before proceeding to larger and more complex organisms which is also fairly understandable.

With these points in mind i couldn’t help to imagine on how could we process this data in a larger scale. Eventually proof of concepts and scientific models are handed to software designers and engineers so they work their magic and make it scalable. But I still didn’t know what seems to be the “Achilles heel” of computational genomics, so I tried to find some experts opinions and that’s where I stumbled upon a one-sentence-fits-all (from Quora) on computational genomics problems:

Analysing seemingly disparate, wide-ranging–and ever expanding–phenotypic data in context of a static genetic profile in a diverse cohort that is dynamically growing.

Cryptic as it seems at first I then proceeded to find an analogy that best fits this statement. I came up with 2 simpler problem statements like so:

Analysing seemingly disparate, wide-ranging–and ever expanding–phenotypes
1 – The modelling of always-evolving individuals and figuring out the “family” tree.

Generating a profile from the diverse cohort that is dynamically growing
2 – How different families interact with each other in order to picture a larger entity (== profile)

My analogy seems to make sense to me but I plea for you my dear reader to correct me if I am wrong. If we go back and check the orders of magnitude presented above, one can imagine how these statements fit in.

So the problem is granularity?

If the above statements are true, this seems to be one of the main problems. How do we connect the dots for wide-ranging/ever-expanding types and how do they work together to form bigger entities?

One thing is certain as well, as first observed and as discussed in many forums: Distributed computing architecture is also a big problem, one I won’t be addressing in this article anymore as it would deserve a whole article on its own.

Looking for answers

As a good soldier I had to turn to actual bibliography concerning code architecture and design. There might be other (more suited?) examples, I will be very grateful if you point me to other references.

The broad direction

I needed a book which could provide a broad but rich analysis on software development practices from the very basics to the most advanced techniques and that could get me started on where to read on the specifics. I normally refer to Code Complete 2 when looking for further reading on specific topics. One of the conclusions extracted was clear:

“Patterns are the key to larger granularity discussions.”

If patterns are the key to larger granularity discussions that brings us to…*drum roll*… one of the most celebrated books of all computer science short life – Design Patterns by the Gang of Four

The Gang of Four

If Code Complete is the “Joy of Cooking” for programmers, this one is probably a crash course on Mediterranean cuisine, with specifics on how to assemble tasty pasta dishes.

This book is famous for statements that still spill much ink even over 20 years after it was first published, such as:

“Program to an interface, not an implementation.”

…or even:

“Favor object composition over class inheritance.”

It thoroughly catalogs some of the most common design patterns in the object-oriented software development history.

At the time I am writing this I have just started digesting this book, but I jumped to one design pattern to get me started on how to address our honourable quest for modelling complex and granular objects such as genomes. And the answer that seems to come close when handling many objects is Flyweight and his compadres Composite and Factory. Now the million dollar question seems to be: Does computational genomics fit into the use cases for Flyweight?

“A journey of a thousand miles begins with a single step”

Now, now, now… Clearly my expertise in this area is a limiting factor deciding which design patterns apply, the key elements, their interactions and the small detail that…uhm…it’s a NP-HARD problem! So I must be pragmatic. Please lay down the pitchforks and put down the torches, that’s not nice.

However I couldn’t stop this train without producing something meaningful for myself and perhaps even useful for the community. That’s why I decided to implement an example of genetic algorithm in a new language… say Rust! (More on the next part)

I had formal training about genetic algorithms, though more as introductory than comprehensive. And I always wanted to get better at it, hence my resolution.

Finally I will finish with a childhood quote – the old (pt_PT) Dragon Ball Z ending quote:
Don’t lose the next episode, because we… won’t either!

Notes for clarity:

1 – My knowledge of Biology doesn’t go beyond of that of a high school student, I didn’t come out looking to start a Biology war on this article (if there is any reason to)

2 – Also this should be important for clarity

Computational genomics references:

1 – Genetics and computer science by Michio Kaku

2 – Whole genome sequencing by Wikipedia

3 – The first computer model of an organism, Stanford 2012

4 – The actual paper on whole-cell computational model

5 – Official Wholecell project git repo at GitHub

6 – Quora on computational genomics

Computer science books and references:

1 – Code Complete: A Practical Handbook of Software Construction, Second Edition

2 – Design Patterns: Elements of Reusable Object-Oriented Software

3 – Coding Horror Blog

4 – Flyweight by Wikipedia

5 – Composite by Wikipedia

6 – Factory by Wikipedia

Matjaz Trcek
Matjaz Trcek
SRE @ Magnolia CMS

Working as an SRE in Magnolia CMS. In my free time I work on many side projects some of which are covered in this blog.