domingo, 22 de agosto de 2010

Bitvector Genealogy


Introduction

Recently I have come across something that reminded me of a challenge
I tried some time ago, ITAs bitvector genealogy
challenge(1). Back in the day when I was first
learning scheme, I have tried to use this to build up my skills with
the help of a great tutorial(3).
Needless to say that I failed at that, both for my lack of scheme and
common lisp competence but also for my lack of aptitude as a
programmer.

The something was a thread in Hacker News where ITA was mentioned. I
became curious to see how much I had progressed since my last attempt,
so I grabbed DrRacket, Emacs and jumped into the fray one more time.

Again, most of the original hard work was done by Patrick May. What I
am showing here is a take on his work, translated to scheme
and modified to use different data structures and different points of
views on the same fundamental algorithm.

Ok, this is the description of the problem:

The BitVectors are an ancient and immortal race of 10,000, each with
a 10,000 bit genome. The race evolved from a single individual by
the following process: 9,999 times a BitVector chosen at random from
amongst the population was cloned using an error-prone process that
replicates each bit of the genome with 80% fidelity (any particular
bit is flipped from parent to child 20% of the time, independent of
other bits).

Write a program to guess the reproductive history of BitVectors from
their genetic material. The randomly-ordered file
bitvectors-genes.data.gz contains a 10,000 bit line for each
individual. Your program's output should be, for each input line,
the 0-based line number of that individual's parent, or -1 if it is
the progenitor. Balance performance against probability of mistakes
as you see fit.
(1)

The next step is really the crux of the problem. This is where a bit
of mathematics comes into play

The next step in developing a solution is to determine whether or not
a parent-child relationship exists between two bit vectors. According
to the problem definition, every bit vector other than the original is
created by randomly mutating the elements of an existing bit
vector. The probability of a particular bit changing is 20%. That
means that, on average, the children of a bit vector should be 80%
identical to their parent and the grandchildren should be 64%
similar. Of course, the random element means that some will have more
than 80% similarity and others less. Some children may be more
similar to each other than either is to their parent. The best result
any automated solution will be able to achieve is the most probable
relationship tree.

Assuming that a fair random number generator was used, the set
containing the number of bits different between a parent and each of
its children should follow a binomial distribution. The probability of
two vectors having a parent child relationship is related to how close
the difference between the two is to the expected value (20%, in this
case). In a binomial distribution (which approximates a continuous
normal distribution), 99.73% of the bit difference values will be
within three standard deviations of the expected value. The standard
deviation is computed by:


Body

First, I will define some globals that will be useful throughout the
source code, and the utilities to load the bitvectors data into a
vector of numbers:

(define ROOT-SYMBOL -1)

(define MUTATION-RATE 0.2)

;; http://mathworld.wolfram.com/BinomialDistribution.html
(define DEVIATION 3) ;; 3 = 99.73 percent


(define DNA-LENGTH 0)

(define (get-dna)
  DNA-LENGTH)

(define (set-dna! number)
  (set! DNA-LENGTH number))

;; filename->bitvector : input-port -> (vectorof binary-number)
;; read everything from a given file and returns a 
;; vector containing binary numbers. 
(define (filename->bitvector file-path (radix 2))
  (local [(define (read-accumulative bitvector)
            (let ([line-read (read-line file-path 'any)])
              (if (eof-object? line-read)
                  bitvector
                  (let ([possible-number (string->number line-read radix)])
                    (read-accumulative
                     (if (number? possible-number)
                         (vector-append bitvector (vector possible-number))
                         bitvector))))))]
    (cond
      [(equal? "" file-path)]
      [else
       (read-accumulative (vector))])))

The next part is really the crux of the problem. This is where a bit
of mathematics comes into play (2)
and the most complex part of the problem in my view:

The next step in developing a solution is to determine whether or not
a parent-child relationship exists between two bit vectors. According
to the problem definition, every bit vector other than the original is
created by randomly mutating the elements of an existing bit
vector. The probability of a particular bit changing is 20%. That
means that, on average, the children of a bit vector should be 80%
identical to their parent and the grandchildren should be 64%
similar. Of course, the random element means that some will have more
than 80% similarity and others less. Some children may be more
similar to each other than either is to their parent. The best result
any automated solution will be able to achieve is the most probable
relationship tree.

Assuming that a fair random number generator was used, the set
containing the number of bits different between a parent and each of
its children should follow a binomial distribution. The probability of
two vectors having a parent child relationship is related to how close
the difference between the two is to the expected value (20%, in this
case). In a binomial distribution (which approximates a continuous
normal distribution), 99.73% of the bit difference values will be
within three standard deviations of the expected value.
(3)

Taking that, we can establish the existence of a relationship (with
some degree of confidence) with the following code:

;;  binary-difference : binary-number binary-number -> integer
;; returns the number of bits that are different between 2 binary numbers.
(define (binary-difference bits1 bits2)
  (let ([differences 0])
    (for ((i (in-string (number->string (bitwise-xor bits1 bits2) 2))))
      (when (equal? i #\1)
        (set! differences (add1 differences))))
    differences))

;; binomial-standard-deviation : number -> number
;; Compute the expected binomial standard deviation for N events of probability P.
(define (binomial-standard-deviation n p)
  (sqrt (* n p (- 1 p))))

;; has-kinship? : integer integer -> boolean
;; verify if 2 individuals share a kinship. 
;; The relationship is assumed if the bit difference between the two is
;; less than DEVIATIONS standard deviations from the expected average for
;; the MUTATION-RATE.
(define (has-kinship? individual-1
                      individual-2
                      (mutation-rate MUTATION-RATE)
                      (deviations DEVIATION)
                      (dna-length (get-dna)))
  (let* ([expected-difference (* dna-length mutation-rate)]
         [standard-deviation
          (binomial-standard-deviation dna-length
                                       mutation-rate)]
         [max-difference (+ expected-difference
                            (* deviations standard-deviation))])
    (< (binary-difference individual-1
                          individual-2)
       max-difference)))



The code above shows in some ways where I deviated from Patrick's
example. I used a number instead of a bit array (which does not exist in
Racket, the language I'm using, as far as I know). In the following
code, this difference deepens, with the construction of a hash map
instead of a vector to get the relationship matrix:

;; create-relationship-hash : (vectorof number) -> hash
;; creates a hash mapping an individual to it's list of relationships.
;; Both parents and children are mapped to an individual.
(define (make-relationship-hash bitvector)
  (let ([relationship-hash (make-hash)]
        [dna-range (in-range 0 (vector-length bitvector))])
    (for* ((individual-1-index dna-range)
           (individual-2-index dna-range))
      (begin (unless (hash-has-key? relationship-hash individual-1-index)
               (hash-set! relationship-hash individual-1-index empty))
             (let ([individual-1 (vector-ref bitvector
                                             individual-1-index)]
                   [individual-2 (vector-ref bitvector
                                             individual-2-index)])
               (when (and (has-kinship? individual-1
                                        individual-2)
                          (not (= individual-1 individual-2)))
                 (hash-update! relationship-hash
                               individual-1-index
                               (λ (relationships)
                                 (cons individual-2-index relationships)))))))
    relationship-hash))

This almost closes the issue, but I must figure out the directions of
the relationships(who is who's father and son). To establish that I
should have to:

  • Find all the bit vectors with only a single parent-child relationship
  • Assume that those vectors are the child nodes
  • Record that information
  • Remove the child node from the parent's list of relationships
  • Repeat

(3)

That is accomplished in the longest and most complex code in the solution:

;; make-kinship-hash : hash -> hash
;; creates a hash mapping an individual only to it's parent.
(define (make-kinship-hash relationship-hash-initial)
  (local [(define (make-kinships kinship-hash
                                 relationship-hash)
            (local [(define (process-possible-kinship individual-index
                                                      relationships-indexes)
                      (let ([number-of-relationships
                             (length relationships-indexes)])
                        (unless (> number-of-relationships 1)
                          (begin
                            (hash-remove! relationship-hash
                                          individual-index)
                            (if (= number-of-relationships 1)
                                ; leaf node
                                (begin
                                  (hash-set! kinship-hash
                                             individual-index
                                             (first relationships-indexes))
                                  (hash-update! relationship-hash
                                                (first relationships-indexes)
                                                (λ (relationships)
                                                  (remove individual-index
                                                          relationships))))
                                ; root node
                                (begin (hash-remove! relationship-hash
                                                     individual-index)
                                       (hash-set! kinship-hash
                                                  individual-index
                                                  ROOT-SYMBOL)))))))]
              (if (false? (hash-iterate-first relationship-hash))
                  kinship-hash
                  (begin
                    (hash-for-each relationship-hash
                                   process-possible-kinship)
                    (make-kinships kinship-hash
                                   relationship-hash)))))]
    (make-kinships (make-hash) relationship-hash-initial)))

And that is the end of the solution! I am still facing the same issues
as Patrick, specially in regard to the root individual, but
reimplementing his solution and trying to achieve a better result by
the use of different data structures led me to understand the problem
more deeply, and it was loads of fun!

PS: There is more code in
this
repository, specially regarding testing and comparison.


Bibliography

1

ITA.

Bitvector genealogy.

URL
http://www.itasoftware.com/careers/work-at-ita/hiring-puzzles.html.




2

MathWorld.

Binomial distribution.

URL http://mathworld.wolfram.com/BinomialDistribution.html.




3

Patrick May.

Bitvector genealogy.

URL http://www.softwarematters.org/bitvector-genealogy.html.



Nenhum comentário:

Postar um comentário