2007-06-08

A simple re-write system using a Markov variant.

Last week I found Jeremy Wertheimer's master thesis via a news:comp.lange.scheme posting about the Rete algorithm. Jeremy is now the President and CEO of ITA Software and on their career page I found the Word Numbers puzzle:

"If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?"

If you enjoy programming puzzles, there's even more in their puzzle archive.

I won't talk about the solution of the puzzle here (well, possibly parts of the solution, but not the tricky bits anyway) but rather take one piece of it out and show a minimal implementation to perform that particular task: to rewrite any given number (from 1 to 999,999,999) into it's text representation. In the long description of the puzzle they've added: "omitting spaces, 'and', and punctuation" which means that 1,709 translates to onethousandsevenhundrednine and 500,000,000 is written fivehundredmillion. It's easy to add 'and' and spaces, but fixing that is left as an exercise for the reader :-)

I've been looking into the Markov Algorithm quite a bit lately and have used this piece of code for various things the last weeks. The Markov Algorithm is really simple to implement but is not very fast, it's not necessarily very slow either but that depends on what you compare it with I guess.

There are a few differences from text-book Markov in my implementation. For example, there's no way to specify a terminal rule. Also, I'm using recursion to pass the result from one rule back into the system.

Let's have a look at some code (~30 lines of Python).

>>> import re
>>> def evaluate(fact):
... print "evaluate(%s)" % (fact)
... for lhs, rhs in rules:
... match_obj = re.match(lhs, fact)
... if match_obj is not None:
... return "".join(fire(rhs, match_obj))

>>> def fire(rhs, match_obj):
... print "fire(%s, %s)" % (rhs, match_obj.re.pattern)
... if isinstance(rhs, str):
... try:
... return match_obj.group(rhs) # 1a) substitute (pattern match)
... except IndexError:
... return rhs # 1b) substitute (constant)
... elif isinstance(rhs, tuple):
... partial_results = [fire(action, match_obj) for action in rhs]
... return "".join(partial_results) # 2) substitute and concatenate
... elif isinstance(rhs, list):
... partial_results = [fire(action, match_obj) for action in rhs]
... return evaluate("".join(partial_results)) # 3) substitute, concatenate and eval

>>> def convert(n):
... number = str(n)
... return evaluate(number)
The fire method can handle three different types of actions: 1) substitute, 2) substitute and concatenate, 3) substitute, concatenate and continue evaluation. Substitution can be done in two different ways, either using a constant or part of a pattern match. For completeness, I really should add a way to perform a real evaluation/execution of Python code but that's for another blog post.

Ok, so far we've got the implementation. Now, all we need are some rules.

Let's start with the easiest ones (1-digit numbers):
>>> rules = [("1", "one"), ("2", "two"), ("3", "three"), ("4", "four"),
... ("5", "five"), ("6", "six"), ("7", "seven"), ("8", "eight"), ("9", "nine"), ]
Each rule is expressed as a tuple containing 2 parts. A Left Hand Side (LHS) and a Right Hand Side (RHS). The LHS is used as a regular expression pattern by the evaluate function and the RHS is interpreted as a constant. So now we can do:
>>> convert(2)
'two'
which is not very interesting. Let's add some more rules.
>>> rules = [("(?P[digit]\\d{1})0", (["digit"], "ty")),
... ("(?P[digit1]\\d{1})(?P[digit2]\\d{1})", (["digit1", "0"], ["digit2"])),] + rules
Now, that's a bit more complicated. But, stay with me here. The LHS is blue and the RHS is green. The first rule basically says: [a]0 should be transformed into [a]ty (for example, 40 becomes four-ty). The second rule says: [ab] means [a]ty[b] (for example, 43 means four-ty-three).

We're nearly done with the 2-digit numbers. the problem is that the current rule base cannot handle numbers like 37.
>>> convert(37)
'threetyseven'
It's kind of close but still wrong. We'll have to add a few more constants.
>>> rules = [("20", "twenty"), ("30", "thirty"), ("50", "fifty"),
... ("80", "eighty")] + rules
>>> convert(37)
'thirtyseven'
That's better, but we still can't handle 17.
>>> convert(17)
'onetyseven'
>>> rules = [('10', "ten"), ('11', "eleven"), ('12', "twelve"),
... ('13', "thirteen"), ('15', "fifteen"), ('18', "eighteen"),
... ("1(?P[digit]\\d{1})", (["digit"], "teen")),] + rules
>>> convert(17)
'seventeen'
That's actually all that's needed to handle the tricky numbers (between 1 and 99). All 3-digit numbers ([abc]) are really just: [a]hundred[bc]. Similarly all 4 to 6-digit numbers ([a{1,3}bcd]) are [a]thousand[bcd] and 7 to 9-digit numbers ([a{1,3}bcdefg]) are [a]million[bcdefg].
>>> rules = [("(?P[digit]\\d{1})00", (["digit"], "hundred")),
... ("(?P[digit1]\\d{1})(?P[rest]\\d{2})", (["digit1", "00"], ["rest"]))] + rules
>>> convert(373)
'threehundredseventythree'
>>> convert(303)
Traceback (most recent call last):
...
TypeError: sequence item 0: expected string, NoneType found
Weren't we done? The missing piece is a rule (with the highest priority) that tells the implementation to ignore leading zeros. It's required because we re-write and call the evaluate function on a partial result. Let's look at the call trace:
evaluate(303)
fire((['digit1', '00'], ['rest']), (?P[digit1]\d{1})(?P[rest]\d{2}))
fire(['digit1', '00'], (?P[digit1]\d{1})(?P[rest]\d{2}))
fire(digit1, (?P[digit1]\d{1})(?P[rest]\d{2}))
fire(00, (?P[digit1]\d{1})(?P[rest]\d{2}))
evaluate(300)
fire((['digit'], 'hundred'), (?P[digit]\d{1})00)
fire(['digit'], (?P[digit]\d{1})00)
fire(digit, (?P[digit]\d{1})00)
evaluate(3)
fire(three, 3)
fire(hundred, (?P[digit]\d{1})00)
fire(['rest'], (?P[digit1]\d{1})(?P[rest]\d{2}))
fire(rest, (?P[digit1]\d{1})(?P[rest]\d{2}))
evaluate(03)
fire((['digit1', '0'], ['digit2']), (?P[digit1]\d{1})(?P[digit2]\d{1}))
fire(['digit1', '0'], (?P[digit1]\d{1})(?P[digit2]\d{1}))
fire(digit1, (?P[digit1]\d{1})(?P[digit2]\d{1}))
fire(0, (?P[digit1]\d{1})(?P[digit2]\d{1}))
evaluate(00)
fire((['digit'], 'ty'), (?P[digit]\d{1})0)
fire(['digit'], (?P[digit]\d{1})0)
fire(digit, (?P[digit]\d{1})0)
evaluate(0)
fire(ty, (?P[digit]\d{1})0)
There's no rule to handle something like '03'. We could of course add ("0", "zero") to the rule base but that would be wrong (we'd get 'threehundredzerotythree'). The right thing to do is (and we're also adding the rules for thousand and million):
>>> rules = [("0+(?P[number]\\d+)", ["number"]),
... ("(?P[number]\\d{1,3})000000", (["number"], "million")),
... ("(?P[number]\\d{1,3})(?P[rest]\\d{6})", (["number", "000000"], ["rest"])),
... ("(?P[number]\\d{1,3})000", (["number"], "thousand")),
... ("(?P[number]\\d{1,3})(?P[rest]\\d{3})", (["number", "000"], ["rest"])), ] + rules
>>> convert(303)
'threehundredthree'
>>> convert(911610034)
'ninehundredelevenmillionsixhundredtenthousandthirtyfour'
>>> convert(5000000)
'fivehundredmillion'
>>> convert(1709)
'onethousandsevenhundrednine'
The funny thing is that solving it this way feels more natural to me than any other way. I think I've become damaged ;-)

4 kommentarer:

woolfel sa...

I see you also found ITA :)

Johan Lindberg sa...

Yes, and when I found their puzzles, I just had to solve them. I'm glad they have an archive as well. It will probably keep me busy for quite some time :-)

woolfel sa...

I also found a patent by jeremy related to ITA software. I have a link to it in a recent post. interesting stuff that ITA is doing.

Johan Lindberg sa...

Yes, I read your post about it and ITA Software sounds interesting indeed. Even though the thesis wasn't quite what I expected it got me thinking about some new approaches. It was well worth the time I spent reading it.