A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from http://www.gtoal.com/wordgames/documents/arithmetic-encoding.mai below:

From gtoal@admin.vt.com Fri Jun 4 19:41:00 1999 Received: from admin.vt.com (root@admin.vt.com [204.117.188.10]) by gtoal.com (8.8.5/8.8.5) with ESMTP id TAA05938 for ; Fri, 4 Jun 1999 19:40:56 -0500 (CDT) Received: (from root@localhost) by admin.vt.com (8.8.5/8.7.3) id RAA29502; Fri, 4 Jun 1999 17:17:06 -0500 (CDT) Date: Fri, 4 Jun 1999 17:17:06 -0500 (CDT) From: Graham Toal Message-Id: <199906042217.RAA29502@admin.vt.com> To: word-game-programmers@mit.edu Subject: How to find a word in a dawg from a numeric word index efficiently Status: R Dear Word Gamers and Dawg Hackers, Another BFO (Blinding Flash of the Obvious) struck me today... Eric House wrote a program recently which allows you to scroll forwards and backwards through a dawg (a vocabulary browser for the palm pilot - shares the same dawg as his scrabble) and he was mentioning the difficulty that is caused by not being able to look up a word in a dawg by numeric index. I'll switch applications just to make it easier to define the problem and explain a possible solution: Let's say you want to send an arbitrary document to someone, and they have a copy of the same all-encompassing word list that you have, then you can write a code such that a=1, aa=2, aah=3, aahli=4 etc for all the words in your word list, then just send the code (ie the ordinal index of the word in that list). However with a dawg, there is no easy index and to expand those codes back to a list of words, you'd have to walk the dawg again for every word. Well, I realised that this is very similar to the problem of transmitting a code set in a regular compression program, such as lzw, and the solutions there apply here: you encode each word as a variable-length bitstream which denotes the choices you took going down the trie to find the end of the word. You can either do this in fixed bit fields (eg 5 bits, 4 bits, 3 bits, 3 bits, 1 bit in the example below) *or* ... (today's Blinding Flash) encode it arithmetically! For those of you not already familiar with Arithmetic Encoding, I'll work a full example here on a small data set that I can do by hand: (The only prior knowlege I assume here is a working knowlege of the DAWG data structure.) Let's take an arbitrary word 'super', which in some word dawg is stored as follows: The first letter s is branch #5 of 26 choices from the root node The next letter u is branch #11 of 14 choices from the 's' node The next letter p is branch #2 of 7 choices from the 'u' node The next letter e is branch #3 of 5 choices from the 'p' node The last letter r is branch #0 of 2 choices from the 'e' node (aside: branch numbers start at 0, so branch #0 would be the first choice of the two above) For reasons that will be obvious when we decode this, we build the value in reverse starting with 'r' and working back over 'repus'. This is trivially implemented by returning from the recursion of checking the word in the dawg, from the end node back. Check the word on the way down, do the arithmetic coding as you return back up. So the word /super/ (aka \repus\) would be arithmetically encoded as follows: Initial Total: 0 first data item is 'r': link #0 in a set of 2 multiply by 2 (or don't, the initial one is a no-op) Now, add 0 Total: 0 Next, data for 'e' is link #3 in a set of 5 multiply by 5 Total: 0 add 3 Total: 3 Next, data for 'p' is link #2 in a set of 7 multiply by 7 Total: 21 add 2 Total: 23 Next, data for 'u' is link #11 in a set of 14 multiply by 14 Total: 322 add 11 Total: 333 Next, data for 's' is link #5 in a set of 26 multiply by 26 Total: 8658 add 5 Total: 8663 (If anyone new to this coding thinks that this will rapidly generate numbers outside the range of a 32 bit integer, you're right - that's why there are so many 'bignum' and arbitrary precision bitfield packages out there. They're useful here as well as in crypto; though for many (most?) wordlists it will actually be possible to code every word in a 32 bit integer or less.) So, we now have a word encoded as '8663' (which I'm guesstimating would need about 15 bits of storage) To decode this word, we do the same in reverse, using the same dawg to guide us. We know the first branch has exactly 26 possibilities, so we divide 8663 by 26 and take the remainder - which is 333, remainder 5 Total: 333 We go down branch number 5 (which is labelled as 's'), and see that at this point in the trie we now have 14 branches. divide by 14, giving 23 remainder 11 So this is the 11th letter in this node which is 'u' Total: 23 We now traverse branch number 11 and find a node with 7 possible links divide 23 by 7 giving 3 remainder 2 This letter was therefore choice #2 of 7, ie 'p' Total: 3 This node has 5 links, so we divide by 5 giving 0 remainder 3 So this letter is choice #3 of 5 ('e') and we go to the next node, which has 2 choices. Total: 0 0 divided by 2 is 0 remainder 0, so our last letter is choice #0, aka 'r'. Consequently we have extracted the word 'super' from index 8663. (I leave it as an exercise to the reader to correct the above algorithm so that words can be extracted which are also prefixes of other words, such as a word list which contains both 'super' and 'superman' Clue: there's an important "+1" needed somewhere) I throw this out in case it is of any benefit to our readers. Clearly you would not use this coding for such a trivial use as sending a text document (there are easier ways!) but you may well find it of use in some future word game yet to be written. I can see it being very useful for storing in ram a long list of possibilities generated by some traversal of a dawg. Let me know if you code this up and/or put it to any real use. I'll archive a copy of this article on my web site in case it's ever needed. regards Graham

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4