Sunday, April 17, 2011

python: finding a missing letter in the alphabet from a list - least lines of code

I'm trying to find the missing letter in the alphabet from the list with the least lines of code.

If the list is sorted already (using list.sort()), what is the fastest or least lines of code to find the missing letter.

If I know there are only one missing letter.

(This is not any type of interview questions. I actually need to do this in my script where I want to put least amount of work in this process since it will be repeated over and over indeterministically)

From stackoverflow
  • Here's one way of doing it, assuming your "alphabets" is integers, and that the list has at least two items:

    for i in xrange(1, len(a)):
      if a[i] != a[i - 1] + 1:
        print a[i - 1] + 1, "is missing"
    
  • With sorted lists a binary search is usually the fastest alghorythm. Could you please provide an example list and an example "missing alphabet"?

    bLee : well, it's just a list of alphabets... like this: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'] and missing one would be J here.
  • Some questions:

    • Are all the letters upper or lower case? (a/A)
    • Is this the only alphabet you'll want to check?
    • Why are you doing this so often?

    Least lines of code:

    # do this once, outside the loop
    alphabet=set(string.ascii_lowercase)
    # inside the loop, just 1 line:
    missingletter=(alphabet-set(yourlist)).pop()
    

    The advantage of the above is that you can do it without having to sort the list first. If, however, the list is always sorted, you can use bisection to get there faster. On a simple 26-letter alphabet though, is there much point?

    Bisection (done in ~4 lookups):

    frompos, topos = 0, len(str)
    for i in range(1,100):  #never say forever with bisection...
        trypos = (frompos+topos+1)/2
        print "try:",frompos,trypos,topos
        if alphabet[trypos] != str[trypos]:
            topos = trypos
        else:
            frompos = trypos
        if topos-frompos==1:
            if alphabet[topos] != str[topos]:
                print alphabet[frompos]
            else:
                print alphabet[topos]
            break
    

    This code requires fewer lookups, so is by far the better scaling version O(log n), but may still be slower when executed via a python interpreter because it goes via python ifs instead of set operations written in C.

    (Thanks to J.F.Sebastian and Kylotan for their comments)

    monkut : I like this way, I always forget about set() :P
    bLee : Well.. Why am I doing this so often? Just curious. I was doing it on my way and wanted to know if there is a better way of doing this. I was going to apply this to some other situations. As far as alphabets go, I think your first solution is great :)
    Kylotan : I think this way would be best, if the alphabet was populated from string.ascii_letters. After all, if they change ASCII some time soon, you don't want your code to break. ;)
    Brian : I think there is a bug in your bisection approach when the missing letter is "a". topos will never be 0, so you'll return "b" instead.
    J.F. Sebastian : set(list(a_string)) == set(a_string) There is no need to use `list` here.
    Phil H : Brian, you are absolutely right. I was thinking it was all looking a bit too clean for bisection...
    Phil H : J.F.Sebastian, you are also right.
    J.F. Sebastian : Use string.ascii_lowercase instead of "abc...". It better communicates the code intent
    J.F. Sebastian : It is hard to write binary_search correctly the first time therefore use bisect.bisect or similar
    Phil H : JFS, I've changed it to use string.ascii_lowercase, although for the general case you'd just use some source list or set there.
  • Using a list comprehension:

    >>> import string
    >>> sourcelist = 'abcdefghijklmnopqrstuvwx'
    >>> [letter for letter in string.ascii_lowercase if letter not in sourcelist]
    ['y', 'z']
    >>>
    

    The string module has some predefined constants that are useful.

    >>> string.ascii_lowercase
    'abcdefghijklmnopqrstuvwxyz'
    
    >>> string.letters
    'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    
    >>> string.hexdigits
    '0123456789abcdefABCDEF'
    
    >>> string.octdigits
    '01234567'
    
    >>> string.digits
    '0123456789'
    
    >>> string.printable
    '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
    >>>
    
    bLee : string.ascii_letters[:26]?? That just gives a list of the alphabet? whoa. +1!
    monkut : yup, string.ascii_letters gives you the whole alphabet both lowercase and caps. :P I just realized string.ascii_lowercase gives you the same result as string.ascii_letters[:26]. Updated.
  • if you're talking about alphabet as letters:

    letterSet = set()
    for word in wordList:
      letterSet.update(set(word.lower()))
    
    import string
    alphabet = set(string.lowercase)
    missingLetters = alphabet.difference(letterSet)
    
    J.F. Sebastian : string.lowercase depends on locale. Use string.ascii_lowercase instead of.
    vartec : Well, at least it should depend on locale (actually it doesn't). Definition of alphabet depends on locale. string.ascii_lowercase is valid *only* for English.
    J.F. Sebastian : It *does* depend on locale e.g. if your system locale is not English: `python -c"import locale; locale.setlocale(locale.LC_ALL,''); import string; print string.lowercase"` then it prints your local alphabet
    J.F. Sebastian : s/local alphabet/lowercases letters in your locale encoding/
    vartec : It should, but in my case it doesn't >>> import locale, string >>> locale.setlocale(locale.LC_ALL,'pl_PL.UTF-8') 'pl_PL.UTF-8' >>> string.lowercase 'abcdefghijklmnopqrstuvwxyz' >>> locale.setlocale(locale.LC_ALL,'es_ES.UTF-8') 'es_ES.UTF-8' >>> string.lowercase 'abcdefghijklmnopqrstuvwxyz'
  • In the too clever for it's own good category, and assuming there is exactly one missing letter in a lowercase alphabet:

    print chr(2847 - sum(map(ord, theString)))
    

    [Edit] I've run some timings on the various solutions to see which is faster. Mine turned out to be fairly slow in practice (slightly faster if I use itertools.imap instead).

    Surprisingly, the listcomp solution by monkut turned out to be fastest - I'd have expected the set solutions to do better, as this must scan the list each time to find the missing letter. I tried first converting the test list to a set in advance of membership checking, expecting this to speed it up but in fact it made it slower. It looks like the constant factor delay in creating the set dwarfs the cost of using an O(n**2) algorithm for such a short string.

    That suggested than an even more basic approach, taking advantage of early exiting, could perform even better. The below is what I think currently performs best:

    def missing_letter_basic(s):
        for letter in string.ascii_lowercase:
            if letter not in s: return letter
        raise Exception("No missing letter")
    

    The bisection method is probably best when working with larger strings however. It is only just edged out by the listcomp here, and has much better asymptotic complexity, so for strings larger than an alphabet, it will clearly win.

    [Edit2]

    Actually, cheating a bit, I can get even better than that, abusing the fact that there are only 26 strings to check, behold the ultimate O(1) missing letter finder!

    find_missing_letter = dict((string.ascii_lowercase[:i]+string.ascii_lowercase[i+1:],
                                string.ascii_lowercase[i]) for i in range(26)).get
    
    >>> find_missing_letter('abcdefghijklmnoprstuvwxyz')
    'q'
    

    Here are my timings (500000 runs, tested with letters missing near the start, middle and end of the string (b, m and y)

                             "b"    "m"     "y"
                   bisect : 2.762   2.872   2.922  (Phil H)
                 find_gap : 3.388   4.533   5.642  (unwind)
                 listcomp : 2.832   2.858   2.822  (monkut)
             listcomp_set : 4.770   4.746   4.700  As above, with sourcelist=set(sourcelist) first
           set_difference : 2.924   2.945   2.880  (Phil H)
                      sum : 3.815   3.806   3.868
                 sum_imap : 3.284   3.280   3.260
                    basic : 0.544   1.379   2.359 
              dict_lookup : 0.135   0.133   0.134
    
    Christian Witts : Was your complete set only built once or did you build it on every iteration ?
    Brian : The set of all alphabetic chars (in Phil H's set_difference method) was built once. For listcomp_set, it was every iteration (we need to, as we're checking for membership in the string being tested)
    J.F. Sebastian : +1: for dict approach

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.