Python and CAPTCHA chinese

Python August 4th, 2008

It’s a long time post after last updated, because I got much more lazier. There are three elements in my lifetime now only, eating, sleeping and DOTA…
I used a third party tool to enter VS platform for playing DOTA, and accounts often been ban. So I’ve write a automate registe tool, the following is a step by step guide for building it :)

First of all,I use Live HTTP headers to capture the data which been posted to server when registering. The road block is that we need to parse the CAPTCHA image and send the correct numbers to it.
The VS CAPTCHA is very simply,fixed position and lot of noise. Refs:
captha_orig.jpg
Split the image, find out the numbers region. Refs:
captha_split.jpg
Use PIL library to convert the source image. Refs:
captha_converted.jpg
But this converted image has to many noise, so we need to do a CONTOUR filter first. Refs:
captha_filted.jpg
The next work is extracting data from regions which these numbers located. We just need to extract some stand data, for example the “2″ in previous image:
captha_sample.jpg
And the data is:

?View Code PYTHON
1
[255, 0, 255, 0, 0, 0, 255, 0, 255, 255, 255, 0, 0, 255, 0, 0, 0, 0, 0, 0, 0, 255, 255, 255, 255, 255, 0, 255, 255, 255, 255, 255, 0, 0, 255, 0, 255, 255, 255, 0, 255, 255, 255, 255, 0, 0, 255, 255, 255, 255, 255, 0, 255, 255, 255, 255, 0, 0, 0, 255, 255, 255, 255, 0, 255, 0, 255, 0, 0, 255, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 0, 0, 255, 255, 255, 0, 255, 255, 0, 0, 255, 255, 255, 0, 255, 0, 255, 255, 255, 0, 0, 255, 255, 0, 255, 255, 0, 0, 255, 255, 0, 0, 255, 0, 255, 0, 255, 255, 0, 255, 255, 0, 0, 255, 255, 255, 255, 255, 0, 255, 0, 255, 255, 0, 255, 0, 0, 0, 0, 0, 0, 255, 255, 0, 255, 0, 0, 0, 0, 0, 0, 0, 0, 255, 0]

We need do this many times, and get the stand sample from 0-9 for building a features library.

So a complete CAPTCHA parse process for this is: get the image, do a CONTOUR filter and convert it, extract data from regions which numbers located, look up the matching number from the feature library. I use the Levenshtein Distance algorithm to do this matching.

The following is a example code:

?View Code PYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/usr/bin/env python
# -*- coding: utf8 -*-
 
import cookielib, Image, ImageFilter, StringIO, urllib, urllib2
 
from features import FEATURES
 
CAPTHA = 'http://www.vsa.com.cn/user/center/code/image2.jsp'
 
def _levenshtein_distance(m, n):
        len_plus = lambda x: len(x) + 1 
 
        c = [[i] for i in range(0, len_plus(m))]
        c[0] = [j for j in range(0, len_plus(n))]
 
        for i in range(0, len(m)):
            for j in range(0, len(n)):
                c[i+1].append(
                    min(
                        c[i][j+1] + 1,
                        c[i+1][j] + 1,
                        c[i][j] + (0 if m[i] == n[j] else 1)
                    )
                )                                                                                     
 
        return c[-1][-1]
 
def _get_number(source):
 
    distance = [_levenshtein_distance(source, i) for i in FEATURES]
    minimal = min(distance)
    return distance.index(minimal)
 
# Set the cookie
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
 
# Get the captha image
img_file = opener.open(CAPTHA)
tmp = StringIO.StringIO(img_file.read())
image = Image.open(tmp)
 
# Show the image
image.show()
 
# Convert the image
i = image.filter(ImageFilter.CONTOUR).convert('1')
 
# Get four numbers regions' data
blocks = [list(i.crop(b).getdata()) for b in [(5, 3, 17, 16), (18, 3, 30, 16), (31, 3, 43, 16), (44, 3, 56, 16)]]
 
# Parse numbers
numbers = [_get_number(b) for b in blocks]
 
# Output numbers
print '%s%s%s%s' % tuple(numbers)

features.py is a feature library file, so I don’t want to post it here. To download a complete code example can Click here :)

After parsed the CAPTCHA, the left work is posting the data to server only, that’s too easy for you.

Tags: ,