Optical Character Recognition for Cursive Handwriting

Nafiz Arica, Student Member, IEEE, and Fatos T. Yarman-Vural, Senior Member, IEEE Abstract? In this paper, a new analytic scheme, which uses a sequence of segmentation and recognition algorithms, is proposed for offline cursive handwriting recognition problem. First, some global parameters, such as slant angle, baselines, and stroke width and height are estimated. Second, a segmentation method finds character segmentation paths by combining gray scale and binary information. Third, Hidden Markov Model (HMM) is employed for shape recognition to label and rank the character candidates.

For this purpose, a string of codes is extracted from each segment to represent the character candidates. The estimation of feature space parameters is embedded in HMM training stage together with the estimation of the HMM model parameters. Finally, the lexicon information and HMM ranks are combined in a graph optimization problem for word-level recognition. This method corrects most of the errors produced by segmentation and HMM ranking stages by maximizing an information measure in an efficient graph search algorithm. The experiments in dicate higher recognition rates compared to the available methods reported in the literature.

Index Terms? Handwritten word recognition, preprocessing, segmentation, optical character recognition, cursive handwriting, hidden Markov model, search, graph, lexicon matching. ? 1 HE most difficult problem in the field of Optical Character Recognition (OCR) is the recognition of unconstrained cursive handwriting. The present tools for modeling almost infinitely many variations of human handwriting are not yet sufficient. The similarities of distinct character shapes, the overlaps, and interconnection of the neighboring characters further complicate the problem.

Additionally, when observed in isolation, characters are often ambiguous and require context information to reduce the classification error. Thus, current research aims at developing constrained systems for limited domain applications such as postal address reading [21], check sorting [8], tax reading [20], and office automation for text entry [7]. A well-defined lexicon plus a well-constrained syntax help provide a feasible solution to the problem [11]. Handwritten Word Recognition techniques use either holistic or analytic strategies for training and recognition stages.

Holistic strategies employ top-down approaches for recognizing the whole word, thus eliminating the segmentation problem [9]. In this strategy, global features, extracted from the entire word image, are used in recognition of limited-size lexicon. As the size of the lexicon gets larger, the complexity of algorithms increase linearly due to the need for a larger search space and a more complex pattern representation. Additionally, the recognition rates decrease rapidly due to the decrease in betweenclass-variances in the feature space.

The analytic strategies, on the other hand, employ bottom-up approaches, starting from stroke or character- T INTRODUCTION level and going towards producing a meaningful text. Explicit [23] or implicit [16] segmentation of word into characters or strokes is required for this strategy. With the cooperation of segmentation stage, the problem is reduced to the recognition of simple isolated characters or strokes, which can be handled for unlimited vocabulary. However, there is no segmentation algorithm available in the literature for correctly extracting the characters from a given word image.

The popular techniques are based on over-segmenting the words and applying a search algorithm for grouping segments to make up characters [14], [10]. If a lexicon of limited size is given, dynamic programming is used to rank every word in the lexicon. The word with the highest rank is chosen as the recognition hypothesis. The complexity of search process for this strategy also increases linearly with the lexicon size, if the flat representation of lexicon is used. More efficient representations such as trie and hash tables can be used in order to reduce the search space.

Application of the preprocessing techniques to a given image, may introduce unexpected distortion (closing loops, breaking character, spurious branches etc. ) to the data, which may cause unrecoverable errors in the recognition system. Most of the existing character recognition systems threshold the gray-level image and normalize the slant angle and baseline skew in the preprocessing stage. Then, they employ the normalized binary image in the segmentation and recognition stages [10], [16], [3]. However, in some cases, normalization may severely deform the writing, generating improper character shapes.

Furthermore, through the binarization of the gray scale document image, useful information is lost. In order to avoid the limitation of binary image, some recent methods use gray-level image [13]. There, however, the insignificant details suppress important shape information. The scheme developed in this study, employs an analytic approach on gray-level image, which is supported by binary image and a set of global features. Document image is not . The authors are with the Computer Engineering Department, Middle East Technical University, Ankara, Turkey. E-mail: {nafiz, vural}@ceng. metu. edu.

tr. Fig. 1. System overview. preprocessed for noise reduction and normalization. However, global parameters, such as lower-upper baseline and slant angle are estimated and then incorporated to improve the accuracy of the segmentation and recognition stages. The scheme makes concurrent use of binary and gray-level image in a mixed way to extract the maximum amount of information for both segmentation and recognition. The segmentation algorithm, proposed in this study, segments the whole word into strokes, each of which corresponds mostly to a character or rarely to a portion of a character.

Recognition of each segment is accomplished in three stages: In the first stage, characters are labeled in three classes as ascending, descending, and normal characters. In the second stage, Hidden Markov Model (HMM) is employed for shape recognition. The features extracted from the strokes of each segment are fed to a left-right HMM. The parameters of the feature space are also estimated in the training stage of HMM. Finally, an efficient word-level recognition algorithm resolves handwriting strings by combining lexicon information and the HMM probabilities. 2 SYSTEM OVERVIEW

The proposed system receives the gray-level word image as input, assuming the segmentation of input image into individual words is performed. Although the system is designed for cursive handwriting, methodologies used in the system are easily applicable to machine or hand-printed characters. System overview is summarized by the block diagram representation in Fig. 1. Global parameter estimation, segmentation, and feature extraction stages employ both gray-level and binary images. The parameters for HMM and feature space are estimated by using the correctly segmented character images in training.

These parameters are then used in feature extraction and HMM ranking of character segments. Finally, the word-level recognition algorithm maximizes an information measure, using the HMM probabilities and lexicon information, resulting with ASCII strings. If the input image consists of isolated characters, the segmentation stage is omitted. Global Parameter Estimation. The output of the global parameter estimation stage is the word-level features, such as average stroke width/height, baselines, skew, and slant angles (see Section 3).

First Level Character Classification. The baselines and character size information estimated in HMM training stage are used to decide on the ascending and descending character thresholds in a given word image. The character size information contains the height-to-width ratios of ascending, descending, and normal characters (see Section 4). Segmentation. Initially, the word image is divided into segmentation regions each of which contains a segmentation path. Then, a search process finds the segmentation path in each region in order to split the connected characters.

The algorithm performs the search process by combining the characteristics of gray scale and binary images. The proposed method slightly over-segments the word image (see Section 5). Feature Extraction and HMM Training. Since HMM is most successful in the recognition of one-dimensional string of codes, it is critical to represent the two-dimensional information of character images as one dimensional strings. A feature extraction scheme proposed by the authors of this study [1] is employed in this stage, where a set of directional skeletons is extracted by scanning a fixed size window in

various directions (see Section 6. 1). HMM training is performed on the selected output of the segmentation stage for the estimation of both HMM parameters and the parameters of feature space. These parameters are composed of the character window size, number of scanning directions, and number of regions in each scanning direction. The parameters, which give the maximum recognition rate for the training set, are then used to form the feature space of recognition stage (see Section 6. 2). HMM Ranking. Each string of codes extracted from a character segment is fed to the HMM recognizer.

The output is a ranked list of character labels with the associated HMM probabilities for one or more sequential segments (see Section 6. 3). Word Level Recognition. The goal of this stage is to recognize the unknown word by using the candidate characters of the HMM recognizer and the lexicon information. For this purpose, the candidate characters and the associated HMM probabilities are represented in a word graph. Then, dynamic programming is invoked for finding the best path in the word graph, which corresponds to a valid word in the lexicon (see Section 7). Fig. 2.

Slant angle detection of the word ? eighteen.? (a) Near vertical contours. (b) Slant of each contour. 3 GLOBAL PARAMETER ESTIMATION Previous studies in the literature involve Hough transform [12], slanted histogram approach [9], and word contour analyzing [15]. In this study, the slant of a word is estimated by finding the average angle of near-vertical strokes extracted by three-by-three masks, which allows one-pixel deviation from the vertical line. This is calculated by extracting a chain of connected pixels representing the edges of strokes whose lengths are longer than the estimated stroke height.

The mode orientation of those edges close to the vertical direction is used as an overall slant estimate for a word image (Fig. 2). In this stage, first, the input image is binarized by using the optimal thresholding method in [22]. Then, the eightneighbor boundary extraction algorithm [2] finds the closed contours in the binary word image. An averaging mask smoothes the character contours. Second, average stroke width/height estimation, slant angle detection, and baselines extraction are performed on the binarized image. These parameters are essential in the segmentation algorithm.

Gray-scale and binary-word images are employed together with the global parameters in the subsequent stages of the proposed scheme. Note that, this stage does not perform any noise reduction or normalization on the gray-level word image. 3. 1 Stroke Width/Height Estimation A simple program is developed to estimate the average stroke width, which is then used to decide whether the character segmentation regions are spurious or not. This is a two-scan procedure. The first scan on each row of the binary image calculates the stroke width histogram by counting the black pixel runs in horizontal direction.

Then, the mean width, estimated over all of the rows, is taken as the upper bound (maximum width) for the run length of the strokes. The second scan on the stroke width histogram discards those strokes whose run length is greater than maximum width. Finally, the stroke width of the input-word image is estimated as the average width of the strokes in the second scan. In order to estimate the stroke height, which is assumed to be average height of the vertical strokes in writing, a similar algorithm is used. However, the scanning procedure is applied in vertical direction.

Minimum height is estimated instead of maximum width. In the second scan, those pixels whose run lengths are smaller than minimum height are discarded. Estimated stroke height is used in slant angle detection and upper baseline extraction, assuming small slant. It is slightly less then the actual one due to the slant. 3. 2 Slant Angle Detection Slant is the deviation of the strokes from the vertical direction, depending on writing style. In many handwriting recognition studies, slant correction is applied before segmentation and recognition stages.

However, this correction produces serious deformation of characters, which may cause important information loss. In this study, we did not make any slant correction, but we used the slant angle in the segmentation stage by moving from top to the bottom of a word image in the slant direction. 3. 3 Baselines Extraction Proper computation of baselines, which delimit the main body of the word, has significant meaning, especially, in holistic approaches. Locations of upper and lower baselines determine the existence of ascending and descending characters in a given word image.

Baseline information can also be used in segmentation in order to avoid problems introduced by ascending and descending portions of the characters. Most studies in the literature, first detect and correct the skew angle, then estimate the baselines using some heuristics, based on horizontal projection profile [18], [9]. One commonly used method for skew detection is to fit a line through a set of points assumed to lie on the baseline by minimizing the sum of least squares or absolute values distances. Local minima of the word image are, generally, chosen as the restricted set of points for detection of lower baseline.

Therefore, some local minima need to be pruned to remove the ones that do not lie on the baseline such as the minima on the descending part of the characters. Most of the time the minimum of a descending character is relatively lower than the minimum of a normal character. However, it is practically impossible to find a heuristic for correctly eliminating those minima, which spoil the baseline skew. Therefore, a pruning procedure does not work properly in many cases, resulting in an abnormal skew angle.

In order to prevent this undesirable effect, a method, which gives a weight to each minimum with respect to the curvature of the stroke at the local minima, is proposed in [5], assuming a lower curvature on the minima of a descending character compared to that of a normal character. However, these curvatures depend highly on the writing style. In this study, we propose a new baseline extraction algorithm, which uses a weighting approach based on the angles between the local minima in a least square estimation scheme.

First, a preliminary centerline for each word image is determined by finding the horizontal line with the highest number of black pixel runs. Then, the local minima below the preliminary baseline are identified eliminating the ones on the ascending part (Fig. 3a). The goal is to find the best fit to the local minima with a high contribution from the normal characters and low contribution from descending characters. A weight is computed for each minimum by considering the average angle between that minimum and the rest of the minima.

This approach assumes relatively small average angles among the minima of normal characters compared to the average angle between a descending minimum and normal minima, Fig. 3. Baselines extraction. (a) Local maxima and minima above and below the preliminary center line, respectively. (b) Angles between the minimum eight and the other minima. (c) Distances of local maxima from the lower baseline. (d) Upper, center, and lower baselines. independent of the writing style. Finally, a line-fitting algorithm is performed over the weighted local minima.

After the detection of lower baseline, upper baseline is estimated using the local maxima above the lower baseline. The following algorithm gives the steps used for the lower and upper baselines extraction method: Baseline Extraction 1. Scan the binary word image horizontally and count the black pixel runs. Find the horizontal line, called preliminary center line, which contains the highest number of black pixel runs (Fig. 3a). Lower Baseline Extraction: 2. Identify the local minima below the preliminary center line. For each minimum i, construct a list of the angles ? j ?

between that minimum and the other minima (Fig. 3b). 3. For each minimum i, a) Cluster j into two classes gI and gP by using C-Means algorithm, where I j w, i T? j and M is the number of local minima. b) Find the class gk and its mean “k with the highest number of elements. 4. Find the lower baseline by fitting a line that minimizes the sum of weighted square distances: i? w ? 0 I “P ? yj A ? xj A ˜? P Y j j? I Upper Baseline Extraction: 5. Identify the local maxima above the lower baseline and calculate the vertical distances ? di ? of each maximum from the lower baseline (Fig.

3c). 6. Prune the ones whose distance is lower than the estimated stroke height, in order to eliminate the spurious local maxima which belong to the main body of the writing. 7. Cluster the local maxima in two classes according to the distances ? di ? from the lower baseline. 8. Take the mean value of the class, which includes the local maxima with smaller distances and draw a parallel line to the lower baseline, which passes from that mean. Center Baseline Extraction: 9. Find the center baseline as the parallel line with equal distance to the upper and lower baseline. 4

FIRST LEVEL CHARACTER CLASSIFICATION ?I? where a and b are the parameters to be estimated for the lower baseline. In this stage, the proposed system decides on the existence of ascending and descending characters in a given word image. Although this information is not crucial for the system, it improves the accuracy of segmentation and recognition algorithms a great deal. In most of the holistic approaches, ascending and descending characters are identified by using an empirical threshold as a percentage of the main body [9]. However, this threshold is subject to change for different data sets.

In this study, the thresholds are determined with respect to the character size parameters estimated in the HMM training stage. Fig. 4. First-level character classification. (a) Character size parameters estimated in HMM training. (b) Ascending and descending character detection. The distance between the upper and lower baseline is taken as the height of the normal characters. Since the character size parameters contain the height-to-width ratios of normal, ascending, and descending characters, we can easily calculate the heights of the ascending and descending characters, assuming equal width for all the characters in a given word.

Finally, the ascending/descending characters are detected by using threshold values, computed as the half way between the height of normal and ascending/ descending characters. (Fig. 4). Fig. 5. Extraction of segmentation regions. (a) From character peaks to top. (b) From character peaks to lower baseline. (c) From lower baseline to bottom. (d) The segmentation regions for the overall word. 5 SEGMENTATION Segmentation is the most crucial part of the cursive handwritten recognition problem, when an analytic approach is used for the recognition [4].

Small modifications on the segmentation algorithms result in serious improvements in the recognition rates. The segmentation method proposed in this study, is motivated from the work of Lee et al. [13], where the character segmentation problem is defined as the problem of finding the shortest path in a graph defined over a segmentation region, minimizing the accumulated intensity. In [13], the segmentation regions are identified from the peaks of the horizontal projection profile in the graylevel image assuming machine printed characters.

On the other hand, our aim is to develop a system for the handwritten characters, which may be both slanted and skewed. For this reason, the proposed method performs the segmentation by combining the characteristics of gray-scale and binary images. First, the segmentation regions are determined in the binary word image by using the contour of the writing. Then, an improved search process is applied to the segmentation regions on the gray-scale word image for determining the segmentation boundaries between characters.

5. 1 Determination of Segmentation Regions The segmentation regions carry the potential segmentation boundaries between the connected characters. Our first task is to partition each word image into stripes along the slant angle direction, each of which contains a potential segmentation boundary. Two rules are applied on the binary word image for identifying the segmentation regions: Rule 1. A single maximum above the center baseline is an indication of a single character or a portion of a character.