Loading...
HomeMy WebLinkAboutQuick Fields Pattern Matching - White PaperLaserfiche°: Understanding Pattern Matching Training Guide January 2006 The information contained in this document represents the current view of Compulink Management Center, Inc on the issues discussed as of the date of publication. Because Compulink must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Compulink, and Compulink cannot guarantee the accuracy of any information presented after the date of publication. This chapteris for informational purposes only. COMPULINK MAKES NO WARRANTIES, EXPRESS OR IMPLIED, AS TO THE INFORMATION IN THIS DOCUMENT. Table of Contents Laserfiche :Understanding Pattern Matching .......................................................i Table of Contents ..................................................................................................2 Introduction ...........................................................................................................3 Pattern Matching Basics .......................................................................................3 Common OCR Errors ............................................................................................5 Zone OCR and Spaces .......................................................................................5 Dates and Slashes .............................................................................................b Multivariable barcodes .........................................................................................8 Moving Data .........................................................................................................10 Pattern Matching and OMR .................................................................................12 Appendix A: Regular Expression Reference Guide ...........................................14 Regular Expressions .......................................................................................14 Abbreviations ...................................................................................................16 Appendix B: Common Sample Patterns .............................................................17 2 Introduction Pattern Matching is an add-onto Quick Fields that can help ensure that documents are processed correctly. Pattern Matching "reads" through text strings, looking to match patterns that are defined using regular expressions. Based on the regular expression you write, Pattern Matching processes can return all or part of the pattern that it searches for. This session is intended to be an introduction to pattern matching and uses common examples to show its usefulness in real world scenarios. Although working knowledge of Quick Fields or the Standard Scanning interface is required, there is no assumed knowledge of regular expressions or prerequisite experience with Pattern Matching. Pattern Matching Basics Pattern Matching can be used in two basic ways: for text validation, and for text extraction. UlThen a user uses Pattern Matching to validate information, the pattern matching process compares a value with an expected pattern or value. For instance, you might wish to verify that a particular value was a valid social security number. You could configure a pattern match that would check that a particular value was of xxx-xx-xxxx format, where all characters except for dashes were numbers. If the pattern were found, the string would be validated; if it were not, the string would be flagged as unvalidated. Pattern Matching can also be used for text extraction. In this case, Pattern Matching will insert the information that matches the pattern into the desired document property as a token. For instance, you could insert the social security numbers found above into a particular template field in the document. "Match groups" are a particularly important concept in pattern matching. A match group is a particular set of expressions that you want to match in the document; for instance, the set of expressions used to locate a social security number could be configured as a match group. In Laserfiche Quick Fields, match groups are defined by enclosing the expressions you wish to match in braces. Using match groups allows you to match a particular set of criteria, which you can then reference or use later, and to clearly define what values you wish to validate or store. For instance, you can use match group syntax to find social security numbers with dashes between the groups of numbers, but to only store the numbers themselves, not the dashes.You can also elect to locate only the first occurrence of a string that matches the match group, or all occurrences that match. For instance, the syntax { [0-9] }would look for digits between 0 and 9. If you configured your session to locate the first occurrence of a match only, this syntax would return 1 from the string 13579 in the 3 document. If you configured your session to locate all occurrences of a match, the syntax would return 13579. By default, a pattern matching process will attempt to match as much as possible. Pattern matching processes are 'greedy' -they will match as large a string as they can unless specifically constrained otherwise. For instance, you might configure a pattern matching process to match a string starting with the letter 'b' and ending with the letter 'a,' with a variable number of other letters in between the two. If the process located the word 'bananas,' it would return 'banana' rather than 'ba' or 'bang.' Furthermore, if it located the sentence 'Bananas are tasty,' it would match 'Bananas are ta.' You would need to more strictly constrain the syntax to return only a subset of these results. Finally, when configuring a pattern matching process, it is important to remember that there may be multiple ways to approach the same problem. If you are having trouble getting the results you want with the initial syntax you try, think more about the results you are trying to generate. Is there another way to approach the problem that might give you the results that you want? 4 Common OCR Errors One of the most frequent applications of pattern matching is to correct for the most common OCR errors that occur when Zone OCR is used to extract information from documents. Zone OCR can automate the processing and indexing of documents, and regular errors in the OCR can greatly diminish this capability because they require manual intervention to correct. Oftentimes, Pattern Matching can be used to account for and correct OCR errors, and therefore to eliminate the need for manual correction. Zone OCR and Spaces V1le have found that sometimes, due to character spacing, the OCR engine will include a space in the middle of the word or number being extracted through a Zone OCR process. This can be problematic for a number of reasons. If you're using the data to name, index and/or file documents, this could cause problems with searching and folder creation because the system wouldn't know to eliminate the spaces. Additionally, if you're using field constraints to enforce formatting rules, the space could cause invalid fields; these invalid fields would need to be corrected before the documents could be sent to Laserfiche. The pattern you use to eliminate spaces will depend on the type of data you are extracting. If all you need to do is eliminate spaces and extract all recognized text in a specific region, you would simply use the abbreviation for alphanumeric character in the following pattern: {\a+} This pattern would return all letters and numbers in the string of recognized text and ignore everything else, including spaces. Note: This pattern and the patterns below will only work in this context if the Pattern Matching option Match Group is configured such that recurring match groups match "All occurrences." If you need to be more specific about what type of data you want to extract, you can be more specific with your pattern. For example, if you were looking for an employee name, you would use a pattern that only returns letters: {\c+} This pattern would return only the letters from the string of recognized text and ignore everything else, including numbers and spaces. Similarly, if you were extracting an invoice number and only expecting numbers, you would use a pattern that only returns numbers: {\d+} This pattern would return only numbers from the string of recognized text and ignore everything else, including letters and spaces. 5 Dates and Slashes Another common OCR error is the misrecognition of the forward slashes in dates. It's fairly common for the OCR engine to return a 1 or 1 instead of a /. This is troublesome when using the date type because the values will cause Quick Fields to mark the field as invalid or, if it accepts the value, to input the wrong date. Sometimes this problem can be addressed by specifying a letter preference in the configuration settings for the Zone OCR process, but usually you will need to use pattern matching to extract the values for month, day and year and then put them back together using tokens. There are two common date formats that you will most likely need to work with. The easiest to work with is the eight digit date, where day and month are described using two digits and year is described using four digits (i.e. 01/01/2006). The other type of date you will encounter has a variable format depending on day, month and sometimes year (i.e. 1/1/2006 vs. 11/11/2006). Eight Digit Dates The eight digit date is the easies to work with because the string that the OCR engine returns will always contain 10 characters (don't forget the slashes). The first two characters will be the month, the fourth and fifth characters the day and the last four characters the year. To create a pattern that matches the entire string, you can simply use the abbreviation for any character ten times: This pattern will match every character in a date. To extract the month, day and year, you would simply use the curly braces around the characters you want Pattern matching to return. For month, you would put the braces around the first two characters: This pattern will return only the first two characters of the string. For day you would put the braces around the fourth and fifth characters: This pattern will return only the fourth and fifth characters of the string. For year you would put the braces around the last four characters of the . string. This pattern will return only the last four characters of the string. Variable Dates Dates that don't enforce an eight digit format are more difficult to work with because your pattern has to accommodate the potential variations. Your 6 pattern must accommodate a one or two digit month or day and a two or four digit year. To do this, you can use the OR operator of the regular expression editor. For month and day you would use the following pattern: \d I (\d\d) This pattern will match a one or two digit month or day For year, you would use the following pattern: (\d\d) I (\d\d\d\d) This pattern will match a two or four digit year. To match the entire string, you would put month, day and year together and separate them by the abbreviation for any character. It's also a good idea to use the special character for "end of input" to anchor the pattern. To match any date, you would use the following pattern: \d I (\d\d).\d I (\d\d).(\d\d) I (\d\d\d\d)$ This pattern will match any date regardless of format or delimiter Similar to working with eight digit dates, you can use the curly braces to extract month, day and year. For month, you would put the curly braces around the first set of digits: {\d I (\d\d)}.\d I (\d\d).(\d\d) I (\d\d\d\d)$ This pattern will return a one or two digit month For day, you would put the braces around the second set of digits: \d I (\d\d).{\d I (\d\d)}.(\d\d) I (\d\d\d\d)$ This pattern would return a one or two digit day For year, you would put the braces around the last set of digits: \d I (\d\d).\d I (\d\d).{(\d\d) I (\d\d\d\d)}$ This pattern would match a two or four digit year, 7 Multivariable barcodes Barcodes are an accurate and efficient way to extract data from documents. In fact, Quick Fields can run barcode processes faster then any other data extraction process. The only drawback to using barcode is that they can mar the aesthetics of a form and too many barcodes can crowd out the human- readable information. One solution to the problem of cluttering forms with multiple barcodes is to concatenate multiple fields within a single barcode that is placed in an area on the form where it won't detract from the contents of the form itself. when creating multivariable barcodes, the most important consideration is choosing the delimiter you will use to separate the fields. The reason it's important to choose the right delimiter is because the regular expression editor has a number of special characters, and choosing a delimiter that is also a regular expression character can cause problems. For example, a common delimiter used for exchanging data as a flat text file is the pipe character (I ). From the previous example, you know that the pipe character is the OR operator, so using it as a field delimiter could cause conflicts. The other thing you want to avoid is choosing a field delimiter that could show up as part of the field value, because it could cause your pattern to fail or only return parts of the fields. For this example, you will use Pattern Matching to extract multiple fields from a single barcode on invoice forms. The barcodes will include Invoice Number, Customer Name, Invoice Date, P.O. Number and Total Due. Based on the contents of these fields, the "at" sign (@) was chosen as the delimiter. Looking at the sample document, you can see that the barcode includes the following information: 00100@Smith Company, Inc.@07/31/2005@9876@$100.00 Since we're using a unique delimiter, it's pretty simple to write a pattern that will match the entire string. Since the invoice number is always a five digit number, it makes sense to specify five digits (you could do the same with the four digit P.O. number as well). For the rest of the information, we'll use the abbreviation for any character (.) and a wild card that means it's repeated zero or many times (~ ). It's important to use this wildcard because it will allow for blank fields (like P.O.). To match the entire string, we'll use the following pattern: \d\d\d\d\d@.*@.*@.*@.* To extract the individual fields, we use the curly braces like we've done before. To extract the invoice number we'll use the following pattern: 8 This pattern will extract the five digit invoice number. To extract company name, we'll place the curly braces between the first two delimiters: This pattern will extract the company name. To extract the invoice date, we'll place the curly braces between the second and third delimiters: This pattern will extract the invoice date. To extract the P.O. number, we'll place the curly braces between the last two delimiters: This pattern will extract the P.O. number. To extract the total due, we'll place the curly braces after the last delimiter: This pattern will extract the total due. 9 Moving Data Sometimes the data you need to extract is printed on pre-printed stationary and regular movement of the paper in the printer causes the data to shift on the form. Other times, scanned images become skewed as they go through the document viewer and the deskew process causes the data to be shifted vertically or horizontally. If you're relying on Zone OCR to accurately read data from the forms, the shifting can cause significant problems -especially if the form has a lot of data and your zones can't accommodate any movement of data. To address this issue, you can enlarge your zones, retrieve multiple lines of data and use Pattern Matching to extract only the information you want. Another thing to consider when using Zone OCR is that using multiple processes will slow down processing. If you have multiple data elements to extract in the same general area on the form, you can use larger zones and retrieve multiple lines of data and extract the individual data elements using Pattern Matching. This strategy will make document processing more efficient than using a Zone OCR process for each data element. For this example, we'll use the same invoice forms as the Multivariable Barcode example, but we'll use Zone OCR to retrieve the Invoice Number and Invoice Date. Instead of using two separate Zone OCR processes, we'll use one large zone and configure it to retrieve multiple lines of data. The Zone OCR process returns the following data: IIII mNNr~I~1111111 Invoice #00100 Date: 07/31/2005 TIVe need to write regular expressions that will return the invoice number as well as the invoice date. Since the invoice number is always a five digit number and is preceded by the pound sign (#), the pattern will be pretty simple. The only exceptional issue we need to account for is the data in the string that comes before the pound sign and the invoice number. To accommodate this information, we can use the abbreviation for any character and the wildcard that means it's repeated zero or more times. The following pattern would work: This pattern will extract the five digit invoice dated preceded by the # symbol 10 To extract the invoice date, we can take advantage of the fact that it's the last piece of information in the string and that it follows the eight digit date format. The pattern we will use will match the last ten characters at the end of the string: This pattern Will match the invoice date at the end of the string. 11 Pattern Matching and OMR Optical Mark Recognition can determine whether a mark has been made within a specific zone on a form. This technology is especially useful when preprinted slip sheets are used to automate document processing. By using slip sheets with check boxes, it's really easy to quickly provide information about documents by checking a couple of boxes on the slip sheet. In this scenario, OMR is not very useful on its own because it can only return true or false values. However, Pattern Matching can be used to add useful information to the output of OMR processes. For this example, we'll use simple slip sheets with check boxes for document type. V1le'll use OMR and Pattern Matching to assign document type and file the documents accordingly. This could be part of a two step process where batches are broken into documents that are filed by type and then each document type is processed again in quick fields. The sample slip sheet has four document types; Form 1040, Savings Account Activation, Form W9 and IRA Application. V1le'll start with four OMR processes that will determine which of the boxes has been checked. V1~e will know which box has been checked because its corresponding OMR process will return a value of True while the others return a value of False. To get useful information from these processes, we'll use Pattern Matching to append the values they return with the document type they represent. UVe do this by adding document type to the token for the OMR process in the Look for pattern in: input box and then using the abbreviation for any character and a wildcard to return the entire string. Because there are four OMR processes, we'll need to create four Pattern Matching Processes. The table below shows the processed from the example: Look for pattern in: Pattern Output: Farm 1040$Form 10400MR$ . * Form 1040TRUE/FALSE Account Activation$Account Activation OMR$ Account ActivationTRUE/FALSE Form w-9$Form VV-9$ . * Form w-9TRUE/FALSE IRA Application$IRA Application$ . ~` IRA ApplicationTRUE/FALSE These patterns will add document type to the results of the OMR processes and return document type and a value of TRUE or FALSE. V11hat we end up with are four pieces of information that give us the document type and whether its corresponding box has been checked (TRUE or FALSE). Because only one box should be checked, we can assume that only one of the patterns will include the value TRUE. To assign document type, we'll combine the output of all for patterns to determine which one includes the 12 value TRUE and extract the associated document type. To do this, we'll use a fairly complex or statement and the curly braces: ({Form 1040}TRUE) I ({Account Activation}TRUE) I ({Form W- 9}TRUE) I ({IRA Application}TRUE) This pattern will read through the combined output of four previous patterns to determine which box has been checked and return the associated document type. While this may seem a little complicated, it's an efficient use of slip sheets to automatically break up batches of documents, assign document type and file them accordingly. 13 Appendix A: Regular Expression Reference Guide Regular Expressions The following table contains a description for each regular expression that can be used to create a pattern that will be searched for in the document. Name Symbol Description Match Group { } Indicates that only the data returned by the expression(s) contained within the braces will be stored as the token value. Any Character Matches any single character. Character in [] Matches any character inside the brackets. For Range example, the expression [abc123] will find the characters "a", "b", "c", "1", "2", or "3". N Character Not in [^] Matches any character except for those inside the Range brackets. For example, the expression [^abc123] will find all characters except for "a", "b", "c", "1", "2" or "3.. Range Character - Beginning of ^ Input Matches any characters contained within the specified range. For example, the expression [o- 9] will find the numbers 0 through 9. Matches the beginning of the user-defined value. For example, the expression ^[abc123] will find a matching value when the user-defined value starts with either "a", "b", "c", "1", "2", or "3". End of Input $ Matches the end of the user-defined value. For example, the expression [abc123]$ will find a matching value when the user-defined value ends with either "a", "b-', "c-', "1", "2", or "3.. Not ! Matches when the expression following the symbol (!) is not found in the user-defined value. For example, the expression a!b will find a matching value when "a" is found and it is not immediately followed by "b". Or I Matches when one of two expressions is found. M ~ For example, the expression T I the will find a 14 matching value when the user-defined value is set to either "The" or "the". 0 or More Indicates that the preceding expression matches zero or more times. For example, the expression [0-9] ~ will only find a set of consecutive digits if it is specified at the beginning of the user-defined value. + Indicates that the preceding expression matches one or more times. For example, the expression [0-9]+ will only find a set of consecutive digits if it is specified at the beginning of the user-defined value. Previous ? Statement is Optional Group O Escape Character \ Indicates that the preceding expression is optional. It can match either once or not at all. For example, the expression [0-9] [0-9]?will find either a single or double digit within auser- definedvalue. Groups an expression together. For example, the expression (\d+,)*\d+ will find either a set of consecutive digits or comma-separated numbers. The group operator followed by the ~ is what allows Pattern Matching to find the comma- separatednumbers. The \d+ matches the last number in the comma-separated list of numbers. For example, it could match "1", "12", or "1,23,456". Indicates either an abbreviation (see table below) or that the next character should be translated literally. For example, [0-9]+ matches one or more digits, but [0-9] \+ matches a digit followed by a plus character. Note: The Match Group expression is only necessary if you would like to only return a specific portion of the data being processed. For example, if you would only like to store the month and year from the value "12/31/2003", then you would specify the following expression: {\d+}{/}\d+/{\d+}. Notice that the digits in the month position, a single backslash, and the digits in the year position are all specified within braces. 15 Abbreviations The following table describ es the various abbreviations that you may use when specifying your regul ar expression. An abbreviation is a simple and quick way to specify a regu lar expression. For example, instead of typing ([a- zA-Z]), you may simply type \c to find a single alphabetic character. Name Symbol Description AlphaNumeric \a This symbol represents any single alphanumeric Character character. The corresponding syntax would be: N ~ ([a-zA-ZO-9]) UVhiteSpace \b This symbol represents a single space character. Character The corresponding syntax would be: ([ \ \t] ) Alphabetic \c This symbol represents any single alphabetic Character character. The corresponding syntax would be: A Decimal Digit \ d This symbol represents any single decimal digit. N ~ The corresponding syntax would be: ([0-9]) Hexadecimal \h This symbol represents any single hexadecimal Digit digit. The corresponding syntax would be: ([0-9a- N fA-F]) Integer Digit \z This symbol represents an integer. An integer consists of any set of consecutive decimal digits. The corresponding syntax would be: ([0-9]+) Newline \n This symbol represents any single carriage return Character or line break. The corresponding syntax would Quoted String \q This symbol represents any single set of characters contained within a set of quotation marks. The corresponding syntax would be: Simple Vljord \w This symbol represents any simple word. A simple word consists of consecutive alphabetic characters. The corresponding syntax would be: 16 Appendix B: Common Sample Patterns This section contains a short list of common patterns that can be used to identify or format data. This list is solely provided as an introduction to the many uses of this process. These patterns, like any other pattern, can be modified to best suit the needs of your organization. Note: These common patterns use both symbols and abbreviations. Abbreviations have been used to save space and to demonstrate their proper usage. Typ e Pattern Example Phone Number ((xxx) xxx-xxxx format) Short Date (single or double-digit month/day format) \(\d\d\d\) \d\d\d- ?\d\d\d\d \d?\d/\d?\d/\d\d\d\d (562) 988-1688 12/25/2003 i 123-45-6789 N 17:50 Social Security Number [0-9]+-[0-9] [0-9]-[0-9]+ (x~cx-xx-xxxx format) Time [0-9]?[0-9]:[0-9][0-9] (h:mm or hh:mm format) Zip Code (x~xx or xxacxx-acxxx format) i i \d*-?\d* 9080 The following table demonstrates the regular expressions that should be used to retrieve a single line of text from data containing multiple lines of text. An example of a process that can be configured to retrieve multiple lines of text is International Zone OCR. Note: The First Line pattern will not return the desired results when the data being processed only contains a single line of text. Type Pattern Sample Pattern First Line {, *?\n} {, *?\n} Second Line \n\n{. ~?\n} \n\n{. ~?\n} W 17 Additional For each additional line, prepend the Lines following to the Second Line pattern: \n\n. *? Last Line Follow the instructions for additional lines. You should then modify the expression within the braces to match the following: {,*} 18 Understanding Pattern Matching January 2006 Author: Jereb Cheatham Contributing author; Miruna Babatie Editor; Constance Anderson Compulink Management Center, Inc. Global Headquarters 3545 Long Beach Blvd. Long Beach, CA 90807 U. S.A Phone: +1.562.988.1688 www.laserfiche.com Laserfiche is a trademark of Compulink Management Center, Inc. Various product and service names references herein maybe trademarks of Compulink Management Center, Inc. All other products and service names mentioned may be trademarks of their respective owners. Copyright ©2005 Compulink Management Center, Inc. All rights reserved 19