Friday, February 28, 2014

Regular Expressions for Dummies:

How to use regular expressions in Stata

(created by Nadina Anderson)


First things first. What is a regular expression?
"In computing, a regular expression is a sequence of characters that forms a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. 'find and replace'-like operations" (-Wikipedia)
Many of us have done this on a basic level in Word. We realize that we misspelled something or made some other kind of error, and we know that we can go Edit--> Find-->Replace and Word will give us the option to "search" for all a word or set of words and we can replace it with a different word.

For example, if I was typing a paper on Ukraine and I had continually referred to Ukraine as "the Ukraine" the entire time, only to learn that "the Ukraine" implies that Ukraine is a region or subsection of some larger area (e.g. "the great plains" "the midwest") and is an insult to Ukrainian sovereignty, I could go to Word and request that every time it finds "the Ukraine" it replaces it with simply "Ukraine."

We can get a lot fancier than this in both Textpad and Stata. Textpad is very useful if you have a PC. If you've done "find and replace" searches in Word, it is relatively easy to switch over to Textpad. The options in Textpad resemble Word's basic 'find & replace' function, but Textpad has a "regular expression" option you can check in the 'find' menu below. This means we can use the fancy symbols (see down the page) to look for parts of words and special patterns of characters.


With regular expressions, we can search for parts of words, or even just particular sequences of numbers and letters and have Textpad 'pick out' those strings. Textpad can tell us where they appear, when they appear, and, if we want, it can replace them with something we specify.

Strings vs. Numeric
When using regular expressions in Stata, we will need to use specific commands to what we normally do in the Find menu in Textpad. Also, instead of searching a document, we now will be searching the values of variables in a dataset. In Stata we use regular expressions only with "string" variables.

What is a string?

A string is a set of letters or characters (e.g. a word or set of words). It is also a datatype in Stata. A string variable will have words as its official 'values', whereas a numeric variable will always have numbers (although fyi strings can also include numbers). We always have to put quotes around a string in Stata so Stata knows what we are talking about.

To see what variables in your data are strings vs. numeric, you can always type "describe" into your command bar, and you should get a long list of your variables.  The second column of this list should display the "storage type" of all your variables.


describe

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------
make            str18   %-18s                 Make and Model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair Record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn Circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear Ratio
foreign         byte    %8.0g      origin     Car type

----------------------------------------------------------------

Numeric storage types: byte, int, long, float, double 
String storage: str1, str2, ..., strL
(--> number on end indicates how many characters are allowed in the string)

Many datasets you can download online in Stata format (GSS etc.) are all numeric, but the values of the variables will be labeled (and so you will need to specify "nolabel" to see the numeric values). 

Symbols to use in Regular Expressions

So if we want to search a string variable for a specific value, say the variable was "protests68" and it included all the newspaper headlines from the New York Times in 1968 about people protesting. Maybe we want to see how often the word "killed" appeared in the headlines. Stata lets us search for for "killed" very easily and will tell us how many headlines include the word "killed".

But to actually search for patterns within strings we need to use regular expressions. A simple form of this would be if we wanted, for instance, all newspaper headlines with "violence" or "violent" in the title, we could ask Stata to find "violen." where the "." indicates "match any character."

literal is a word that we "literally" find and replace. That is to say, we use the entire word to search, and we replace it with an entirely different word (e.g. "the Ukraine" to "Ukraine" in the example above). A regular expression does not contain an entire word. A regular expression uses a set of symbols to look for patterns of characters, e.g. when we wanted to search for "violence" and "violent" above, we used one regular expression "violen." to find both.

Here is a summary of symbols to use in regular expressions in Stata:

Counting
*Asterisk means “match zero or more” of the preceding expression.
+Plus sign means “match one or more” of the preceding expression.
?Question mark means “match either zero or one” of the preceding expression.
Characters
azThe dash operator means “match a range of characters or numbers”. The “a” and “z” are merely an example. It could also be 0–9, 5–8, F–M, etc.
.Period means “match any character”.
\A backslash is used as an escape character to match characters that would otherwise be interpreted as a regular-expression operator.
Anchors
^When placed at the beginning of a regular expression, the caret means “match expression at beginning of string”. This character can be thought of as an “anchor” character since it does not directly match a character, only the location of the match.
$When the dollar sign is placed at the end of a regular expression, it means “match expression at end of string”. This is the other anchor character.
Groups
|The pipe character signifies a logical “or” that is often used in character sets (see square brackets below).
[ ]Square brackets denote a set of allowable characters/expressions to use in matching, such as [a-zA-Z0-9] for all alphanumeric characters.
( )Parentheses must match and denote a subexpression group.

This table is from the information on:
http://www.stata.com/support/faqs/data-management/regular-expressions/

There are three Stata function commands to accompany these symbols: 
1) regexm 
2) regexr
3) regexs

I think these are much less intuitive than Word and Textpad, but read below and hopefully it will start to make sense.

As a handy way to recall the commands, you can always remember that "reg" "ex" signifies you want to use a "regular" "expression." 

Remember--
  • M--stands for "match" so if you want Stata just to find and match a regular expression, use regexm (regular expression match). You can use this command to create a variable that is 0 if the expression is not present, and 1 if the expression is present
  • R--stands for "replace" so if you want Stata to find and replace the expression, remember to use regexr
  • S--stands for "subexpression" which I think is confusing. I prefer to think of S as standing for "spits" because Stata will extract a piece of an expression and spit out the part you tell it to. So for spitting out a chunk of a larger string, you want regexs
Lastly, these are function commands, meaning they need to be a part of a larger argument. Typically, function commands follow a generate command. Note, the regex functions do not work with egen.

REGEXM EXAMPLE

Our favorite string variable, the make and model of cars:

tab make

    Make and Model |      Freq.     Percent        Cum.
-------------------+-----------------------------------
       AMC Concord |          1        1.35        1.35
         AMC Pacer |          1        1.35        2.70
        AMC Spirit |          1        1.35        4.05
         Audi 5000 |          1        1.35        5.41
          Audi Fox |          1        1.35        6.76
          BMW 320i |          1        1.35        8.11
     Buick Century |          1        1.35        9.46
     Buick Electra |          1        1.35       10.81
     Buick LeSabre |          1        1.35       12.16
        Buick Opel |          1        1.35       13.51
       Buick Regal |          1        1.35       14.86
     Buick Riviera |          1        1.35       16.21
     Buick Skylark |          1        1.35       17.57

If this is our string variable, and we just want a variable that indicates how many different models of Buicks we have, we could use "match."

In general, we want:
 gen newvar =regexm(stringvar, "String-we-want-matched")
gen buick=regexm(make, "Buick")

tab buick

      buick |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         67       90.54       90.54
          1 |          7        9.46      100.00
------------+-----------------------------------

      Total |         74      100.00

Capitalization matters! If you don't want your strings capitalized you can also use the function "lower"--e.g. you can
gen newvar=lower(stringvar
and it will lowercase all the strings in your variable.

Matching is really useful if we have a huge unwieldy number of strings, but less so if the variable is small.

REGEXR EXAMPLE

So, maybe we hate Buicks and we want our data to demonstrate our hatred explicitly. We could replace "Buick" with "piece-of-crap" if we wanted to. For this we need the "replace" function.

In general the replace function looks like this:

           gen newvar=regexr(stringvar, "string-in-data", "new-string-we-want-in-data")

For the auto data:


gen crap=regexr(make, "Buick", "piece-of-crap")        

And low and behold, the Buicks are revealed in the data!

tab crap 

        Toyota Corona |          1        1.35       83.78
            VW Dasher |          1        1.35       85.14
            VW Diesel |          1        1.35       86.49
            VW Rabbit |          1        1.35       87.84
          VW Scirocco |          1        1.35       89.19
            Volvo 260 |          1        1.35       90.54
piece-of-crap Century |          1        1.35       91.89
piece-of-crap Electra |          1        1.35       93.24
piece-of-crap LeSabre |          1        1.35       94.59
   piece-of-crap Opel |          1        1.35       95.95
  piece-of-crap Regal |          1        1.35       97.30
piece-of-crap Riviera |          1        1.35       98.65
piece-of-crap Skylark |          1        1.35      100.00
----------------------+-----------------------------------

                Total |         74      100.00

We can also use regexr to homogenize our data. For instance if we had, in the above example, someone enter "bug" instead VW, or Volkswagen, instead of VW, we can convert all of those to the string we want:

gen vw=regexr(make, "bug" | "Bug" | "volkswagen" | "Volkswagen", "VW")  

The  |  symbol indicates "or".

However this replacement uses "literals" in the sense that we are not using any of the fancy symbols in the chart above to find patterns in the strings.

REGEXS EXAMPLE

And finally, the spitting example. With this function, regular expressions are the very useful, but this is also the command that is easiest to get wrong. It has to be used in combination with regexm.

In general we want:
gen newvar =regexs(#) if regexm(stringvar, ("first subexpression") ("second subexpression")...("nth subexpression"))
This gets a little tricky. This function lets you 'pick out' or extract a portion of a larger string. So if the string was today's date, "1march2014" but we only wanted the month we could write regexs(2) because we want the 2nd 'piece' of the larger string (here, the piece that comes after the day).

A handy way the Stata-folks explain it is the following:

Subexpression #String returned
01march2014
11
2march
32014

So if you type in regexs(0) Stata will try to match the entire string.

If we had a variable called "date" that included a set of dates like:

1feb2014
14feb2014
1march2014
8march2014
6june2014

And we only wanted to know the month, we could write:

gen month=regexs(2) if regexm(date,("[0-9]+")("[a-z]+")("[1-9]+"))

The parentheses go around each subexpression. Here we have told Stata to look for number, followed by a set of letters, followed by another number, and return the set of letters (because we asked it for the 2nd subexpression).

The "+" indicates that Stata needs to match to the given subexpression before moving on. The [ ] denote the set of allowable characters.

Our new variable month should look like this:

tab month


      month |      Freq.     Percent        Cum.
------------+-----------------------------------
        feb |          1        0.20        0.20 
        feb |          1        0.20        0.40
      march |          1        0.20        0.60
      march |          1        0.20        0.80
       june |          1        0.20      100.00

Now, if the dates varied at all, this command would not work because we did not give Stata very specific instructions. We could specify that  Stata look for the day at the beginning of the expression by using ^ and look for the year at the end of the expression by using $, like this:

gen month=regexs(2) if regexm(date,("^[0-9]+")("[a-z]+")("[1-9]+$"))

We could also be more lenient and put a * after our subexpressions instead of +, which would allow Stata to still return an expression even if the day or year was missing.


COMBINATION EXAMPLE

When are these "regex" commands actually useful? When might we actually use them on real data? (besides expressing dislike for Buicks?)

Let's return to the example where we have data from the New York Times on protests in 1968. Maybe we have a "where" variable that records where the protest took place, according to the newspaper article.


The 'where' variable might look like the following:

. tab where

                                  where |      Freq.     Percent        Cum.
----------------------------------------+-----------------------------------
            (North)Richmond, California |          1        0.12        0.12
                      NY PUBLIC LIBRARY |          1        0.12        0.25
                        South Bend, Ind |          1        0.12        0.37
 The auditorium of Robert A. Van Wyck.. |          1        0.12        0.50
. ST. JOSEPH COUNTY JAIL IN SOUTH BEN.. |          1        0.12        0.62
                  . UNIVERSITY OF MIAMI |          1        0.12        0.74
        . at their conference in Dallas |          1        0.12        0.87
                                      0 |          5        0.62        1.49
                                        ...(lots of values)

outside of the American Sociological .. |          1        0.12       92.80
  outside the Biltmore Hotel in chicago |          1        0.12       92.93
outside the Black Panthers headquarters |          1        0.12       93.05
                     outside the school |          1        0.12       93.18
persumably the event was at the high .. |          1        0.12       93.30
              philadelphia, and chester |          1        0.12       93.42
                       philadelphia, pa |          1        0.12       93.55
                          pikeville, ky |          1        0.12       93.67
                          pittsburg, pa |          1        0.12       93.80
                      portland , oregan |          1        0.12       93.92
               providence, rhode island |          1        0.12       94.04
                raleigh, north carolina |          1        0.12       94.17
                     san francisco , ca |          1        0.12       94.29
                      san francisco, ca |          2        0.25       94.54
                          san mateo, ca |          2        0.25       94.79
                      sanfrancisco , ca |          1        0.12       94.91
                     southbend, indiana |          1        0.12       95.04
                          st. louis, mi |          1        0.12       95.16
                           stamford, ct |          1        0.12       95.29
                    stanford univeristy |          1        0.12       95.41
                     staten island , ny |          1        0.12       95.53
                             storrs, cn |          1        0.12       95.66
                          storrs, conn. |          1        0.12       95.78
                      swan quarter, n.c |          1        0.12       95.91
                        tampa , florida |          1        0.12       96.03
       the Black Panther's headquarters |          1        0.12       96.15
the Criminal Courts Building as thier.. |          1        0.12       96.28
the Elizabeth Police Station, New Yor.. |          1        0.12       96.40
               the U.S Court of Appeals |          1        0.12       96.53
              the midway of Chicago, IL |          1        0.12       96.65
 the south field of Columbia Univeristy |          1        0.12       96.77
                        the u of oregon |          1        0.12       96.90
                they presented it in DC |          1        0.12       97.02
                    throughout New York |          1        0.12       97.15
 to City Hall and back to their schools |          1        0.12       97.27
             to Gov. Rockefeller Albany |          1        0.12       97.39
                       to Pres. Johnson |          1        0.12       97.52
to announce 100%police enforcement of.. |          1        0.12       97.64
to the White House, Congress and the .. |          1        0.12       97.77
toward the International Amphitheatre.. |          1        0.12       97.89
                                trenton |          1        0.12       98.01
                            trenton, NJ |          1        0.12       98.14
                            trenton, nj |          1        0.12       98.26
               u of michigan, ann arbor |          1        0.12       98.39
under Rockefeller Cente's statue of A.. |          1        0.12       98.51
                         united nations |          1        0.12       98.64
                   up 5th avenue in nyc |          1        0.12       98.76
                          washington dc |          2        0.25       99.01
                         washington, dc |          1        0.12       99.13
                         waterloo, iowa |          1        0.12       99.26
                          west 11th st. |          1        0.12       99.38
                  wilmington , delaware |          1        0.12       99.50
                   wilmington, deleware |          1        0.12       99.63
             wilmington, north carolina |          1        0.12       99.75
                 wilson, north carolina |          1        0.12       99.88
                         woodbourne, ny |          1        0.12      100.00
----------------------------------------+-----------------------------------

                                  Total |        806      100.00

These are a little confusing, because we have cities, specific buildings, states, neighborhoods, and events all jumbled together into one "where" variable. 

Let's assume our ideal "where" variable actually has the abbreviated state name at the end, e.g. "TUCSON, AZ". We can figure out how many of the values above already have the state name at the end with the following code:

gen where_upper=upper(where)

First we want to put all of the places into uppercase letters (assuming we like places to be uppercase). The function "upper" will automatically convert all of your strings into upper case strings. Then we want to find the coma, state-abbreviation:

gen find_abbr=regexm(where_upper, "(, )+([A-Z][A-Z]$)")

Here I have generated a new variable called "find_abbr" (for "find abbreviation"). I have asked Stata to find two subexpressions (contained within the parentheses): a coma and a space (, ), and two upper-case letters ([A-Z][A-Z]). The brackets [ ] indicate that the two characters can be any upper-case letters. 

I also have asked Stata to make sure to find the coma, the "+" tells Stata not to continue until the coma is matched. Lastly I have put in an anchor "$" that tells Stata that it should look for two upper-case letters at the end of the expression (see Stata symbol chart above). 

Now Stata  is sorting through the upper-case 'where' values and "matching" the values that have a coma and the specified abbreviation, i.e. assigning a match to the value of "1" in "find_abbr".

Tabulating our new variable, you can see that we have 137 cases where (hopefully!) the state is at the end 

  find_abbr |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        669       83.00       83.00
          1 |        137       17.00      100.00
------------+-----------------------------------
      Total |        806      100.00

Now perhaps we remember that sometimes when we write down the state abbreviation, we put a period in between the letters, like A.Z. State would not catch this in the example above because we did not tell Stata to look for a period. 

In order to get rid of the period, we can use the simple find and replace function: 

gen noper=regexr(where_upper, "\.", "")

Here I have generated a new variable called "noper" (no period) where I have hopefully erased all of the periods in the string. We need the backslash ( \ ) to tell Stata that we are actually looking for a real period--within regular expressions, "." means "match any character" (see chart above). 

Now if we run the match function again using "noper":

gen abbr_noper=regexm(noper, "(, )+([A-Z][A-Z]$)")

We should catch some of the states that previously had the format "A.Z." or "N.Y."

tab abbr_noper

 abbr_noper |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        659       81.76       81.76
          1 |        147       18.24      100.00
------------+-----------------------------------
      Total |        806      100.00

And yes! We have ten more cases (147 instead of 137) that have now been matched to "1" in our data.

If of course we want to see all these 147 states that are already in the format we want, we need to extract the state from the larger "where" value. We can do this with the 'subexpression' (or spit out) function.

gen state=regexs(2) if regexm(noper, "(, )+([A-Z][A-Z]$)")

So we are picking out the 2nd supexpression in our match function: the "[A-Z][A-Z]". Now, we could have some typos in the data, but overall we have a list of states:

. tab state

      state |      Freq.     Percent        Cum.
------------+-----------------------------------
         CA |         19       12.93       12.93
         CN |          2        1.36       14.29
         CO |          3        2.04       16.33
         CT |          3        2.04       18.37
         DC |          3        2.04       20.41
         FL |          5        3.40       23.81
         GA |          1        0.68       24.49
         IL |          1        0.68       25.17
         IN |          2        1.36       26.53
         KY |          1        0.68       27.21
         LA |          1        0.68       27.89
         MA |          5        3.40       31.29
         MD |          3        2.04       33.33
         MI |          2        1.36       34.69
         MO |          2        1.36       36.05
         MY |          2        1.36       37.41
         NC |          6        4.08       41.50
         NH |          1        0.68       42.18
         NJ |         18       12.24       54.42
         NY |         44       29.93       84.35
         OH |          3        2.04       86.39
         PA |          7        4.76       91.16
         RI |          2        1.36       92.52
         SC |          6        4.08       96.60
         TN |          3        2.04       98.64
         VT |          1        0.68       99.32
         WS |          1        0.68      100.00
------------+-----------------------------------
      Total |        147      100.00


If this example was confusing try reading the examples here:
http://www.stata.com/support/faqs/data-management/regular-expressions/
or here:
http://www.stata.com/meeting/wcsug07/medeiros_reg_ex.pdf

OTHER USEFUL THINGS IN STATA

Stata has a bunch of string functions that I think are really useful and more intuitive to use than the regex functions.
  • ltrim(), itrim(), and rtrim() allow you to 'trim off' empty spaces in the beginning, middle, and end of your strings
  • word() returns whatever specified word in a string you want
  • abbrev() will abbreviate a string by cutting it down to the number of characters you want
  • proper() will capitalize the first letter of each word in your string
  • lower() will make all of your strings lower-cased

For a full list see here:


FILEFILTER

Last but not least, there is a very handy command called "filefilter" which allows you do change an entire file and save the changed version as a new file. This is very handy if, for example, you wanted to convert all of the spaces in a document into tabs, or if you wanted to insert the crucial left-hand single quote necessary for Stata macros.

The general form we want for filefilter is
filefilter oldfile newfile, from("old pattern") to ("new pattern")
We won't always need quotes, but they don't hurt and you do need them to demonstrate blank spaces. So, say we had a file "spaces.txt" that had all spaces when we wanted tabs, we could change all our spaces to tabs with the following command:

        filefilter spaces.txt tabs.txt, from (" ") to (/t)

The " " indicates a space, and the /t is the symbol for a tab. After we ran this command we would have a new file called "tabs.txt" that was all tabs instead of spaces (which could be very useful for making tables in Excel or Word).

Here is a short list of the code abbreviations used in filefilter:

code := \BS backslash \r carriage return \n newline \t tab \M Classic Mac EOL, or \r \W Windows EOL, or \r\n \U Unix or Mac EOL, or \n \LQ left single quote, ` \RQ right single quote, ' \Q double quote, " \$ dollar sign, $ \###d 3-digit [0-9] decimal ASCII \##h 2-digit [0-9,A-F] hexadecimal ASCII

For more information on filefilter visit:

http://www.stata.com/help.cgi?filefilter


This is a very introductory look at the regular expressions functions in Stata, please explore the links to see more complicated examples!