Python Automation Cookbook
上QQ阅读APP看书,第一时间看更新

There's more...

The special characters can be reversed if they are case swapped. For example, the reverse of the ones we used are as follows:

  • \D: Marks any non-digit
  • \W: Marks any non-letter
  • \B: Marks any character that's not at the start or end of a word
The most commonly used special characters are typically \d (digits) and  \w (letters and digits), as they mark common patterns to search for, and the plus sign for one or more.

Groups can be assigned names as well. This makes them more explicit at the expense of making the group more verbose in the following shape—(?P<groupname>PATTERN). Groups can be referred to by name with .group(groupname) or by calling .groupdict() while maintaining its numeric position.

For example, the step 4 pattern can be described as follows:

>>> PATTERN = re.compile(r'(?P<city>[A-Z][\w\s]+?).(?P<state>TX|OR|OH|MN)')
>>> match = PATTERN.search(TEXT)
>>> match.groupdict()
{'city': 'Odessa', 'state': 'TX'}
>>> match.group('city')
'Odessa'
>>> match.group('state')
'TX'
>>> match.group(1), match.group(2)
('Odessa', 'TX')

Regular expressions are a very extensive topic. There are whole technical books devoted to them and they can be notoriously deep. The Python documentation is good to be used as reference (https://docs.python.org/3/library/re.html) and to learn more.

If you feel a little intimidated at the start, it's a perfectly natural feeling. Analyze each of the patterns with care, dividing it into different parts, and they will start to make sense. Don't be afraid to run a regex interactive analyzer!

Regexes can be really powerful and generic, but they may not be the proper tool for what you are trying to achieve. We've seen some caveats and patterns that have subtleties. As a rule of thumb, if a pattern starts to feel complicated, it's time to search for a different tool. Remember the previous recipes as well and the options they presented, such as parse.