About Rexpy
Rexpy is an open-source
Python library for finding
regular
expressions
from a corpus of example strings.
It is part of the test-driven
data analysis (tdda) repository on Github.
You can read more about the motivation for Rexpy on the
TDDA Blog.
This web application is a thin wrapper around the Rexpy library.
It allows you to type (or paste) a number of example strings
into the left-hand box (one string per line).
When you click Find Patterns,
Rexpy will attempt to find one or more regular expressions which,
between them, match all the strings you provide. The goal is for
the regular expressions to be general enough to capture all the likely
variation in patterns if you have provided only a subset, while
being specific enough to be useful.
Example
Input Strings | Output |
---|---|
1-AB-987 | \d+\-[A-Z]{2}\-\d{3} |
1987-CD-321 | |
23-ZQ-422 |
Here, the three input patterns all
- start with some digits (
\d+
) - followed by a hyphen (
\-
) - followed by two capital letters (
[A-Z]{2}
) - followed by another hyphen (
\-
) - finishing with three digits (
\[0-9]{3}
).
Controls
There are two toggles that control the output. These can be set
before or after generating the regular expressions.
-
group when group is selected, the resulting regular expressions
will include capture groups where the content of a subpattern is
variable. In the example above, when this is checked, the output
will change to
(\d+)\-([A-Z]{2})\-(\d{3})
, with three capture groups (traditionally named\1
,\2
and\3
) corresponding to the three parts of the regular expression that match variable substrings. -
anchor when anchor is selected, the resulting regular expressions
will start with
^
and end with$
. This forces the regular expressions to match only whole lines.
Known Limitations
This is a demonstrator for Rexpy, which is a very young project. Rexpy itself has many limitations, some of which we hope to relax over time. Some of the more notable limitation are- It is only intended for fairly regular, structured patterns, not free text.
- It uses the regular expression module re, from Python's standard library. This has various limitations including few facilities for handling non-ASCII characters and a hard limit of 100 capture groups.
- Little attempt is made to merge regular expressions or to use alternations at present. We expect this to improve over time.
- No more than 10,000 input lines;
- No more than 200,000 input characters.