Regex Substring Finder
Problem Statement
You are given a word and a pattern, both of which are strings. The characters available to the pattern is a subset of the characters usually available in Regex.
Here, you only need to account for two characters: *
and .
The *
character specifies that any string may have zero or more of the character that precedes the *
. For example, the pattern a*b
matches to a string with 0 or more a’s followed by a b. The .
character matches with any character. For example, the pattern a.b
matches with both acb and adb, but not ab.
Your job is to written an algorithm that determine whether a given pattern matches the whole given word. Please also provide runtime and complexity analysis.
Starter Code
def regex_match(word, pattern):
"""
inputs:
word: Provided word (string)
pattern: Pattern that needs to be checked against word (string)
output:
True or False (boolean) that indicates whether the pattern matches the string
"""
# Your solution here
Test Cases
Test Case 1
Input:
word: cooode
pattern: co*de
Output: True
Test Case 2
Input:
word: coooode
pattern: o*e
Output: False
Test Case 3
Input:
word: coooode
pattern: .*
Output: True
Test Case 4
Input:
word: cooode
pattern: ......
Output: True
Test Case 5
Input:
word: code
pattern: cor*de
Output: False
Hint
Hint 1 (Click to reveal!)
What string matching problem does this look like? How we can use the underlying approach used to solve that problem here?
Hint 2 (Click to reveal!)
Let’s break down the matching possibilities. If you encounter a .
in your pattern, what do you need to check at the corresponding index in the string? If you encounter a *
in your pattern, what are the two possible outcomes from comparison? If you encounter a alphanumeric character, what’s the only possible check you can make? If you answer these questions to yourself, you figured out all the different cases needed in your algorithm.
Click Below to see the Solution
Algorithm
The most natural solution to this problem requires dynamic programming. As hinted above, this problem is very similar to Edit Distance, where you conduct pairwise comparisons across strings. Here, you initialize a table where the number of rows equals the length of the word and the number of columns equals the length of the pattern. At the begginning of the algorithm, set your base case T[0][0]
= True, as two length 0’s strings always match. The recurrence relation breaks down into the following cases:
- If
T[i][j]
=T[i-1][j-1]
if the character at index i-1 of the word equals the character at index j-1 of the pattern or the character at index j -1 is a.
Since the characters match, we can move both strings up by 1, and set the value of this subprobblem equal to the one atT[i-1][j - 1]
, because this match won’t cause the whole matching process to fail. - If the last character of the pattern (j - 1) is a
*
we have two cases at hand. Either the*
can mean 0 of the preceding character, so we should ignore the*
, or the*
can mean use the preceding value, in which case we compare the last character of the word to the pattern. Case 1 (Ignore *) :T[i][j]
=T[i][j-1]
Case 2 (Use *):T[i][j]
=T[i-1][j]
In order to see how this lets us create arbitrary length sequences of the character preceding the*
in the pattern, see that if we pick Case 2, we keep pulling on old subproblems until we decide to ignore the*
. We or the results of these two subproblems, so that we pick whichever is most optimal currently: ` T[i][j]=
T[i][j-1]||
T[i-1][j]` - Finally, if the character at index i-1 of the word doesn’t match the character or either of the wildcards at index j-1 of the pattern, we set
T[i][j]
= False. We get the solution by returning the value at indexT[len(word)][len(pattern)]
from the table.
Code
def regex_match(word, pattern):
wordLen = len(word)
patternLen = len(pattern)
# Initialize Subproblem Table
# Need to loop in range(len + 1), because first index of the word/pattern occurs at index 1 (not index 0)
subproblems = [[False for i in range(patternLen + 1)] for j in range(wordLen + 1)]
#A pattern of length zero always matches a word of length 0
subproblems[0][0] = True
# Only '*' can match with empty strring
for j in range(1, wordLen + 1):
if (pattern[j - 1] == '*'):
subproblems[0][j] = subproblems[0][j - 1]
# fill the table in bottom-up fashion
for i in range(1, wordLen + 1):
for j in range(1, patternLen + 1):
# Two cases if we see a '*'
# Either we can ignore the '*' or it matches with the ith character in the word
if (pattern[j - 1] == '*'):
subproblems[i][j] = subproblems[i][j - 1] or subproblems[i - 1][j]
# Current characters match if current pattern character is '?' or the characters match
elif (pattern[j - 1] == '?' or word[i - 1] == pattern[j - 1]):
subproblems[i][j] = subproblems[i - 1][j - 1]
# If characters don't match
else:
subproblems[i][j] = False
return subproblems[wordLen][patternLen]
Runtime/Space Analysis
Like most dynamic programming algorithms, the runtime of this algorithm depends on the size of the table and the amount time to compute one entry in the table. If n is the length of the word and m is the length of the pattern, this algorithm takes since the size of the table is n * m and it takes time to compute a single entry.
Correspondingly, the space complexity of this algorithm is the size of the table, so it’s .
Walkthrough
Test Case #1 Walkthrough:
Input:
word: cooode
pattern: co*de
Output: True
subproblems[1][1]
= True, sincepattern[0] : c
matchesword[0] : c
subproblems[1][2:6]
= False, since none of the characters match c except for . andsubproblems[0][3]
= Falsesubproblems[1][1]
= False andsubproblems[2][2]
= True since the o’s match andsubproblems[1][1]
= True.subproblems[2][3]
= True, since the*
can evaluate to the empty sequence.subproblems[2][3:6]
= Falsesubproblems[3][1]
= False,subproblems[3][2]
= True,subrpoblems[3][3]
= True since the*
stands in for an O.subproblems[3][4:6]
= Falsesubproblems[4][1]
= False,subproblems[4][2]
= True,subproblems[4][3]
= Truesubproblems[4][4:6]
= Falsesubproblems[5][1]
= False,subproblems[5][2]
= False,subproblems[5][3]
= True,subproblems[5][4]
= True,subproblems[5][5]
= Falsesubproblems[6][1]
= False,subproblems[6][2]
= False,subproblems[6][3]
= True,subproblems[6][4]
= False,subproblems[6][5]
= Truesubproblems[6][5]
is the final value in the table. It evaluates to True, so we return True.
Follow Up Questions To Think About
- Suppose we also include the regex character + (1 or more of the preceding character). How will the algorithm change?
- Suppose we allow for one error in the matching (i.e. not a full match). How will the algorithm change?