Regex Substring Finder

Problem Statement

You are given a word and a pattern, both of which are strings. The characters available to the pattern is a subset of the characters usually available in Regex.

Here, you only need to account for two characters: * and . The * character specifies that any string may have zero or more of the character that precedes the * . For example, the pattern a*b matches to a string with 0 or more a’s followed by a b. The . character matches with any character. For example, the pattern a.b matches with both acb and adb, but not ab.

Your job is to written an algorithm that determine whether a given pattern matches the whole given word. Please also provide runtime and complexity analysis.

Starter Code


def regex_match(word, pattern):
"""
inputs:
    word: Provided word (string)
    pattern: Pattern that needs to be checked against word (string)
output:
    True or False (boolean) that indicates whether the pattern matches the string
"""
# Your solution here

Test Cases

Test Case 1

Input:
    word: cooode
    pattern: co*de
Output: True

Test Case 2

Input:
    word: coooode
    pattern: o*e
Output: False

Test Case 3

Input:
    word: coooode
    pattern: .*
Output: True

Test Case 4

Input:
    word: cooode
    pattern: ......
Output: True

Test Case 5

Input:
    word: code
    pattern: cor*de
Output: False

Hint

Hint 1 (Click to reveal!)

What string matching problem does this look like? How we can use the underlying approach used to solve that problem here?

Hint 2 (Click to reveal!)

Let’s break down the matching possibilities. If you encounter a . in your pattern, what do you need to check at the corresponding index in the string? If you encounter a * in your pattern, what are the two possible outcomes from comparison? If you encounter a alphanumeric character, what’s the only possible check you can make? If you answer these questions to yourself, you figured out all the different cases needed in your algorithm.

Click Below to see the Solution

Algorithm

The most natural solution to this problem requires dynamic programming. As hinted above, this problem is very similar to Edit Distance, where you conduct pairwise comparisons across strings. Here, you initialize a table where the number of rows equals the length of the word and the number of columns equals the length of the pattern. At the begginning of the algorithm, set your base case T[0][0] = True, as two length 0’s strings always match. The recurrence relation breaks down into the following cases:

If T[i][j] = T[i-1][j-1] if the character at index i-1 of the word equals the character at index j-1 of the pattern or the character at index j -1 is a . Since the characters match, we can move both strings up by 1, and set the value of this subprobblem equal to the one at T[i-1][j - 1], because this match won’t cause the whole matching process to fail.
If the last character of the pattern (j - 1) is a * we have two cases at hand. Either the * can mean 0 of the preceding character, so we should ignore the *, or the * can mean use the preceding value, in which case we compare the last character of the word to the pattern. Case 1 (Ignore *) : T[i][j] = T[i][j-1] Case 2 (Use *): T[i][j] = T[i-1][j] In order to see how this lets us create arbitrary length sequences of the character preceding the * in the pattern, see that if we pick Case 2, we keep pulling on old subproblems until we decide to ignore the *. We or the results of these two subproblems, so that we pick whichever is most optimal currently: ` T[i][j] = T[i][j-1] || T[i-1][j]`
Finally, if the character at index i-1 of the word doesn’t match the character or either of the wildcards at index j-1 of the pattern, we set T[i][j] = False. We get the solution by returning the value at index T[len(word)][len(pattern)] from the table.

Code

def regex_match(word, pattern):
    wordLen = len(word)
    patternLen = len(pattern)
          
    # Initialize Subproblem Table
    # Need to loop in range(len + 1), because first index of the word/pattern occurs at index 1 (not index 0)
    subproblems = [[False for i in range(patternLen + 1)] for j in range(wordLen + 1)] 
      
    #A pattern of length zero always matches a word of length 0 
    subproblems[0][0] = True
      
    # Only '*' can match with empty strring 
    for j in range(1, wordLen + 1): 
        if (pattern[j - 1] == '*'): 
            subproblems[0][j] = subproblems[0][j - 1] 
              
    # fill the table in bottom-up fashion 
    for i in range(1, wordLen + 1): 
        for j in range(1, patternLen + 1): 
              
            # Two cases if we see a '*'
            # Either we can ignore the '*' or it matches with the ith character in the word
            if (pattern[j - 1] == '*'): 
                subproblems[i][j] = subproblems[i][j - 1] or subproblems[i - 1][j] 
              
            # Current characters match if current pattern character is '?' or the characters match
            elif (pattern[j - 1] == '?' or word[i - 1] == pattern[j - 1]): 
                subproblems[i][j] = subproblems[i - 1][j - 1] 
                  
            # If characters don't match 
            else: 
                subproblems[i][j] = False
      
    return subproblems[wordLen][patternLen]

Runtime/Space Analysis

Like most dynamic programming algorithms, the runtime of this algorithm depends on the size of the table and the amount time to compute one entry in the table. If n is the length of the word and m is the length of the pattern, this algorithm takes $O(n*m)$ since the size of the table is n * m and it takes $O(1)$ time to compute a single entry.

Correspondingly, the space complexity of this algorithm is the size of the table, so it’s $O(n*m)$ .

Walkthrough

Test Case #1 Walkthrough:

Input:
    word: cooode
    pattern: co*de
Output: True

subproblems[1][1] = True, since pattern[0] : c matches word[0] : c
subproblems[1][2:6] = False, since none of the characters match c except for . and subproblems[0][3] = False
subproblems[1][1] = False and subproblems[2][2] = True since the o’s match and subproblems[1][1] = True. subproblems[2][3] = True, since the * can evaluate to the empty sequence.
subproblems[2][3:6] = False
subproblems[3][1] = False, subproblems[3][2] = True, subrpoblems[3][3] = True since the * stands in for an O.
subproblems[3][4:6] = False
subproblems[4][1] = False, subproblems[4][2] = True, subproblems[4][3] = True
subproblems[4][4:6] = False
subproblems[5][1] = False, subproblems[5][2] = False, subproblems[5][3] = True, subproblems[5][4] = True, subproblems[5][5] = False
subproblems[6][1] = False, subproblems[6][2] = False, subproblems[6][3] = True, subproblems[6][4] = False, subproblems[6][5] = True
subproblems[6][5] is the final value in the table. It evaluates to True, so we return True.

Follow Up Questions To Think About

Suppose we also include the regex character + (1 or more of the preceding character). How will the algorithm change?
Suppose we allow for one error in the matching (i.e. not a full match). How will the algorithm change?