How to effectively use RegEx

Modified on Mon, 30 Jun at 12:08 PM


Regular Expressions, RegEx,  are an incredibly powerful tool in your toolkit.  It's extremely versatile and can be used in many different use cases. This is demonstrated in how Regex is used in the Regex Well Accelerator, Regex search module, Regex Auto Extractor, and the Regex Router, Regex Timestamp, and Regex Extract ingest preprocessors.  There are also several ingesters which support Regex based configuration options.  Because of how valuable a skill set Regex can be, we put together this document to try and help you get started with Regex,  give examples on popular use cases, some best practices, and even tools and resources you can use to help build and test your Regex.




What Are Regular Expressions?


According to Wikipedia:


A regular expression (shortened as regex or regexp) sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation 


Essentially, Regular Expressions are an extremely powerful way of matching complex patterns and extracting enumerable fields from text. 






Using Regex in Gravwell



One of Gravwell's strengths is the fact it doesn't force you to apply any structure to your entry at Ingest.  This creates opportunities to do a lot with your data at query time,  and helps avoid corruption of the raw data if it doesn't fit into your desired structure as expected.   It doesn't however negate the fact that having some structure to your data is incredibly useful when trying to find patterns, run comparisons,  or just generally doing any analysis at scale of your data. There are various standards such as JSON, Key/Value, and CSV which apply some structure from the data source.  But there is also a lot of data sources which do not cleanly fit a standard, requiring you to manually apply the structure.   The 2 most popular methods of doing so are with the Regex search module or using a Regex Auto Extractor.


Gravwell utilizes the RE2 regex standard, and we provide a Regex Playground which you can use to help build your regex string and test it against some example data. We do not store or save any data entered on our server,  so you can feel confident if you want to copy cell data from your Gravwell instance into the Test String section of the playground to validate your regex works as expected.  It also supports copying multiple test strings if you need to confirm how the regex performs across different entries or shaped data.


Regex can be used as a simple filter that matches a literal string, or a more advanced filter which can handle a variety of options or different patterns within the filter string.    It can also be used to extract Enumerated Values from a string, which can then be referenced and used in following search modules.


The simplest regular expression is a single literal character. Except for the metacharacters like *+?()|, characters match themselves. To match a metacharacter, escape it with a backslash: \+ matches a literal plus character.


Below is a list of some of the most common RegEx syntax and tokens you may find yourself using as you start your RegEx journey


^ (carat)Anchor to the start of the string
$ (dollar sign)Anchor to the end of the string
. (period)Wildcard / Match any single Character
?  (Question mark)Zero or one of the preceding character
*  (Asterisk)Zero or more of the preceding character
+  (Plus)One or More of the preceding character
(?P<Group1> 1234  )Create an Enumerated Value named "Group1" that matches "1234"    (EV name can be anything within the greater than/less than.    Capture match can be any regular expression between the closing greater than and parenthesis ) 
 {  }  (Curley Brackets)Use with a number to specify how many of the preceding character to match.  ex.
{3} -- Match exactly 3 times
{3,6} --- Match 3 to 6 times
{3,} --- Match 3 or more
{,6} --- Match up to 6 times
[  ]   ( Brackets )Match whatever character or range of characters within the brackets. Case Sensitive.  Can be combined with a ?, *, +, or Curley bracket to match multiples
ex.
[abc] --- Match a, b, or c
[a-f] -- match any character between a and f
[ABC] --- Match A, B, or C
[A-Fa-f] --- Match a-f any case
[0-6]  --- Match any digit between 0 and 6
\   (backslash)Backslashes are used to escape a special character, for example if you are looking for a literal question mark or plus, you'd use \? or \+ in your RegEx string so it won't be interpreted as a regex special character.


Regex Auto Extractor


  Regex may be the most common use for auto-extractors.  Regular expressions are hard to get right, easy to mistype, and difficult to optimize.  If you have a regular expression guru available, they can help you build a blazing fast regular expression that does all manner of efficient and flexible extractions, then you can simply deploy it in an auto-extraction and forget all about it. 


By using Enumerated Value capture groups, you can easily cut apart your unstructured data and create a structure that is easy to reuse in multiple queries and automations.   Auto Extractors are also sharable,  so you can build your regex extraction once and share it so that multiple users/analysts can take advantage of a standardized structure of the data.


Use of a Regex Playground will be extremely valuable in building your auto-extractor, as you can copy multiple individual entries into the playground and ensure your Regex string handles log variability correctly and predictably.



Common RegEx Patterns


There are common RegEx Patterns you may find yourself using repeatedly.  To help get started,  here are a few examples and snippets which you may find useful.


 

 

 

 Use Case  Regex Sample  Example Text 
 ---  ---  --- 
 Credit Card Number (Visa, Mastercard, Amex)  ^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9]{2})[0-9]{12}|3[47][0-9]{13})$  4111111111111111 
 Cross-Site Scripting (XSS) Detection  <script>|</script>|javascript:|onload|onclick|onmouseover  <script>alert('XSS')</script> 
 Data Exfiltration Attempts  \b(download|upload|ftp|sftp|scp)\b  File downloaded from FTP server 
 Date Validation (YYYY-MM-DD)  ^\d{4}-\d{2}-\d{2}$  2022-07-25 
 DNS Tunneling Attempts  \b(dns|dig|host|nslookup)\b  DNS query for example.com 
 Email Address  ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$  john.doe@example.com 
 IP Address (IPv4)  ^(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)){3}$  192.168.1.100 
 IPv4 Extraction    (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)  192.168.1.100 
 IPv6 Extraction  [a-fA-F0-9]{1,4}:[a-fA-F0-9]{1,4}:[a-fA-F0-9]{1,4}:[a-fA-F0-9]{1,4}:[a-fA-F0-9]{1,4}:[a-fA-F0-9]{1,4}:[a-fA-F0-9]{1,4}:[a-fA-F0-9]{1,4}  2001:0db8:85a3:0000:0000:8a2e:0370:7334 
 Malware-Related Strings  \b(virus|malware|trojan|spyware|adware|ransomware)\b  This file contains malware 
 Password Cracking Detection  \b(password|pwd|pass)\b  The password is P@ssw0rd 
 Password Validation (min 8 chars, at least one uppercase, one lowercase, and one digit)  ^(?=.*[A-Z])(?=.*[a-z])(?=.*\\d).{8,}$  P@ssw0rd123 
 Password Validation (min 8 chars, at least one uppercase, one lowercase, one digit)  ^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$  P@ssw0rd 
 Phone Number (US format)  ^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$  (123) 456-7890 
 SQL Injection and XSS Detection  (SELECT|INSERT|UPDATE|DELETE|DROP|CREATE|ALTER|TRUNCATE)|(<script>|</script>|javascript:|onload|onclick|onmouseover)  SELECT * FROM users WHERE id = 1; <script>alert('XSS')</script> 
 SQL Injection Detection  SELECT|INSERT|UPDATE|DELETE|DROP|CREATE|ALTER|TRUNCATE  SELECT * FROM users WHERE id = 1 
 SQL Injection Detection (parameterized)  \?|\*|\'|\"|;|--  SELECT * FROM users WHERE id = ? 
 Suspicious System Calls  \b(syscall|system|execve|fork|clone)\b  Syscall 59 (execve) made by process 1234 
 Suspicious User Agent Strings  Mozilla\/[0-9]\.[0-9] \(compatible; MSIE [0-9]\.[0-9]; Windows NT [0-9]\.[0-9]; Trident\/[0-9]\.[0-9]  Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0) 
 Unauthorized Access Attempts  \b(login|auth|authenticate|password|username)\b  Login failed for user 'admin' 
 URL Extraction  https?://[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}[a-zA-Z0-9._%+-/]*  https://www.example.com/path/to/resource 
 URL Extraction  https?:\/\/[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}[\/a-zA-Z0-9._%+-]*  https://www.example.com 


IPv4 Address(?P<ipv4>\d{1-3}\.\d{1-3}\.\d{1-3}\.\d{1-3})
IPv6 Address(?P<ipv6>(?:[0-9a-fA-F]{1,4}::?){1,7}[0-9a-fA-F]{1,4})
MAC Address(?<mac>(?:[0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2})
Email Address(?P<email>[a-zA-Z0-9._%+\'-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})
Domain Name(?P<domain>(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,})
URL (basic)(?P<url_basic>https?:\/\/[^\s\/$.?#].[^\s]*)
Date (YYYY-MM-DD)(?P<date>\d{4}-\d{2}-\d{2})
Time (HH:MM:SS)(?P<time>\d{2}:\d{2}:\d{2})
UUID (v4)(?P<UUID>[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-4[0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12})

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article