Regex to detect count of email addresses in email header? -
i have regex detect email address - trying create regex looks in header of email message counts email addresses , ignores email addresses specific domain (abc.com).
for example, there's ten email addresses 1@test.com ignoring 11th address 2@abc.com.
current regex:
^[a-z0-9._%+-]+@[a-z0-9.-]+.[a-z]{2,4}$
consider following powershell example of universal regex.
to find email addresses:
<(.*?)>
handy if server surrounds email addresses brackets(?<!content-type(.|\n){0,10000000})([a-za-z0-9.!#$%&''*+-/=?\^_``{|}~-]+@(?!abc.com)[a-za-z0-9-]+(?:\.[a-za-z0-9-]+)*)
if don't have brackets around email addresses in header. note particular regex copied community wiki answer on stackoverflow 201323 , modified here prevent@abc.com
. there edge cases regex not work for. on same page there complex regex looks match every email address. don't have time modify 1 skip@abc.com
.
example
$matches = @() $string = 'return-path: <example_from@abc123.com> x-spamcatcher-score: 1 [x] received: [136.167.40.119] (helo abc.com) fe3.abc.com (communigate pro smtp 4.1.8) esmtp-tls id 61258719 example_to@mail.abc.com; message-id: <4129f3ca.2020509@abc.com> date: wed, 21 jan 2009 12:52:00 -0500 (est) from: taylor evans <remember@to.vote> user-agent: mozilla/5.0 (windows; u; windows nt 5.1; en-us; rv:1.0.1) x-accept-language: en-us, en mime-version: 1.0 to: jon smith <example_to@mail.abc.com> subject: business development meeting content-type: text/plain; charset=us-ascii; format=flowed content-transfer-encoding: 7bit content-type: multipart/alternative; boundary="------------060102080402030702040100" multi-part message in mime format. --------------060102080402030702040100 content-type: text/plain; charset=iso-8859-15; format=flowed content-transfer-encoding: 7bit hello, html mail, has *bold*, /italic /and _underlined_ text. , have table here: cell(1,1) cell(2,1) cell(1,2) cell(2,2) , put picture here: image alt text that''s it. --------------060102080402030702040100 content-type: multipart/related; boundary="------------030904080004010009060206" --------------030904080004010009060206 content-type: text/html; charset=iso-8859-15 content-transfer-encoding: 7bit <!doctype html public "-//w3c//dtd html 4.01 transitional//en"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=iso-8859-15"> </head> <body bgcolor="#ffffff" text="#000000"> hello,<br> <br> html mail, has <b>bold</b>, <i>italic </i>and <u>underlined</u> text.<br> , have table here:<br> <table border="1" cellpadding="2" cellspacing="2" height="62" width="401"> <tbody> <tr> <td valign="top">cell(1,1)<br> </td> <td valign="top">cell(2,1)</td> </tr> <tr> <td valign="top">cell(1,2)</td> <td valign="top">cell(2,2)</td> </tr> </tbody> </table> <br> , put picture here:<br> <br> <img alt="image alt text" src="cid:part1.ffffffff.5555555@example.com" height="79" width="98"><br> <br> that''s it. email me @ test@email.com<br> subject: <br> </body> </html>' # write-host start # write-host $string write-host write-host found [array]$found = ([regex]'(?<!content-type(.|\n){0,10000000})([a-za-z0-9.!#$%&''*+-/=?\^_`{|}~-]+@(?!abc.com)[a-za-z0-9-]+(?:\.[a-za-z0-9-]+)*)').matches($string) $found | foreach { write-host "key @ $($_.groups[1].index) = '$($_.groups[1].value)'" } # next match write-host "found $($found.count) matching addresses"
yields
found key @ 14 = 'example_from@abc123.com' key @ 200 = 'example_to@mail.abc.com' key @ 331 = 'remember@to.vote' key @ 485 = 'example_to@mail.abc.com' found 4 matching addresses
summary
(?<!content-type(.|\n){0,10000000})
preventscontent-type
appearing within 10,000,000 characters before email address. has effect of preventing email address matches in body of message. because requester using java , java doesn't support use*
inside lookbehind i'm using{0,10000000}
instead. (see regex behind without obvious maximum length in java). aware may introduce edge cases may not captured expected.<(.*?@(?!abc.com).*?)>
(
start return[a-za-z0-9.!#$%&''*+-/=?\^_``{|}~-]+
match 1 or more allowed characters. double single quote escape single quote character powershell. , double tick escapes backtick stackoverflow.@
include first @ sign(?!abc.com)
reject find if includesabc.com
[a-za-z0-9-]+
continue looking remaining characters non greedy upto first dot or end of string.(?:\.[a-za-z0-9-]+)*)
continue looking character chunks followed dot
Comments
Post a Comment