Les Hazlewood

Where Les is More…

Email Validation using Regular Expressions (the Right Way)

Filed under: Software, Java, General — Les at 2:53 pm on Saturday, February 4, 2006

UPDATE: This article was updated on February 1st, 2008 to account for domain literals and quoted strings such as “John Smith” <john.smith@somewhere.com>. It is now effectively the only complete and semantically correct email validator for Java.

PETTY REQUEST: The update required considerably more effort than the original as it now accounts for all valid RFC parsing conditions. Because of this, and that this page is easily my most visited, I’d appreciate it if you could show your appreciation by hooking a brother up and clicking on some ads. It helps pay for my hosting. Thanks!

In Object-Oriented design, I’m a firm believer in modeling things in they way they truly exist (in as much is possible given abstraction and time restrictions). So, whenver I design a system’s domain model, I create Classes that represent entities as they exist in real life. That being said, I’ve accrued a nice library of Classes that I reuse in a lot of projects.

For example, I don’t save or reference an email address as a String: strings as objects don’t tell me anything about the email address itself, like if its valid, if its bouncing, if it has been verified by the user with which it is associated, etc, etc. As such, I have created an EmailAddress class to represent this information. Doing this is a small example of the beauty of OO over functional programming.

Anyway, I was a little lax in the past in my validation logic. This time on my last project, I was determined to get things right once and for all.

I googled quite a while for the Right Way to validate an email address. In my opinion, there is only one Right Way - the RFC 2822 way. This is the standard after all.

I never came across anything I was happy with. All the responses seemed to be perl or php variant regular experessions or some horribly convoluted text string nearly impossible to decipher. I was disappointed to see so many interpretations of a standard. I mean, c’mon people, its written in pure black and white!!!

I guess the old addage “If you want something done right, you’ve got to do it yourself” resonated in my head this time. I actually took the time out to read the RFC (something I hadn’t done in a long while, probably since college).

After reading the RFC, I translated the grammar into usable, *readable* source code that now resides in my EmailAddress class, and I’ve included it below for the benefit of anyone that wishes to use it. It is written in Java, but the same code could be replicated in C# or PHP or whatever. Just keep it clean!

N.B: Look at the to the first two constants, ALLOW_DOMAIN_LITERALS and ALLOW_QUOTED_IDENTIFIERS - enable or disable them as you see fit for your application.

/* * Copyright 2008 Les Hazlewood * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /** * This constant states that domain literals are allowed in the email address, e.g.: * * <p><tt>someone@[192.168.1.100]</tt> or <br/> * <tt>john.doe@[23:33:A2:22:16:1F]</tt> or <br/> * <tt>me@[my computer]</tt></p> * * <p>The RFC says these are valid email addresses, but most people don't like allowing them. * If you don't want to allow them, and only want to allow valid domain names * (<a href="http://www.ietf.org/rfc/rfc1035.txt">RFC 1035</a>, x.y.z.com, etc), * change this constant to <tt>false</tt>. * * <p>Its default value is <tt>true</tt> to remain RFC 2822 compliant, but * you should set it depending on what you need for your application. */ private static final boolean ALLOW_DOMAIN_LITERALS = true; /** * This contstant states that quoted identifiers are allowed * (using quotes and angle brackets around the raw address) are allowed, e.g.: * * <p><tt>"John Smith" &lt;john.smith@somewhere.com&gt;</tt> * * <p>The RFC says this is a valid mailbox. If you don't want to * allow this, because for example, you only want users to enter in * a raw address (<tt>john.smith@somewhere.com</tt> - no quotes or angle * brackets), then change this constant to <tt>false</tt>. * * <p>Its default value is <tt>true</tt> to remain RFC 2822 compliant, but * you should set it depending on what you need for your application. */ private static final boolean ALLOW_QUOTED_IDENTIFIERS = true; // RFC 2822 2.2.2 Structured Header Field Bodies private static final String wsp = "[ \\t]"; //space or tab private static final String fwsp = wsp + "*"; //RFC 2822 3.2.1 Primitive tokens private static final String dquote = "\\\""; //ASCII Control characters excluding white space: private static final String noWsCtl = "\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F\\x7F"; //all ASCII characters except CR and LF: private static final String asciiText = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]"; // RFC 2822 3.2.2 Quoted characters: //single backslash followed by a text char private static final String quotedPair = "(\\\\" + asciiText + ")"; //RFC 2822 3.2.4 Atom: private static final String atext = "[a-zA-Z0-9\\!\\#\\$\\%\\&\\'\\*\\+\\-\\/\\=\\?\\^\\_\\`\\{\\|\\}\\~]"; private static final String atom = fwsp + atext + "+" + fwsp; private static final String dotAtomText = atext + "+" + "(" + "\\." + atext + "+)*"; private static final String dotAtom = fwsp + "(" + dotAtomText + ")" + fwsp; //RFC 2822 3.2.5 Quoted strings: //noWsCtl and the rest of ASCII except the doublequote and backslash characters: private static final String qtext = "[" + noWsCtl + "\\x21\\x23-\\x5B\\x5D-\\x7E]"; private static final String qcontent = "(" + qtext + "|" + quotedPair + ")"; private static final String quotedString = dquote + "(" + fwsp + qcontent + ")*" + fwsp + dquote; //RFC 2822 3.2.6 Miscellaneous tokens private static final String word = "((" + atom + ")|(" + quotedString + "))"; private static final String phrase = word + "+"; //one or more words. //RFC 1035 tokens for domain names: private static final String letter = "[a-zA-Z]"; private static final String letDig = "[a-zA-Z0-9]"; private static final String letDigHyp = "[a-zA-Z0-9-]"; private static final String rfcLabel = letDig + "(" + letDigHyp + "{0,61}" + letDig + ")?"; private static final String rfc1035DomainName = rfcLabel + "(\\." + rfcLabel + ")*\\." + letter + "{2,6}"; //RFC 2822 3.4 Address specification //domain text - non white space controls and the rest of ASCII chars not including [, ], or \: private static final String dtext = "[" + noWsCtl + "\\x21-\\x5A\\x5E-\\x7E]"; private static final String dcontent = dtext + "|" + quotedPair; private static final String domainLiteral = "\\[" + "(" + fwsp + dcontent + "+)*" + fwsp + "\\]"; private static final String rfc2822Domain = "(" + dotAtom + "|" + domainLiteral + ")"; private static final String domain = ALLOW_DOMAIN_LITERALS ? rfc2822Domain : rfc1035DomainName; private static final String localPart = "((" + dotAtom + ")|(" + quotedString + "))"; private static final String addrSpec = localPart + "@" + domain; private static final String angleAddr = "<" + addrSpec + ">"; private static final String nameAddr = "(" + phrase + ")?" + fwsp + angleAddr; private static final String mailbox = nameAddr + "|" + addrSpec; //now compile a pattern for efficient re-use: //if we're allowing quoted identifiers or not: private static final String patternString = ALLOW_QUOTED_IDENTIFIERS ? mailbox : addrSpec; public static final Pattern VALID_PATTERN = Pattern.compile(patternString);

Anyway, the above java code allows you to do things like the following.

In the EmailAddress class, you can have a method:

public static boolean isValid( String userEnteredEmailString ) { return VALID_PATTERN.matcher( userEnteredEmailString ).matches(); }

Then you can write validation logic wherever you want (hopefully in a dedicated Validator ;) ):

if ( !EmailAddress.isValid( userEnteredEmailString ) { throw InvalidFormatException( "Invalid e-mail format!" ); }

Better yet, if you want to see if any email address instance is valid, the EmailAddress class has the following method that you can use for ‘pure’ OO ‘messaging’ (i.e. a method invoked on an object is a ‘message’ from the calling object to the target object):

public boolean isValid() { //use static method call as helper w/ class attribute 'text' return isValid( getText() ); }

which enables you to do checks this way (this is ‘pure’ OO):

if ( anEmailAddressInstance.isValid() ) { //do something } else { //do something else }

Happy validating!

1 Comment »

10

Comment by Steven Elliott

April 4, 2006 @ 5:56 pm

Thanks Les for doing the hard work of implementing RFC 2822. I don’t why their are so many personal interpretations of what a vaild email address is or why so few actually bothered with the RFC standard.

Anyway thanks. I have just one minor correction and that is Pattern does not have:
boolean Pattern.matches(String s)

You need to create:
matcher = Pattern.matcher( CharSequence)
and then return matcher.matches().

Steven

11

Comment by Les

April 4, 2006 @ 9:47 pm

Ah, yes, thanks very much for pointing that out ;) I’ve updated the blog entry accordingly.

Cheers,

Les

853

Comment by Bupesh

October 31, 2006 @ 2:47 am

I tried using this code. But, its saying a@b is a valid email address. Is it?!

Comment by Les

November 6, 2006 @ 11:41 am

Hi Bupesh,

a@b is not a valid email address. But the code works as expected - I just used a@b through a simple test:

EmailAddress emailAddy = new EmailAddress( "a@b" );

if ( !emailAddy.isValid() ) {
    System.out.printlin( "Email is not valid!" );
} else {
    System.out.println( "Email is valid" );
}

When I ran that code block, my console printed: “Email is not valid!”.

So the code works as expected.

Cheers,

Les

Pingback by Les Hazlewood » EmailAddress Java class

November 14, 2006 @ 10:18 am

[…] Anyway, In this CMS, I’m using time-honored OO classes I’ve used on many many projects. One such is the EmailAddress class that I’ve referenced in earlier posts in this blog. I’ve gotten some good feedback on this class, so I thought I’d just post the whole thing in case anyone wants to benefit from it (instead of just using code chunks I’ve posted before). […]

Comment by nithya

December 24, 2006 @ 1:14 am

Will it work for “.a@bbb.com”? actually it shouldn’ work but it does!

Comment by Al Medeiros

January 15, 2007 @ 2:53 am

Thanks,

This code save me a lot of time.

I am having one strange thing happen. This seems to accept first,last@site.com as a valid email address. I don’t see a comma in any of the patterns but yet it is accepting a comma as valid in localpart. Any ideas?

Thanks again,

Al Medeiros

Comment by Hans

February 2, 2007 @ 8:03 am

I think I foud an error in your expression as it allows an email address to start with a single quote ‘.

Which is surely not valid, javamail doesnt accept it.

Comment by Les

February 10, 2007 @ 3:41 pm

@Nithya

.a@bbb.com does not show up as a valid email address.

My very simple test program tells me it is invalid, so the expression is correct.

For example, the following code does in fact print out “Invalid email.”:

String email = ".a@bbb.com";
if ( EmailAddress.isValidText( email ) ) {
    System.out.println("Valid email!");
} else {
    System.out.println("Invalid email.");
}

Comment by Les

February 10, 2007 @ 3:45 pm

@Al

The expression is correct. first,last@site.com is not a valid address, as you point out. This code chunk does print out “Invalid email.”:

String email = "first,last@site.com";
if ( EmailAddress.isValidText( email ) ) {
System.out.println("Valid email!");
} else {
System.out.println("Invalid email.");
}

Comment by Les

February 10, 2007 @ 3:49 pm

@Hans

The the above regular expression is still correct. An email address, per the RFC 2822 spec is allowed to start with a single quote, or any other character in the atext constant above.

Javamail doesn’t have any internal email address validation that I’m aware of, so Javamail isn’t denying the email per se - it is probably your underlying email server that javamail connects to that is saying the email is invalid. In this case, the email server is wrong - at least according to the RFC spec. The expression is still accurate.

Cheers,

Les

Comment by kumar

March 23, 2007 @ 1:16 am

thanks for the neatly written code.
but it does not validate the email address ending with an IP , such as don@[18.138.9.10]
Isn’t this a valid mail id???

Comment by Les

March 23, 2007 @ 3:26 am

@Kumar,

You’re absolutely correct. don@[18.138.9.10] is a valid email address. So is a quoted identifier, i.e. “Don Somebody” <don@[18.138.9.10]>, but the expression does not account for these 2 cases. I’ll add them in soon. Thanks!

Comment by Sateesh

April 5, 2007 @ 8:05 am

Les did you add these in already?

Comment by Les

April 5, 2007 @ 8:56 am

@Sateesh

Nope, not yet - I haven’t had the time :( (On a consulting engagement in Dublin, Ireland for the last 2 months). I hope to address these issues now that I’m back home in the States.

Cheers,

Les

Comment by cherouvim

April 30, 2007 @ 7:25 am

Great work.

thanks!

Comment by Thiago

May 13, 2007 @ 12:46 pm

Hi Les. I am Brazillian and I am creating a very simple framework to help with validations of specific Brazillian formats like social security number. Even though e-mail isn´t one of them, I am adding some other basic validation functions wich include e-mail.
It is amazing that there isn´t a framework like that with minimal documentation already.

Anyway, I will publish that very simple framework at sourceforge.net and I was wondering if I could use your code above (regular expression part) in it and put the credits on the javadoc header. It would took me quite some time to do the same thing again myself, can I use yours?

There is no profit envolved, just a simple framework that I did for myself and will publish since will probably be of use for other people on my country.

Cheers,
Thiago.

Comment by Sean Sandquist

June 5, 2007 @ 3:40 pm

What about e-mail addresses such as:

whomever@u.washington.edu ?

The regex says that this is not valid. Yet u.washington.edu is a valid domain. (As is the similar “u.arizona.edu”.) The regex doesn’t like the lone “u”.

Sean

Comment by Les

June 5, 2007 @ 5:31 pm

@Sean

You’re absolutely right! Thanks for catching that. I’ve updated the blog entry accordingly (the rfcLabel definition specifically).

Comment by Zackery Sidsworth

June 14, 2007 @ 3:29 am

This one makes sence “One’s first step in wisdom is to kuesstion everything - and one’s last is to come to terms with everything.”

Comment by Simon Reinhardt

July 4, 2007 @ 1:02 am

JavaMail does have email address validation, see http://java.sun.com/products/javamail/javadocs/javax/mail/internet/InternetAddress.html#validate() and the source for that at https://glassfish.dev.java.net/source/browse/glassfish/mail/src/java/javax/mail/internet/InternetAddress.java?rev=1.6&view=markup . However, as they say, “The current implementation checks many, but not all, syntax rules.” So there’s still room for your implementation. :)

What do you guys think of the paragraph about the RFC on http://www.regular-expressions.info/email.html ?

Comment by Flavel

July 19, 2007 @ 1:46 pm

Wow, yeah, great job.

Comment by Samuel Arias

August 14, 2007 @ 2:12 pm

Thank you very mucho, great job. No more personal interpretation.

Comment by mike

August 22, 2007 @ 9:00 pm

Thank you very much. Saved me a lot of headaches.

Comment by Aditya

November 7, 2007 @ 1:58 pm

Thank you Les. You saved lot of time for everyone of us who is trying to validate emails :). Gr8 work. This works like a charm. Thanks again.

Comment by jimmy

December 15, 2007 @ 2:52 am

Les, thank you so much for this wonderfully simple, cleanly written and elegant email validator for Java. Now it will be easy to validate emails against the acual spec rather than some home-baked aproximation. You rock! I think you have written THE canonical implementation for Java.

Comment by Gyanendra

January 11, 2008 @ 3:28 am

This is great job.I am in trouble to create regular expression for email validation .but this solve my all problems.

Thanks alot.

Comment by Thorsten

February 24, 2008 @ 6:09 pm

What about e-mail addresses containing punycode; it’s a replacment for internationalized domain names (IDN) like so called “umlaut domains” (using ä, ö, ü, etc.)?

See RFC 3492 for details.
Example: mq@ยจฆฟคฏข.tld -> me@xn-22cdfh1b8fsa.tld (this is a valid punycode representation for an IDN)

Pingback by LocationAd » Blog Archive » E-Mail-Validierung 2.0 - DNS MX-Prüfung

February 24, 2008 @ 6:18 pm

[…] Ein RFC-konforme und RegEx-basierte E-Mail-Validierung für Java ist hier zu finden: http://www.leshazlewood.com/wp-trackback.php?p=5 […]

Comment by Casey

May 12, 2008 @ 3:19 pm

Hi there!

I wanted to let you know that I have taken your code and added a number of features to it. I post the link here in case it’s useful to you or anyone reading this. Essentially it adds a number of functions for extracting addresses (and parts of addresses), as well as verifying whole headers (including group tokens, etc.)

You can find it (along with documentation, etc) at:

http://boxbe.com/freebox.html

Modified/added: removed some functions, added support for CFWS token, corrected FWSP token, added some boolean flags, added getInternetAddress and extractHeaderAddresses and other methods, did some optimization of the regex.

Where Mr. Hazlewood’s version was more for ensuring certain forms that were passed in during registrations, etc, this handles more types of verifying as well a few forms of extracting the data in predictable, cleaned-up chunks.

(I see that you removed my other rambling comments, which I was going to ask you to do anyway. :-) )

Thanks again,
-Casey

Comment by cherouvim

October 16, 2008 @ 2:52 am

Hello

This email shows valid: test;test@example.com

But is it really? When I try to send to this I get:
javax.mail.internet.AddressException: Illegal semicolon, not in group in string “test;test@example.com”

thanks
Ioannis

Comment by Giancarlo Angulo

November 13, 2008 @ 12:32 am

was about to start on this but you probably saved me a couple of hours, T
hanks A Lot!

Comment by Jon Strayer

November 23, 2008 @ 12:15 pm

Wow, thank you. This is very helpful.

According to Wikipedia (for what it’s worth) this address is valid:
abc+mailbox/department=shipping@example.com

It seems to cause the pattern matcher to go into an endless loop.

A similar address:
abc+mailbox/department.shipping@example.com
takes just over 7.5 seconds to validate.

The combination of ‘+’ and ‘=’ seems to be what is causing the problems.

Comment by FloBa

February 2, 2009 @ 10:13 pm

Great work.

Just in short: setting ALLOW_DOMAIN_LITERALS will
validate a@b as valid.

Regards

Comment by Eric Silva

March 24, 2009 @ 8:52 am

Has this code been updated to comply with RFC 5322 (http://www.ietf.org/rfc/rfc5322.txt) which supersedes RFC 2822.

Comment by Richard Berger

July 16, 2009 @ 5:42 pm

Thanks so much!!! Clicked on some ads for you too :)

Comment by Leo

October 13, 2009 @ 4:57 pm

I should mention that if ALLOW_DOMAIN_LITERALS = true;
then a@b is valid but ALLOW_DOMAIN_LITERALS = false; then a@b is nnot valid

Comment by Muhammad Khokhar

October 27, 2009 @ 7:51 am

Well, the best way to do using java is as follows :

————

String email = “muhdadeel@yahoo.com”;
Patter p = Pattern.compile(”.+@.+\\.[a-z]+”);
Matcher m = p.matcher(email);

boolean matchFound = m.matches();

//we have to make sure ,user dont put only a@b.c,since it should be atleast a@b.cc

StringTokenizer st = new StringTokenizer(email,”.”);
String lastToekn = null;
while(st.hasMoreTokens())

{
lastToekn = st.nextToken();

}
if(matchFound && lastToekn.length() >= 2 )
{
out.println(”Valid Email”);
}
else
{
out.println(”sorry,invalid”);
}

Thats the best way ,pals…

Comment by Lance Lavandowska

November 10, 2009 @ 3:19 pm

Les, I had a requirement to allow non-ascii letters (acutes, umlauts, and such). I replaced any instance of a-zA-Z with \\p{L} and added changed the final compile step to Pattern.compile(patternString, Pattern.UNICODE_CASE). I don’t know if this deviates from the spec (too lazy) but I thought I’d pass it on. Thanks for the great class!

Comment by PUK

December 23, 2009 @ 9:08 am

I incorporated this into my project 3 months ago. Today I give it the string “sdlkfjaklsdfjaskldfjaslkdjfflasda@sdffjfj” and it locks up! When breaking into the debugger, I’ve got a massive call-stack. Here’s just a small fraction of it:


java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$6.isSatisfiedBy(Pattern.java:4763)
java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
java.util.regex.Pattern$Curly.match0(Pattern.java:3760)
java.util.regex.Pattern$Curly.match(Pattern.java:3744)
java.util.regex.Pattern$Curly.match0(Pattern.java:3789)
java.util.regex.Pattern$Curly.match(Pattern.java:3744)

RSS feed for comments on this post. TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

Powered by WP Hashcash