Parsing a URL with a Regex

If that title didn’t strike you dead with fear, then you’ve never attempted this impossible task before. I consider it right up there with finding Shangri-laAtlantisZ, or El Dorado.

Lots of ink has been spilt bytes have been wasted seeking this mythical creature but no silver bullet has been found.  Thanks to a Mike Strobel, a friend on twitter, he pointed me to John Gruber’s attempt at the problem.  You may know John for his co-authoring of “markdown.” Needless to say, he knows a thing or two about parsing text.

John’s regex is nearly perfect and captures most of the nastiest test cases he or I could throw at it.  He released this pattern as public domain. But it doesn’t appear to be actively maintained anywhere that I could see.  Otherwise, my Goggling skills are failing me.  If you know of such a project that is actively maintaining his pattern, please let me know in the combox below.

I wanted to get this thing on Github ASAP so that the world might begin maintaining this thing in the hopes of, together, developing the One, True, Perfect URL Regex Pattern (OTPURP – ok, so I need better marketing).

If you have interest in something like, or if you have developed the OTPURP and wish to share/contribute it to the world, please check out my attempt to centralize it:

https://github.com/chadmyers/UrlRegex

It if takes off, I’m happy to move the repo home to a neutral account and turn control over to someone or someones else.

I’ve set up a test suite for it based on John’s test cases and some of my own. If you clone the source and open start.html, you should see something like this:

urlregex

Related Articles:

    Post Footer automatically generated by Add Post Footer Plugin for wordpress.

    About Chad Myers

    Chad Myers is the Director of Development for Dovetail Software, in Austin, TX, where he leads a premiere software team building complex enterprise software products. Chad is a .NET software developer specializing in enterprise software designs and architectures. He has over 12 years of software development experience and a proven track record of Agile, test-driven project leadership using both Microsoft and open source tools. He is a community leader who speaks at the Austin .NET User's Group, the ADNUG Code Camp, and participates in various development communities and open source projects.
    This entry was posted in parsing, regex. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
    • bob

      the regex for parsing a url is in the rfc http://tools.ietf.org/html/rfc3986#page-50 or am I missing something?

    • http://mastersband.com Aidan

      @bob The problem is identifying where a URL occurs within an arbitrary string of text. The RFC regex breaks a *known* URL into its constituent parts.

    • bob

      @Aidin : ok thanks, got it ;)

    • http://www.speednet.biz/ Speednet

      For some reason, John Gruber tries to match balanced parens in the URL. It’s great that he correctly matches parens, but there is absolutely no requirement for parens to be “balanced” in a URL. e.g., this URL is fine: http://www.test.com/this(page.htm

      Also, there is totally no need to look for “www” or “www\d+”. Lots of sites have dropped the “www”.

      Better just to code it like this: (?:[-a-z0-9]{1,63}\.)+

      Or, if your regex engine supports lookbacks, you can avoid the invalid syntax of a hyphen right before each dot: (?:[-a-z0-9]{1,63}(?

      (Note: by placing the hyphen as the first char inside the brackets, we can avoid escaping it, making it more readable.)

      Another tip: top-level domains are currently up to 6 characters in length (i.e., “.museum”), so it should be changed to [a-z]{2,6}.

      -Todd