<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Parsing a URL with a Regex</title>
	<atom:link href="http://lostechies.com/chadmyers/2010/11/20/parsing-a-url-with-a-regex/feed/" rel="self" type="application/rss+xml" />
	<link>http://lostechies.com/chadmyers/2010/11/20/parsing-a-url-with-a-regex/</link>
	<description>Software development, testing, and techie life</description>
	<lastBuildDate>Thu, 08 Mar 2012 22:19:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.2</generator>
	<item>
		<title>By: Speednet</title>
		<link>http://lostechies.com/chadmyers/2010/11/20/parsing-a-url-with-a-regex/#comment-1213</link>
		<dc:creator>Speednet</dc:creator>
		<pubDate>Sun, 21 Nov 2010 13:31:24 +0000</pubDate>
		<guid isPermaLink="false">/blogs/chad_myers/archive/2010/11/19/parsing-a-url-with-a-regex.aspx#comment-1213</guid>
		<description>For some reason, John Gruber tries to match balanced parens in the URL.  It&#039;s great that he correctly matches parens, but there is absolutely no requirement for parens to be &quot;balanced&quot; in a URL.  e.g., this URL is fine:  http://www.test.com/this(page.htm

Also, there is totally no need to look for &quot;www&quot; or &quot;www\d+&quot;.   Lots of sites have dropped the &quot;www&quot;.

Better just to code it like this:  (?:[-a-z0-9]{1,63}\.)+

Or, if your regex engine supports lookbacks, you can avoid the invalid syntax of a hyphen right before each dot:  (?:[-a-z0-9]{1,63}(?&lt;!-)\.)+

(Note: by placing the hyphen as the first char inside the brackets, we can avoid escaping it, making it more readable.)

Another tip:  top-level domains are currently up to 6 characters in length (i.e., &quot;.museum&quot;), so it should be changed to [a-z]{2,6}.

-Todd
</description>
		<content:encoded><![CDATA[<p>For some reason, John Gruber tries to match balanced parens in the URL.  It&#8217;s great that he correctly matches parens, but there is absolutely no requirement for parens to be &#8220;balanced&#8221; in a URL.  e.g., this URL is fine:  <a href="http://www.test.com/this(page.htm" rel="nofollow">http://www.test.com/this(page.htm</a></p>
<p>Also, there is totally no need to look for &#8220;www&#8221; or &#8220;www\d+&#8221;.   Lots of sites have dropped the &#8220;www&#8221;.</p>
<p>Better just to code it like this:  (?:[-a-z0-9]{1,63}\.)+</p>
<p>Or, if your regex engine supports lookbacks, you can avoid the invalid syntax of a hyphen right before each dot:  (?:[-a-z0-9]{1,63}(?<!-)\.)+</p>
<p>(Note: by placing the hyphen as the first char inside the brackets, we can avoid escaping it, making it more readable.)</p>
<p>Another tip:  top-level domains are currently up to 6 characters in length (i.e., &#8220;.museum&#8221;), so it should be changed to [a-z]{2,6}.</p>
<p>-Todd</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: bob</title>
		<link>http://lostechies.com/chadmyers/2010/11/20/parsing-a-url-with-a-regex/#comment-1212</link>
		<dc:creator>bob</dc:creator>
		<pubDate>Sat, 20 Nov 2010 10:41:18 +0000</pubDate>
		<guid isPermaLink="false">/blogs/chad_myers/archive/2010/11/19/parsing-a-url-with-a-regex.aspx#comment-1212</guid>
		<description>@Aidin : ok thanks, got it ;) </description>
		<content:encoded><![CDATA[<p>@Aidin : ok thanks, got it ;) </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aidan</title>
		<link>http://lostechies.com/chadmyers/2010/11/20/parsing-a-url-with-a-regex/#comment-1211</link>
		<dc:creator>Aidan</dc:creator>
		<pubDate>Sat, 20 Nov 2010 02:27:46 +0000</pubDate>
		<guid isPermaLink="false">/blogs/chad_myers/archive/2010/11/19/parsing-a-url-with-a-regex.aspx#comment-1211</guid>
		<description>@bob The problem is identifying where a URL occurs within an arbitrary string of text. The RFC regex breaks a *known* URL into its constituent parts.</description>
		<content:encoded><![CDATA[<p>@bob The problem is identifying where a URL occurs within an arbitrary string of text. The RFC regex breaks a *known* URL into its constituent parts.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: bob</title>
		<link>http://lostechies.com/chadmyers/2010/11/20/parsing-a-url-with-a-regex/#comment-1210</link>
		<dc:creator>bob</dc:creator>
		<pubDate>Sat, 20 Nov 2010 01:18:21 +0000</pubDate>
		<guid isPermaLink="false">/blogs/chad_myers/archive/2010/11/19/parsing-a-url-with-a-regex.aspx#comment-1210</guid>
		<description>the regex for parsing a url is in the rfc http://tools.ietf.org/html/rfc3986#page-50 or am I missing something?</description>
		<content:encoded><![CDATA[<p>the regex for parsing a url is in the rfc <a href="http://tools.ietf.org/html/rfc3986#page-50" rel="nofollow">http://tools.ietf.org/html/rfc3986#page-50</a> or am I missing something?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
