我试图在
Java中使用
Daring Fireball Regular Expression for matching URLs,并且我发现一个URL,导致评估永远.我修改了原来的正则表达式以使用Java语法.
private final static String pattern =
"\\b" +
"(" + // Capture 1: entire matched URL
"(?:" +
"[a-z][\\w-]+:" + // URL protocol and colon
"(?:" +
"/{1,3}" + // 1-3 slashes
"|" + // or
"[a-z0-9%]" + // Single letter or digit or '%'
// (Trying not to match e.g. "URI::Escape")
")" +
"|" + // or
"www\\d{0,3}[.]" + // "www.","www1.","www2." … "www999."
"|" + // or
"[a-z0-9.\\-]+[.][a-z]{2,4}/" + // looks like domain name followed by a slash
")" +
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space,non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens,up to 2 levels
")+" +
"(?:" + // End with:
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens,up to 2 levels
"|" + // or
"[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" + // not a space or one of these punct chars (updated to add a 'dash'
")" +
")";
// @see //07/improved_regex_for_matching_urls
private static final Pattern DARING_FIREBALL_PATTERN = pile(pattern,Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
如果我尝试运行以下操作,它将永远存在.我把它缩小到了平衡括号的匹配(我想).如果你更改了括号内的文本,它的工作正常,但是大约15个字符,它开始慢下来.
final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)");
boolean found = matcher.find();
有没有办法来改善这个正则表达式,以至于不会永远存在的线条?我在JUnit测试类中有大约100个不同的URL,我需要继续工作.