检测并从字符串中提取URL？

| 这是一个简单的问题，但是我不明白。我想检测字符串中的url，然后将其替换为较短的url。我从stackoverflow找到了这个表达式，但是结果只是http

Pattern p = Pattern.compile(\"\\\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]\",Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(str);
        boolean result = m.find();
        while (result) {
            for (int i = 1; i <= m.groupCount(); i++) {
                String url=m.group(i);
                str = str.replace(url, shorten(url));
            }
            result = m.find();
        }
        return html;

有更好的主意吗？

已邀请:

6 个回复

癸痊醒

m.group（1）为您提供第一个匹配组，即第一个捕获括号。这是ѭ2 您应该尝试查看m.group（0）中是否有内容，或者用括号将所有模式括起来，然后再次使用m.group（1）。您需要重复查找功能以匹配下一个并使用新的组数组。

届甸衬丝蚕

让我继续前进，并说我不是复杂案例的正则表达式的拥护者。试图为这种事情写出完美的表达是非常困难的。也就是说，我确实有一个用于检测URL的应用程序，并且它由通过的350行单元测试用例类支持。有人从一个简单的正则表达式开始，多年来，我们已经发展了表达式和测试用例来处理我们发现的问题。这绝对不是小事：

// Pattern for recognizing a URL, based off RFC 3986
private static final Pattern urlPattern = Pattern.compile(
        \"(?:^|[\\\\W])((ht|f)tp(s?):\\\\/\\\\/|www\\\\.)\"
                + \"(([\\\\w\\\\-]+\\\\.){1,}?([\\\\w\\\\-.~]+\\\\/?)*\"
                + \"[\\\\p{Alnum}.,%_=?&#\\\\-+()\\\\[\\\\]\\\\*$~@!:/{};\']*)\",
        Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);

这是使用它的示例：

Matcher matcher = urlPattern.matcher(\"foo bar http://example.com baz\");
while (matcher.find()) {
    int matchStart = matcher.start(1);
    int matchEnd = matcher.end();
    // now you have the offsets of a URL match
}

悍蕾驮苇袜

/**
 * Returns a list with all links contained in the input
 */
public static List<String> extractUrls(String text)
{
    List<String> containedUrls = new ArrayList<String>();
    String urlRegex = \"((https?|ftp|gopher|telnet|file):((//)|(\\\\\\\\))+[\\\\w\\\\d:#@%/;$()~_?\\\\+-=\\\\\\\\\\\\.&]*)\";
    Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
    Matcher urlMatcher = pattern.matcher(text);

    while (urlMatcher.find())
    {
        containedUrls.add(text.substring(urlMatcher.start(0),
                urlMatcher.end(0)));
    }

    return containedUrls;
}

例：

List<String> extractedUrls = extractUrls(\"Welcome to https://stackoverflow.com/ and here is another link http://www.google.com/ \\n which is a great search engine\");

for (String url : extractedUrls)
{
    System.out.println(url);
}

印刷品：

https://stackoverflow.com/
http://www.google.com/

硕歌沙

在整个内容周围加上一些括号（开始时的单词边界除外），它应与整个域名匹配：

\"\\\\b((https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])\"

我不认为regex匹配整个URL。

薄扩络拜

检测URL并非易事。如果足以让您获取以https？| ftp | file开头的字符串，那就可以了。这里的问题是，您有一个捕获小组，()，而这些小组仅位于第一部分http ...附近。我将使用（？:)将这部分设为一个非捕获组，并将整个内容放在方括号中。

\"\\\\b((?:https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|])\"

艾食魄轻县

这个小代码片段/函数将有效地从Java中的字符串中提取URL字符串。我在这里找到了执行此操作的基本正则表达式，并在Java函数中使用了它。为了捕获不是以“ http：//”开头的链接，我在基本正则表达式上做了一些扩展，添加了“ | www [。]”部分。聊够了（很便宜），下面是代码：

//Pull all links from the body for easy retrieval
private ArrayList pullLinks(String text) {
ArrayList links = new ArrayList();

String regex = \"\\\\(?\\\\b(http://|www[.])[-A-Za-z0-9+&amp;@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&amp;@#/%=~_()|]\";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
if (urlStr.startsWith(\"(\") &amp;&amp; urlStr.endsWith(\")\"))
{
urlStr = urlStr.substring(1, urlStr.length() - 1);
}
links.add(urlStr);
}
return links;
}

要回复问题请先登录或注册

检测并从字符串中提取URL？

6 个回复

发起人

java

regex

url

问题状态

检测并从字符串中提取URL？

与内容相关的链接

6 个回复

发起人

java

regex

url

问题状态