Extracting Anchor Tag From Html Using Java

November 28, 2023 Post a Comment

I have several anchor tags in a text, Input: Take me to StackOverflow Output: http://stackoverflow.com How can I find all thos

Solution 1:

There are classes in the core API that you can use to get all href attributes from anchor tags (if present!):

import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

publicclassHtmlParseDemo {
   publicstaticvoidmain(String [] args)throws Exception {

       Stringhtml="<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
           "<!--                                                               " +
           "<a href=\"http://ignoreme.com\" >...</a>                           " +
           "-->                                                                " +
           "<a href=\"http://www.google.com\" >Take me to Google</a>           " +
           "<a>NOOOoooo!</a>                                                   ";

       Readerreader=newStringReader(html);
       HTMLEditorKit.Parserparser=newParserDelegator();
       final List<String> links = newArrayList<String>();

       parser.parse(reader, newHTMLEditorKit.ParserCallback(){
           publicvoidhandleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               if(t == HTML.Tag.A) {
                   Objectlink= a.getAttribute(HTML.Attribute.HREF);
                   if(link != null) {
                       links.add(String.valueOf(link));
                   }
               }
           }
       }, true);

       reader.close();
       System.out.println(links);
   }
}

which will print:

[http://stackoverflow.com, http://www.google.com]

Solution 2:

publicstatic void main(String[] args) {
    String test ="qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd"+"<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf";

    String regex ="<a href=(\"[^\"]*\")[^<]*</a>";

    Pattern p =Pattern.compile(regex);

    Matcher m = p.matcher(test);
    System.out.println(m.replaceAll("$1"));
}

NOTE: All Andrzej Doyle's points are valid and if you have more then simple <a href="X">Y</a> in your input, and you are sure that is parsable HTML, then you are better with HTML parser.

To summarize:

The regex i posted doesn't work if you have <a> in comment. (you can treat it as special case)
It doesn't work if you have other attributes in the <a> tag. (again you can treat it as special case)
there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.

However, if your req is always replace <a href="X">Y</a> with "X" without considering the context, then the code i've posted will work.

Solution 3:

You can use JSoup

Stringhtml="<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Documentdoc= Jsoup.parse(html);
Elementlink= doc.select("a").first();

StringlinkHref= link.attr("href"); // "http://stackoverflow.com"

Also See

Example

Solution 4:

The above example works perfect; if you want to parse an HTML document say instead of concatenated strings, write something like this to compliment the code above.

Existing code above ~ modified to show: HtmlParser.java (HtmlParseDemo.java) above complementing code with HtmlPage.java below. The content of the HtmlPage.properties file is at the bottom of this page.

The main.url property in the HtmlPage.properties file is: main.url=http://www.whatever.com/

That way you can just parse the url that your after. :-) Happy coding :-D

import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

publicclassHtmlParser
{
    publicstaticvoidmain(String[] args)throws Exception
    {
        Stringhtml= HtmlPage.getPage();

        Readerreader=newStringReader(html);
        HTMLEditorKit.Parserparser=newParserDelegator();
        final List<String> links = newArrayList<String>();

        parser.parse(reader, newHTMLEditorKit.ParserCallback()
        {
            publicvoidhandleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
            {
                if (t == HTML.Tag.A)
                {
                    Objectlink= a.getAttribute(HTML.Attribute.HREF);
                    if (link != null)
                    {
                        links.add(String.valueOf(link));
                    }
                }
            }
        }, true);

        reader.close();

        // create the header
        System.out.println("<html>\n<head>\n   <title>Link City</title>\n</head>\n<body>");

        // spit out the links and create hreffor (String l : links)
        {
            System.out.print("   <a href=\"" + l + "\">" + l + "</a>\n");
        }

        // create footer
        System.out.println("</body>\n</html>");
    }
}

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;

publicclassHtmlPage
{
    publicstatic String getPage()
    {
        StringWritersw=newStringWriter();
        ResourceBundlebundle= ResourceBundle.getBundle(HtmlPage.class.getName().toString());

        try
        {
            URLurl=newURL(bundle.getString("main.url"));

            HttpURLConnectionconnection= (HttpURLConnection) url.openConnection();
            connection.setRequestMethod("GET");
            connection.setDoOutput(true);

            InputStreamcontent= (InputStream) connection.getInputStream();
            BufferedReaderin=newBufferedReader(newInputStreamReader(content));

            String line;

            while ((line = in.readLine()) != null)
            {
                sw.append(line).append("\n");
            }

        } catch (Exception e)
        {
            e.printStackTrace();
        }

        return sw.getBuffer().toString();
    }
}

For example, this will output links from http://ebay.com.au/ if viewed in a browser. This is a subset, as there are a lot of links

    
    
       Link City
    
    
       #mainContent
       http://realestate.ebay.com.au/

Solution 5:

The most robust way (as has been suggested already) is to use regular expressions (java.util.regexp), if you are required to build this without using 3d party libs.

The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/package-summary.html). This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.

Still, I would recommend also looking at existing frameworks out there built for this purpose (like JSoup, also mentioned above). No need to reinvent the wheel.

Html5 Academy