dimanche 8 novembre 2015

Get all links from a website in Java with a time limit

What I want is to get all links from a webpage and add them to a list, but if 5 minutes have passed I want it to stop and just keep the list as it is.

I get all the links, but I can't seem to make it stop.

            public void fillList(String path) throws Exception
            {
                list = new ArrayList<String>();
                Reader r = null;

                timer = new StoppingThread();
                timer.start();

                try
                {
                    URL u = new URL(path);
                    InputStream in = u.openStream();
                    r = new InputStreamReader(in);

                    ParserDelegator hp = new ParserDelegator();
                    hp.parse(r, new HTMLEditorKit.ParserCallback()
                    {
                        public void handleStartTag(HTML.Tag t, MutableAttributeSet a,
                                int pos)
                        {
                            if (t == HTML.Tag.A)
                            {
                                if (!timer.isActive())
                                    return;

                                @SuppressWarnings("rawtypes")
                                Enumeration attrNames = a.getAttributeNames();
                                while (attrNames.hasMoreElements())
                                {
                                    if (!timer.isActive())
                                        return;

                                    Object key = attrNames.nextElement();
                                    if ("href".equals(key.toString()))
                                    {
                                        if (!list.contains((String) a.getAttribute(key)))
                                        {
                                            if (a.getAttribute(key).toString().startsWith("https://"))
                                            {
                                                list.add((String) a.getAttribute(key));
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }, true);
                }
                finally
                {
                    if (r != null)
                    {
                        r.close();
                    }
                }
            }

What I tried was a simple thread with a timer, when it stops a boolean becomes false, but return there doesn't seem to do anything. If anything it sometimes takes even longer with the timer on.

This is the Thread:

            public class StoppingThread extends Thread
            {
                private Boolean active;
                public void run()
                {
                    active = true;
                    try
                    {
                        sleep(1000 * 60 * 3);
                    }
                    catch (InterruptedException e) { }
                    active = false;
                }

                public Boolean isActive()
                {
                    return active;
                }
            }

Also, I'm using Apache's commons-io-2.4.jar here to do this. Can anyone tell what I'm doing wrong or how to do it right?

Aucun commentaire:

Enregistrer un commentaire