public class JsoupBasedHtmlParser extends HTMLParser
LagartoBasedHtmlParser
and this one (adapter pattern)ATT_BACKGROUND, ATT_CODE, ATT_CODEBASE, ATT_DATA, ATT_HREF, ATT_IS_IMAGE, ATT_REL, ATT_SRC, ATT_STYLE, ATT_TYPE, DEFAULT_PARSER, ICON, IE_UA, IE_UA_PATTERN, PARSER_CLASSNAME, SHORTCUT_ICON, STYLESHEET, TAG_APPLET, TAG_BASE, TAG_BGSOUND, TAG_BODY, TAG_EMBED, TAG_FRAME, TAG_IFRAME, TAG_IMAGE, TAG_INPUT, TAG_LINK, TAG_OBJECT, TAG_SCRIPT
Constructor and Description |
---|
JsoupBasedHtmlParser() |
Modifier and Type | Method and Description |
---|---|
Iterator<URL> |
getEmbeddedResourceURLs(String userAgent,
byte[] html,
URL baseUrl,
URLCollection coll,
String encoding)
Get the URLs for all the resources that a browser would automatically
download following the download of the HTML content, that is: images,
stylesheets, javascript files, applets, etc...
|
extractIEVersion, getEmbeddedResourceURLs, getEmbeddedResourceURLs, isEnableConditionalComments, normalizeUrlValue
getParser, isReusable
public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException
HTMLParser
All URLs should be added to the Collection.
Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.
N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
getEmbeddedResourceURLs
in class HTMLParser
userAgent
- User Agenthtml
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedcoll
- URLCollectionencoding
- CharsetHTMLParseException
- when parsing the html
failsCopyright © 1998-2019 Apache Software Foundation. All Rights Reserved.