public abstract class HTMLParser extends BaseParser
HTMLParser
subclasses can parse HTML content to obtain URLs.Modifier and Type | Field and Description |
---|---|
protected static String |
ATT_BACKGROUND |
protected static String |
ATT_CODE |
protected static String |
ATT_CODEBASE |
protected static String |
ATT_DATA |
protected static String |
ATT_HREF |
protected static String |
ATT_IS_IMAGE |
protected static String |
ATT_REL |
protected static String |
ATT_SRC |
protected static String |
ATT_STYLE |
protected static String |
ATT_TYPE |
static String |
DEFAULT_PARSER |
protected static String |
ICON |
protected static String |
IE_UA |
protected static Pattern |
IE_UA_PATTERN |
static String |
PARSER_CLASSNAME |
protected static String |
SHORTCUT_ICON |
protected static String |
STYLESHEET |
protected static String |
TAG_APPLET |
protected static String |
TAG_BASE |
protected static String |
TAG_BGSOUND |
protected static String |
TAG_BODY |
protected static String |
TAG_EMBED |
protected static String |
TAG_FRAME |
protected static String |
TAG_IFRAME |
protected static String |
TAG_IMAGE |
protected static String |
TAG_INPUT |
protected static String |
TAG_LINK |
protected static String |
TAG_OBJECT |
protected static String |
TAG_SCRIPT |
Modifier | Constructor and Description |
---|---|
protected |
HTMLParser()
Protected constructor to prevent instantiation except from within
subclasses.
|
Modifier and Type | Method and Description |
---|---|
protected Float |
extractIEVersion(String userAgent) |
Iterator<URL> |
getEmbeddedResourceURLs(String userAgent,
byte[] html,
URL baseUrl,
Collection<URLString> coll,
String encoding)
Get the URLs for all the resources that a browser would automatically
download following the download of the HTML content, that is: images,
stylesheets, javascript files, applets, etc...
|
Iterator<URL> |
getEmbeddedResourceURLs(String userAgent,
byte[] html,
URL baseUrl,
String encoding)
Get the URLs for all the resources that a browser would automatically
download following the download of the HTML content, that is: images,
stylesheets, javascript files, applets, etc...
|
abstract Iterator<URL> |
getEmbeddedResourceURLs(String userAgent,
byte[] html,
URL baseUrl,
URLCollection coll,
String encoding)
Get the URLs for all the resources that a browser would automatically
download following the download of the HTML content, that is: images,
stylesheets, javascript files, applets, etc...
|
protected boolean |
isEnableConditionalComments(Float ieVersion) |
protected static String |
normalizeUrlValue(CharSequence url)
Normalizes URL as browsers do
|
getParser, isReusable
protected static final String ATT_BACKGROUND
protected static final String ATT_CODE
protected static final String ATT_CODEBASE
protected static final String ATT_DATA
protected static final String ATT_HREF
protected static final String ATT_REL
protected static final String ATT_SRC
protected static final String ATT_STYLE
protected static final String ATT_TYPE
protected static final String ATT_IS_IMAGE
protected static final String TAG_APPLET
protected static final String TAG_BASE
protected static final String TAG_BGSOUND
protected static final String TAG_BODY
protected static final String TAG_EMBED
protected static final String TAG_FRAME
protected static final String TAG_IFRAME
protected static final String TAG_IMAGE
protected static final String TAG_INPUT
protected static final String TAG_LINK
protected static final String TAG_OBJECT
protected static final String TAG_SCRIPT
protected static final String STYLESHEET
protected static final String SHORTCUT_ICON
protected static final String ICON
protected static final String IE_UA
protected static final Pattern IE_UA_PATTERN
public static final String PARSER_CLASSNAME
public static final String DEFAULT_PARSER
protected HTMLParser()
public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, String encoding) throws HTMLParseException
URLs should not appear twice in the returned iterator.
Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.
userAgent
- User Agenthtml
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedencoding
- CharsetHTMLParseException
- when parsing the html
failspublic abstract Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException
All URLs should be added to the Collection.
Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.
N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
userAgent
- User Agenthtml
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedcoll
- URLCollectionencoding
- CharsetHTMLParseException
- when parsing the html
failspublic Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, Collection<URLString> coll, String encoding) throws HTMLParseException
N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
userAgent
- User Agenthtml
- HTML codebaseUrl
- Base URL from which the HTML code was obtainedcoll
- Collection - will contain URLString objects, not URLsencoding
- CharsetHTMLParseException
- when parsing the html
failsprotected final boolean isEnableConditionalComments(Float ieVersion)
ieVersion
- Float IE versionprotected Float extractIEVersion(String userAgent)
userAgent
- User Agentprotected static String normalizeUrlValue(CharSequence url)
url
- CharSequence
Copyright © 1998-2019 Apache Software Foundation. All Rights Reserved.