ParseManager


Inherits From:
LoadBroker
Declared In:
ParseManager.h


Class Description

This class manages downloads of pages for the consumer (see LoadBroker) by making use of "parsing dictionaries", detailed specifications of the text format of a Web page, in order to break down the content into a useful structure of dictionaries and arrays representing the useful content of a Web page. It is the workhorse of all Watson tools, and does all the dirty work so that the tool can focus on the user interface.

BE SURE to invoke detachConsumer on this object before deallocating this in your class's release method.

A parsing dictionary is normally loaded from a ".plist" file, which can be XML-based, but it may be easier to use the old NeXT-style plists because you only have to "escape" the quote marks, not the ubiquitous < and > characters found in HTML pages. You can load a parsing dictionary like this:

    
    NSString *path = [[NSBundle bundleForClass:[self class]]
    pathForResource:
@"name"     ofType:@"plist"];
    NSDictionary *parsingDict = [NSDictionary dictionaryWithContentsOfFile:path];
    

By using a URLManager as its "supplier", it manages an unlimited number of concurrent downloads, and will message the consumer with the appropriate completion method, along with the resulting dictionary that results from parsing the page (or any error message) and the "parameter" passed to the loading method so that the consumer can recognize the data that it has received (or recover a parameter passed to the original query, for instance to display what was searched for along with what was found).

The Parsing Dictionary

The Parsing dictionary contains all the information needed to connect to a page and scan its contents. It is generally accessed from loadParsingDict:userInputs:options:parameter:

The Parsing Dictionary has the following keys/values at the top level:
actionURL to connect to to generate the page; the inputs are appended to this if it's a GET. This key is required if the parsing dictionary is going to be recognized by the MultiParser (or, more specifically, by [NSBundle loadAllParsingDictionariesFromFolder:]
methodPOST or GET; how to retrieve the page from HTTP.
inputA dictionary of all the non-variable inputs for the request. The dictionary of variable inputs will be added to these to build up the full page request.
orderingAn array of what order the input parameters should appear in. This optional parameter is for Web sites that care what order its parameters appear in on the URL.
requestEncodingA numeric value of the string encoding that the request is posted in. These numbers come from NSString.h; highlights appear below.
resultEncodingA numeric value of the string encoding that the result page is expected to be in.
cacheIf 1 or YES or TRUE, the page contents are cached.
sequenceAn array (see below) indicating all of the patterns to look for, in order. The range of the entire page text is searched.

Returned from the parsing of a page is a dictionary with the given keys/values above, EXCEPT for sequence, along with keys found in the sequence with their accompanying values. (This allows any other keys/values to be stored in this dictionary and fetched from code; e.g. a "base" URL that doesn't belong in the code.)

If there is a key "mutable" for the sequence, then the returned dictionary will be an NSMutableDictioanry, otherwise it will be a non-mutable NSDictionary.

Also, the dynamic parameters are added for the key "parameters" to the resulting dictionary. The keys above should be considered "reserved words" to avoid conflict with the keys placed in the resulting sequence dictionary.

Each dictionary in a sequence contains the following keys/values:

constantA constant string to put in place for the given key; no text is scanned.
keyKey for a the found text to be associated with in the result dictionary, assuming that the Parsing Dictionary above this level is a sequence. If this parameter is missing, the text will be parsed but not returned; this can be useful for skipping unnneeded text. (If an array or sequence is specified here -- see below -- then this will be the key to look up the resulting array or dictionary.)
startString pattern to look for at the beginning of (or preceeding) the desired text. If not specified, the pattern starts at the beginning of the text range.
endString pattern to look for at the end of (or after) the desired text. If not specified, the pattern starts at the beginning of the text range.
optionalStartIf YES (or any value), the start doesn't have to be found -- the scanner will start at the beginning of the block of text. This is useful if your start delimeter is the same as your end delimeter in a list when there isn't a good start delimeter, and you still need to get the first row.
optionalEndIf YES (or any value), the end doesn't have to be found -- the scanner will scan all the way to the end of the given block of text if the scan wasn't successful using the end parameter. This is useful if your end delimeter is the same as your start delimeter in a list when there isn't a good end delimeter, and you still need to get the last row.
includeIf YES (or any value), the start and end strings should be included in the result.
ignoreCaseIf YES (or any other value), the case in the matching will be ignored.
optionalIf YES (or any value), the expected pattern doesn't need to be found. If non-optional key is not present, the missing pattern will be logged. If optional and the text is not found, the scan pointer is not incremented; if non-optional and the text is not found, the text pointer is incremented. Moral: specify optional whenever anything might not be there and you still want to function.
ignoreIf YES (or any value), the text will be scanned, but not returned. This can be useful for skipping text.
mutableIf YES (or any value), the array or sequence specified will be mutable.
uniqueIf YES (or any value), the key will not be replaced in the dictionary. This is useful when you have multiple variations of something to parse, but you don't want to replace something in the dictionary if it was already found.
arrayAnother Parsing Dictionary indicating the pattern to find multiple times within the matched text. The returned result will be an array. If "mutable" specified, it will be an NSMutableArray.
sequenceAn array for finding a sequence of patterns within the range of the found text of this Parsing Dictionary. Each of those sub-dictionaries should have a "key" specified for storing in the resulting dictionary.

Other keys affect how a piece of text is post-processed once is found. The order of the list below is the order in which they will be processed, in case there are multiple post-processing directives.

flattenIf YES, the HTML text is "flattened" -- escaped characters like &lt; are converted to strings, and HTML tags are stripped away. The text should be real HTML; it's undefined what happens to flatten a string like "><font>Hello</fo"
condenseThe text is "condensed" -- multiple "white" spaces are coalesced into a single space. Leading/trailing white space is trimmed.
crunchThe text is "crunched" -- multiple "white" spaces are coalesced into a single one of that character.
trimLeading/trailing white space is trimmed from the text.
noWhiteSpaceAll white space is removed from the text.
urlThe text, formatted like a URL, is URL-decoded, changing + into space, nn into characters, etc. If the value specified for url is a number, the text will be URL-decoded that many times.
prefixThe given string is inserted in front of the text.
suffixThe given string is appended after the text.
substituteThe given string is substituted for the found text.
lowercaseThe given string is made lowercase.
uppercaseThe given string is made uppercase.
capitalizeThe given string is capitalized.

You can also convert a post-processed string to a class like NSNumber or NSCalendarDate.

classNSNumber or NSCalendarDate.

If NSNumber, format = int | float | double. If none of the above, it's not converted. If NSCalendarDate, format is as in +[NSCalendarDate dateWithString:calendarFormat:], or if unspecified, it must be in format of YYYY-MM-DD HH:MM:SS ±HHMM

There are some subtleties with all of this (for instance, in scanning through an array, the end delimiter is not skipped past, allowing you to parse text like "<li>one<li>two<li>three". The best way to get familiar with this is to study some existing parse dictionaries, and make heavy use of the Test Tool to see how your scan dictionary is interpreted.

Request and result encoding constants include the following values. A full list can be found in NSString.h.

17-bit ASCII, 0-127 only. Default value. NSASCIIStringEncoding
4UTF-8. NSUTF8StringEncoding
5ISO Latin 1, AKA 8859-1. NSISOLatin1StringEncoding
12Windows Latin 1. Used by Windows servers. NSWindowsCP1252StringEncoding
30Mac OS Roman. Used by Macintosh servers. NSMacOSRomanStringEncoding


Instance Variables

URLManager *mURLManager;
NSMutableDictionary *mParamStorage;

mURLManagerNo description.
mParamStorageNo description.


Method Types

Initialization
- initWithDescriptor:consumer:progress:success:cancelled:error:
- initWithDescriptor:consumer:progress:
- initWithDescriptor:consumer:progress:success:cancelled:error:loaders:capacity:
- initWithDescriptor:consumer:progress:loaders:capacity:
Background Loading
- loadURL:parsingDict:userInputs:options:parameter:
- loadParsingDict:userInputs:options:parameter:
Foreground Loading (Deprecated)
- sequenceFromParsingDict:userInputs:progress:
- sequenceFromURL:parsingDict:userInputs:progress:
Status and control
- stopLoading
- stopLoadingParameter:
- loadersRemaining
- status
- isDone
Utility
- urlFromParsingDict:userInputs:options:
- queryStringFromParsingDict:userInputs:options:
- isPost:options:
- isCached:options:
- resultEncodingFromParsingDict:
- requestEncodingFromParsingDict:options:
- allInputsFromParsingDict:userInputs:
Text Scanning
- scanFromData:parsingDictionary:
- scanFromText:parsingDictionary:
- scanFromText:incrementingRange:parsingDictionary:
- postProcessText:parsingDict:

Instance Methods

allInputsFromParsingDict:userInputs:

- (NSDictionary *)allInputsFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs

Get the unchanging inputs and the given "userInputs" inputs and merge them together as one dictionary.


initWithDescriptor:consumer:progress:

- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress

Initialize the ParseManager, with default methods


initWithDescriptor:consumer:progress:loaders:capacity:

- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress loaders:(int)inConcurrent capacity:(int)inCapacity

Initialize the parse manager to use a URLStackManager; specify inConcurrent to be the maximum number of concurrent loads, and inCapacity to be the maximum capacity of loaded items, after which items will be ignored.


initWithDescriptor:consumer:progress:success:cancelled:error:

- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress success:(SEL)inSuccess cancelled:(SEL)inCancelled error:(SEL)inError

Initialize the ParseManager. It initializes as in [LoadBroker initWithDescriptor:consumer:success:cancelled:error:] and then sets up a URLManager to actually do the loading. You can also make use of initWithDescriptor:consumer:progress: to use the default methods.


initWithDescriptor:consumer:progress:success:cancelled:error:loaders:capacity:

- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress success:(SEL)inSuccess cancelled:(SEL)inCancelled error:(SEL)inError loaders:(int)inConcurrent capacity:(int)inCapacity

Initialize the parse manager to use a URLStackManager as above, also specifying the selectors to run upon success, error, or cancellation.


isCached:options:

- (BOOL)isCached:(NSDictionary *)inParsingDict options:(NSDictionary *)inOptions

No method description.


isDone

- (BOOL)isDone

Returns true if done loading; that is, no loaders remain.


isPost:options:

- (BOOL)isPost:(NSDictionary *)inParsingDict options:(NSDictionary *)inOptions

No method description.


loadParsingDict:userInputs:options:parameter:

- (void)loadParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions parameter:(id)inParam

Begin loading a page in the background. inParsingDict describes what to load and how to interpret it. inUserInputs is a dictionary of "dynamic" inputs to be added to the request. inOptions are passed to the URLLoader. inParam is any object that is passed back to the consumer when the load finishes so the consumer can recognize what finished loading; it might be a string, a dictionary, for example.


loadURL:parsingDict:userInputs:options:parameter:

- (void)loadURL:(NSString *)inOverrideURL parsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions parameter:(id)inParam

Begins loading a page in the background using the parsing dictionary specified in inParsingDict, but loading the URL specified in inOverrideURL. Other parameters are as specified in loadParsingDict:userInputs:options:parameter:. This method is useful for loading a page when the actual URL is not known until runtime.


loadersRemaining

- (int)loadersRemaining

Return the number of concurrent loads remaining to be completed.


postProcessText:parsingDict:

- (id)postProcessText:(NSString *)inText parsingDict:(NSDictionary *)inParsingDict

Post-Process a text chunk based on other indicators, like flatten, condense, url, prefix, suffix, class


queryStringFromParsingDict:userInputs:options:

- (NSString *)queryStringFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions

No method description.


requestEncodingFromParsingDict:options:

- (NSStringEncoding)requestEncodingFromParsingDict:(NSDictionary *)inParsingDict options:(NSDictionary *)inOptions

Get request encoding, if specified

NOT ACTUALLY USED YET. IT OUGHT TO BE....


resultEncodingFromParsingDict:

- (NSStringEncoding)resultEncodingFromParsingDict:(NSDictionary *)inParsingDict

Get result encoding, if specified


scanFromData:parsingDictionary:

- (id)scanFromData:(NSData *)inData parsingDictionary:(NSDictionary *)inParsingDict

Scan from data representing a parsing dictionary, either text scraping or XML parsing.


scanFromText:incrementingRange:parsingDictionary:

- (id)scanFromText:(NSString *)inText incrementingRange:(NSRange *)ioRange parsingDictionary:(NSDictionary *)inParsingDict

Support function (though still public; perhaps useful) to scan within a range. This is a recursive function.

We can specify a value for "constant", or "start" and "end" strings to search between (or within, if a value for "include" is given). If start is unspecified, then the text from the beginning of the given range is used; if end is unspecified, then we scan to the end of the given range.

If the text is found, the next steps depend on existance of "array", "sequence", or none (meaning just a string).


scanFromText:parsingDictionary:

- (id)scanFromText:(NSString *)inText parsingDictionary:(NSDictionary *)inParsingDict

Scan an entire chunk of text. Called by sequenceFromParsingDict (the easiest way) or directly from tools that load the page themselves. (PERHAPS NOT THE IDEAL) This is not for XML parsing, this is only for page scraping!


sequenceFromParsingDict:userInputs:progress:

- (NSDictionary *)sequenceFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs progress:(NSProgressIndicator *)inProgress

High-level, but blocking, function to load from a parsing dictionary then parse a page.


sequenceFromURL:parsingDict:userInputs:progress:

- (NSDictionary *)sequenceFromURL:(NSString *)inURL parsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs progress:(NSProgressIndicator *)inProgress

High-level, but blocking, function to load from a URL and parsing dictionary then parse a page.


status

- (NSString *)status

Returns a status string; useful for debugging or displaying to the user to show how many items remain to be completed.


stopLoading

- (void)stopLoading

Stop loading all pending URLs. This might be done in response to a "Cancel All" button or some such.


stopLoadingParameter:

- (void)stopLoadingParameter:(id)inParam

Stop loading url with given parameter


urlFromParsingDict:userInputs:options:

- (NSString *)urlFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions

Return a full URL for get OR post encoding by appending the inputs to the URL.


Version 1.1 Copyright ©2003 by Karelia Software, LLC. All Rights Reserved.