- Inherits From:
- LoadBroker
- Declared In:
- ParseManager.h
BE SURE to invoke detachConsumer on this object before deallocating this in your class's release method.
A parsing dictionary is normally loaded from a ".plist" file, which can be XML-based, but it may be easier to use the old NeXT-style plists because you only have to "escape" the quote marks, not the ubiquitous < and > characters found in HTML pages. You can load a parsing dictionary like this:
@"name"
NSString *path = [[NSBundle bundleForClass:[self class]]
pathForResource: ofType:@"plist"];
NSDictionary *parsingDict = [NSDictionary dictionaryWithContentsOfFile:path];
By using a URLManager as its "supplier", it manages an unlimited number of concurrent downloads, and will message the consumer with the appropriate completion method, along with the resulting dictionary that results from parsing the page (or any error message) and the "parameter" passed to the loading method so that the consumer can recognize the data that it has received (or recover a parameter passed to the original query, for instance to display what was searched for along with what was found).
The Parsing Dictionary
The Parsing dictionary contains all the information needed to connect to a page and scan its contents. It is generally accessed from loadParsingDict:userInputs:options:parameter:
The Parsing Dictionary has the following keys/values at the top level:
| action | URL to connect to to generate the page; the inputs are appended to this if it's a GET. This key is required if the parsing dictionary is going to be recognized by the MultiParser (or, more specifically, by [NSBundle loadAllParsingDictionariesFromFolder:] |
| method | POST or GET; how to retrieve the page from HTTP. |
| input | A dictionary of all the non-variable inputs for the request. The dictionary of variable inputs will be added to these to build up the full page request. |
| ordering | An array of what order the input parameters should appear in. This optional parameter is for Web sites that care what order its parameters appear in on the URL. |
| requestEncoding | A numeric value of the string encoding that the request is posted in. These numbers come from NSString.h; highlights appear below. |
| resultEncoding | A numeric value of the string encoding that the result page is expected to be in. |
| cache | If 1 or YES or TRUE, the page contents are cached. |
| sequence | An array (see below) indicating all of the patterns to look for, in order. The range of the entire page text is searched. |
Returned from the parsing of a page is a dictionary with the given keys/values above, EXCEPT for sequence, along with keys found in the sequence with their accompanying values. (This allows any other keys/values to be stored in this dictionary and fetched from code; e.g. a "base" URL that doesn't belong in the code.)
If there is a key "mutable" for the sequence, then the returned dictionary will be an NSMutableDictioanry, otherwise it will be a non-mutable NSDictionary.
Also, the dynamic parameters are added for the key "parameters" to the resulting dictionary. The keys above should be considered "reserved words" to avoid conflict with the keys placed in the resulting sequence dictionary.
Each dictionary in a sequence contains the following keys/values:
| constant | A constant string to put in place for the given key; no text is scanned. |
| key | Key for a the found text to be associated with in the result dictionary, assuming that the Parsing Dictionary above this level is a sequence. If this parameter is missing, the text will be parsed but not returned; this can be useful for skipping unnneeded text. (If an array or sequence is specified here -- see below -- then this will be the key to look up the resulting array or dictionary.) |
| start | String pattern to look for at the beginning of (or preceeding) the desired text. If not specified, the pattern starts at the beginning of the text range. |
| end | String pattern to look for at the end of (or after) the desired text. If not specified, the pattern starts at the beginning of the text range. |
| optionalStart | If YES (or any value), the start doesn't have to be found -- the scanner will start at the beginning of the block of text. This is useful if your start delimeter is the same as your end delimeter in a list when there isn't a good start delimeter, and you still need to get the first row. |
| optionalEnd | If YES (or any value), the end doesn't have to be found -- the scanner will scan all the way to the end of the given block of text if the scan wasn't successful using the end parameter. This is useful if your end delimeter is the same as your start delimeter in a list when there isn't a good end delimeter, and you still need to get the last row. |
| include | If YES (or any value), the start and end strings should be included in the result. |
| ignoreCase | If YES (or any other value), the case in the matching will be ignored. |
| optional | If YES (or any value), the expected pattern doesn't need to be found. If non-optional key is not present, the missing pattern will be logged. If optional and the text is not found, the scan pointer is not incremented; if non-optional and the text is not found, the text pointer is incremented. Moral: specify optional whenever anything might not be there and you still want to function. |
| ignore | If YES (or any value), the text will be scanned, but not returned. This can be useful for skipping text. |
| mutable | If YES (or any value), the array or sequence specified will be mutable. |
| unique | If YES (or any value), the key will not be replaced in the dictionary. This is useful when you have multiple variations of something to parse, but you don't want to replace something in the dictionary if it was already found. |
| array | Another Parsing Dictionary indicating the pattern to find multiple times within the matched text. The returned result will be an array. If "mutable" specified, it will be an NSMutableArray. |
| sequence | An array for finding a sequence of patterns within the range of the found text of this Parsing Dictionary. Each of those sub-dictionaries should have a "key" specified for storing in the resulting dictionary. |
Other keys affect how a piece of text is post-processed once is found. The order of the list below is the order in which they will be processed, in case there are multiple post-processing directives.
| flatten | If YES, the HTML text is "flattened" -- escaped characters like < are converted to strings, and HTML tags are stripped away. The text should be real HTML; it's undefined what happens to flatten a string like "><font>Hello</fo" |
| condense | The text is "condensed" -- multiple "white" spaces are coalesced into a single space. Leading/trailing white space is trimmed. |
| crunch | The text is "crunched" -- multiple "white" spaces are coalesced into a single one of that character. |
| trim | Leading/trailing white space is trimmed from the text. |
| noWhiteSpace | All white space is removed from the text. |
| url | The text, formatted like a URL, is URL-decoded, changing + into space, nn into characters, etc. If the value specified for url is a number, the text will be URL-decoded that many times. |
| prefix | The given string is inserted in front of the text. |
| suffix | The given string is appended after the text. |
| substitute | The given string is substituted for the found text. |
| lowercase | The given string is made lowercase. |
| uppercase | The given string is made uppercase. |
| capitalize | The given string is capitalized. |
You can also convert a post-processed string to a class like NSNumber or NSCalendarDate.
| class | NSNumber or NSCalendarDate. |
If NSNumber, format = int | float | double. If none of the above, it's not converted. If NSCalendarDate, format is as in +[NSCalendarDate dateWithString:calendarFormat:], or if unspecified, it must be in format of YYYY-MM-DD HH:MM:SS ±HHMM
There are some subtleties with all of this (for instance, in scanning through an array, the end delimiter is not skipped past, allowing you to parse text like "<li>one<li>two<li>three". The best way to get familiar with this is to study some existing parse dictionaries, and make heavy use of the Test Tool to see how your scan dictionary is interpreted.
Request and result encoding constants include the following values. A full list can be found in NSString.h.
| 1 | 7-bit ASCII, 0-127 only. Default value. NSASCIIStringEncoding |
| 4 | UTF-8. NSUTF8StringEncoding |
| 5 | ISO Latin 1, AKA 8859-1. NSISOLatin1StringEncoding |
| 12 | Windows Latin 1. Used by Windows servers. NSWindowsCP1252StringEncoding |
| 30 | Mac OS Roman. Used by Macintosh servers. NSMacOSRomanStringEncoding |
URLManager *mURLManager;
NSMutableDictionary *mParamStorage;
mURLManager No description. mParamStorage No description.
InitializationBackground Loading
- - initWithDescriptor:consumer:progress:success:cancelled:error:
- - initWithDescriptor:consumer:progress:
- - initWithDescriptor:consumer:progress:success:cancelled:error:loaders:capacity:
- - initWithDescriptor:consumer:progress:loaders:capacity:
Foreground Loading (Deprecated)
- - loadURL:parsingDict:userInputs:options:parameter:
- - loadParsingDict:userInputs:options:parameter:
Status and control
- - sequenceFromParsingDict:userInputs:progress:
- - sequenceFromURL:parsingDict:userInputs:progress:
Utility
- - stopLoading
- - stopLoadingParameter:
- - loadersRemaining
- - status
- - isDone
Text Scanning
- - urlFromParsingDict:userInputs:options:
- - queryStringFromParsingDict:userInputs:options:
- - isPost:options:
- - isCached:options:
- - resultEncodingFromParsingDict:
- - requestEncodingFromParsingDict:options:
- - allInputsFromParsingDict:userInputs:
- - scanFromData:parsingDictionary:
- - scanFromText:parsingDictionary:
- - scanFromText:incrementingRange:parsingDictionary:
- - postProcessText:parsingDict:
- (NSDictionary *)allInputsFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs
Get the unchanging inputs and the given "userInputs" inputs and merge them together as one dictionary.
- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress
Initialize the ParseManager, with default methods
- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress loaders:(int)inConcurrent capacity:(int)inCapacity
Initialize the parse manager to use a URLStackManager; specify inConcurrent to be the maximum number of concurrent loads, and inCapacity to be the maximum capacity of loaded items, after which items will be ignored.
- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress success:(SEL)inSuccess cancelled:(SEL)inCancelled error:(SEL)inError
Initialize the ParseManager. It initializes as in [LoadBroker initWithDescriptor:consumer:success:cancelled:error:] and then sets up a URLManager to actually do the loading. You can also make use of initWithDescriptor:consumer:progress: to use the default methods.
- (id)initWithDescriptor:(NSString *)inDescriptor consumer:(id)inConsumer progress:(NSProgressIndicator *)inProgress success:(SEL)inSuccess cancelled:(SEL)inCancelled error:(SEL)inError loaders:(int)inConcurrent capacity:(int)inCapacity
Initialize the parse manager to use a URLStackManager as above, also specifying the selectors to run upon success, error, or cancellation.
- (BOOL)isCached:(NSDictionary *)inParsingDict options:(NSDictionary *)inOptions
No method description.
- (BOOL)isDone
Returns true if done loading; that is, no loaders remain.
- (BOOL)isPost:(NSDictionary *)inParsingDict options:(NSDictionary *)inOptions
No method description.
- (void)loadParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions parameter:(id)inParam
Begin loading a page in the background. inParsingDict describes what to load and how to interpret it. inUserInputs is a dictionary of "dynamic" inputs to be added to the request. inOptions are passed to the URLLoader. inParam is any object that is passed back to the consumer when the load finishes so the consumer can recognize what finished loading; it might be a string, a dictionary, for example.
- (void)loadURL:(NSString *)inOverrideURL parsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions parameter:(id)inParam
Begins loading a page in the background using the parsing dictionary specified in inParsingDict, but loading the URL specified in inOverrideURL. Other parameters are as specified in loadParsingDict:userInputs:options:parameter:. This method is useful for loading a page when the actual URL is not known until runtime.
- (int)loadersRemaining
Return the number of concurrent loads remaining to be completed.
- (id)postProcessText:(NSString *)inText parsingDict:(NSDictionary *)inParsingDict
Post-Process a text chunk based on other indicators, like flatten, condense, url, prefix, suffix, class
- (NSString *)queryStringFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions
No method description.
- (NSStringEncoding)requestEncodingFromParsingDict:(NSDictionary *)inParsingDict options:(NSDictionary *)inOptions
Get request encoding, if specified
NOT ACTUALLY USED YET. IT OUGHT TO BE....
- (NSStringEncoding)resultEncodingFromParsingDict:(NSDictionary *)inParsingDict
Get result encoding, if specified
- (id)scanFromData:(NSData *)inData parsingDictionary:(NSDictionary *)inParsingDict
Scan from data representing a parsing dictionary, either text scraping or XML parsing.
- (id)scanFromText:(NSString *)inText incrementingRange:(NSRange *)ioRange parsingDictionary:(NSDictionary *)inParsingDict
Support function (though still public; perhaps useful) to scan within a range. This is a recursive function.
We can specify a value for "constant", or "start" and "end" strings to search between (or within, if a value for "include" is given). If start is unspecified, then the text from the beginning of the given range is used; if end is unspecified, then we scan to the end of the given range.
If the text is found, the next steps depend on existance of "array", "sequence", or none (meaning just a string).
- (id)scanFromText:(NSString *)inText parsingDictionary:(NSDictionary *)inParsingDict
Scan an entire chunk of text. Called by sequenceFromParsingDict (the easiest way) or directly from tools that load the page themselves. (PERHAPS NOT THE IDEAL) This is not for XML parsing, this is only for page scraping!
- (NSDictionary *)sequenceFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs progress:(NSProgressIndicator *)inProgress
High-level, but blocking, function to load from a parsing dictionary then parse a page.
- (NSDictionary *)sequenceFromURL:(NSString *)inURL parsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs progress:(NSProgressIndicator *)inProgress
High-level, but blocking, function to load from a URL and parsing dictionary then parse a page.
- (NSString *)status
Returns a status string; useful for debugging or displaying to the user to show how many items remain to be completed.
- (void)stopLoading
Stop loading all pending URLs. This might be done in response to a "Cancel All" button or some such.
- (void)stopLoadingParameter:(id)inParam
Stop loading url with given parameter
- (NSString *)urlFromParsingDict:(NSDictionary *)inParsingDict userInputs:(NSDictionary *)inUserInputs options:(NSDictionary *)inOptions
Return a full URL for get OR post encoding by appending the inputs to the URL.