RiTa
index
Name RiMarkov
Description Performs analysis and text generation via Markov chains (aka N-Grams) with options to process single characters, words, sentences, and arbitrary regular expressions. Provides a variety of methods specifically designed for text-generation. Example usage:
   RiMarkov rm = new RiMarkov(this, 3);
   rm.loadFile("war_peace.txt"); // in data dir.
   String[] sents = generateSentences(10);
   for (int i = 0; i < sents.length; i++) {
     System.out.println(sents[i]);
   }
For large models, it is recommended to use this object in server-mode. See RiTaServer (@link http://www.rednoise.org/rita/documentation/ritaserver_class_ritaserver.htm).

Note: use RiMarkov.setTokenizerRegex() to control how inputs are tokenized (or split-up). The default is to use the Penn word-tokenizing conventions (without splitting contractions). You may wish to simply use whitespace (or some other regular expression), which can be accomplished as follows:

   RiMarkov rm = new RiMarkov(this, 3);
   rm.setTokenizerRegex("\\s");
This creates a new model, with n=3, that tokenizes its input on the whitespace characters: [ \t\n\x0B\f\r].


Note: use RiMarkov.setAllowingDuplicates(boolean) method to ensure that sentences that exist in the input test are not output by generate(). This method should be used with care, as certain sets of input texts (with allowDuplicates=false) may result in decreased performance and/or excessive memory use.

Constructors
RiMarkov(parent, nFactor);
RiMarkov(nFactor, ignoreCase);
RiMarkov(pApplet, nFactor, ignoreCase);
Methods
disableSentenceProcessing()   Tells the model to ignore (english-like) sentences in its input and treat all text tokens the same.

generateSentence()   Generates a sentence from the model.

Note: multiple sentences generated by this method WILL NOT follow the model across sentence boundaries; thus the following two calls are NOT equivalent:

     String[] results = markov.generateSentences(10);
               and
     for (int i = 0; i < 10; i++) {
       results[i] = markov.generateSentence();
     }
The latter will create 10 sentences with no explicit relationship between one and the next; while the former will follow probabilities from one sentence (across a boundary) to the next.

generateSentences()   Generates some # (one or more) of sentences from the model.

Note: multiple sentences generated by this method WILL follow the model across sentence boundaries; thus the following two calls are NOT equivalent:

     String[] results = markov.generateSentences(10);
               and
     for (int i = 0; i < 10; i++)  {
       results[i] = markov.generateSentence();
     }
The latter will create 10 sentences with no explicit relationship between one and the next; while the former will follow probabilities from one sentence (across a boundary) to the next.

generateTokens()   Generates a string of length tokens from the model.

getCompletions()   Returns an unordered list of possible words w that complete an n-gram consisting of: pre[0]...pre[k], w, post[k+1]...post[n]. As an example, the following call:
 getCompletions(new String[]{ "the" }, new String[]{ "ball" })
will return all the single words that occur between 'the' and 'ball' in the current model (assuming n > 2), e.g., ['red', 'big', 'bouncy']).

Note: For this operation to be valid, (pre.length + post.length) must be strictly less than the model's nFactor, otherwise an exception will be thrown.

getMaxSentenceLength()   Returns the maximum # of words allowed in a generated sentence

getMinSentenceLength()   Returns the minimum # of words allowed in a generated sentence

getNFactor()   Returns the current n-value for the model

getProbabilities()   Returns the full set of possible next tokens (as a HashMap: String -> Float (probability)) given an array of tokens representing the path down the tree (with length less than n). If the input array length is not less than n, or the path cannot be found, or the endnode has no children, null is returned.

Note: As the returned Map represents the full set of possible next tokens, the sum of its probabilities will always be equal 1.

getProbability()   Returns the probability of obtaining a sequence of k character tokens were k <= nFactor, e.g., if nFactor = 3, then valid lengths for the String tokens are 1, 2 & 3.

getRoot()  

getWordCount()   Returns the # of words loaded into the model

isPrintingIgnoredText()  

isRemovingQuotations()   Returns whether the model is ignoring quotations found in the input

isSmoothing()   Returns whether (add-1) smoothing is enabled for the model

loadFile()   Load a text file into the model -- if using Processing, the file should be in the sketch's data folder.

loadSentences()   Loads an array of sentences into the model; each element in the array must be a single sentence for proper parsing.

loadText()   Load a String into the model, splitting the text first into sentences, then into words, according to the current regular expression.

loadTokens()   Loads an array of tokens (or words) into the model; each element in the array must be a single token for proper constuction of the model.

printTree()   Outputs a String representing the models probability tree using the supplied print stream (or System.out).

NOTE: this method will block for potentially long periods of time on large models.

setAllowDuplicates()   Determines whether calls to generateSentence(s) will return sentences that exist (character-for-character) in the input text(s).

Note: The trade-off here is between ensuring novel outputs and a potential slow-down due to rejected outputs (b/c they exist in the input text.) Use with care as setting this to true for large models may result in excessive memory use.

setMaxSentenceLength()   Sets the maximum # of words allowed in a generated sentence (default=35)

setMinSentenceLength()   Sets the minimum # of words allowed in a generated sentence (default=6)

setPrintIgnoredText()  

setRecognizeSentences()   Sets whether the model will try to recognize (english-like) sentences in its input (default=true).

setRemoveQuotations()   Tells the model whether to ignore various quotations types in the input (default=true)

setTokenizerRegex()   Creates a new RegexTokenizer from the supplied regular expression and uses it when adding subsequent data to the model.

setUseSmoothing()   Toggles whether (add-1) smoothing is enabled for the model. Should be called before any data loading is done.

Usage Web & Application