--- /dev/null
+# fss-0002
+
+About utf8 Documentation:
+ The utf8 program is intended to provide simple support for identifying UTF-8 character sequences with regard to its Unicode codepoint designation.
+ Unicode codepoints may also be converted into their binary UTF-8 character sequence equivalents.
+
+ Basic support for identifying and validating is provided.
+
+ This is not intended to be an advanced UTF-8 processing program.
+ While this is an open option for future development, it is beyond the scope, time, and knowledge of the author (Kevin Day).
+
+ There is basic support for identifying character widths and combining characters.
+
+Printing a Complete Binary String:
+ The +q/++quiet must be used to suppress message output (this does not suppress output of processed data).
+
+ The +n/++no_color must be used to suppress printing the color context that is used for invalid sequences.
+
+ The -B/--to_binary must be used to print the binary string.
+
+ The -F/--to_file may optionally be used (and is recommended to be used) to send the output to a file rather than to the screen.
+
+ The -s/--strip_invalid must not be used as the binary string may contain invalid data that must be preserved for proper binary equivalence.
+
+ The -S/--separate must not be used to ensure the data is not separated by multiple lines.
+
+ The -H/--headers must not be used to ensure the processing headers are not printed.
+
+ The -v/--verify must no be used as this suppressed data output.
+
+ The -C/--to_codepoint must not be used as this prints Unicode codepoints rather than the binary string.
--- /dev/null
+# fss-0002
+
+Output Documentation:
+ This program either outputs a binary string representing UTF-8 character sequences or a string representing Unicode codepoints.
+
+ When printing in binary string output, suppressing color, and suppressing any output messages, the output should only be an exact representation of the data.
+ Ideally, what this means is that the a binary program, such as /bin/bash can be used as input and this programs binary string output (with appropriate additional parameters) should be capable of printing output that is identical to the original input binary.
+
+ In the case of Unicode codepoints, each codepoint is represented with the ASCII character 'U' followed by the ASCII character '+' followed by four to 6 hexidecimal digits.
+ Unicode number equivalents are not supported nor are ever intended to be supported to ensure simplicity in the design.
+ This makes the Unicode sequence output scriptable and usable as a data store.
+
+ Any time any processed data fails to properly represent a valid Unicode sequence that sequence is printed exactly as is (when printing codepoints) or is printed as-is (when printing binary strings).
+ The invalid data is printed with a context, such as the error color context.
+ The color context may be suppressed by appropriate parameters.
+ The printing of invalid data may be suppressed by the appropraite parameters.
+ When printing from a binary string to a Unicode codepoint, invalid data is printed as '0x' followed by the hexidecimal digit representation (all lower case).
+
+ When printing combining or width, the private use area is considered unknown but is not considered an error.
+ Anything else that is unknown is considered an error.
--- /dev/null
+# fss-0002
+
+Output Specification:
+ The following output to modes are supported\:
+ - binary: Print the binary character sequences.
+ - codepoint: Print the Unicode codepoints.
+ - combining: Print whether or not the character is a combining character.
+ - width: Print the width of the character.
+
+Output to Binary:
+ The output is printed as raw UTF-8 character sequences without any special formatting.
+ May contain error data representing invalid characters (or codepoints).
+ Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error".
+ All character codes, valid or not, may be printed as-is.
+ May not be combined with any other 'Output to' modes.
+
+Output to Codepoint:
+ Only upper case ASCII characters 'U', '+', '0' through '9', and 'A' through 'F' are allowed.
+ Always begins with 'U' and then '+'.
+ Is always separated by a single space (ASCII character 0x20) (optionally except for the final codepoint in a set which may be terminated by a new line or the end of a file).
+ May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits).
+ May not be combined with any other 'Output to' modes.
+
+Output to Combining:
+ A single character is used to represent combining, not-combinging, or unknown.
+ Is always separated by a single space (ASCII character 0x20) (optionally except for the final character in a set which may be terminated by a new line or the end of a file).
+ The upper case ASCII character 'C' is used to represent a combining character.
+ The upper case ASCII character 'N' is used to represent a non-combining character.
+ The ASCII '?' is used to represent either an invalid or an unknown character.
+ Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error".
+ May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits).
+ May only be combined with 'Output to Width' mode.
+
+Output to Width:
+ A single character is used to represent width or unknown.
+ Is always separated by a single space (ASCII character 0x20) (optionally except for the final character in a set which may be terminated by a new line or the end of a file).
+ The ASCII character '0' is used to represent a 0-width character (generally this means a non-graph character).
+ The ASCII character '1' is used to represent a 1-width character.
+ The ASCII character '2' is used to represent a 2-width character.
+ The ASCII '?' is used to represent either an invalid or an unknown character.
+ Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error".
+ May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits).
+ May only be combined with 'Output to Combining' mode.
+
+Output to Combining and Width:
+ When 'Output to Combining' is used with 'Output to Width', this operates exactly as 'Output to Width', except that when a valid combining character is detected, the ASCII 'C' character is used instead of the width.