From: Kevin Day Date: Sat, 11 Dec 2021 22:31:08 +0000 (-0600) Subject: Update: Add documentation. X-Git-Tag: 0.5.7~46 X-Git-Url: https://git.kevux.org/?a=commitdiff_plain;h=c12ef1dafa5b6aae987597629b5a1b8495c2fb45;p=fll Update: Add documentation. This is yet another reminder to me to try and avoid accidental commits. I should have already had the documentation written up and be committed along with the initial commit. Given that this project was accidentally committed before it was ready to, this left the project in less than ideal state. As a reminder to myself to help encourage avoiding this mistake, I am constantly adding this oops notice to my commits. With this documentation written, I once more believe that I have wrapped everything up that I need to consider this ready. I previously thought this was the case, but as is seen by recent previous commits, this was not the case. Going forward, I plan on investigating writing tests for this project and to use this project as an example of writing tests for the entire FLL probject. This will hopefully allow me to find any remaining bugs and make this program production ready. --- diff --git a/level_3/utf8/documents/about.txt b/level_3/utf8/documents/about.txt new file mode 100644 index 0000000..aa1f774 --- /dev/null +++ b/level_3/utf8/documents/about.txt @@ -0,0 +1,31 @@ +# fss-0002 + +About utf8 Documentation: + The utf8 program is intended to provide simple support for identifying UTF-8 character sequences with regard to its Unicode codepoint designation. + Unicode codepoints may also be converted into their binary UTF-8 character sequence equivalents. + + Basic support for identifying and validating is provided. + + This is not intended to be an advanced UTF-8 processing program. + While this is an open option for future development, it is beyond the scope, time, and knowledge of the author (Kevin Day). + + There is basic support for identifying character widths and combining characters. + +Printing a Complete Binary String: + The +q/++quiet must be used to suppress message output (this does not suppress output of processed data). + + The +n/++no_color must be used to suppress printing the color context that is used for invalid sequences. + + The -B/--to_binary must be used to print the binary string. + + The -F/--to_file may optionally be used (and is recommended to be used) to send the output to a file rather than to the screen. + + The -s/--strip_invalid must not be used as the binary string may contain invalid data that must be preserved for proper binary equivalence. + + The -S/--separate must not be used to ensure the data is not separated by multiple lines. + + The -H/--headers must not be used to ensure the processing headers are not printed. + + The -v/--verify must no be used as this suppressed data output. + + The -C/--to_codepoint must not be used as this prints Unicode codepoints rather than the binary string. diff --git a/level_3/utf8/documents/output.txt b/level_3/utf8/documents/output.txt new file mode 100644 index 0000000..ab55ef2 --- /dev/null +++ b/level_3/utf8/documents/output.txt @@ -0,0 +1,20 @@ +# fss-0002 + +Output Documentation: + This program either outputs a binary string representing UTF-8 character sequences or a string representing Unicode codepoints. + + When printing in binary string output, suppressing color, and suppressing any output messages, the output should only be an exact representation of the data. + Ideally, what this means is that the a binary program, such as /bin/bash can be used as input and this programs binary string output (with appropriate additional parameters) should be capable of printing output that is identical to the original input binary. + + In the case of Unicode codepoints, each codepoint is represented with the ASCII character 'U' followed by the ASCII character '+' followed by four to 6 hexidecimal digits. + Unicode number equivalents are not supported nor are ever intended to be supported to ensure simplicity in the design. + This makes the Unicode sequence output scriptable and usable as a data store. + + Any time any processed data fails to properly represent a valid Unicode sequence that sequence is printed exactly as is (when printing codepoints) or is printed as-is (when printing binary strings). + The invalid data is printed with a context, such as the error color context. + The color context may be suppressed by appropriate parameters. + The printing of invalid data may be suppressed by the appropraite parameters. + When printing from a binary string to a Unicode codepoint, invalid data is printed as '0x' followed by the hexidecimal digit representation (all lower case). + + When printing combining or width, the private use area is considered unknown but is not considered an error. + Anything else that is unknown is considered an error. diff --git a/level_3/utf8/specifications/output.txt b/level_3/utf8/specifications/output.txt new file mode 100644 index 0000000..3ed3db2 --- /dev/null +++ b/level_3/utf8/specifications/output.txt @@ -0,0 +1,46 @@ +# fss-0002 + +Output Specification: + The following output to modes are supported\: + - binary: Print the binary character sequences. + - codepoint: Print the Unicode codepoints. + - combining: Print whether or not the character is a combining character. + - width: Print the width of the character. + +Output to Binary: + The output is printed as raw UTF-8 character sequences without any special formatting. + May contain error data representing invalid characters (or codepoints). + Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error". + All character codes, valid or not, may be printed as-is. + May not be combined with any other 'Output to' modes. + +Output to Codepoint: + Only upper case ASCII characters 'U', '+', '0' through '9', and 'A' through 'F' are allowed. + Always begins with 'U' and then '+'. + Is always separated by a single space (ASCII character 0x20) (optionally except for the final codepoint in a set which may be terminated by a new line or the end of a file). + May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits). + May not be combined with any other 'Output to' modes. + +Output to Combining: + A single character is used to represent combining, not-combinging, or unknown. + Is always separated by a single space (ASCII character 0x20) (optionally except for the final character in a set which may be terminated by a new line or the end of a file). + The upper case ASCII character 'C' is used to represent a combining character. + The upper case ASCII character 'N' is used to represent a non-combining character. + The ASCII '?' is used to represent either an invalid or an unknown character. + Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error". + May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits). + May only be combined with 'Output to Width' mode. + +Output to Width: + A single character is used to represent width or unknown. + Is always separated by a single space (ASCII character 0x20) (optionally except for the final character in a set which may be terminated by a new line or the end of a file). + The ASCII character '0' is used to represent a 0-width character (generally this means a non-graph character). + The ASCII character '1' is used to represent a 1-width character. + The ASCII character '2' is used to represent a 2-width character. + The ASCII '?' is used to represent either an invalid or an unknown character. + Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error". + May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits). + May only be combined with 'Output to Combining' mode. + +Output to Combining and Width: + When 'Output to Combining' is used with 'Output to Width', this operates exactly as 'Output to Width', except that when a valid combining character is detected, the ASCII 'C' character is used instead of the width.