Update: Add documentation.

author Kevin Day <thekevinday@gmail.com>

Sat, 11 Dec 2021 22:31:08 +0000 (16:31 -0600)

committer Kevin Day <thekevinday@gmail.com>

Sat, 11 Dec 2021 22:31:08 +0000 (16:31 -0600)
author Kevin Day <thekevinday@gmail.com>
Sat, 11 Dec 2021 22:31:08 +0000 (16:31 -0600)
committer Kevin Day <thekevinday@gmail.com>
Sat, 11 Dec 2021 22:31:08 +0000 (16:31 -0600)
diff --git a/level_3/utf8/documents/about.txt b/level_3/utf8/documents/about.txt

new file mode 100644 (file)

index 0000000..aa1f774
--- /dev/null
+++ b/level_3/utf8/documents/about.txt
@@ -0,0 +1,31 @@
+# fss-0002
+
+About utf8 Documentation:
+  The utf8 program is intended to provide simple support for identifying UTF-8 character sequences with regard to its Unicode codepoint designation.
+  Unicode codepoints may also be converted into their binary UTF-8 character sequence equivalents.
+
+  Basic support for identifying and validating is provided.
+
+  This is not intended to be an advanced UTF-8 processing program.
+  While this is an open option for future development, it is beyond the scope, time, and knowledge of the author (Kevin Day).
+
+  There is basic support for identifying character widths and combining characters.
+
+Printing a Complete Binary String:
+  The +q/++quiet must be used to suppress message output (this does not suppress output of processed data).
+
+  The +n/++no_color must be used to suppress printing the color context that is used for invalid sequences.
+
+  The -B/--to_binary must be used to print the binary string.
+
+  The -F/--to_file may optionally be used (and is recommended to be used) to send the output to a file rather than to the screen.
+
+  The -s/--strip_invalid must not be used as the binary string may contain invalid data that must be preserved for proper binary equivalence.
+
+  The -S/--separate must not be used to ensure the data is not separated by multiple lines.
+
+  The -H/--headers must not be used to ensure the processing headers are not printed.
+
+  The -v/--verify must no be used as this suppressed data output.
+
+  The -C/--to_codepoint must not be used as this prints Unicode codepoints rather than the binary string.
diff --git a/level_3/utf8/documents/output.txt b/level_3/utf8/documents/output.txt

new file mode 100644 (file)

index 0000000..ab55ef2
--- /dev/null
+++ b/level_3/utf8/documents/output.txt
@@ -0,0 +1,20 @@
+# fss-0002
+
+Output Documentation:
+  This program either outputs a binary string representing UTF-8 character sequences or a string representing Unicode codepoints.
+
+  When printing in binary string output, suppressing color, and suppressing any output messages, the output should only be an exact representation of the data.
+  Ideally, what this means is that the a binary program, such as /bin/bash can be used as input and this programs binary string output (with appropriate additional parameters) should be capable of printing output that is identical to the original input binary.
+
+  In the case of Unicode codepoints, each codepoint is represented with the ASCII character 'U' followed by the ASCII character '+' followed by four to 6 hexidecimal digits.
+  Unicode number equivalents are not supported nor are ever intended to be supported to ensure simplicity in the design.
+  This makes the Unicode sequence output scriptable and usable as a data store.
+
+  Any time any processed data fails to properly represent a valid Unicode sequence that sequence is printed exactly as is (when printing codepoints) or is printed as-is (when printing binary strings).
+  The invalid data is printed with a context, such as the error color context.
+  The color context may be suppressed by appropriate parameters.
+  The printing of invalid data may be suppressed by the appropraite parameters.
+  When printing from a binary string to a Unicode codepoint, invalid data is printed as '0x' followed by the hexidecimal digit representation (all lower case).
+
+  When printing combining or width, the private use area is considered unknown but is not considered an error.
+  Anything else that is unknown is considered an error.
diff --git a/level_3/utf8/specifications/output.txt b/level_3/utf8/specifications/output.txt

new file mode 100644 (file)

index 0000000..3ed3db2
--- /dev/null
+++ b/level_3/utf8/specifications/output.txt
@@ -0,0 +1,46 @@
+# fss-0002
+
+Output Specification:
+  The following output to modes are supported\:
+  - binary:    Print the binary character sequences.
+  - codepoint: Print the Unicode codepoints.
+  - combining: Print whether or not the character is a combining character.
+  - width:     Print the width of the character.
+
+Output to Binary:
+  The output is printed as raw UTF-8 character sequences without any special formatting.
+  May contain error data representing invalid characters (or codepoints).
+  Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error".
+  All character codes, valid or not, may be printed as-is.
+  May not be combined with any other 'Output to' modes.
+
+Output to Codepoint:
+  Only upper case ASCII characters 'U', '+', '0' through '9', and 'A' through 'F' are allowed.
+  Always begins with 'U' and then '+'.
+  Is always separated by a single space (ASCII character 0x20) (optionally except for the final codepoint in a set which may be terminated by a new line or the end of a file).
+  May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits).
+  May not be combined with any other 'Output to' modes.
+
+Output to Combining:
+  A single character is used to represent combining, not-combinging, or unknown.
+  Is always separated by a single space (ASCII character 0x20) (optionally except for the final character in a set which may be terminated by a new line or the end of a file).
+  The upper case ASCII character 'C' is used to represent a combining character.
+  The upper case ASCII character 'N' is used to represent a non-combining character.
+  The ASCII '?' is used to represent either an invalid or an unknown character.
+  Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error".
+  May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits).
+  May only be combined with 'Output to Width' mode.
+
+Output to Width:
+  A single character is used to represent width or unknown.
+  Is always separated by a single space (ASCII character 0x20) (optionally except for the final character in a set which may be terminated by a new line or the end of a file).
+  The ASCII character '0' is used to represent a 0-width character (generally this means a non-graph character).
+  The ASCII character '1' is used to represent a 1-width character.
+  The ASCII character '2' is used to represent a 2-width character.
+  The ASCII '?' is used to represent either an invalid or an unknown character.
+  Invalid characters (or codepoints) may be designated as an error using a context, such as a color context of "error".
+  May not contain non-ASCII values for designating characters (or codepoints) (which includes not allowing non-ASCII digits).
+  May only be combined with 'Output to Combining' mode.
+
+Output to Combining and Width:
+  When 'Output to Combining' is used with 'Output to Width', this operates exactly as 'Output to Width', except that when a valid combining character is detected, the ASCII 'C' character is used instead of the width.
author	Kevin Day <thekevinday@gmail.com>
	Sat, 11 Dec 2021 22:31:08 +0000 (16:31 -0600)
committer	Kevin Day <thekevinday@gmail.com>
	Sat, 11 Dec 2021 22:31:08 +0000 (16:31 -0600)
level_3/utf8/documents/about.txt	[new file with mode: 0644]	patch \| blob
level_3/utf8/documents/output.txt	[new file with mode: 0644]	patch \| blob
level_3/utf8/specifications/output.txt	[new file with mode: 0644]	patch \| blob