]> Kevux Git Server - fll/commit
Update: Follow up previous Unicode changes.
authorKevin Day <thekevinday@gmail.com>
Sat, 18 Jun 2022 22:28:32 +0000 (17:28 -0500)
committerKevin Day <thekevinday@gmail.com>
Sat, 18 Jun 2022 22:41:04 +0000 (17:41 -0500)
commitd2a3ee563f6d1678886bce8f12056589ee427198
tree91ed486baa2858a68ce9d8ee81141de782320194
parentd00a6090c1ab35ba44fc53d12abc45e97da31d93
Update: Follow up previous Unicode changes.

The previous commit changed a significant amount of behavior.
That commit noted that follow up changes would be necessary.

First things first.
I noticed that when I simplified the is valid checks I ended up over simplifying them.
There are several byte sequences that are not valid UTF-8 sequences.

I previously added surrogates and it turns out that UTF-8 specifically does not support Unicode surrogates.
Remove all related code.

The f_utf_char_t is supposed to be in big-endian format.
The macros are fixed to properly handle this.
This fix exposed problems in the conversion functions.
The conversion functions lack the proper big-endian and little-endian support.
Introduce a new structure and parameters to support designating the big-endian and little-endian.
Support a default order to host byte order.

The utf8 program needs to properly handle the endianness in a different way.
The bytes are in left-to-right format but when converted are converted in a left-to-right format but shifted to the right.
Swapping between little-endian to big-endian would be incorrect because the byte order is aleady correct.
The byte position is what is incorrect.
That is 0x0000c280 should be shifted to 0xc2800000.
Swapping the endianness would instead yield 0x80c20000 (which is incorrect).

The use of the word "character" as a variable name and in documentation can be confusing.
I have recently defined a "byte sequence", a "code point", and a "unicode" as specific types.
Change the word "character" to the appropriate name to make the code less confusing and more specific.
There are also other words used in place of "character" that might not be the ones listed above.

Some of the tests, particularly the emoji tests, have incorrect data.
I discovered that many sources out on the internet violate the standard and call code points an emoji that are not official recognized as an emoji by the standard.
I'm going with wikipedia on the new and updated emoji list.

The f_char_t is available so update old code that still uses uint8_t to instead use f_char_t for character related data.

Changes to the is valid code resulted in identifying invalid byte sequences that were previously considered valid.
152 files changed:
build/level_0/settings
build/level_1/settings
build/monolithic/settings
level_0/f_conversion/c/conversion/common.c
level_0/f_conversion/c/conversion/common.h
level_0/f_conversion/c/private-conversion.c
level_0/f_status/c/status.h
level_0/f_status_string/c/status_string.c
level_0/f_status_string/c/status_string.h
level_0/f_utf/c/private-utf.c
level_0/f_utf/c/private-utf.h
level_0/f_utf/c/private-utf_alphabetic.c
level_0/f_utf/c/private-utf_alphabetic.h
level_0/f_utf/c/private-utf_combining.c
level_0/f_utf/c/private-utf_combining.h
level_0/f_utf/c/private-utf_control.c
level_0/f_utf/c/private-utf_control.h
level_0/f_utf/c/private-utf_digit.c
level_0/f_utf/c/private-utf_digit.h
level_0/f_utf/c/private-utf_emoji.c
level_0/f_utf/c/private-utf_emoji.h
level_0/f_utf/c/private-utf_numeric.c
level_0/f_utf/c/private-utf_numeric.h
level_0/f_utf/c/private-utf_phonetic.c
level_0/f_utf/c/private-utf_phonetic.h
level_0/f_utf/c/private-utf_private.c
level_0/f_utf/c/private-utf_private.h
level_0/f_utf/c/private-utf_punctuation.c
level_0/f_utf/c/private-utf_punctuation.h
level_0/f_utf/c/private-utf_subscript.c
level_0/f_utf/c/private-utf_subscript.h
level_0/f_utf/c/private-utf_superscript.c
level_0/f_utf/c/private-utf_superscript.h
level_0/f_utf/c/private-utf_surrogate.c [deleted file]
level_0/f_utf/c/private-utf_surrogate.h [deleted file]
level_0/f_utf/c/private-utf_symbol.c
level_0/f_utf/c/private-utf_symbol.h
level_0/f_utf/c/private-utf_valid.c
level_0/f_utf/c/private-utf_valid.h
level_0/f_utf/c/private-utf_whitespace.c
level_0/f_utf/c/private-utf_whitespace.h
level_0/f_utf/c/private-utf_wide.c
level_0/f_utf/c/private-utf_wide.h
level_0/f_utf/c/private-utf_word.c
level_0/f_utf/c/private-utf_word.h
level_0/f_utf/c/private-utf_zero_width.c
level_0/f_utf/c/private-utf_zero_width.h
level_0/f_utf/c/utf/common.c
level_0/f_utf/c/utf/common.h
level_0/f_utf/c/utf/convert.c
level_0/f_utf/c/utf/convert.h
level_0/f_utf/c/utf/is.c
level_0/f_utf/c/utf/is.h
level_0/f_utf/c/utf/is_character.c
level_0/f_utf/c/utf/is_character.h
level_0/f_utf/data/build/settings
level_0/f_utf/data/build/settings-tests
level_0/f_utf/tests/unit/c/data-utf.c
level_0/f_utf/tests/unit/c/data-utf.h
level_0/f_utf/tests/unit/c/test-utf-character_is_alphabetic.c
level_0/f_utf/tests/unit/c/test-utf-character_is_combining.c
level_0/f_utf/tests/unit/c/test-utf-character_is_control.c
level_0/f_utf/tests/unit/c/test-utf-character_is_digit.c
level_0/f_utf/tests/unit/c/test-utf-character_is_emoji.c
level_0/f_utf/tests/unit/c/test-utf-character_is_numeric.c
level_0/f_utf/tests/unit/c/test-utf-character_is_phonetic.c
level_0/f_utf/tests/unit/c/test-utf-character_is_private.c
level_0/f_utf/tests/unit/c/test-utf-character_is_punctuation.c
level_0/f_utf/tests/unit/c/test-utf-character_is_subscript.c
level_0/f_utf/tests/unit/c/test-utf-character_is_superscript.c
level_0/f_utf/tests/unit/c/test-utf-character_is_surrogate.c [deleted file]
level_0/f_utf/tests/unit/c/test-utf-character_is_surrogate.h [deleted file]
level_0/f_utf/tests/unit/c/test-utf-character_is_symbol.c
level_0/f_utf/tests/unit/c/test-utf-character_is_valid.c
level_0/f_utf/tests/unit/c/test-utf-character_is_whitespace.c
level_0/f_utf/tests/unit/c/test-utf-character_is_wide.c
level_0/f_utf/tests/unit/c/test-utf-character_is_word.c
level_0/f_utf/tests/unit/c/test-utf-character_is_zero_width.c
level_0/f_utf/tests/unit/c/test-utf-is_alphabetic.c
level_0/f_utf/tests/unit/c/test-utf-is_combining.c
level_0/f_utf/tests/unit/c/test-utf-is_control.c
level_0/f_utf/tests/unit/c/test-utf-is_digit.c
level_0/f_utf/tests/unit/c/test-utf-is_emoji.c
level_0/f_utf/tests/unit/c/test-utf-is_numeric.c
level_0/f_utf/tests/unit/c/test-utf-is_phonetic.c
level_0/f_utf/tests/unit/c/test-utf-is_private.c
level_0/f_utf/tests/unit/c/test-utf-is_punctuation.c
level_0/f_utf/tests/unit/c/test-utf-is_subscript.c
level_0/f_utf/tests/unit/c/test-utf-is_superscript.c
level_0/f_utf/tests/unit/c/test-utf-is_surrogate.c [deleted file]
level_0/f_utf/tests/unit/c/test-utf-is_surrogate.h [deleted file]
level_0/f_utf/tests/unit/c/test-utf-is_symbol.c
level_0/f_utf/tests/unit/c/test-utf-is_valid.c
level_0/f_utf/tests/unit/c/test-utf-is_whitespace.c
level_0/f_utf/tests/unit/c/test-utf-is_wide.c
level_0/f_utf/tests/unit/c/test-utf-is_word.c
level_0/f_utf/tests/unit/c/test-utf-is_zero_width.c
level_0/f_utf/tests/unit/c/test-utf.c
level_0/f_utf/tests/unit/c/test-utf.h
level_1/fl_conversion/c/conversion.c
level_1/fl_conversion/c/conversion.h
level_1/fl_conversion/c/conversion/common.c [new file with mode: 0644]
level_1/fl_conversion/c/conversion/common.h [new file with mode: 0644]
level_1/fl_conversion/c/private-conversion.c
level_1/fl_conversion/c/private-conversion.h
level_1/fl_conversion/data/build/settings
level_2/fll_status_string/c/status_string.c
level_3/byte_dump/c/byte_dump.c
level_3/control/c/private-control.c
level_3/control/c/private-control.h
level_3/controller/c/controller/private-controller.c
level_3/controller/c/controller/private-controller.h
level_3/controller/c/entry/private-entry.c
level_3/controller/c/rule/private-rule.c
level_3/controller/c/rule/private-rule.h
level_3/fake/c/private-make-operate.c
level_3/fake/c/private-make-operate_process.c
level_3/fake/c/private-make-operate_process_type.c
level_3/fake/c/private-make-operate_process_type.h
level_3/fake/c/private-make-operate_validate.c
level_3/fake/c/private-make.c
level_3/fss_basic_list_read/c/fss_basic_list_read.c
level_3/fss_basic_list_read/c/private-read.c
level_3/fss_basic_list_read/c/private-read.h
level_3/fss_basic_read/c/fss_basic_read.c
level_3/fss_basic_read/c/private-read.c
level_3/fss_basic_read/c/private-read.h
level_3/fss_embedded_list_read/c/fss_embedded_list_read.c
level_3/fss_embedded_list_read/c/private-read.c
level_3/fss_embedded_list_write/c/private-write.c
level_3/fss_extended_list_read/c/fss_extended_list_read.c
level_3/fss_extended_list_read/c/private-read.c
level_3/fss_extended_list_read/c/private-read.h
level_3/fss_extended_list_write/c/private-write.c
level_3/fss_extended_read/c/fss_extended_read.c
level_3/fss_extended_read/c/private-read.c
level_3/fss_extended_read/c/private-read.h
level_3/fss_identify/c/fss_identify.c
level_3/fss_identify/c/private-identify.c
level_3/fss_payload_read/c/fss_payload_read.c
level_3/fss_payload_read/c/private-read.c
level_3/fss_payload_read/c/private-read.h
level_3/fss_status_code/c/private-fss_status_code.c
level_3/iki_read/c/iki_read.c
level_3/status_code/c/private-status_code.c
level_3/utf8/c/private-print.c
level_3/utf8/c/private-print.h
level_3/utf8/c/private-utf8.c
level_3/utf8/c/private-utf8_bytesequence.c
level_3/utf8/c/private-utf8_bytesequence.h
level_3/utf8/c/private-utf8_codepoint.c
level_3/utf8/c/private-utf8_codepoint.h