Update: Follow up previous Unicode changes.
The previous commit changed a significant amount of behavior.
That commit noted that follow up changes would be necessary.
First things first.
I noticed that when I simplified the is valid checks I ended up over simplifying them.
There are several byte sequences that are not valid UTF-8 sequences.
I previously added surrogates and it turns out that UTF-8 specifically does not support Unicode surrogates.
Remove all related code.
The f_utf_char_t is supposed to be in big-endian format.
The macros are fixed to properly handle this.
This fix exposed problems in the conversion functions.
The conversion functions lack the proper big-endian and little-endian support.
Introduce a new structure and parameters to support designating the big-endian and little-endian.
Support a default order to host byte order.
The utf8 program needs to properly handle the endianness in a different way.
The bytes are in left-to-right format but when converted are converted in a left-to-right format but shifted to the right.
Swapping between little-endian to big-endian would be incorrect because the byte order is aleady correct.
The byte position is what is incorrect.
That is 0x0000c280 should be shifted to 0xc2800000.
Swapping the endianness would instead yield 0x80c20000 (which is incorrect).
The use of the word "character" as a variable name and in documentation can be confusing.
I have recently defined a "byte sequence", a "code point", and a "unicode" as specific types.
Change the word "character" to the appropriate name to make the code less confusing and more specific.
There are also other words used in place of "character" that might not be the ones listed above.
Some of the tests, particularly the emoji tests, have incorrect data.
I discovered that many sources out on the internet violate the standard and call code points an emoji that are not official recognized as an emoji by the standard.
I'm going with wikipedia on the new and updated emoji list.
The f_char_t is available so update old code that still uses uint8_t to instead use f_char_t for character related data.
Changes to the is valid code resulted in identifying invalid byte sequences that were previously considered valid.