Skip to content

Commit 966d22b

Browse files
anson1014LinuxJedi
andauthored
Ensure that source files contain only valid UTF8 encodings (#2188)
Modern software (including text editors, static analysis software, and web-based code review interfaces) often requires source code files to be interpretable via a consistent character encoding, with UTF-8 or ASCII (a strict subset of UTF-8) as the default. Several of the MariaDB source files contain bytes that are not valid in either the UTF-8 or ASCII encodings, but instead represent strings encoded in the ISO-8859-1/Latin-1 or ISO-8859-2/Latin-2 encodings. These inconsistent encodings may prevent software from correctly presenting or processing such files. Converting all source files to valid UTF8 characters will ensure correct handling. Comments written in Czech were replaced with lightly-corrected translations from Google Translate. Additionally, comments describing the proper handling of special characters were changed so that the comments are now purely UTF8. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc. Co-authored-by: Andrew Hutchings <andrew@linuxjedi.co.uk>
1 parent 0fbcb0a commit 966d22b

File tree

4 files changed

+34
-60
lines changed

4 files changed

+34
-60
lines changed

mysys/my_win_popen.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ extern "C" FILE *my_win_popen(const char *cmd, const char *mode)
9292
goto error;
9393
break;
9494
default:
95-
/* Unknown mode, éxpected "r", "rt", "w", "wt" */
95+
/* Unknown mode, expected "r", "rt", "w", "wt" */
9696
abort();
9797
}
9898
if (!SetHandleInformation(parent_pipe_end, HANDLE_FLAG_INHERIT, 0))

storage/connect/domdoc.cpp

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -642,7 +642,6 @@ bool DOMNODELIST::DropItem(PGLOBAL g, int n)
642642
if (Listp == NULL || Listp->length < n)
643643
return true;
644644

645-
//Listp->item[n] = NULL; La propriété n'a pas de méthode 'set'
646645
return false;
647646
} // end of DeleteItem
648647

strings/ctype-czech.c

Lines changed: 25 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,13 @@
2323
solution was needed than the one-to-one conversion table. To
2424
note a few, here is an example of a Czech sorting sequence:
2525
26-
co < hlaska < hláska < hlava < chlapec < krtek
26+
co < hlaska < hláska < hlava < chlapec < krtek
2727
2828
It because some of the rules are: double char 'ch' is sorted
29-
between 'h' and 'i'. Accented character 'á' (a with acute) is
29+
between 'h' and 'i'. Accented character 'á' (a with acute) is
3030
sorted after 'a' and before 'b', but only if the word is
3131
otherwise the same. However, because 's' is sorted before 'v'
32-
in hlava, the accentness of 'á' is overridden. There are many
32+
in hlava, the accentness of 'á' is overridden. There are many
3333
more rules.
3434
3535
This file defines functions my_strxfrm and my_strcoll for
@@ -42,8 +42,9 @@
4242
passes, that's why we need four times more space for expanded
4343
string.
4444
45-
This file also contains the ISO-Latin-2 definitions of
46-
characters.
45+
The non-ASCII literal strings in this file are encoded
46+
in the iso-8859-2 / latin-2 character set
47+
(https://en.wikipedia.org/wiki/ISO/IEC_8859-2)
4748
4849
Author: (c) 1997--1998 Jan Pazdziora, adelton@fi.muni.cz
4950
Jan Pazdziora has a shared copyright for this code
@@ -112,7 +113,7 @@ static const struct wordvalue doubles[] = {
112113
};
113114

114115
/*
115-
Unformal description of the algorithm:
116+
Informal description of the algorithm:
116117
117118
We walk the string left to right.
118119
@@ -127,7 +128,7 @@ static const struct wordvalue doubles[] = {
127128
128129
End of pass is marked with value 1 on the output.
129130
130-
For each character, we read it's value from the table.
131+
For each character, we read its value from the table.
131132
132133
If the value is ignore (0), we go straight to the next character.
133134
@@ -139,31 +140,6 @@ static const struct wordvalue doubles[] = {
139140
exists behind it, find its value.
140141
141142
We append 0 to the end.
142-
---
143-
Neformální popis algoritmu:
144-
145-
Procházíme øetìzec zleva doprava.
146-
147-
Konec øetìzce je pøedán buï jako parametr, nebo je to *p == 0.
148-
Toto je o¹etøeno makrem IS_END.
149-
150-
Pokud jsme do¹li na konec øetìzce pøi prùchodu 0, nejdeme na
151-
zaèátek, ale na ulo¾enou pozici, proto¾e první a druhý prùchod
152-
bì¾í souèasnì.
153-
154-
Konec vstupu (prùchodu) oznaèíme na výstupu hodnotou 1.
155-
156-
Pro ka¾dý znak øetìzce naèteme hodnotu z tøídící tabulky.
157-
158-
Jde-li o hodnotu ignorovat (0), skoèíme ihned na dal¹í znak..
159-
160-
Jde-li o hodnotu konec slova (2) a je to prùchod 0 nebo 1,
161-
pøeskoèíme v¹echny dal¹í 0 -- 2 a prohodíme prùchody.
162-
163-
Jde-li o kompozitní znak (255), otestujeme, zda následuje
164-
správný do dvojice, dohledáme správnou hodnotu.
165-
166-
Na konci pøipojíme znak 0
167143
*/
168144

169145
#define ADD_TO_RESULT(dest, len, totlen, value) \
@@ -336,24 +312,23 @@ my_strnxfrm_czech(CHARSET_INFO *cs __attribute__((unused)),
336312

337313

338314
/*
339-
Neformální popis algoritmu:
340-
341-
procházíme øetìzec zleva doprava
342-
konec øetìzce poznáme podle *p == 0
343-
pokud jsme do¹li na konec øetìzce pøi prùchodu 0, nejdeme na
344-
zaèátek, ale na ulo¾enou pozici, proto¾e první a druhý
345-
prùchod bì¾í souèasnì
346-
konec vstupu (prùchodu) oznaèíme na výstupu hodnotou 1
347-
348-
naèteme hodnotu z tøídící tabulky
349-
jde-li o hodnotu ignorovat (0), skoèíme na dal¹í prùchod
350-
jde-li o hodnotu konec slova (2) a je to prùchod 0 nebo 1,
351-
pøeskoèíme v¹echny dal¹í 0 -- 2 a prohodíme
352-
prùchody
353-
jde-li o kompozitní znak (255), otestujeme, zda následuje
354-
správný do dvojice, dohledáme správnou hodnotu
355-
356-
na konci pøipojíme znak 0
315+
Informal description of the algorithm:
316+
317+
we pass the chain from left to right
318+
we know the end of the string by *p == 0
319+
if we reached the end of the string on transition 0, then we don't go to
320+
start, but to the saved position, because the first and second
321+
the passage runs concurrently
322+
we mark the end of the input (transition) with the value 1 on the output
323+
324+
then we load the value from the sorting table
325+
if the value is ignore (0), we jump to the next pass
326+
if the value is the end of the word (2) and it is a 0 or 1 transition,
327+
we skip all the other 0 -- 2 and switch transitions
328+
if it is a composite character (255), we test whether it follows
329+
correct to the pair, we find the correct value
330+
331+
then we add the character 0 at the end
357332
*/
358333

359334

strings/ctype-latin1.c

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -504,19 +504,19 @@ struct charset_info_st my_charset_latin1_nopad=
504504
*
505505
* The modern sort order is used, where:
506506
*
507-
* 'ä' -> "ae"
508-
* 'ö' -> "oe"
509-
* 'ü' -> "ue"
510-
* 'ß' -> "ss"
507+
* 'ä' -> "ae"
508+
* 'ö' -> "oe"
509+
* 'ü' -> "ue"
510+
* 'ß' -> "ss"
511511
*/
512512

513513

514514
/*
515515
* This is a simple latin1 mapping table, which maps all accented
516516
* characters to their non-accented equivalents. Note: in this
517-
* table, 'ä' is mapped to 'A', 'ÿ' is mapped to 'Y', etc. - all
517+
* table, 'ä' is mapped to 'A', 'ÿ' is mapped to 'Y', etc. - all
518518
* accented characters except the following are treated the same way.
519-
* Ü, ü, Ö, ö, Ä, ä
519+
* Ü, ü, Ö, ö, Ä, ä
520520
*/
521521

522522
static const uchar sort_order_latin1_de[] = {
@@ -582,7 +582,7 @@ static const uchar combo2map[]={
582582
my_strnxfrm_latin_de() on both strings and compared the result strings.
583583
584584
This means that:
585-
Ä must also matches ÁE and Aè, because my_strxn_frm_latin_de() will convert
585+
Ä must also matches ÁE and Aè, because my_strxn_frm_latin_de() will convert
586586
both to AE.
587587
588588
The other option would be to not do any accent removal in
@@ -708,7 +708,7 @@ void my_hash_sort_latin1_de(CHARSET_INFO *cs __attribute__((unused)),
708708

709709
/*
710710
Remove end space. We have to do this to be able to compare
711-
'AE' and 'Ä' as identical
711+
'AE' and 'Ä' as identical
712712
*/
713713
end= skip_trailing_space(key, len);
714714

0 commit comments

Comments
 (0)