DEV Community

Cover image for Reading UTF-8 char by char in C
Talles L
Talles L

Posted on

Reading UTF-8 char by char in C

Using wchar_t didn't quite worked out in my tests, so handling it on my own:

#include <stdint.h> #include <stdio.h> #include <stdlib.h>  // https://stackoverflow.com/a/44776334 int8_t utf8_length(char c) { // 4-byte character (11110XXX) if ((c & 0b11111000) == 0b11110000) return 4; // 3-byte character (1110XXXX) if ((c & 0b11110000) == 0b11100000) return 3; // 2-byte character (110XXXXX) if ((c & 0b11100000) == 0b11000000) return 2; // 1-byte ASCII character (0XXXXXXX) if ((c & 0b10000000) == 0b00000000) return 1; // Probably a 10XXXXXXX continuation byte return -1; } void main () { const char* filepath = "example.txt"; FILE* file = fopen(filepath, "r"); if (!file) { perror(filepath); exit(1); } char c; for(;;) { c = getc(file); if (c == EOF) break; putc(c, stdout); int8_t length = utf8_length(c); while (--length) { c = getc(file); putc(c, stdout); } getchar(); } fclose (file); } 
Enter fullscreen mode Exit fullscreen mode

And here's my test file:

Hello, World! ๐ŸŒ๐Ÿš€ Hello ยกHola! ร‡a va? ไฝ ๅฅฝ ใ“ใ‚“ใซใกใฏ ์•ˆ๋…•ํ•˜์„ธ์š” ยฉยฎโ„ขโœ“โœ— ๐Ÿ˜„๐Ÿ˜ข๐Ÿ˜Ž๐Ÿ”ฅโœจ โ‚ฌ๐ˆ๐’€ญ 
Enter fullscreen mode Exit fullscreen mode

Top comments (0)