DEV Community

Cover image for .NET: The ASCII Problem
Wesley
Wesley

Posted on

.NET: The ASCII Problem

This might be seen as just unnecessary for optimizing binaries, but you'd have to at least ask yourself, why does .NET do what leads to these solutions?

For historical reasons, System.String uses the UCS-2 character encoding, that is, UTF-16 without surrogate pairs.
However, most strings in typical .NET applications consist solely of ASCII characters, leading to wasted space: half of the bytes in a string are likely to be null bytes!
Since strings are immutable, we can scan the character data when the string is constructed, then dynamically select an encoding, thereby saving 50% of string memory in most cases.

ASCII Mono | Mono

For some reason, this applies to all string literals except enum names and field names.

The fork linked in this page hasn't been active since 2016, and I know, before I even try, if I clone it and run make that there will be a million errors. So now, it's time to pull out my best gimmick in programming, hacks.

Tests

Here is a base executable with no code to execute:
3.0KB:

0000: 4D5A90000300000004000000FFFF0000 MZ.............. 0010: B8000000000000004000000000000000 ........@....... 0020: 00000000000000000000000000000000 ................ 0030: 00000000000000000000000080000000 ................ 0040: 0E1FBA0E00B409CD21B8014CCD215468 ........!..L.!Th 0050: 69732070726F6772616D2063616E6E6F is program canno 0060: 742062652072756E20696E20444F5320 t be run in DOS 0070: 6D6F64652E0D0D0A2400000000000000 mode....$....... 0080: 504500004C0103000000000000000000 PE..L........... 0090: 00000000E00002010B01080000040000 ................ 00a0: 0006000000000000BE22000000200000 ........."... .. 00b0: 00400000000040000020000000020000 .@....@.. ...... 00c0: 04000000000000000400000000000000 ................ 00d0: 00800000000200000000000003004085 ..............@. 00e0: 00001000001000000000100000100000 ................ 00f0: 00000000100000000000000000000000 ................ 0100: 702200004B00000000400000D0020000 p"..K....@...... 0110: 00000000000000000000000000000000 ................ 0120: 006000000C0000000000000000000000 .`.............. 0130: 00000000000000000000000000000000 ................ 0140: 00000000000000000000000000000000 ................ 0150: 00000000000000000020000008000000 ......... ...... 0160: 00000000000000000820000048000000 ......... ..H... 0170: 00000000000000002E74657874000000 .........text... 0180: C4020000002000000004000000020000 ..... .......... 0190: 00000000000000000000000020000060 ............ ..` 01a0: 2E72737263000000D002000000400000 .rsrc........@.. 01b0: 00040000000600000000000000000000 ................ 01c0: 00000000400000402E72656C6F630000 ....@..@.reloc.. 01d0: 0C0000000060000000020000000A0000 .....`.......... 01e0: 00000000000000000000000040000042 ............@..B 01f0: 00000000000000000000000000000000 ................ 0200: A0220000000000004800000002000500 ."......H....... 0210: 5C2000000C0200000100000002000006 \ .............. 0220: 00000000000000000000000000000000 ................ 0230: 00000000000000000000000000000000 ................ 0240: 00000000000000000000000000000000 ................ 0250: 1E02280100000A2A062A000042534A42 ..(....*.*..BSJB 0260: 01000100000000000C00000076322E30 ............v2.0 0270: 2E35303732370000000005006C000000 .50727......l... 0280: D0000000237E00003C01000084000000 ....#~..<....... 0290: 23537472696E677300000000C0010000 #Strings........ 02a0: 0800000023555300C801000010000000 ....#US......... 02b0: 2347554944000000D801000034000000 #GUID.......4... 02c0: 23426C6F620000000000000002000010 #Blob........... 02d0: 471500000900000000FA013300160000 G..........3.... 02e0: 01000000020000000200000002000000 ................ 02f0: 01000000020000000100000001000000 ................ 0300: 0100000000007B000100000000000600 ......{......... 0310: 17001E00060034005200000000000100 ......4.R....... 0320: 0000000001000100000010000A000000 ................ 0330: 05000100010050200000000086182500 ......P ......%. 0340: 0100010058200000000096002B000500 ....X ......+... 0350: 01000000010012000900250001001100 ..........%..... 0360: 250001002E0013000B00048000000000 %............... 0370: 00000000000000000000000030000000 ............0... 0380: 0200000000000000000000002A007200 ............*.r. 0390: 0000000000000000003C4D6F64756C65 .........<Module 03a0: 3E0050726F6772616D0061726773004F >.Program.args.O 03b0: 626A6563740053797374656D002E6374 bject.System..ct 03c0: 6F72004D61696E007768790052756E74 or.Main.why.Runt 03d0: 696D65436F6D7061746962696C697479 imeCompatibility 03e0: 4174747269627574650053797374656D Attribute.System 03f0: 2E52756E74696D652E436F6D70696C65 .Runtime.Compile 0400: 725365727669636573006D73636F726C rServices.mscorl 0410: 6962007768792E657865000000032000 ib.why.exe.... . 0420: 000000001B8EF60B53E43C4E97C8F435 ........S.<N...5 0430: 7DB1B9050003200001050001011D0E1E }..... ......... 0440: 01000100540216577261704E6F6E4578 ....T..WrapNonEx 0450: 63657074696F6E5468726F77730108B7 ceptionThrows... 0460: 7A5C561934E089000000000000000000 z\V.4........... 0470: 982200000000000000000000AE220000 ."...........".. 0480: 00200000000000000000000000000000 . .............. 0490: 0000000000000000A022000000000000 ........."...... 04a0: 00005F436F724578654D61696E006D73 .._CorExeMain.ms 04b0: 636F7265652E646C6C0000000000FF25 coree.dll......% 04c0: 00204000000000000000000000000000 . @............. 04d0: 00000000000000000000000000000000 ................ 
Enter fullscreen mode Exit fullscreen mode

At the bare minimum, executables have an extra 1.5KB of data for the .reloc and .rsrc section, which the latter has version info.
DEV is not helping me here with the little width I have to work with.

Plain String

Here's a plain call to WriteLine in Program.Main.

Console.WriteLine("Lorem ipsum dolor sit amet, consectetur... 
Enter fullscreen mode Exit fullscreen mode

Compile command: mcs -debug- -nostdlib- -o+ -sdk:2 test.cs
Binary view: (xxd -u -g 1 -s +0x200 -l 0x400 test.exe)
4.0KB:

03a0: 2F0084000000000000000000003C4D6F /............<Mo 03b0: 64756C653E0050726F6772616D006172 dule>.Program.ar 03c0: 677300436F6E736F6C65005379737465 gs.Console.Syste 03d0: 6D0057726974654C696E65004F626A65 m.WriteLine.Obje 03e0: 6374002E63746F72004D61696E007768 ct..ctor.Main.wh 03f0: 790052756E74696D65436F6D70617469 y.RuntimeCompati 0400: 62696C69747941747472696275746500 bilityAttribute. 0410: 53797374656D2E52756E74696D652E43 System.Runtime.C 0420: 6F6D70696C6572536572766963657300 ompilerServices. 0430: 6D73636F726C6962007768792E657865 mscorlib.why.exe 0440: 000000000084054C006F00720065006D .......L.o.r.e.m 0450: 00200069007000730075006D00200064 . .i.p.s.u.m. .d 0460: 006F006C006F00720020007300690074 .o.l.o.r. .s.i.t 0470: 00200061006D00650074002C00200063 . .a.m.e.t.,. .c 0480: 006F006E007300650063007400650074 .o.n.s.e.c.t.e.t 0490: 00750072002000610064006900700069 .u.r. .a.d.i.p.i 04a0: 007300630069006E006700200065006C .s.c.i.n.g. .e.l 04b0: 00690074002E0020004E0075006E0063 .i.t... .N.u.n.c 04c0: 00200065006700650074002000610075 . .e.g.e.t. .a.u 04d0: 00630074006F0072002000740065006C .c.t.o.r. .t.e.l 04e0: 006C00750073002C0020007500740020 .l.u.s.,. .u.t. 04f0: 0063006F006E00640069006D0065006E .c.o.n.d.i.m.e.n 0500: 00740075006D0020006500730074002E .t.u.m. .e.s.t.. 0510: 0020004E0075006C006C006100200066 . .N.u.l.l.a. .f 0520: 00650072006D0065006E00740075006D .e.r.m.e.n.t.u.m 0530: 002C0020007400750072007000690073 .,. .t.u.r.p.i.s 0540: 002000730069007400200061006D0065 . .s.i.t. .a.m.e 0550: 0074002000680065006E006400720065 .t. .h.e.n.d.r.e 0560: 00720069007400200072007500740072 .r.i.t. .r.u.t.r 0570: 0075006D002C00200061006E00740065 .u.m.,. .a.n.t.e 0580: 00200064007500690020006C006F0062 . .d.u.i. .l.o.b 0590: 006F0072007400690073002000610075 .o.r.t.i.s. .a.u 05a0: 006700750065002C0020007300650064 .g.u.e.,. .s.e.d 05b0: 002000700075006C00760069006E0061 . .p.u.l.v.i.n.a 05c0: 007200200065007800200064006F006C .r. .e.x. .d.o.l 05d0: 006F00720020006E006F006E0020006C .o.r. .n.o.n. .l 05e0: 0061006300750073002E0020004E0061 .a.c.u.s... .N.a 05f0: 006D0020007300650064002000650073 .m. .s.e.d. .e.s 
Enter fullscreen mode Exit fullscreen mode

This string in binary takes the size of an EXE section. (0x400 bytes / 1KB)
This is doomsday for large console apps.

String Resource

Loading a string from a resource:

Console.WriteLine(Resources.ResourceManager.GetString("test")); 
Enter fullscreen mode Exit fullscreen mode

Resource file:

<?xml version="1.0" encoding="utf-8"?> <root> ... <data name="test" xml:space="preserve"> <value>Lorem ipsum dolor sit amet, consectetur...</value> </data> </root> 
Enter fullscreen mode Exit fullscreen mode

Command: resgen /useSourcePath /compile main.resx
Added to MCS args: -resource:main.resources
4.5KB:

0200: D0280000000000004800000002000500 .(......H....... 0210: 80230000200500000100000004000006 .#.. ........... 0220: AC200000D20200000000000000000000 . .............. 0230: 00000000000000000000000000000000 ................ 0240: 00000000000000000000000000000000 ................ 0250: 1E02280900000A2AD27E010000041428 ..(....*.~.....( 0260: 0500000A391E0000007201000070D002 ....9....r...p.. 0270: 000002280600000A6F0700000A730800 ...(....o....s.. 0280: 000A80010000047E010000042A1E0228 .......~....*..( 0290: 0900000A2A5628020000067209000070 ....*V(....r...p 02a0: 6F0A00000A280B00000A2A00CE020000 o....(....*..... 02b0: CECAEFBE01000000910000006C537973 ............lSys 02c0: 74656D2E5265736F75726365732E5265 tem.Resources.Re 02d0: 736F757263655265616465722C206D73 sourceReader, ms 02e0: 636F726C69622C2056657273696F6E3D corlib, Version= 02f0: 342E302E302E302C2043756C74757265 4.0.0.0, Culture 0300: 3D6E65757472616C2C205075626C6963 =neutral, Public 0310: 4B6579546F6B656E3D62373761356335 KeyToken=b77a5c5 0320: 3631393334653038392353797374656D 61934e089#System 0330: 2E5265736F75726365732E52756E7469 .Resources.Runti 0340: 6D655265736F75726365536574020000 meResourceSet... 0350: 00010000000000000050414450414450 .........PADPADP 0360: 33AF737C00000000C900000008740065 3.s|.........t.e 0370: 0073007400000000000182044C6F7265 .s.t........Lore 0380: 6D20697073756D20646F6C6F72207369 m ipsum dolor si 0390: 7420616D65742C20636F6E7365637465 t amet, consecte 03a0: 7475722061646970697363696E672065 tur adipiscing e 03b0: 6C69742E204E756E6320656765742061 lit. Nunc eget a 03c0: 7563746F722074656C6C75732C207574 uctor tellus, ut 03d0: 20636F6E64696D656E74756D20657374 condimentum est 07a0: 0000000000000000B700230100000000 ..........#..... 07b0: 0000000001000000E401000000000000 ................ 07c0: 003C4D6F64756C653E00520050726F67 .<Module>.R.Prog 07d0: 72616D00724D005265736F757263654D ram.rM.ResourceM 07e0: 616E616765720053797374656D2E5265 anager.System.Re 07f0: 736F75726365730047656E6572617465 sources.Generate 0800: 64436F64654174747269627574650053 dCodeAttribute.S 0810: 797374656D2E436F6465446F6D2E436F ystem.CodeDom.Co 0820: 6D70696C6572002E63746F7200446562 mpiler..ctor.Deb 0830: 75676765724E6F6E55736572436F6465 uggerNonUserCode 0840: 4174747269627574650053797374656D Attribute.System 0850: 2E446961676E6F737469637300436F6D .Diagnostics.Com 0860: 70696C657247656E6572617465644174 pilerGeneratedAt 0870: 747269627574650053797374656D2E52 tribute.System.R 0880: 756E74696D652E436F6D70696C657253 untime.CompilerS 0890: 6572766963657300456469746F724272 ervices.EditorBr 08a0: 6F777361626C65417474726962757465 owsableAttribute 08b0: 0053797374656D2E436F6D706F6E656E .System.Componen 08c0: 744D6F64656C00456469746F7242726F tModel.EditorBro 08d0: 777361626C655374617465004F626A65 wsableState.Obje 08e0: 63740053797374656D00526566657265 ct.System.Refere 08f0: 6E6365457175616C7300547970650047 nceEquals.Type.G 0900: 65745479706546726F6D48616E646C65 etTypeFromHandle 0910: 0052756E74696D655479706548616E64 .RuntimeTypeHand 0920: 6C65006765745F417373656D626C7900 le.get_Assembly. 0930: 417373656D626C790053797374656D2E Assembly.System. 0940: 5265666C656374696F6E006172677300 Reflection.args. 0950: 476574537472696E6700436F6E736F6C GetString.Consol 0960: 650057726974654C696E65006765745F e.WriteLine.get_ 0970: 4D004D61696E004D007768790052756E M.Main.M.why.Run 0980: 74696D65436F6D7061746962696C6974 timeCompatibilit 0990: 79417474726962757465006D73636F72 yAttribute.mscor 09a0: 6C6962007768792E7265736F75726365 lib.why.resource 09b0: 73007768792E65786500000000077700 s.why.exe.....w. 09c0: 68007900000974006500730074000000 h.y...t.e.s.t... 09d0: 7855E192F3A8DF4ABD3B8148F7C608B2 xU.....J.;.H.... 09e0: 0003061205052002010E0E4101003353 ...... ....A..3S 09f0: 797374656D2E5265736F75726365732E ystem.Resources. 0a00: 546F6F6C732E5374726F6E676C795479 Tools.StronglyTy 0a10: 7065645265736F757263654275696C64 pedResourceBuild 0a20: 65720831372E302E302E300000032000 er.17.0.0.0... . 
Enter fullscreen mode Exit fullscreen mode

This adds nearly as much bloat with the bottom data.

String as Byte[]

(Real world example)
(sigh):

Console.WriteLine(Encoding.ASCII.GetString( new byte[] {(byte)'L',(byte)'o',(byte)'r',(byte)'e',(byte)'m',(byte)' ',(byte)'i'...} )); 
Enter fullscreen mode Exit fullscreen mode

5.0KB (...):

0100: F02300004B00000000600000D0020000 .#..K....`...... 0110: 00000000000000000000000000000000 ................ 0120: 008000000C0000000000000000000000 ................ 0130: 00000000000000000000000000000000 ................ 0140: 00000000000000000000000000000000 ................ 0150: 00000000000000000020000008000000 ......... ...... 0160: 00000000000000000820000048000000 ......... ..H... 0170: 00000000000000002E74657874000000 .........text... 0180: 44040000002000000006000000040000 D.... .......... 0190: 00000000000000000000000020000060 ............ ..` 01a0: 2E736461746100000402000000400000 .sdata.......@.. 01b0: 00040000000A00000000000000000000 ................ 01c0: 00000000400000C02E72737263000000 ....@....rsrc... 01d0: D00200000060000000040000000E0000 .....`.......... 01e0: 00000000000000000000000040000040 ............@..@ 01f0: 2E72656C6F6300000C00000000800000 .reloc.......... 0200: 00020000001200000000000000000000 ................ 0630: 03000000003C4D6F64756C653E005072 .....<Module>.Pr 0640: 6F6772616D0061726773004279746500 ogram.args.Byte. 0650: 53797374656D003C5072697661746549 System.<PrivateI 0660: 6D706C656D656E746174696F6E446574 mplementationDet 0670: 61696C733E0024417272617954797065 ails>.$ArrayType 0680: 3D35313600246669656C642D31323333 =516.$field-1233 0690: 38454237413930433932343238313343 8EB7A90C9242813C 06a0: 41463436453833353941303343304242 AF46E8359A03C0BB 06b0: 444133440052756E74696D6548656C70 DA3D.RuntimeHelp 06c0: 6572730053797374656D2E52756E7469 ers.System.Runti 06d0: 6D652E436F6D70696C65725365727669 me.CompilerServi 06e0: 63657300496E697469616C697A654172 ces.InitializeAr 06f0: 7261790041727261790052756E74696D ray.Array.Runtim 0700: 654669656C6448616E646C6500436F6E eFieldHandle.Con 0710: 736F6C650057726974654C696E65004F sole.WriteLine.O 0a00: 4C6F72656D20697073756D20646F6C6F Lorem ipsum dolo 0a10: 722073697420616D65742C20636F6E73 r sit amet, cons 0a20: 65637465747572206164697069736369 ectetur adipisci 0a30: 6E6720656C69742E204E756E63206567 ng elit. Nunc eg 0a40: 657420617563746F722074656C6C7573 et auctor tellus 
Enter fullscreen mode Exit fullscreen mode

This adds an extra section named ".sdata" where the byte array is stored, and a stupidly long field name for the array.

Enum

Probably the stupidest way to work around this, but could only work in certain circumstances like storing/using config keys.
(Real world example
Rough version that doesn't use a proper array or support punctuation:

public enum _ { Lorem, ipsum, dolor, sit, amet, consectetur, ... (dupes have underscore appended :( ) } string test = ""; for (int i = 0; i < (int)(_._); i++) test += ((_)i).ToString() + ' '; Console.WriteLine(test); 
Enter fullscreen mode Exit fullscreen mode

5.0KB:

03a0: 09000100010002010000120000001500 ................ 03b0: 0100030006061400010056801C000400 ..........V..... 03c0: 56802200040056802800040056802E00 V."...V.(...V... 03d0: 04005680320004005680370004005680 ..V.2...V.7...V. 03e0: 4300040056804E000400568053000400 C...V.N...V.S... 03f0: 56805800040056805D00040056806400 V.X...V.]...V.d. 0400: 040056806B00040056806E0004005680 ..V.k...V.n...V. 0410: 7A00040056807E000400568084000400 z...V.~...V..... 0420: 56808E00040056809500040056809A00 V.....V.....V... 0430: 04005680A00004005680AA0004005680 ..V.....V.....V. 0440: B10004005680B60004005680BA000400 ....V.....V..... 0450: 5680C30004005680C90004005680CD00 V.....V.....V... 0770: 5201080014015701080018015C010800 R.....W.....\... 0780: 1C016101080020016601080024016B01 ..a... .f...$.k. 0790: 08002801700108002C01750108003001 ..(.p...,.u...0. 07a0: 7A01080034017F010800380184010800 z...4.....8..... 07b0: 3C018901080040018E01080044019301 <.....@.....D... 07c0: 2E003300BC01B5010480000000000000 ..3............. 07d0: 00000000000000000000780200000200 ..........x..... 07e0: 00000000000000000000DB01BA020000 ................ 07f0: 0000030002000000003C4D6F64756C65 .........<Module 0800: 3E0050726F6772616D005F0076616C75 >.Program._.valu 0810: 655F5F004C6F72656D00697073756D00 e__.Lorem.ipsum. 0820: 646F6C6F720073697400616D65740063 dolor.sit.amet.c 0830: 6F6E7365637465747572006164697069 onsectetur.adipi 0840: 7363696E6700656C6974004E756E6300 scing.elit.Nunc. 0850: 6567657400617563746F720074656C6C eget.auctor.tell 0860: 757300757400636F6E64696D656E7475 us.ut.condimentu 0870: 6D00657374004E756C6C61006665726D m.est.Nulla.ferm 0880: 656E74756D0074757270697300736974 entum.turpis.sit 
Enter fullscreen mode Exit fullscreen mode

This takes up 0x440 bytes.

Raw ASCII string squished into UTF-16 LE C# String

and then converted to a byte array with Unicode GetBytes,
and then converted to a string with ASCII GetString.

Console.WriteLine(Encoding.ASCII.GetString(Encoding.Unicode.GetBytes( "潌敲灩畳潤潬⁲楳⁴浡瑥‬潣獮捥整畴⁲摡灩獩楣杮攠..." ))); 
Enter fullscreen mode Exit fullscreen mode

3.5KB(!!):

03a0: 750017002100750017002E003B002100 u...!.u.....;.!. 03b0: 04800000000000000000000000000000 ................ 03c0: 00008000000002000000000000000000 ................ 03d0: 00004000C200000000000000003C4D6F ..@..........<Mo 03e0: 64756C653E0050726F6772616D006172 dule>.Program.ar 03f0: 677300456E636F64696E670053797374 gs.Encoding.Syst 0400: 656D2E54657874006765745F41534349 em.Text.get_ASCI 0410: 49006765745F556E69636F6465004765 I.get_Unicode.Ge 0420: 74427974657300476574537472696E67 tBytes.GetString 0430: 00436F6E736F6C650053797374656D00 .Console.System. 0440: 57726974654C696E65004F626A656374 WriteLine.Object 0450: 002E63746F72004D61696E0077687900 ..ctor.Main.why. 0460: 52756E74696D65436F6D706174696269 RuntimeCompatibi 0470: 6C697479417474726962757465005379 lityAttribute.Sy 0480: 7374656D2E52756E74696D652E436F6D stem.Runtime.Com 0490: 70696C65725365727669636573006D73 pilerServices.ms 04a0: 636F726C6962007768792E6578650000 corlib.why.exe.. 04b0: 0082034C6F72656D20697073756D2064 ...Lorem ipsum d 04c0: 6F6C6F722073697420616D65742C2063 olor sit amet, c 04d0: 6F6E7365637465747572206164697069 onsectetur adipi 04e0: 7363696E6720656C69742E204E756E63 scing elit. Nunc 04f0: 206567657420617563746F722074656C eget auctor tel 0500: 6C75732C20757420636F6E64696D656E lus, ut condimen 0510: 74756D206573742E204E756C6C612066 tum est. Nulla f 0520: 65726D656E74756D2C20747572706973 ermentum, turpis 0690: 612C20657420626962656E64756D2066 a, et bibendum f 06a0: 656C69732074656D7075732061636375 elis tempus accu 06b0: 6D73616E2E0100003EF604CCBDC0D640 msan....>......@ 06c0: B4394801D5D5B2C90004000012050520 .9H............ 06d0: 011D050E0520010E1D05040001010E03 ..... .......... 06e0: 200001050001011D0E1E010001005402 .............T. 06f0: 16577261704E6F6E457863657074696F .WrapNonExceptio 0700: 6E5468726F77730108B77A5C561934E0 nThrows...z\V.4. 0710: 89000000000000000000000000000000 ................ 0720: 4825000000000000000000005E250000 H%..........^%.. 0730: 00200000000000000000000000000000 . .............. 0740: 00000000000000005025000000000000 ........P%...... 0750: 00005F436F724578654D61696E006D73 .._CorExeMain.ms 0760: 636F7265652E646C6C0000000000FF25 coree.dll......% 0770: 00204000000000000000000000000000 . @............. 0780: 00000000000000000000000000000000 ................ 
Enter fullscreen mode Exit fullscreen mode

The string takes 0x200 bytes / 0.5 KB.

Top comments (0)