Our QA team are absolute monsters, they're monsters I tell you!
No matter how much I try to bribe them with drinks at our next company gathering to stop finding bugs in my code, they keep coming up with unreasonable requests and "expected behaviours"...ewwwww, gross!
Don't believe me?
Recently they asked for a list to be ordered alphanumerically.
Can you believe that? Monsters I tell you!
I mean, it isn't like an end user would expect a list to be sorted:
test 1 test 2 test 12
...why can't it just be like MySQL thinks it should be sorted:
test 1 test 12 test 2
Makes perfect sense right? Right?
Anyway lets not get too deep into who is right and who is wrong here (they're right FYI), I thought I better try and fix it.
I mean...how hard can it be? (foreshadowing)
The short answer
I know you might be busy, so the short answer is sort it in JS / PHP / whatever. Don't try and do it with MySQL.
Like seriously, just return your data set and then use Intl.Collator
if you are using Node, or natsort()
in PHP.
And you know what, that would have been fine. It would have solved the problem and I could have moved on with my day. QA would have been happy, list would be sorted, all good!
However, if you are like me then when your CTO reviews your PR and says:
I take that to heart!
I mean, he is so right!
Why should we have to sort things in code, MySQL should be able to do that for us!
What ensued was over 4 hours of fiddling and swearing to try and get MySQL to bend to my will.
Did I manage it? No...but kinda.
What I ended up with works well enough for user input in our specific scenario, and is much better than other solutions I found, but has it's limitations.
But we are getting ahead of ourselves, I want to show you what is it like trying to solve this in MySQL!
The Data for testing
First thing is first, let's start with the test data I used, just so you can see if I missed anything:
The Test Data
Test Data
INSERT INTO test_data (data_value) VALUES ('2 test'), ('12 test'), ('1 test'), ('20 test'), ('10 test'), ('1test'), ('2test'), ('test 1'), ('my data 2020-01-01'), ('my data 2020-02-01'), ('01-02-24 data'), ('12 test'), ('4a test'), ('my 2020-01-01 data 2020-01-01'), ('my 2020-02-01 data 2020-01-01'), ('my 2020-02-01 data 202-01-01'), ('my 2020-02-01 data 20-01-01'), ('my 2020-02-01 data 1-01-01'), ('my 2020-02-01 data 2-01-01'), ('my 2020-02-01 data 12-01-01'), ('my 2020-02-01 data 01-01-01'), ('my 2020-01-01 data 2020-02-01'), ('my 2020-01-01 data 2021-01-01'), ('my 2020-01-01 data 2120-01-01'), ('my 2120-01-01 data 2020-01-01'), ('4b test'), ('my test'), ('my 12 magic test'), ('my magic 12 test'), ('cheese and test 12'), ('42-a-1'), ('40-a-1'), ('40a'), ('FoClSy4727'), ('Pthw068bf'), ('6bfS'), ('HOFAp_Yx7920'), ('25hWTX'), ('dnjLlW1'), ('RHrIt72402eaLr'), ('cIhb42WFNQ'), ('9244uVCpGa'), ('yDKrkCp7960'), ('GeGIrPM-H86'), ('wrOae537LGCT'), ('WffSPaBA318'), ('kQ33596c'), ('3uEKHmHePf'), ('796h-eYWy'), ('833HufIZAS'), ('utjtV03Xns'), ('dlCSh87811'), ('13IUkOxEVl'), ('VHCok55901XYVk'), ('2RnSVwq'), ('AwtwQdn09'), ('gvSV6z'), ('uxWLO039hb'), ('vTg946');
Some nonsense with numbers in various places throughout the strings, seemed reasonable enough to me.
It is probably not a perfect data set, but good enough for what I was after!
So once I had some data I did what every developer does, asked the Google God for an answer:
Googling for natural sort in MySQL
Seems like the logical first step right?
MySQL has been around for 30 years, and alphanumeric sorting is something you would think people need to do often, so it must have some answers out there already right?
Short answer...it was exhausting. Every single article, Stack Overflow post etc. spitting out the same nonsense that doesn't work or is hyper specific to a particular data format.
I wanted something generic, so I could just sort alphanumerically everywhere!
So let me walk you through the pain I endured so you don't have to, and so you can decide if my answer is any better...or if I have just added to the pain!
Let's try some of the generic sort methods people suggested:
Test 1: the "+0" method
SELECT * FROM test_data ORDER BY `data_value`+0 ASC
The idea being that it tries to force MySQL to do numeric ordering over lexicographical ordering.
This would work fine if we had numbers or words in each row (the numbers would indeed sort correctly and then the words).
However because we have mixed data, it just spat out a load of nonsense:
Test 1 results
id | data_value |
---|---|
44 | GeGIrPM-H86 |
25 | my 2120-01-01 data 2020-01-01 |
27 | my test |
28 | my 12 magic test |
29 | my magic 12 test |
34 | FoClSy4727 |
Full Results for test 1
id data_value 44 GeGIrPM-H86 25 my 2120-01-01 data 2020-01-01 27 my test 28 my 12 magic test 29 my magic 12 test 34 FoClSy4727 35 Pthw068bf 37 HOFAp_Yx7920 39 dnjLlW1 40 RHrIt72402eaLr 41 cIhb42WFNQ 43 yDKrkCp7960 30 cheese and test 12 45 wrOae537LGCT 46 WffSPaBA318 47 kQ33596c 51 utjtV03Xns 52 dlCSh87811 54 VHCok55901XYVk 56 AwtwQdn09 57 gvSV6z 58 uxWLO039hb 59 vTg946 17 my 2020-02-01 data 20-01-01 8 test 1 9 my data 2020-01-01 10 my data 2020-02-01 14 my 2020-01-01 data 2020-01-01 15 my 2020-02-01 data 2020-01-01 16 my 2020-02-01 data 202-01-01 18 my 2020-02-01 data 1-01-01 19 my 2020-02-01 data 2-01-01 21 my 2020-02-01 data 01-01-01 23 my 2020-01-01 data 2021-01-01 22 my 2020-01-01 data 2020-02-01 20 my 2020-02-01 data 12-01-01 24 my 2020-01-01 data 2120-01-01 6 1test 11 01-02-24 data 3 1 test 55 2RnSVwq 1 2 test 7 2test 48 3uEKHmHePf 13 4a test 26 4b test 36 6bfS 5 10 test 2 12 test 12 12 test 53 13IUkOxEVl 4 20 test 38 25hWTX 33 40a 32 40-a-1 31 42-a-1 49 796h-eYWy 50 833HufIZAS 42 9244uVCpGa
Ok so that didn't work (and I didn't really expect it to).
Let's try something else:
Test 2: The "length" hack
SELECT alphanumeric, integer FROM sorting_test ORDER BY LENGTH(alphanumeric), alphanumeric
This is another interesting approach.
obviously "12" is longer than "1" so it would work there.
It also works if you have a standard format for data, like the "test 1, test 12, test 2" I mentioned at the beginning.
However, for our data it returns more nonsense:
Test 2 results
id | data_value |
---|---|
33 | 40a |
36 | 6bfS |
6 | 1test |
7 | 2test |
3 | 1 test |
1 | 2 test |
38 | 25hWTX |
32 | 40-a-1 |
31 | 42-a-1 |
57 | gvSV6z |
8 | test 1 |
59 | vTg946 |
5 | 10 test |
Full Results for test 2
id data_value 33 40a 36 6bfS 6 1test 7 2test 3 1 test 1 2 test 38 25hWTX 32 40-a-1 31 42-a-1 57 gvSV6z 8 test 1 59 vTg946 5 10 test 2 12 test 12 12 test 4 20 test 55 2RnSVwq 13 4a test 26 4b test 39 dnjLlW1 27 my test 47 kQ33596c 49 796h-eYWy 56 AwtwQdn09 35 Pthw068bf 53 13IUkOxEVl 48 3uEKHmHePf 50 833HufIZAS 42 9244uVCpGa 41 cIhb42WFNQ 52 dlCSh87811 34 FoClSy4727 51 utjtV03Xns 58 uxWLO039hb 44 GeGIrPM-H86 46 WffSPaBA318 43 yDKrkCp7960 37 HOFAp_Yx7920 45 wrOae537LGCT 11 01-02-24 data 40 RHrIt72402eaLr 54 VHCok55901XYVk 28 my 12 magic test 29 my magic 12 test 30 cheese and test 12 9 my data 2020-01-01 10 my data 2020-02-01 18 my 2020-02-01 data 1-01-01 19 my 2020-02-01 data 2-01-01 21 my 2020-02-01 data 01-01-01 20 my 2020-02-01 data 12-01-01 17 my 2020-02-01 data 20-01-01 16 my 2020-02-01 data 202-01-01 14 my 2020-01-01 data 2020-01-01 22 my 2020-01-01 data 2020-02-01 23 my 2020-01-01 data 2021-01-01 24 my 2020-01-01 data 2120-01-01 15 my 2020-02-01 data 2020-01-01 25 my 2120-01-01 data 2020-01-01
And that also makes sense, all it is doing is ordering on length first so:
"a1" would come before "1potato" by length.
Test 3: The Hardouken!
Just take a look at this beauty:
SELECT * FROM test_data ORDER BY REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( data_value, '0', 'Ā'), '1', 'ā'), '2', 'Ă'), '3', 'ă'), '4', 'Ą'), '5', 'ą'), '6', 'Ć'), '7', 'ć'), '8', 'Ĉ'), '9', 'ĉ') COLLATE utf8_bin;
The meme is real!
But despite that, this one pretty much works (albeit slightly differently to how I would have done it).
This one is smart in that it replaces each number with a character that will be sorted based on their binary representation in Unicode.
Test 3 results
id | data_value |
---|---|
56 | AwtwQdn09 |
34 | FoClSy4727 |
44 | GeGIrPM-H86 |
37 | HOFAp_Yx7920 |
35 | Pthw068bf |
40 | RHrIt72402eaLr |
54 | VHCok55901XYVk |
46 | WffSPaBA318 |
41 | cIhb42WFNQ |
30 | cheese and test 12 |
Full Results for test 3
id data_value 56 AwtwQdn09 34 FoClSy4727 44 GeGIrPM-H86 37 HOFAp_Yx7920 35 Pthw068bf 40 RHrIt72402eaLr 54 VHCok55901XYVk 46 WffSPaBA318 41 cIhb42WFNQ 30 cheese and test 12 52 dlCSh87811 39 dnjLlW1 57 gvSV6z 47 kQ33596c 9 my data 2020-01-01 10 my data 2020-02-01 29 my magic 12 test 27 my test 28 my 12 magic test 14 my 2020-01-01 data 2020-01-01 22 my 2020-01-01 data 2020-02-01 23 my 2020-01-01 data 2021-01-01 24 my 2020-01-01 data 2120-01-01 21 my 2020-02-01 data 01-01-01 18 my 2020-02-01 data 1-01-01 20 my 2020-02-01 data 12-01-01 19 my 2020-02-01 data 2-01-01 17 my 2020-02-01 data 20-01-01 16 my 2020-02-01 data 202-01-01 15 my 2020-02-01 data 2020-01-01 25 my 2120-01-01 data 2020-01-01 8 test 1 51 utjtV03Xns 58 uxWLO039hb 59 vTg946 45 wrOae537LGCT 43 yDKrkCp7960 11 01-02-24 data 3 1 test 6 1test 5 10 test 2 12 test 12 12 test 53 13IUkOxEVl 1 2 test 55 2RnSVwq 7 2test 4 20 test 38 25hWTX 48 3uEKHmHePf 13 4a test 26 4b test 32 40-a-1 33 40a 31 42-a-1 36 6bfS 49 796h-eYWy 50 833HufIZAS 42 9244uVCpGa
You know what, I would use this if it wasn't for one thing. It would mean that we could no longer accept 'Ā' etc as input.
Unfortunately our software will be used internationally, so we can't make that call.
Additionally if you do decide to use this, it is worth noting your database column must use COLLATE 'utf8_bin'
on it as well or you may get errors.
Chat GPT Time
When googling fails, chat GPT can always help right?
I am not going to show all the results, but let's just say the following are just a few of the "did not work" ones it produced!
Test 4: GPT fever dream
SELECT data_value, -- Replace numbers with 'a' and letters with 'b' dynamically ( SELECT GROUP_CONCAT( CASE WHEN c REGEXP '[0-9]' THEN 'a' -- Replace digits with 'a' WHEN c REGEXP '[a-zA-Z]' THEN 'b' -- Replace letters with 'b' ELSE c -- Leave other characters as-is END ORDER BY seq ) SEPARATOR '' FROM ( SELECT SUBSTRING(data_value, seq, 1) AS c, seq FROM ( SELECT data_value, seq FROM test_data JOIN ( SELECT n AS seq FROM (SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10) numbers WHERE n <= 255 -- Handle up to 255 characters in a string ) seq_table ON seq <= CHAR_LENGTH(data_value) ) char_table ) char_map ) AS sort_key FROM test_data ORDER BY sort_key;
It didn't run, and I couldn't quite work out why, but isn't it impressive looking!
What it seemed to be trying to do was replace all numbers with "a" and all letters with "b" to form strings like "aabaaa, aabba, abbaa" for comparison.
I would have probably spent more time trying to fix this if it wasn't flawed.
"11ab1" would produce "aabba"
"111ab1" would produce "aaaba".
So with the way MySQL sorts, 111ab1 would have been above 11ab1 anyway as "aaa" comes before "aab".
Nice idea, but sadly not quite it.
Test 5: GPT nearly nailed it!
This one surprised me.
I had seen it elsewhere already (as we know GPT is great at just spitting things out that it has already seen and isn't actually creative), but the explanation it gave made more sense!
SELECT * FROM test_data ORDER BY CAST(data_value AS UNSIGNED), data_value ASC;
Ahhhh nearly. I really thought this was the answer.
But sadly it fails on "Test 1, Test 12, Test 2". It is good for sorting if the number comes first though!
My Turn
I still couldn't get what I wanted after several attempts. But I was now much closer to an answer after seeing a few techniques.
After a bit of noodle scratching and crying I thought I had solved it!
It works...kinda!
Here is what I came up with:
CREATE TABLE test_data ( id INT AUTO_INCREMENT PRIMARY KEY, data_value VARCHAR(255) NOT NULL, transformed_column VARCHAR(255) AS ( CASE WHEN REGEXP_SUBSTR(data_value, '[0-9]+') IS NOT NULL THEN REGEXP_REPLACE( data_value, '[0-9]+', LPAD(REGEXP_SUBSTR(data_value, '[0-9]+'), 8, '0') ) ELSE data_value -- Or another default value, e.g., 'No Numbers' END ) STORED );
Then we can query it like so:
SELECT * FROM test_data ORDER BY transformed_column, data_value;
And this gives us nearly the results we would expect.
Test 6 results
data_value | transformed_column |
---|---|
1 test | 00000001 test |
01-02-24 data | 00000001-00000001-00000001 data |
1test | 00000001test |
2 test | 00000002 test |
2RnSVwq | 00000002RnSVwq |
2test | 00000002test |
3uEKHmHePf | 00000003uEKHmHePf |
4a test | 00000004a test |
Full Results for test 6
data_value transformed_column 1 test 00000001 test 01-02-24 data 00000001-00000001-00000001 data 1test 00000001test 2 test 00000002 test 2RnSVwq 00000002RnSVwq 2test 00000002test 3uEKHmHePf 00000003uEKHmHePf 4a test 00000004a test 4b test 00000004b test 6bfS 00000006bfS 10 test 00000010 test 12 test 00000012 test 12 test 00000012 test 13IUkOxEVl 00000013IUkOxEVl 20 test 00000020 test 25hWTX 00000025hWTX 40-a-1 00000040-a-00000040 40a 00000040a 42-a-1 00000042-a-00000042 796h-eYWy 00000796h-eYWy 833HufIZAS 00000833HufIZAS 9244uVCpGa 00009244uVCpGa AB-34Y67846 AB-00000034Y00000034 AwtwQdn09 AwtwQdn00000009 cheese and test 12 cheese and test 00000012 cIhb42WFNQ cIhb00000042WFNQ dlCSh87811 dlCSh00087811 dnjLlW1 dnjLlW00000001 E5-RMT893Y9 E00000005-RMT00000005Y00000005 EV-489RY3DA EV-00000489RY00000489DA FoClSy4727 FoClSy00004727 GeGIrPM-H86 GeGIrPM-H00000086 gvSV6z gvSV00000006z HOFAp_Yx7920 HOFAp_Yx00007920 kQ33596c kQ00033596c my 12 magic test my 00000012 magic test my 2020-01-01 data 2020-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-01-01 data 2020-02-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-01-01 data 2021-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-01-01 data 2120-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-02-01 data 01-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-02-01 data 1-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-02-01 data 12-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-02-01 data 2-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-02-01 data 20-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-02-01 data 202-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2020-02-01 data 2020-01-01 my 00002020-00002020-00002020 data 00002020-00002020-00002020 my 2120-01-01 data 2020-01-01 my 00002120-00002120-00002120 data 00002120-00002120-00002120 my data 2020-01-01 my data 00002020-00002020-00002020 my data 2020-02-01 my data 00002020-00002020-00002020 my magic 12 test my magic 00000012 test my test my test Pthw068bf Pthw00000068bf RHrIt72402eaLr RHrIt00072402eaLr test 1 test 00000001 utjtV03Xns utjtV00000003Xns uxWLO039hb uxWLO00000039hb VHCok55901XYVk VHCok00055901XYVk vTg946 vTg00000946 WffSPaBA318 WffSPaBA00000318 wrOae537LGCT wrOae00000537LGCT yDKrkCp7960 yDKrkCp00007960
It breaks on the rows:
data_value | transformed_column |
---|---|
my 2020-02-01 data 1-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 12-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
my 2020-02-01 data 2-01-01 | my 00002020-00002020-00002020 data 00002020-00002020-00002020 |
As you can see, but overall I think it will suit our needs for now!
This was based on a LPAD
example I saw. But it had the same flaw that it only worked on Strings with numbers at the beginning.
So I took that principle and applied it to all number series.
Also this now uses a generated column so that it is fast on reads (although we could soon make it part of the query if it didn't hurt read performance too much or we were more bothered about write performance).
The big issue is the fact that this works through "brute force" by working around the fact that MySQL sees "11" as smaller than "2" due to checking one character at a time (the dumbed down version of what happens) by making all numbers 8 digits long.
It also replaces all numbers in the transformed_column
with the first number it finds and as far as I can tell, other than nesting about 20 statements and using coalesce
or something, there is no way to avoid this behaviour I could find.
Conclusion
I am not really sure? MySQL should join the 21st century and provide an alphanumeric sort option like other DBs?
While we wish for things to happen I think the short answer in the scope of this article is that I still haven't solved this.
But with that being said, I feel like test 6 is much closer to the answer than anything I found out in the wild and may be useful to someone someday?
Obviously if you are expecting numbers longer than 8 digits, you may want to adjust the LPAD
value as that is one limitation.
Also it isn't perfect so you can't rely on it for scenarios other than sorting user input so they can find things easily.
I think at this point my best bet is to throw it out into the world and see if some smarter people than me can solve this thing properly! (I am probably needing some much deeper knowledge of sorting in MySQL to really solve it!)
If someone can solve the problem I have where I can't change every number to 8 digits long, then this would work even better! (so "my 2020-02-01 data 2-01-01" would become "my 00002020-00000002-00000001 data 00000002-00000001-00000001").
Or maybe there is a literal one liner that I missed in all my research that solves this? (I will only cry a little bit if there is, I promise!).
Anyway, that is the best I have, but would love to see something better!
Finally here is a DB Fiddle to play with
Here is a DB fiddle of the generated column version to start from if you think you can spot a quick win / improvement:
https://www.db-fiddle.com/f/o2ohcGVAgHZQg4teg1s9jW/1034
Thanks for reading, catch you all soon!
Top comments (9)
Now, that is a dedication to duty. Personally I'd have dropped a support call into Oracle and told the QA team that you are waiting on a reply, pretty sure you could string that along for a fair few months... ;)
Haha, might just do that! 🤣💗
Guarantee someone shows me how to do this in one line and I end up deleting the post or putting "here is the answer, ignore the rest of the article" at the top! haha
You said one line but hadn't specify how long can it be 🌝
Jokes aside it's not that "natural" as you might expect. I mean sire 10 is "greater" than "2" but it's the same order we use when ordering strings. E.g.
Imagine we accept as a society that when you combine
h
withi
it suddenly becomes "greater" in some dimension thanhk
hence you expectweird, isn't it?
I understand your struggle nevertheless and I can think of two ways of solving it; create a view pre-calculating the order (do you can consume it from different places) or just let it be and, as you said, sort it this way in JS/PHP or whatever 😅 I'd probably go for the latter.
Haha exactly, looks like that is the way for now, but I will have a see if I can hack together some impractical monstrosity that does it purely in MySQL and works perfectly...even if it is 200 times lower than JS / PHP / whatever! haha.
Now that you made me think about it... What about PL SQL ordering the substring of numbers dynamically with regex?
I've no computer RN to test
… and for readers who just want to do it in JS:
Exactly, way easier than the way I cam up with to actually do this in MySQL. 💗
Nice work as always Graham!!! Way to follow through while running into some elegant queries to keep in your toolbox!