DEV Community

Cover image for ASCII in NLP - NLP
datatoinfinity
datatoinfinity

Posted on • Edited on

ASCII in NLP - NLP

ASCII (American Standard Code for Information Interchange) is a character encoding standard that represents 128 characters using 7 bits. These 128 characters include uppercase and lowercase letters, numbers, punctuation marks, and control characters.

While this is the technical definition (source: Google), let’s understand why ASCII is important in Natural Language Processing (NLP).

The Problem

Think of two situations:

  • Converting a number to binary
  • Converting text to binary

Converting numbers to binary is pretty straightforward:

 5 in binary = 101 100 in binary = 1100100 

But Converting text to binary add extra step. First convert to number and to binary.

But converting text to binary involves an extra step:

  • First, convert each character to a number (using encoding like ASCII)
  • Then, convert that number to binary

So yes, this is exactly what we’re doing in NLP and programming — and instead of assigning numbers ourselves, we use ASCII, which is a standardized encoding for characters.

Python Code Example:

 print(ord('A')) print(ord('a')) print(ord('1')) print(ord(' ')) print(chr(65)) 
 Output: 65 97 49 32 A 

Here:

ord() gives the ASCII value (number) of a character
chr() gives the character from an ASCII value

Now you explain the what's happening here.

 name = "John" ascii_values = [ord(char) for char in name] print(ascii_values) 
 [74, 111, 104, 110] 

Top comments (0)