Java remove non-printable non-ascii characters using regex

Java example to use regular expressions to search and replace unwanted and non-printable characters ASCII characters from text file content.

Java Clean ASCII Text Files

We may have unwanted non-ascii characters into file content or string from variety of ways e.g. from copying and pasting the text from an MS Word document or web browser, PDF-to-text conversion or HTML-to-text conversion. we may want to remove non-printable characters before using the file into the application because they prove to be problem when we start data processing on this file’s content.

In this java regex example, I am using regular expressions to search and replace non-ascii characters and even remove non-printable characters as well.

1. Java remove non-printable characters

Java program to clean string content from unwanted chars and non-printable chars.

 private static String cleanTextContent(String text) {	// strips off all non-ASCII characters	text = text.replaceAll("[^\\x00-\\x7F]", "");	// erases all the ASCII control characters	text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");	// removes non-printable characters from Unicode	text = text.replaceAll("\\p{C}", "");	return text.trim(); } 

2. Remove non-printable characters example

2.1. File content with non-ascii content

I will read a file with following content and remove all non-ascii characters including non-printable characters.

öäü how to do in java . com A função, Ãugent

2.2. Java program to clean ASCII text

 package com.howtodoinjava.demo; import java.io.File; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.stream.Stream; public class CleanTextExample {	public static void main(String[] args)	{	File file = new File("c:/temp/data.txt");	String uncleanContent = readFileIntoString(file);	System.out.println(uncleanContent);	String cleanContent = cleanTextContent(uncleanContent);	System.out.println(cleanContent);	}	private static String readFileIntoString(File file)	{	StringBuilder contentBuilder = new StringBuilder();	try (Stream<String> stream = Files.lines(Paths.get(file.toURI())))	{	stream.forEach(s -> contentBuilder.append(s).append("\n"));	}	catch (IOException e)	{	System.out.println("Error reading " + file.getAbsolutePath());	}	return contentBuilder.toString();	}	private static String cleanTextContent(String text)	{	// strips off all non-ASCII characters	text = text.replaceAll("[^\\x00-\\x7F]", "");	// erases all the ASCII control characters	text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");	// removes non-printable characters from Unicode	text = text.replaceAll("\\p{C}", "");	return text.trim();	} } 

Program Output.

 öäü how to do in java . com A função, Ãugent how to do in java . com A funo, ugent

Feel free to modify the cleanTextContent() method as per your need – and add/remove regex as per requirements.

Happy Learning !!

Comments

Subscribe
3 Comments
Most Voted
Newest Oldest
Inline Feedbacks
View all comments

About Us

HowToDoInJava provides tutorials and how-to guides on Java and related technologies.

It also shares the best practices, algorithms & solutions and frequently asked interview questions.