goelweb.com --> Software --> Unix utilities --> Removing non-ASCII characters

Removing non-ASCII characters

I was parsing XML files, but sometimes the parser would error out because there were some non-ASCII characters in the input. To deal with this, I could have written a simple C program to strip those characters. But there had to be an easier way. After some effort (more effort than it would have taken me to write the equivalent C program), I came up with a solution.

$ ex -c 's/[^[:alnum:][:punct:][:space:]]/ /g|wq' $HOME/data.xml

This command processes file data.xml. First it substitues non-alphanumeric, non-punctuation, and non-space characters with a space. Then it writes those changes to the file. In one step it does all this, and you don't need to revert to programming. Caution: if you're trying to edit a large file, the program may not successfully run.

This was a good exercise because I've used this idea on a number of other occassions:

$ ex -c 's/[[:space:]]*$//g|wq' file # remove trailing blank spaces on each line
$ ex -c 'g/^[[:space:]]*$/d|wq' file # delete blank lines from file
rishi.goel@alumni.usc.edu