Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?
Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper. But those days are gone. Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").
So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c
This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.
To get the point of a linux app and/or a slightly better understanding
of a linux app I will try the man page for the app. Example man iconv
Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?
Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper. But those days are gone. Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").
So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c
This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.
Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?
Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper.
But those days are gone.
Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web.
In
particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").
So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c
This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread.
gazelle@shell.xmission.com (Kenny McCormack) writes:
Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?
Back in the good old days, the Internet was simple, 7 bit ASCII, and
everything was good and proper. But those days are gone. Nowadays, there >> is all this i18n glop in the strings we get from The Internet/The Web. In >> particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").
So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my >> pipelines the command: iconv -c
This works reasonably well, but just doesn't quite feel entirely correct;
hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of >> other command line options and/or environment variables that you can set to >> control it - but it seems to work well enough for me just using the "-c"
option.
You are using iconv but don't feel it's correct but haven't shown us any >examples of what you are trying to do or how it's failing.
I'll have to guess you've asked it to convert utf8 to ascii.
Maybe you did something else. The man page shows an example making the >target "ASCII//TRANSLIT". The translit part sounds like it might help.
iconv sounds to me like the right tool. If there are other multi-byte >sequences you want to handle, something like sed can do the job.
... But I just want something that will remove
any/all high-ASCII junk.
Anyway, I was curious to find out what other people use, and how they have fared with this problem.
I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT
In article <t0g8o2$k88$1@dont-email.me>,
Dan Espen <dan1espen@gmail.com> wrote:
gazelle@shell.xmission.com (Kenny McCormack) writes:
Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?
Back in the good old days, the Internet was simple, 7 bit ASCII, and
everything was good and proper. But those days are gone. Nowadays, there >>> is all this i18n glop in the strings we get from The Internet/The Web. In >>> particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").
So, it becomes a normal part of my processing get rid of this glop. The >>> tool that I've ended up using is "iconv", and I usually put somewhere in my >>> pipelines the command: iconv -c
This works reasonably well, but just doesn't quite feel entirely correct; >>> hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of >>> other command line options and/or environment variables that you can set to >>> control it - but it seems to work well enough for me just using the "-c" >>> option.
You are using iconv but don't feel it's correct but haven't shown us any >>examples of what you are trying to do or how it's failing.
I'll have to guess you've asked it to convert utf8 to ascii.
Maybe you did something else. The man page shows an example making the >>target "ASCII//TRANSLIT". The translit part sounds like it might help.
iconv sounds to me like the right tool. If there are other multi-byte >>sequences you want to handle, something like sed can do the job.
I get what you are saying. But I just want something that will remove any/all high-ASCII junk. It *might* be as simple as simply writing a
simple search-and-replace script in your-favorite-scriping-language (in my case, that would be AWK) to remove any character with ASCII value > 127.
But, my feeling is that there must be something better.
Also, my sense is that iconv doesn't do enough. Sometimes, even after running it through iconv, you'll still see non-ASCII junk in the file.
Also, as I mentioned, the main character that seems to have a problem is
the ' character. It'd be nice if some commonly accepted solution would at least handle all the mis-codings of that character.
Anyway, I was curious to find out what other people use, and how they have fared with this problem.
I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT
Compared to the tr command, the TRANSLIT functionality (search and
replace instead of search and delete) is very nice. Perhaps a way
to add custom rules for converting characters would be an
improvement, though not one that I frequently desire.
Anyway, I was curious to find out what other people use, and how they have >> fared with this problem.
Sorry, just another iconv person, but faring quite well with it
anyway.
In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
...
I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT
I have put this line into production and it seems to work well. Thanks again.
However, it still just doesn't feel right. I mean, look at all those
"funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is that a novice (and for the purposes of this particular area of discussion, you can consider me to be a novice) would have no idea what these things
mean or what will need to be changed as time goes on.
That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution. More of a "You're the computer;
you figure it out" type solution.
But, apparently, there isn't.
Compared to the tr command, the TRANSLIT functionality (search and
replace instead of search and delete) is very nice. Perhaps a way
to add custom rules for converting characters would be an
improvement, though not one that I frequently desire.
As I said, it works. As long as you are going utf8 (whatever that is; yes, I'm kidding) to ASCII (I know what that is). What if the next time I get some data for this system, it is in utf-9 (or utf-10 or whatever) ?
Anyway, I was curious to find out what other people use, and how they have >>> fared with this problem.
Sorry, just another iconv person, but faring quite well with it
anyway.
Alas, that seems to be as far as it goes...
In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
...
I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT
However, it still just doesn't feel right. I mean, look at all those
"funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is that a novice (and for the purposes of this particular area of discussion, you can consider me to be a novice) would have no idea what these things
mean or what will need to be changed as time goes on.
That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution.
More of a "You're the computer; you figure it out" type solution.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 498 |
Nodes: | 16 (2 / 14) |
Uptime: | 31:53:20 |
Calls: | 9,798 |
Calls today: | 17 |
Files: | 13,751 |
Messages: | 6,188,910 |