Forum: >>> Magnum BBS <<<

Cleaning up the junk you get from the web these days. Is there a normal

From Kenny McCormack@21:1/5 to All on Fri Mar 11 15:37:52 2022

Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?

Back in the good old days, the Internet was simple, 7 bit ASCII, and
everything was good and proper. But those days are gone. Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").

So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

This works reasonably well, but just doesn't quite feel entirely correct;
hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.

--
Donald Drumpf claims to be "the least racist person you'll ever meet".

This would be true if the only other person you've ever met was David Duke.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bit Twister@21:1/5 to Kenny McCormack on Fri Mar 11 11:07:07 2022

On Fri, 11 Mar 2022 15:37:52 -0000 (UTC), Kenny McCormack wrote:

Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?

Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper. But those days are gone. Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").

So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.

To get the point of a linux app and/or a slightly better understanding
of a linux app I will try the man page for the app. Example man iconv

After that there is a search feature in google.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to BitTwister@mouse-potato.com on Fri Mar 11 18:21:25 2022

In article <slrnt2n0dr.u52e.BitTwister@wb.home.test>,
Bit Twister <BitTwister@mouse-potato.com> wrote:
...

To get the point of a linux app and/or a slightly better understanding
of a linux app I will try the man page for the app. Example man iconv

The point isn't to learn how iconv works. I know how iconv works.
I just don't think it works very well when you just want to get rid of all
the junk. I.e., iconv seems to be solving a different problem than the one
I am talking about.

Like I said, I know how to read man pages (like, duh...) and I know how to
use iconv; I just wish there was a better and more "normal" solution to the problem. I hardly think I am alone in wanting this.

--
The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Aspergers

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Espen@21:1/5 to Kenny McCormack on Fri Mar 11 14:42:58 2022

gazelle@shell.xmission.com (Kenny McCormack) writes:

Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?

Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper. But those days are gone. Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").

So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.

You are using iconv but don't feel it's correct but haven't shown us any examples of what you are trying to do or how it's failing.

I'll have to guess you've asked it to convert utf8 to ascii.
Maybe you did something else. The man page shows an example making the
target "ASCII//TRANSLIT". The translit part sounds like it might help.

iconv sounds to me like the right tool. If there are other multi-byte sequences you want to handle, something like sed can do the job.

--
Dan Espen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Kenny McCormack on Fri Mar 11 20:08:12 2022

gazelle@shell.xmission.com (Kenny McCormack) writes:

Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?

Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper.

Maybe you didn't look very hard. In the bad old days there were
inconsistent 8-bit versions of just about everything and (though there
were more) two common, widely used, incompatible character sets.

And don't get started on file types!

But those days are gone.

Now those bad days are largely gone. Almost everything is a stream of
bytes, and there is one almost universally agreed character set. And
almost all protocols get to announce the encoding they are using so you
don't have to guess anymore.

Nowadays, there
is all this i18n glop in the strings we get from The Internet/The Web.

What on earth is i18n glop?

In
particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").

"Good old ASCII" had two single quotes -- an opening one and a closing
one -- though these were secondary meanings. The "closing quote" (also
called acute accent) is more usually referred to as apostrophe. The
modern rendering of it as vertical belies its original purpose.

There are many ways to represent that character (&39; for example), but
iconv -c won't handle these different representations of code point 39
(\047).

But there are also other characters that better fit the description of
"single quote". These used to be very common on the Web because, in a
twist of fate, Windows software often uses a closing single quote as an apostrophe. I don't see that nearly as often these days. Maybe this is
one the 9 you see?

So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread.

It's not clear what you want and it's not clear what the source data
looks like. Do you take into account any declared character set
headers? If so, converting to UTF-8 would probably avoid the need to
discard anything in the input.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to dan1espen@gmail.com on Fri Mar 11 20:37:26 2022

In article <t0g8o2$k88$1@dont-email.me>,
Dan Espen <dan1espen@gmail.com> wrote:

gazelle@shell.xmission.com (Kenny McCormack) writes:

Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?

Back in the good old days, the Internet was simple, 7 bit ASCII, and
everything was good and proper. But those days are gone. Nowadays, there >> is all this i18n glop in the strings we get from The Internet/The Web. In >> particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").

So, it becomes a normal part of my processing get rid of this glop. The
tool that I've ended up using is "iconv", and I usually put somewhere in my >> pipelines the command: iconv -c

This works reasonably well, but just doesn't quite feel entirely correct;
hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of >> other command line options and/or environment variables that you can set to >> control it - but it seems to work well enough for me just using the "-c"
option.

You are using iconv but don't feel it's correct but haven't shown us any >examples of what you are trying to do or how it's failing.

I'll have to guess you've asked it to convert utf8 to ascii.
Maybe you did something else. The man page shows an example making the >target "ASCII//TRANSLIT". The translit part sounds like it might help.

iconv sounds to me like the right tool. If there are other multi-byte >sequences you want to handle, something like sed can do the job.

I get what you are saying. But I just want something that will remove
any/all high-ASCII junk. It *might* be as simple as simply writing a
simple search-and-replace script in your-favorite-scriping-language (in my case, that would be AWK) to remove any character with ASCII value > 127.

But, my feeling is that there must be something better.

Also, my sense is that iconv doesn't do enough. Sometimes, even after
running it through iconv, you'll still see non-ASCII junk in the file.
Also, as I mentioned, the main character that seems to have a problem is
the ' character. It'd be nice if some commonly accepted solution would at least handle all the mis-codings of that character.

Anyway, I was curious to find out what other people use, and how they have fared with this problem.

--
"There's no chance that the iPhone is going to get any significant market share. No chance." - Steve Ballmer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Bacarisse@21:1/5 to Kenny McCormack on Fri Mar 11 21:29:05 2022

gazelle@shell.xmission.com (Kenny McCormack) writes:

... But I just want something that will remove
any/all high-ASCII junk.

That's a clearer statement of what you want (minus the trolling "junk").
Isn't

tr -d '\200\377'

what you want?

Anyway, I was curious to find out what other people use, and how they have fared with this problem.

I've never come across a need for throwing characters away. Very often
the "junk" is there for a purpose.

--
Ben.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to Computer Nerd Kev on Fri Mar 11 21:58:41 2022

In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
...

I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT

Thanks. I'll try that at some point.

(I think you still need -c, or else it will error out when it sees
something unexpected - which, of course, can and does always happen in real life).

--
"If our country is going broke, let it be from feeding the poor and caring for the elderly. And not from pampering the rich and fighting wars for them."

--Living Blue in a Red State--

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Kenny McCormack on Fri Mar 11 21:29:46 2022

Kenny McCormack <gazelle@shell.xmission.com> wrote:

In article <t0g8o2$k88$1@dont-email.me>,
Dan Espen <dan1espen@gmail.com> wrote:

gazelle@shell.xmission.com (Kenny McCormack) writes:

Cleaning up the junk you get from the web these days.
Is there a normalized way to do it?

Back in the good old days, the Internet was simple, 7 bit ASCII, and
everything was good and proper. But those days are gone. Nowadays, there >>> is all this i18n glop in the strings we get from The Internet/The Web. In >>> particular, there seems to be about 9 different ways to represent the
simple "single quote" character (normally represented as "\047").

So, it becomes a normal part of my processing get rid of this glop. The >>> tool that I've ended up using is "iconv", and I usually put somewhere in my >>> pipelines the command: iconv -c

This works reasonably well, but just doesn't quite feel entirely correct; >>> hence my reason for posting this thread. Note that I don't really
understand the full logic or point of iconv, and I think there are lots of >>> other command line options and/or environment variables that you can set to >>> control it - but it seems to work well enough for me just using the "-c" >>> option.

You are using iconv but don't feel it's correct but haven't shown us any >>examples of what you are trying to do or how it's failing.

I'll have to guess you've asked it to convert utf8 to ascii.
Maybe you did something else. The man page shows an example making the >>target "ASCII//TRANSLIT". The translit part sounds like it might help.

iconv sounds to me like the right tool. If there are other multi-byte >>sequences you want to handle, something like sed can do the job.

I get what you are saying. But I just want something that will remove any/all high-ASCII junk. It *might* be as simple as simply writing a
simple search-and-replace script in your-favorite-scriping-language (in my case, that would be AWK) to remove any character with ASCII value > 127.

If you mean more like "search and delete", then I've seen this in
the past and kept it in mind for any case where iconv isn't
available:
tr -cd "\11\12\15\40-\176"

But, my feeling is that there must be something better.

Also, my sense is that iconv doesn't do enough. Sometimes, even after running it through iconv, you'll still see non-ASCII junk in the file.
Also, as I mentioned, the main character that seems to have a problem is
the ' character. It'd be nice if some commonly accepted solution would at least handle all the mis-codings of that character.

I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT

Compared to the tr command, the TRANSLIT functionality (search and
replace instead of search and delete) is very nice. Perhaps a way
to add custom rules for converting characters would be an
improvement, though not one that I frequently desire.

Anyway, I was curious to find out what other people use, and how they have fared with this problem.

Sorry, just another iconv person, but faring quite well with it
anyway.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to Computer Nerd Kev on Mon Mar 14 13:44:36 2022

In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
...

I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT

I have put this line into production and it seems to work well. Thanks again.

However, it still just doesn't feel right. I mean, look at all those
"funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What
is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is
that a novice (and for the purposes of this particular area of discussion,
you can consider me to be a novice) would have no idea what these things
mean or what will need to be changed as time goes on.

That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution. More of a "You're the computer;
you figure it out" type solution.

But, apparently, there isn't.

Compared to the tr command, the TRANSLIT functionality (search and
replace instead of search and delete) is very nice. Perhaps a way
to add custom rules for converting characters would be an
improvement, though not one that I frequently desire.

As I said, it works. As long as you are going utf8 (whatever that is; yes,
I'm kidding) to ASCII (I know what that is). What if the next time I get
some data for this system, it is in utf-9 (or utf-10 or whatever) ?

Anyway, I was curious to find out what other people use, and how they have >> fared with this problem.

Sorry, just another iconv person, but faring quite well with it
anyway.

Alas, that seems to be as far as it goes...

-- http://www.rollingstone.com/politics/news/the-10-dumbest-things-ever-said-about-global-warming-20130619

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Espen@21:1/5 to Kenny McCormack on Mon Mar 14 10:02:52 2022

gazelle@shell.xmission.com (Kenny McCormack) writes:

In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
...

I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT

I have put this line into production and it seems to work well. Thanks again.

However, it still just doesn't feel right. I mean, look at all those
"funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is that a novice (and for the purposes of this particular area of discussion, you can consider me to be a novice) would have no idea what these things
mean or what will need to be changed as time goes on.

That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution. More of a "You're the computer;
you figure it out" type solution.

But, apparently, there isn't.

Since the solution was staring you in the face in the man page
I disagree. All that may be lacking is a more detailed explanation
of what the example does, but
"The next example converts from UTF-8 to ASCII, transliterating when
possible:"
Seems pretty clear to me.

Compared to the tr command, the TRANSLIT functionality (search and
replace instead of search and delete) is very nice. Perhaps a way
to add custom rules for converting characters would be an
improvement, though not one that I frequently desire.

As I said, it works. As long as you are going utf8 (whatever that is; yes, I'm kidding) to ASCII (I know what that is). What if the next time I get some data for this system, it is in utf-9 (or utf-10 or whatever) ?

If you knew what utf-8 was, it's hard to imagine why you would mention non-existing code pages.

Anyway, I was curious to find out what other people use, and how they have >>> fared with this problem.

Sorry, just another iconv person, but faring quite well with it
anyway.

Alas, that seems to be as far as it goes...

Declaring a problem when none exists? Submit the man page correction if
you think something is missing.

--
Dan Espen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kenny McCormack@21:1/5 to All on Mon Mar 14 14:10:44 2022

In article <t0nhuc$3jp$1@dont-email.me>,
Dan Espen <dan1espen@gmail.com> demonstrates that he has totally missed
the point of my subtle, but amusing little contribution:

etc, etc
--
The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/Seriously

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Kenny McCormack on Mon Mar 14 19:24:45 2022

On 14.03.2022 14:44, Kenny McCormack wrote:

In article <t0gf09$12jr$1@gioia.aioe.org>,
Computer Nerd Kev <not@telling.you.invalid> wrote:
...

I find iconv does the job perfectly, running it like this:
iconv -f utf-8 -t ASCII//TRANSLIT

However, it still just doesn't feel right. I mean, look at all those
"funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is that a novice (and for the purposes of this particular area of discussion, you can consider me to be a novice) would have no idea what these things
mean or what will need to be changed as time goes on.

And in your original post you wrote: "Back in the good old days, the
Internet was simple, 7 bit ASCII, and everything was good and proper."

In my book that boils down to two observations.
From an isolated point of view, an US-centric/US-only view, that may
make sense. From a, say, EU view we can say we left the Stone Age and
are now able to express our languages with Unicode and communicate
all over the world and across borders. There's a universal character
set defined, and a quasi-standard encoding based on (quasi-standard)
units (octets, or "bytes" if someone prefers a less determined term).
The second observation is the inherent complexity; the character sets
topic is not trivial (and still not every tool supports it correctly).
And such a tool, like iconv, resembles that complexity (to a degree).
While "code-page" mappings (say, Windows to ISO Latin) are simple[*]
the other "conversions" like transliterations, that go beyond a
technical mapping, are generally not that trivial. (And yes, the '//'
delimiter is not common (and maybe there's better choices?). OTOH,
Unix is full of inconsistent syntax, and this one is harmless compared
to some other syntax variants, like inconsistent options specification
(-o, -opt, --opt, opt=, etc.) across many of the Unix tools.)

[*] Remember these "good ol' days" where for conversion we had (only?)
the 'dd' command that was able to convert from/to EBCDIC.

That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution.

How would a >>"macro" type solution<< look like? (I mean we can hide
ugly syntax issues for special purpose applications in wrapper scripts
or functions.)

More of a "You're the computer; you figure it out" type solution.

There's not enough information in the data that allows "the computer"
to determine the code page; you need meta-data for it; the from/to
arguments in iconv, for example.

Yes, "the world" (including "the Internet") was simpler before, yet not reflecting the global demands.

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Vurt834
  Fri Jun 20 21:23:33 2025
  from St Louis, Mo via Telnet
- Bob Worm
  Fri Jun 20 20:46:42 2025
  from Wales, Uk via Telnet
- Ian Rihard Kosednar
  Fri Jun 20 16:40:58 2025
  from No via SSH
- Ian Rihard Kosednar
  Fri Jun 20 16:38:38 2025
  from No via SSH
- Ian Rihard Kosednar
  Fri Jun 20 16:10:44 2025
  from No via SSH
- Ian Rihard Kosednar
  Fri Jun 20 15:32:37 2025
  from No via SSH
- Ian Rihard Kosednar
  Fri Jun 20 15:29:33 2025
  from No via SSH
- Ian Rihard Kosednar
  Fri Jun 20 15:27:36 2025
  from No via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	498
Nodes:	16 (2 / 14)
Uptime:	31:53:20
Calls:	9,798
Calls today:	17
Files:	13,751
Messages:	6,188,910

Cleaning up the junk you get from the web these days. Is there a normal

Who's Online

Recent Visitors

System Info