• Cleaning up the junk you get from the web these days. Is there a normal

    From Kenny McCormack@21:1/5 to All on Fri Mar 11 15:37:52 2022
    Cleaning up the junk you get from the web these days.
    Is there a normalized way to do it?

    Back in the good old days, the Internet was simple, 7 bit ASCII, and
    everything was good and proper. But those days are gone. Nowadays, there
    is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
    simple "single quote" character (normally represented as "\047").

    So, it becomes a normal part of my processing get rid of this glop. The
    tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

    This works reasonably well, but just doesn't quite feel entirely correct;
    hence my reason for posting this thread. Note that I don't really
    understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.

    --
    Donald Drumpf claims to be "the least racist person you'll ever meet".

    This would be true if the only other person you've ever met was David Duke.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bit Twister@21:1/5 to Kenny McCormack on Fri Mar 11 11:07:07 2022
    On Fri, 11 Mar 2022 15:37:52 -0000 (UTC), Kenny McCormack wrote:
    Cleaning up the junk you get from the web these days.
    Is there a normalized way to do it?

    Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper. But those days are gone. Nowadays, there
    is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
    simple "single quote" character (normally represented as "\047").

    So, it becomes a normal part of my processing get rid of this glop. The
    tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

    This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread. Note that I don't really
    understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.

    To get the point of a linux app and/or a slightly better understanding
    of a linux app I will try the man page for the app. Example man iconv

    After that there is a search feature in google.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to BitTwister@mouse-potato.com on Fri Mar 11 18:21:25 2022
    In article <slrnt2n0dr.u52e.BitTwister@wb.home.test>,
    Bit Twister <BitTwister@mouse-potato.com> wrote:
    ...
    To get the point of a linux app and/or a slightly better understanding
    of a linux app I will try the man page for the app. Example man iconv

    The point isn't to learn how iconv works. I know how iconv works.
    I just don't think it works very well when you just want to get rid of all
    the junk. I.e., iconv seems to be solving a different problem than the one
    I am talking about.

    Like I said, I know how to read man pages (like, duh...) and I know how to
    use iconv; I just wish there was a better and more "normal" solution to the problem. I hardly think I am alone in wanting this.

    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/Aspergers

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Espen@21:1/5 to Kenny McCormack on Fri Mar 11 14:42:58 2022
    gazelle@shell.xmission.com (Kenny McCormack) writes:

    Cleaning up the junk you get from the web these days.
    Is there a normalized way to do it?

    Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper. But those days are gone. Nowadays, there
    is all this i18n glop in the strings we get from The Internet/The Web. In particular, there seems to be about 9 different ways to represent the
    simple "single quote" character (normally represented as "\047").

    So, it becomes a normal part of my processing get rid of this glop. The
    tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

    This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread. Note that I don't really
    understand the full logic or point of iconv, and I think there are lots of other command line options and/or environment variables that you can set to control it - but it seems to work well enough for me just using the "-c" option.

    You are using iconv but don't feel it's correct but haven't shown us any examples of what you are trying to do or how it's failing.

    I'll have to guess you've asked it to convert utf8 to ascii.
    Maybe you did something else. The man page shows an example making the
    target "ASCII//TRANSLIT". The translit part sounds like it might help.

    iconv sounds to me like the right tool. If there are other multi-byte sequences you want to handle, something like sed can do the job.

    --
    Dan Espen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Kenny McCormack on Fri Mar 11 20:08:12 2022
    gazelle@shell.xmission.com (Kenny McCormack) writes:

    Cleaning up the junk you get from the web these days.
    Is there a normalized way to do it?

    Back in the good old days, the Internet was simple, 7 bit ASCII, and everything was good and proper.

    Maybe you didn't look very hard. In the bad old days there were
    inconsistent 8-bit versions of just about everything and (though there
    were more) two common, widely used, incompatible character sets.

    And don't get started on file types!

    But those days are gone.

    Now those bad days are largely gone. Almost everything is a stream of
    bytes, and there is one almost universally agreed character set. And
    almost all protocols get to announce the encoding they are using so you
    don't have to guess anymore.

    Nowadays, there
    is all this i18n glop in the strings we get from The Internet/The Web.

    What on earth is i18n glop?

    In
    particular, there seems to be about 9 different ways to represent the
    simple "single quote" character (normally represented as "\047").

    "Good old ASCII" had two single quotes -- an opening one and a closing
    one -- though these were secondary meanings. The "closing quote" (also
    called acute accent) is more usually referred to as apostrophe. The
    modern rendering of it as vertical belies its original purpose.

    There are many ways to represent that character (&39; for example), but
    iconv -c won't handle these different representations of code point 39
    (\047).

    But there are also other characters that better fit the description of
    "single quote". These used to be very common on the Web because, in a
    twist of fate, Windows software often uses a closing single quote as an apostrophe. I don't see that nearly as often these days. Maybe this is
    one the 9 you see?

    So, it becomes a normal part of my processing get rid of this glop. The
    tool that I've ended up using is "iconv", and I usually put somewhere in my pipelines the command: iconv -c

    This works reasonably well, but just doesn't quite feel entirely correct; hence my reason for posting this thread.

    It's not clear what you want and it's not clear what the source data
    looks like. Do you take into account any declared character set
    headers? If so, converting to UTF-8 would probably avoid the need to
    discard anything in the input.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to dan1espen@gmail.com on Fri Mar 11 20:37:26 2022
    In article <t0g8o2$k88$1@dont-email.me>,
    Dan Espen <dan1espen@gmail.com> wrote:
    gazelle@shell.xmission.com (Kenny McCormack) writes:

    Cleaning up the junk you get from the web these days.
    Is there a normalized way to do it?

    Back in the good old days, the Internet was simple, 7 bit ASCII, and
    everything was good and proper. But those days are gone. Nowadays, there >> is all this i18n glop in the strings we get from The Internet/The Web. In >> particular, there seems to be about 9 different ways to represent the
    simple "single quote" character (normally represented as "\047").

    So, it becomes a normal part of my processing get rid of this glop. The
    tool that I've ended up using is "iconv", and I usually put somewhere in my >> pipelines the command: iconv -c

    This works reasonably well, but just doesn't quite feel entirely correct;
    hence my reason for posting this thread. Note that I don't really
    understand the full logic or point of iconv, and I think there are lots of >> other command line options and/or environment variables that you can set to >> control it - but it seems to work well enough for me just using the "-c"
    option.

    You are using iconv but don't feel it's correct but haven't shown us any >examples of what you are trying to do or how it's failing.

    I'll have to guess you've asked it to convert utf8 to ascii.
    Maybe you did something else. The man page shows an example making the >target "ASCII//TRANSLIT". The translit part sounds like it might help.

    iconv sounds to me like the right tool. If there are other multi-byte >sequences you want to handle, something like sed can do the job.

    I get what you are saying. But I just want something that will remove
    any/all high-ASCII junk. It *might* be as simple as simply writing a
    simple search-and-replace script in your-favorite-scriping-language (in my case, that would be AWK) to remove any character with ASCII value > 127.

    But, my feeling is that there must be something better.

    Also, my sense is that iconv doesn't do enough. Sometimes, even after
    running it through iconv, you'll still see non-ASCII junk in the file.
    Also, as I mentioned, the main character that seems to have a problem is
    the ' character. It'd be nice if some commonly accepted solution would at least handle all the mis-codings of that character.

    Anyway, I was curious to find out what other people use, and how they have fared with this problem.

    --
    "There's no chance that the iPhone is going to get any significant market share. No chance." - Steve Ballmer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Bacarisse@21:1/5 to Kenny McCormack on Fri Mar 11 21:29:05 2022
    gazelle@shell.xmission.com (Kenny McCormack) writes:

    ... But I just want something that will remove
    any/all high-ASCII junk.

    That's a clearer statement of what you want (minus the trolling "junk").
    Isn't

    tr -d '\200\377'

    what you want?

    Anyway, I was curious to find out what other people use, and how they have fared with this problem.

    I've never come across a need for throwing characters away. Very often
    the "junk" is there for a purpose.

    --
    Ben.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to Computer Nerd Kev on Fri Mar 11 21:58:41 2022
    In article <t0gf09$12jr$1@gioia.aioe.org>,
    Computer Nerd Kev <not@telling.you.invalid> wrote:
    ...
    I find iconv does the job perfectly, running it like this:
    iconv -f utf-8 -t ASCII//TRANSLIT

    Thanks. I'll try that at some point.

    (I think you still need -c, or else it will error out when it sees
    something unexpected - which, of course, can and does always happen in real life).

    --
    "If our country is going broke, let it be from feeding the poor and caring for the elderly. And not from pampering the rich and fighting wars for them."

    --Living Blue in a Red State--

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Kenny McCormack on Fri Mar 11 21:29:46 2022
    Kenny McCormack <gazelle@shell.xmission.com> wrote:
    In article <t0g8o2$k88$1@dont-email.me>,
    Dan Espen <dan1espen@gmail.com> wrote:
    gazelle@shell.xmission.com (Kenny McCormack) writes:

    Cleaning up the junk you get from the web these days.
    Is there a normalized way to do it?

    Back in the good old days, the Internet was simple, 7 bit ASCII, and
    everything was good and proper. But those days are gone. Nowadays, there >>> is all this i18n glop in the strings we get from The Internet/The Web. In >>> particular, there seems to be about 9 different ways to represent the
    simple "single quote" character (normally represented as "\047").

    So, it becomes a normal part of my processing get rid of this glop. The >>> tool that I've ended up using is "iconv", and I usually put somewhere in my >>> pipelines the command: iconv -c

    This works reasonably well, but just doesn't quite feel entirely correct; >>> hence my reason for posting this thread. Note that I don't really
    understand the full logic or point of iconv, and I think there are lots of >>> other command line options and/or environment variables that you can set to >>> control it - but it seems to work well enough for me just using the "-c" >>> option.

    You are using iconv but don't feel it's correct but haven't shown us any >>examples of what you are trying to do or how it's failing.

    I'll have to guess you've asked it to convert utf8 to ascii.
    Maybe you did something else. The man page shows an example making the >>target "ASCII//TRANSLIT". The translit part sounds like it might help.

    iconv sounds to me like the right tool. If there are other multi-byte >>sequences you want to handle, something like sed can do the job.

    I get what you are saying. But I just want something that will remove any/all high-ASCII junk. It *might* be as simple as simply writing a
    simple search-and-replace script in your-favorite-scriping-language (in my case, that would be AWK) to remove any character with ASCII value > 127.

    If you mean more like "search and delete", then I've seen this in
    the past and kept it in mind for any case where iconv isn't
    available:
    tr -cd "\11\12\15\40-\176"

    But, my feeling is that there must be something better.

    Also, my sense is that iconv doesn't do enough. Sometimes, even after running it through iconv, you'll still see non-ASCII junk in the file.
    Also, as I mentioned, the main character that seems to have a problem is
    the ' character. It'd be nice if some commonly accepted solution would at least handle all the mis-codings of that character.

    I find iconv does the job perfectly, running it like this:
    iconv -f utf-8 -t ASCII//TRANSLIT

    Compared to the tr command, the TRANSLIT functionality (search and
    replace instead of search and delete) is very nice. Perhaps a way
    to add custom rules for converting characters would be an
    improvement, though not one that I frequently desire.

    Anyway, I was curious to find out what other people use, and how they have fared with this problem.

    Sorry, just another iconv person, but faring quite well with it
    anyway.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to Computer Nerd Kev on Mon Mar 14 13:44:36 2022
    In article <t0gf09$12jr$1@gioia.aioe.org>,
    Computer Nerd Kev <not@telling.you.invalid> wrote:
    ...
    I find iconv does the job perfectly, running it like this:
    iconv -f utf-8 -t ASCII//TRANSLIT

    I have put this line into production and it seems to work well. Thanks again.

    However, it still just doesn't feel right. I mean, look at all those
    "funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What
    is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is
    that a novice (and for the purposes of this particular area of discussion,
    you can consider me to be a novice) would have no idea what these things
    mean or what will need to be changed as time goes on.

    That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution. More of a "You're the computer;
    you figure it out" type solution.

    But, apparently, there isn't.

    Compared to the tr command, the TRANSLIT functionality (search and
    replace instead of search and delete) is very nice. Perhaps a way
    to add custom rules for converting characters would be an
    improvement, though not one that I frequently desire.

    As I said, it works. As long as you are going utf8 (whatever that is; yes,
    I'm kidding) to ASCII (I know what that is). What if the next time I get
    some data for this system, it is in utf-9 (or utf-10 or whatever) ?

    Anyway, I was curious to find out what other people use, and how they have >> fared with this problem.

    Sorry, just another iconv person, but faring quite well with it
    anyway.

    Alas, that seems to be as far as it goes...

    -- http://www.rollingstone.com/politics/news/the-10-dumbest-things-ever-said-about-global-warming-20130619

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Espen@21:1/5 to Kenny McCormack on Mon Mar 14 10:02:52 2022
    gazelle@shell.xmission.com (Kenny McCormack) writes:

    In article <t0gf09$12jr$1@gioia.aioe.org>,
    Computer Nerd Kev <not@telling.you.invalid> wrote:
    ...
    I find iconv does the job perfectly, running it like this:
    iconv -f utf-8 -t ASCII//TRANSLIT

    I have put this line into production and it seems to work well. Thanks again.

    However, it still just doesn't feel right. I mean, look at all those
    "funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is that a novice (and for the purposes of this particular area of discussion, you can consider me to be a novice) would have no idea what these things
    mean or what will need to be changed as time goes on.

    That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution. More of a "You're the computer;
    you figure it out" type solution.

    But, apparently, there isn't.

    Since the solution was staring you in the face in the man page
    I disagree. All that may be lacking is a more detailed explanation
    of what the example does, but
    "The next example converts from UTF-8 to ASCII, transliterating when
    possible:"
    Seems pretty clear to me.

    Compared to the tr command, the TRANSLIT functionality (search and
    replace instead of search and delete) is very nice. Perhaps a way
    to add custom rules for converting characters would be an
    improvement, though not one that I frequently desire.

    As I said, it works. As long as you are going utf8 (whatever that is; yes, I'm kidding) to ASCII (I know what that is). What if the next time I get some data for this system, it is in utf-9 (or utf-10 or whatever) ?

    If you knew what utf-8 was, it's hard to imagine why you would mention non-existing code pages.

    Anyway, I was curious to find out what other people use, and how they have >>> fared with this problem.

    Sorry, just another iconv person, but faring quite well with it
    anyway.

    Alas, that seems to be as far as it goes...

    Declaring a problem when none exists? Submit the man page correction if
    you think something is missing.

    --
    Dan Espen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to All on Mon Mar 14 14:10:44 2022
    In article <t0nhuc$3jp$1@dont-email.me>,
    Dan Espen <dan1espen@gmail.com> demonstrates that he has totally missed
    the point of my subtle, but amusing little contribution:

    etc, etc
    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/Seriously

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kenny McCormack on Mon Mar 14 19:24:45 2022
    On 14.03.2022 14:44, Kenny McCormack wrote:
    In article <t0gf09$12jr$1@gioia.aioe.org>,
    Computer Nerd Kev <not@telling.you.invalid> wrote:
    ...
    I find iconv does the job perfectly, running it like this:
    iconv -f utf-8 -t ASCII//TRANSLIT

    However, it still just doesn't feel right. I mean, look at all those
    "funny constants". What is "utf-8"? What is "ASCII"? What is "//"? What is "TRANSLIT"? Yes, these are all rhetorical questions, but the point is that a novice (and for the purposes of this particular area of discussion, you can consider me to be a novice) would have no idea what these things
    mean or what will need to be changed as time goes on.

    And in your original post you wrote: "Back in the good old days, the
    Internet was simple, 7 bit ASCII, and everything was good and proper."

    In my book that boils down to two observations.
    From an isolated point of view, an US-centric/US-only view, that may
    make sense. From a, say, EU view we can say we left the Stone Age and
    are now able to express our languages with Unicode and communicate
    all over the world and across borders. There's a universal character
    set defined, and a quasi-standard encoding based on (quasi-standard)
    units (octets, or "bytes" if someone prefers a less determined term).
    The second observation is the inherent complexity; the character sets
    topic is not trivial (and still not every tool supports it correctly).
    And such a tool, like iconv, resembles that complexity (to a degree).
    While "code-page" mappings (say, Windows to ISO Latin) are simple[*]
    the other "conversions" like transliterations, that go beyond a
    technical mapping, are generally not that trivial. (And yes, the '//'
    delimiter is not common (and maybe there's better choices?). OTOH,
    Unix is full of inconsistent syntax, and this one is harmless compared
    to some other syntax variants, like inconsistent options specification
    (-o, -opt, --opt, opt=, etc.) across many of the Unix tools.)

    [*] Remember these "good ol' days" where for conversion we had (only?)
    the 'dd' command that was able to convert from/to EBCDIC.


    That was the point (the "Reason for posting") of this thread. That there should be a more "macro" type solution.

    How would a >>"macro" type solution<< look like? (I mean we can hide
    ugly syntax issues for special purpose applications in wrapper scripts
    or functions.)

    More of a "You're the computer; you figure it out" type solution.

    There's not enough information in the data that allows "the computer"
    to determine the code page; you need meta-data for it; the from/to
    arguments in iconv, for example.


    Yes, "the world" (including "the Internet") was simpler before, yet not reflecting the global demands.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)