Forum: >>> Magnum BBS <<<

Code improvement question

From Mike Dewhirst@21:1/5 to All on Wed Nov 15 10:14:10 2023

I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in advance.

After cleaning they might be as big as 1Mb but that would be super rare. Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set.

    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Cheers

Mike

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to Mike Dewhirst via Python-list on Tue Nov 14 23:25:10 2023

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:

I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in advance.

After cleaning they might be as big as 1Mb but that would be super rare. Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set.

    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to Mike Dewhirst via Python-list on Wed Nov 15 04:08:29 2023

On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:

On 15/11/2023 10:25 am, MRAB via Python-list wrote:

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:

I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in >>> advance.

After cleaning they might be as big as 1Mb but that would be super rare. >>> Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set.

    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps
of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented

I suppose ChatGPT is the answer to this thread. Or everything. Or will be.

\b Word boundary
[0-9]{2,7} 2..7 digits
- "-"
[0-9]{2} 2 digits
- "-"
[0-9]{2} 2 digits
\b Word boundary

The "word boundary" thing is to stop it matching where there are letters
or digits right next to the digits.

For example, if the text contained, say, "123456789-12-1234", you
wouldn't want it to match because there are more than 7 digits at the
start and more than 2 digits at the end.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Dewhirst@21:1/5 to MRAB via Python-list on Wed Nov 15 14:41:20 2023

On 15/11/2023 10:25 am, MRAB via Python-list wrote:

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:

I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in
advance.

After cleaning they might be as big as 1Mb but that would be super rare.
Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set.

    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps >> of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented

I suppose ChatGPT is the answer to this thread. Or everything. Or will be.

Thanks

Mike

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rimu Atkinson@21:1/5 to All on Thu Nov 16 11:34:16 2023

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

I feel the same way about regex. If I can find a way to write something
without regex I very much prefer to as regex usually adds complexity and
hurts readability.

You might find https://regex101.com/ to be useful for testing your
regex. You can enter in sample data and see if it matches.

If I understood what your regex was trying to do I might be able to
suggest some python to do the same thing. Is it just removing numbers
from text?

The for loop, "for bit in bits" etc, could be written as a list
comprehension.

pieces = [bit if len(bit) > 6 else "" for bit in bits]

For devs familiar with other languages but new to Python this will look
like gibberish so arguably the original for loop is clearer, depending
on your team.

It's worth making the effort to get into list comprehensions though
because they're awesome.

That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented

I suppose ChatGPT is the answer to this thread. Or everything. Or will be.

I am doubtful. We'll see!

R

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to Mike Dewhirst via Python-list on Fri Nov 17 01:22:46 2023

On 2023-11-17 01:15, Mike Dewhirst via Python-list wrote:

On 15/11/2023 3:08 pm, MRAB via Python-list wrote:

On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:

On 15/11/2023 10:25 am, MRAB via Python-list wrote:

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:

I'd like to improve the code below, which works. It feels clunky to
me.

I need to clean up user-uploaded files the size of which I don't
know in
advance.

After cleaning they might be as big as 1Mb but that would be super
rare.
Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set. >>>>>
    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller >>>>> clumps of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented

I suppose ChatGPT is the answer to this thread. Or everything. Or
will be.

\b          Word boundary
[0-9]{2,7} 2..7 digits
-           "-"
[0-9]{2}    2 digits
-           "-"
[0-9]{2}    2 digits
\b          Word boundary

The "word boundary" thing is to stop it matching where there are
letters or digits right next to the digits.

For example, if the text contained, say, "123456789-12-1234", you
wouldn't want it to match because there are more than 7 digits at the
start and more than 2 digits at the end.

Thanks

I know I should invest some brainspace in re. Many years ago at a Perl conferenceI did buy a coffee mug completely covered with a regex cheat
sheet. It currently holds pens and pencils on my desk. And spiders now I
look closely!

Then I took up Python and re is different.

Maybe I'll have another look ...

The patterns themselves aren't that different; Perl's just has more
features than the re module's.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Dewhirst@21:1/5 to All on Fri Nov 17 15:56:19 2023

This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------HUwyuL0huFqp3kR7gxOIX545
Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64

T24gMTYvMTEvMjAyMyA5OjM0IGFtLCBSaW11IEF0a2luc29uIHZpYSBQeXRob24tbGlzdCB3 cm90ZToNCj4NCj4+Pj4NCj4+PiBXaHkgZG9uJ3QgeW91IHVzZSByZS5maW5kYWxsPw0KPj4+ DQo+Pj4gcmUuZmluZGFsbChyJ1xiWzAtOV17Miw3fS1bMC05XXsyfS1bMC05XXsyfVxiJywg dHh0KQ0KPj4NCj4+IEkgdGhpbmsgSSBjYW4gc2VlIHdoYXQgeW91IGRpZCB0aGVyZSBidXQg aXQgd29uJ3QgbWFrZSBzZW5zZSB0byBtZSAtIA0KPj4gb3Igd2hvZXZlciBsb29rcyBhdCB0 aGUgY29kZSAtIGluIGZ1dHVyZS4NCj4+DQo+PiBUaGF0IGFuc3dlcnMgeW91ciBzcGVjaWZp YyBxdWVzdGlvbi4gSG93ZXZlciwgSSBhbSBpbiBhd2Ugb2YgcGVvcGxlIA0KPj4gd2hvIGNh biBqdXN0ICJkbyIgcmVndWxhciBleHByZXNzaW9ucyBhbmQgSSB0aGFuayB5b3UgdmVyeSBt dWNoIGZvciANCj4+IHdoYXQgd291bGQgaGF2ZSBiZWVuIGEgbW9udW1lbnRhbCBlZmZvcnQg aGFkIEkgdHJpZWQgaXQuDQo+DQo+IEkgZmVlbCB0aGUgc2FtZSB3YXkgYWJvdXQgcmVnZXgu IElmIEkgY2FuIGZpbmQgYSB3YXkgdG8gd3JpdGUgDQo+IHNvbWV0aGluZyB3aXRob3V0IHJl Z2V4IEkgdmVyeSBtdWNoIHByZWZlciB0byBhcyByZWdleCB1c3VhbGx5IGFkZHMgDQo+IGNv bXBsZXhpdHkgYW5kIGh1cnRzIHJlYWRhYmlsaXR5Lg0KPg0KPiBZb3UgbWlnaHQgZmluZCBo dHRwczovL3JlZ2V4MTAxLmNvbS8gdG8gYmUgdXNlZnVsIGZvciB0ZXN0aW5nIHlvdXIgDQo+ IHJlZ2V4LiBZb3UgY2FuIGVudGVyIGluIHNhbXBsZSBkYXRhIGFuZCBzZWUgaWYgaXQgbWF0 Y2hlcy4NCj4NCj4gSWYgSSB1bmRlcnN0b29kIHdoYXQgeW91ciByZWdleCB3YXMgdHJ5aW5n IHRvIGRvIEkgbWlnaHQgYmUgYWJsZSB0byANCj4gc3VnZ2VzdCBzb21lIHB5dGhvbiB0byBk byB0aGUgc2FtZSB0aGluZy4gSXMgaXQganVzdCByZW1vdmluZyBudW1iZXJzIA0KPiBmcm9t IHRleHQ/DQo+DQo+IFRoZSBmb3IgbG9vcCwgImZvciBiaXQgaW4gYml0cyIgZXRjLCBjb3Vs ZCBiZSB3cml0dGVuIGFzIGEgbGlzdCANCj4gY29tcHJlaGVuc2lvbi4NCj4NCj4gcGllY2Vz ID0gW2JpdCBpZiBsZW4oYml0KSA+IDYgZWxzZSAiIiBmb3IgYml0IGluIGJpdHNdDQo+DQo+ IEZvciBkZXZzIGZhbWlsaWFyIHdpdGggb3RoZXIgbGFuZ3VhZ2VzIGJ1dCBuZXcgdG8gUHl0 aG9uIHRoaXMgd2lsbCANCj4gbG9vayBsaWtlIGdpYmJlcmlzaCBzbyBhcmd1YWJseSB0aGUg b3JpZ2luYWwgZm9yIGxvb3AgaXMgY2xlYXJlciwgDQo+IGRlcGVuZGluZyBvbiB5b3VyIHRl YW0uDQo+DQo+IEl0J3Mgd29ydGggbWFraW5nIHRoZSBlZmZvcnQgdG8gZ2V0IGludG8gbGlz dCBjb21wcmVoZW5zaW9ucyB0aG91Z2ggDQo+IGJlY2F1c2UgdGhleSdyZSBhd2Vzb21lLg0K DQpJIGFncmVlIHF1YWxpdGF0aXZlbHkgMTAwJSBidXQgcXVhbnRpdGl2ZWx5IHBlcmhhcHMg SSBhZ3JlZSA4MCUgd2hlcmUgDQpyZWFkYWJpbGl0eSBpcyBlYXN5Lg0KDQpJIHRoaW5rIHRo YXQncyB3aGF0IHlvdSBhcmUgc2F5aW5nIGFueXdheS4NCg0KDQo+DQo+DQo+DQo+Pg0KPj4g VGhhdCBsaXR0bGUgcmUuc3ViKCkgY2FtZSBmcm9tIENoYXRHUFQgYW5kIEkgY2FuIHVuZGVy c3RhbmQgaXQgDQo+PiB3aXRob3V0IHRvbyBtdWNoIGVmZm9ydCBiZWNhdXNlIGl0IGNhbWUg ZG9jdW1lbnRlZA0KPj4NCj4+IEkgc3VwcG9zZSBDaGF0R1BUIGlzIHRoZSBhbnN3ZXIgdG8g dGhpcyB0aHJlYWQuIE9yIGV2ZXJ5dGhpbmcuIE9yIA0KPj4gd2lsbCBiZS4NCj4NCj4gSSBh bSBkb3VidGZ1bC4gV2UnbGwgc2VlIQ0KPg0KPiBSDQo+DQo+DQoNCg0KLS0gDQpTaWduZWQg ZW1haWwgaXMgYW4gYWJzb2x1dGUgZGVmZW5jZSBhZ2FpbnN0IHBoaXNoaW5nLiBUaGlzIGVt YWlsIGhhcw0KYmVlbiBzaWduZWQgd2l0aCBteSBwcml2YXRlIGtleS4gSWYgeW91IGltcG9y dCBteSBwdWJsaWMga2V5IHlvdSBjYW4NCmF1dG9tYXRpY2FsbHkgZGVjcnlwdCBteSBzaWdu YXR1cmUgYW5kIGJlIHN1cmUgaXQgY2FtZSBmcm9tIG1lLiBZb3VyDQplbWFpbCBzb2Z0d2Fy ZSBjYW4gaGFuZGxlIHNpZ25pbmcuDQoNCg==
--------------HUwyuL0huFqp3kR7gxOIX545--

-----BEGIN PGP SIGNATURE-----

wsB5BAABCAAjFiEE/NCg7Xf1UydoVFgpGvW31BqCOLMFAmVW8nMFAwAAAAAACgkQGvW31BqCOLOg pAf/a94wAvfUGEfV/WNB+HAa0jw4F3cVllI9U+GawR80jj4XhWtFN3+RydNPGy/W4ZZCUMrlrqzD 8nEdxQY15FmWHxXGgSdhYXIegQNwE/Zt0xTI8hbFVEci5r6iXN+uHJ9neitucgbN+alS8ceZBGxq gnrurCTqBPpG1i/4OjBaAjXeWLoi/SrOxEvoHiX/vUN51FIelfr+AcZSzlZgZ73/JMVGG7NSHf/5 Q4CXocdE6weZMI/jbGC5olpusxdKm24Fvzcgtcg3AcXQlelRI6Xp66FEOHSDVfQpYNZJvYDWCSha RE1V8lvLoubJsCZ/AOqQMcczPTEnsos8E2CObeRLYA==
=GChg
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Fri Nov 17 10:38:14 2023

Mike Dewhirst ha scritto:

On 15/11/2023 10:25 am, MRAB via Python-list wrote:

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:

I'd like to improve the code below, which works. It feels clunky to me.

I need to clean up user-uploaded files the size of which I don't know in >>> advance.

After cleaning they might be as big as 1Mb but that would be super rare. >>> Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set.

    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps
of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented

I suppose ChatGPT is the answer to this thread. Or everything. Or will be.

Thanks

Mike

I respect your opinion but from the point of view of many usenet users
asking a question to chatgpt to solve your problem is truly an overkill.
The computer world overflows with people who know regex. If you had not
already had the answer with the use of 're' I would have sent you my
suggestion that as you can see it is practically identical. I am quite
sure that in this usenet the same solution came to the mind of many
people.

with open(file) as fp:
try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
except: ret = []

The only difference is '\d' instead of '[0-9]' but they are equivalent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Rimu Atkinson via Python-list on Fri Nov 17 12:17:44 2023

On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

I feel the same way about regex. If I can find a way to write something without regex I very much prefer to as regex usually adds complexity and hurts readability.

I find "straight" regexps very easy to write. There are only a handful
of constructs which are all very simple and you just string them
together. But then I've used regexps for 30+ years, so of course they
feel natural to me.

(Reading regexps may be a bit harder, exactly because they are to
simple: There is no abstraction, so a complicated pattern results in a
long regexp.)

There are some extensions to regexps which are conceptually harder, like lookahead and lookbehind or nested contexts in Perl. I may need the
manual for those (especially because they are new(ish) and every
language uses a different syntax for them) or avoid them altogether.

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.

You might find https://regex101.com/ to be useful for testing your regex.
You can enter in sample data and see if it matches.

If I understood what your regex was trying to do I might be able to suggest some python to do the same thing. Is it just removing numbers from text?

Not "removing" them (as I understood it), but extracting them (i.e. find
and collect them).

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmVXS9EACgkQ8g5IURL+ KF3swA/9G7p+N6H2gnzVUo8B8E5aLBN+GDcvdIidinhP1IUpNekLllhSiDpToSx1 b8IzMebfzrCfU9YTCirAXEF8G3vvlcx77EHvMgRwvywDr2xNRflGz7Bj9rfYzata sk1EhcFvl0uK4eEwwn22N9RvmlMxUiVmCtQwyunlXmexW6zehOflqebKOIv8r6xk 0UwNwm5UyU/n3i+24KA90Kh1nTe9Bbn+vJWWGYSh0SFuJRyv2dDG15mYXjOwpmeQ tnhshmoZvu9cXdHNuO4CYZ/Mab4N+BV5FKX/2ZVegueSKIXVlf2neFlCs/3Uu1I/ /7EHp5CIF7PVgh/DzW8xqvtt8JRM9HOuQetLQiup7lICXoSTvPt9DVGT41r7wxhy S68vaU//KZRWMwI/u34jINOUaboKq0Wr9xstfk2/QmaHYPP3HI83e49Wre5TGAZ1 xknTkGAmIIjVP0MCGlKTOGKUqHgH9NvXQzanWsY4QLWIuS+Lg8Ha9TXucksNhNGq Adlf9WFnqEv+Oyx6ndb7TYZlP0sC7GuUGUYodHBnHVhkkC0SHh+YJd31moMg5Taj PYQl/0/2qY9Qf7G6iwiu0TItdzKNccfntgMELysQN5wdRvq0vd4JFIdIckx8uuwU UvVzdWgRAGFbRhNuvSsK3wLwHg3RvUrRrP6xZQ5

From Thomas Passin@21:1/5 to Peter J. Holzer via Python-list on Fri Nov 17 07:48:41 2023

On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:

On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

I feel the same way about regex. If I can find a way to write something
without regex I very much prefer to as regex usually adds complexity and
hurts readability.

I find "straight" regexps very easy to write. There are only a handful
of constructs which are all very simple and you just string them
together. But then I've used regexps for 30+ years, so of course they
feel natural to me.

(Reading regexps may be a bit harder, exactly because they are to
simple: There is no abstraction, so a complicated pattern results in a
long regexp.)

There are some extensions to regexps which are conceptually harder, like lookahead and lookbehind or nested contexts in Perl. I may need the
manual for those (especially because they are new(ish) and every
language uses a different syntax for them) or avoid them altogether.

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.

You might find https://regex101.com/ to be useful for testing your regex.
You can enter in sample data and see if it matches.

If I understood what your regex was trying to do I might be able to suggest >> some python to do the same thing. Is it just removing numbers from text?

Not "removing" them (as I understood it), but extracting them (i.e. find
and collect them).

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

And the re.VERBOSE (also re.X) flag can always be used so the entire
expression can be written line-by-line with comments nearly the same as
the example above

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Peter J. Holzer@21:1/5 to Thomas Passin via Python-list on Fri Nov 17 15:46:06 2023

On 2023-11-17 07:48:41 -0500, Thomas Passin via Python-list wrote:

On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.

[...]

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

And the re.VERBOSE (also re.X) flag can always be used so the entire expression can be written line-by-line with comments nearly the same
as the example above

Yes. That's what I alluded to above.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

-----BEGIN PGP SIGNATURE-----

iQIzBAABCgAdFiEETtJbRjyPwVTYGJ5k8g5IURL+KF0FAmVXfKcACgkQ8g5IURL+ KF0IbRAAhIV3m17xImIPiBYJotmKF1PjtIA97PWBTls7hmHISzczJJsBoLcOMTXR tvXzQS+Ajfzugi2XFWSQyZCMXZKA9YYxoaNoWWv3R2rpqMPtbobryWovtW+cB1IG 5WjoewZOLhKK+k1jXFkcAXbQ9ueY6xX8VS60iWYebYcvyuWNNRgX/n3hkOtWpily wSEAvQc/KCfEf+3ZDcRy8a6VHIdMT1V9nhP9mfz6rvD4gq63i2ZTdTpPVyUszV1o SFHrcbK4QfTB1znpXy324iNBRXfwrB4mcZm8LhGOK/He3k6HuCqT5ZzHvirbpAmK e3kAJ8RvAsnURslUgMuwCjQPg2K5Hfvjv7cdddb6pUz4FIdImSq+tUxJBB7tHM16 e3PTUDsHpDeX4NY0O+cm3e6TmkjOtt+yvve1opIcj6Gqz5X90FQLvBqFLEaJpdSy faX+BEj2+KQ1oGy+Jd8HvfgfsSPBcIXErrIGrFZPZ+7GzmFPyBkMNnKEEtoy8Pq9 roqG5YVw/bKjRdsKEqa2NsvdDDnENWEkbPXrebkb0X/26e+ah8OmewApCcgONoxd hhX+IShHQqe7956YUmGWkK2flBARtzdlfmCOomFwblaHf+dl/UVvSP07466QvdFT d7GW6QDd0TPXPN6zRecMGmfgIMwVTGdzqUjqag6

From Thomas Passin@21:1/5 to Peter J. Holzer via Python-list on Fri Nov 17 10:17:37 2023

On 11/17/2023 9:46 AM, Peter J. Holzer via Python-list wrote:

On 2023-11-17 07:48:41 -0500, Thomas Passin via Python-list wrote:

On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.

[...]

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

And the re.VERBOSE (also re.X) flag can always be used so the entire
expression can be written line-by-line with comments nearly the same
as the example above

Yes. That's what I alluded to above.

I know, and I just wanted to make it explicit for people who didn't know
much about Python regexes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Thomas Passin on Fri Nov 17 18:20:14 2023

Thomas Passin <list1@tompassin.net> writes:

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

...

I know, and I just wanted to make it explicit for people who didn't know
much about Python regexes.

Or,

def repeat_preceding( min=None, max=None, count=None ):
return '{' + str( count )+ '}' if count else \
'{' + str( min )+ ',' + str( max )+ '}'

digit = '[0-9]'
word_boundary = r'\b'
hyphen = '-'

my_regexp = word_boundary + \
digit + repeat_preceding( min=2, max=7 ) + hyphen + \
digit + repeat_preceding( count=2 ) + hyphen + \
digit + repeat_preceding( count=2 ) + word_boundary

.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MRAB@21:1/5 to jak via Python-list on Fri Nov 17 18:56:54 2023

On 2023-11-17 09:38, jak via Python-list wrote:

Mike Dewhirst ha scritto:

On 15/11/2023 10:25 am, MRAB via Python-list wrote:

On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:

I'd like to improve the code below, which works. It feels clunky to me. >>>>
I need to clean up user-uploaded files the size of which I don't know in >>>> advance.

After cleaning they might be as big as 1Mb but that would be super rare. >>>> Perhaps only for testing.

I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4

def remove_alpha(txt):

    """ r'[^0-9\- ]':

    [^...]: Match any character that is not in the specified set. >>>>
    0-9: Match any digit.

    \: Escape character.

    -: Match a hyphen.

    Space: Match a space.

    """

    cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)

    bits = cleaned_txt.split()

    pieces = []

    for bit in bits:

        # minimum size of a CAS number is 7 so drop smaller clumps
of digits

        pieces.append(bit if len(bit) > 6 else "")

    return " ".join(pieces)

Many thanks for any hints

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented

I suppose ChatGPT is the answer to this thread. Or everything. Or will be. >>
Thanks

Mike

I respect your opinion but from the point of view of many usenet users
asking a question to chatgpt to solve your problem is truly an overkill.
The computer world overflows with people who know regex. If you had not already had the answer with the use of 're' I would have sent you my suggestion that as you can see it is practically identical. I am quite
sure that in this usenet the same solution came to the mind of many
people.

with open(file) as fp:
try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
except: ret = []

The only difference is '\d' instead of '[0-9]' but they are equivalent.

Bare excepts are a very bad idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jak@21:1/5 to All on Sat Nov 18 00:53:21 2023

MRAB ha scritto:

Bare excepts are a very bad idea.

I know, you're right but to test the CAS numbers were inside a string
(txt) and instead of the 'open(file)' there was 'io.StingIO(txt)' so the
risk was almost null. When I copied it here I didn't think about it.
Sorry.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From avi.e.gross@gmail.com@21:1/5 to Rimu Atkinson via Python-list on Sat Nov 18 01:55:13 2023

Many features like regular expressions can be mini languages that are designed to be very powerful while also a tad cryptic to anyone not familiar.

But consider an alternative in some languages that may use some complex set of nested function calls that each have names like match_white_space(2, 5) and even if some are set up to be sort of readable, they can be a pain. Quite a few problems can be
solved nicely with a single regular expression or several in a row with each one being fairly simple. Sometimes you can do parts using some of the usual text manipulation functions built-in or in a module for either speed or to simplify things so that
the RE part is simpler and easier to follow.

And, as noted, Python allows ways to include comments in RE or ways to specify extensions such as PERL-style and so on. Adding enough comments above or within the code can help remind people or point to a reference and just explaining in English (or the
language of your choice that hopefully others later can understand) can be helpful. You can spell out in whatever level of detail what you expect your data to look like and what you want to match or extract and then the RE may be easier to follow.

Of course the endless extensions added due to things like supporting UNICODE have made some RE much harder to create or understand and sometimes the result may not even be what you expected if something strange happens like the symbols ①❹⓸

The above might match digits and maybe be interpreted at some point as 12 dozen, which may even be appropriate but a bit of a surprise perhaps.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmail.com@python.org> On Behalf Of Peter J. Holzer via Python-list
Sent: Friday, November 17, 2023 6:18 AM
To: python-list@python.org
Subject: Re: Code improvement question

On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:

Why don't you use re.findall?

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.

That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.

I feel the same way about regex. If I can find a way to write something without regex I very much prefer to as regex usually adds complexity and hurts readability.

I find "straight" regexps very easy to write. There are only a handful
of constructs which are all very simple and you just string them
together. But then I've used regexps for 30+ years, so of course they
feel natural to me.

(Reading regexps may be a bit harder, exactly because they are to
simple: There is no abstraction, so a complicated pattern results in a
long regexp.)

There are some extensions to regexps which are conceptually harder, like lookahead and lookbehind or nested contexts in Perl. I may need the
manual for those (especially because they are new(ish) and every
language uses a different syntax for them) or avoid them altogether.

Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.

You might find https://regex101.com/ to be useful for testing your regex.
You can enter in sample data and see if it matches.

If I understood what your regex was trying to do I might be able to suggest some python to do the same thing. Is it just removing numbers from text?

Not "removing" them (as I understood it), but extracting them (i.e. find
and collect them).

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Ram@21:1/5 to Stefan Ram on Sun Nov 19 10:04:06 2023

ram@zedat.fu-berlin.de (Stefan Ram) writes:

Thomas Passin <list1@tompassin.net> writes:

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

Or,

def repeat_preceding( min=None, max=None, count=None ):
''' require that the preceding regexp is repeated
a certain number of times, use either min and max
or count '''
return '{' + str( count )+ '}' if count else \
'{' + str( min )+ ',' + str( max )+ '}'

digit = '[0-9]' # match a decimal digit
word_boundary = r'\b' # match a word boundary
a_hyphen = '-' # match a literal hyphen character

def digits( **kwargs ):
''' A certain number of digits. See 'repeat_preceding' for
the possible kwargs. '''
return digit + repeat_preceding( **kwargs )

def word( regexp: str ):
''' something that starts and ends with a word boundary '''
return word_boundary + regexp + word_boundary

my_regexp = \
word \
( digits( min=2, max=7 ) + a_hyphen +
digits( count=2 ) + a_hyphen +
digits( count=2 ))

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paul Rubin@21:1/5 to Stefan Ram on Sun Nov 19 03:03:10 2023

ram@zedat.fu-berlin.de (Stefan Ram) writes:

return '{' + str( count )+ '}' if count else \
'{' + str( min )+ ',' + str( max )+ '}'

return f'{{{count}}}' if count else f'{{{min},{max}}}'

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rimu Atkinson@21:1/5 to All on Tue Nov 21 09:48:49 2023

re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)

\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.

Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.

Now that I know what {} does, you're right, that IS straightforward!
Maybe 2023 will be the year I finally get off my arse and learn regex.

Thanks :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:28:41
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

Code improvement question

Who's Online

Recent Visitors

System Info