I'd like to improve the code below, which works. It feels clunky to me.
I need to clean up user-uploaded files the size of which I don't know in advance.
After cleaning they might be as big as 1Mb but that would be super rare. Perhaps only for testing.
I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4
def remove_alpha(txt):
""" r'[^0-9\- ]':
[^...]: Match any character that is not in the specified set.
0-9: Match any digit.
\: Escape character.
-: Match a hyphen.
Space: Match a space.
"""
cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
bits = cleaned_txt.split()
pieces = []
for bit in bits:
# minimum size of a CAS number is 7 so drop smaller clumps of digits
pieces.append(bit if len(bit) > 6 else "")
return " ".join(pieces)
Many thanks for any hints
On 15/11/2023 10:25 am, MRAB via Python-list wrote:
On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
I'd like to improve the code below, which works. It feels clunky to me.Why don't you use re.findall?
I need to clean up user-uploaded files the size of which I don't know in >>> advance.
After cleaning they might be as big as 1Mb but that would be super rare. >>> Perhaps only for testing.
I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4
def remove_alpha(txt):
""" r'[^0-9\- ]':
[^...]: Match any character that is not in the specified set.
0-9: Match any digit.
\: Escape character.
-: Match a hyphen.
Space: Match a space.
"""
cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
bits = cleaned_txt.split()
pieces = []
for bit in bits:
# minimum size of a CAS number is 7 so drop smaller clumps
of digits
pieces.append(bit if len(bit) > 6 else "")
return " ".join(pieces)
Many thanks for any hints
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented
I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
I'd like to improve the code below, which works. It feels clunky to me.Why don't you use re.findall?
I need to clean up user-uploaded files the size of which I don't know in
advance.
After cleaning they might be as big as 1Mb but that would be super rare.
Perhaps only for testing.
I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4
def remove_alpha(txt):
""" r'[^0-9\- ]':
[^...]: Match any character that is not in the specified set.
0-9: Match any digit.
\: Escape character.
-: Match a hyphen.
Space: Match a space.
"""
cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
bits = cleaned_txt.split()
pieces = []
for bit in bits:
# minimum size of a CAS number is 7 so drop smaller clumps >> of digits
pieces.append(bit if len(bit) > 6 else "")
return " ".join(pieces)
Many thanks for any hints
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
Why don't you use re.findall?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented
I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
On 15/11/2023 3:08 pm, MRAB via Python-list wrote:
On 2023-11-15 03:41, Mike Dewhirst via Python-list wrote:Thanks
On 15/11/2023 10:25 am, MRAB via Python-list wrote:\b Word boundary
On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
I'd like to improve the code below, which works. It feels clunky toWhy don't you use re.findall?
me.
I need to clean up user-uploaded files the size of which I don't
know in
advance.
After cleaning they might be as big as 1Mb but that would be super
rare.
Perhaps only for testing.
I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4
def remove_alpha(txt):
""" r'[^0-9\- ]':
[^...]: Match any character that is not in the specified set. >>>>>
0-9: Match any digit.
\: Escape character.
-: Match a hyphen.
Space: Match a space.
"""
cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
bits = cleaned_txt.split()
pieces = []
for bit in bits:
# minimum size of a CAS number is 7 so drop smaller >>>>> clumps of digits
pieces.append(bit if len(bit) > 6 else "")
return " ".join(pieces)
Many thanks for any hints
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented
I suppose ChatGPT is the answer to this thread. Or everything. Or
will be.
[0-9]{2,7} 2..7 digits
- "-"
[0-9]{2} 2 digits
- "-"
[0-9]{2} 2 digits
\b Word boundary
The "word boundary" thing is to stop it matching where there are
letters or digits right next to the digits.
For example, if the text contained, say, "123456789-12-1234", you
wouldn't want it to match because there are more than 7 digits at the
start and more than 2 digits at the end.
I know I should invest some brainspace in re. Many years ago at a Perl conferenceI did buy a coffee mug completely covered with a regex cheat
sheet. It currently holds pens and pencils on my desk. And spiders now I
look closely!
Then I took up Python and re is different.
Maybe I'll have another look ...
On 15/11/2023 10:25 am, MRAB via Python-list wrote:
On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
I'd like to improve the code below, which works. It feels clunky to me.Why don't you use re.findall?
I need to clean up user-uploaded files the size of which I don't know in >>> advance.
After cleaning they might be as big as 1Mb but that would be super rare. >>> Perhaps only for testing.
I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4
def remove_alpha(txt):
""" r'[^0-9\- ]':
[^...]: Match any character that is not in the specified set.
0-9: Match any digit.
\: Escape character.
-: Match a hyphen.
Space: Match a space.
"""
cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
bits = cleaned_txt.split()
pieces = []
for bit in bits:
# minimum size of a CAS number is 7 so drop smaller clumps
of digits
pieces.append(bit if len(bit) > 6 else "")
return " ".join(pieces)
Many thanks for any hints
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented
I suppose ChatGPT is the answer to this thread. Or everything. Or will be.
Thanks
Mike
Why don't you use re.findall?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
I feel the same way about regex. If I can find a way to write something without regex I very much prefer to as regex usually adds complexity and hurts readability.
You might find https://regex101.com/ to be useful for testing your regex.
You can enter in sample data and see if it matches.
If I understood what your regex was trying to do I might be able to suggest some python to do the same thing. Is it just removing numbers from text?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
On 2023-11-16 11:34:16 +1300, Rimu Atkinson via Python-list wrote:
Why don't you use re.findall?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
I feel the same way about regex. If I can find a way to write something
without regex I very much prefer to as regex usually adds complexity and
hurts readability.
I find "straight" regexps very easy to write. There are only a handful
of constructs which are all very simple and you just string them
together. But then I've used regexps for 30+ years, so of course they
feel natural to me.
(Reading regexps may be a bit harder, exactly because they are to
simple: There is no abstraction, so a complicated pattern results in a
long regexp.)
There are some extensions to regexps which are conceptually harder, like lookahead and lookbehind or nested contexts in Perl. I may need the
manual for those (especially because they are new(ish) and every
language uses a different syntax for them) or avoid them altogether.
Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.
You might find https://regex101.com/ to be useful for testing your regex.
You can enter in sample data and see if it matches.
If I understood what your regex was trying to do I might be able to suggest >> some python to do the same thing. Is it just removing numbers from text?
Not "removing" them (as I understood it), but extracting them (i.e. find
and collect them).
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.
Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.
On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:
Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.
Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.
And the re.VERBOSE (also re.X) flag can always be used so the entire expression can be written line-by-line with comments nearly the same
as the example above
On 2023-11-17 07:48:41 -0500, Thomas Passin via Python-list wrote:
On 11/17/2023 6:17 AM, Peter J. Holzer via Python-list wrote:[...]
Oh, and Python (just like Perl) allows you to embed whitespace and
comments into Regexps, which helps readability a lot if you have to
write long regexps.
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.
Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.
And the re.VERBOSE (also re.X) flag can always be used so the entire
expression can be written line-by-line with comments nearly the same
as the example above
Yes. That's what I alluded to above.
...re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I know, and I just wanted to make it explicit for people who didn't know
much about Python regexes.
Mike Dewhirst ha scritto:
On 15/11/2023 10:25 am, MRAB via Python-list wrote:
On 2023-11-14 23:14, Mike Dewhirst via Python-list wrote:
I'd like to improve the code below, which works. It feels clunky to me. >>>>Why don't you use re.findall?
I need to clean up user-uploaded files the size of which I don't know in >>>> advance.
After cleaning they might be as big as 1Mb but that would be super rare. >>>> Perhaps only for testing.
I'm extracting CAS numbers and here is the pattern xx-xx-x up to
xxxxxxx-xx-x eg., 1012300-77-4
def remove_alpha(txt):
""" r'[^0-9\- ]':
[^...]: Match any character that is not in the specified set. >>>>
0-9: Match any digit.
\: Escape character.
-: Match a hyphen.
Space: Match a space.
"""
cleaned_txt = re.sub(r'[^0-9\- ]', '', txt)
bits = cleaned_txt.split()
pieces = []
for bit in bits:
# minimum size of a CAS number is 7 so drop smaller clumps
of digits
pieces.append(bit if len(bit) > 6 else "")
return " ".join(pieces)
Many thanks for any hints
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or
whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
That little re.sub() came from ChatGPT and I can understand it without
too much effort because it came documented
I suppose ChatGPT is the answer to this thread. Or everything. Or will be. >>
Thanks
Mike
I respect your opinion but from the point of view of many usenet users
asking a question to chatgpt to solve your problem is truly an overkill.
The computer world overflows with people who know regex. If you had not already had the answer with the use of 're' I would have sent you my suggestion that as you can see it is practically identical. I am quite
sure that in this usenet the same solution came to the mind of many
people.
with open(file) as fp:
try: ret = re.findall(r'\b\d{2,7}\-\d{2}\-\d{1}\b', fp.read())
except: ret = []
The only difference is '\d' instead of '[0-9]' but they are equivalent.
Bare excepts are a very bad idea.
Why don't you use re.findall?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
I think I can see what you did there but it won't make sense to me - or whoever looks at the code - in future.
That answers your specific question. However, I am in awe of people who
can just "do" regular expressions and I thank you very much for what
would have been a monumental effort had I tried it.
I feel the same way about regex. If I can find a way to write something without regex I very much prefer to as regex usually adds complexity and hurts readability.
You might find https://regex101.com/ to be useful for testing your regex.
You can enter in sample data and see if it matches.
If I understood what your regex was trying to do I might be able to suggest some python to do the same thing. Is it just removing numbers from text?
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
Thomas Passin <list1@tompassin.net> writes:
Or,re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
return '{' + str( count )+ '}' if count else \
'{' + str( min )+ ',' + str( max )+ '}'
re.findall(r'\b[0-9]{2,7}-[0-9]{2}-[0-9]{2}\b', txt)
\b - a word boundary.
[0-9]{2,7} - 2 to 7 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
- - a hyphen-minus
[0-9]{2} - exactly 2 digits
\b - a word boundary.
Seems quite straightforward to me. I'll be impressed if you can write
that in Python in a way which is easier to read.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (2 / 14) |
Uptime: | 04:28:41 |
Calls: | 10,387 |
Calls today: | 2 |
Files: | 14,061 |
Messages: | 6,416,782 |