Forum: >>> Magnum BBS <<<

Re: Alternative to Debian Repository - extract CSV formatted data from

From John Hasler@21:1/5 to All on Thu Feb 20 18:20:01 2025

Try pdftotext.
--
John Hasler
john@sugarbit.com
Elmwood, WI USA

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From debian-user@howorth.org.uk@21:1/5 to Richard Owlett on Thu Feb 20 18:30:01 2025

Richard Owlett <rowlett@access.net> wrote:

I wish to extract CSV formatted data from a PDF document. [1]
Page ES-7 has a weekly grocery list for males grouped by age.
I need only the first and last columns.

Can someone point me in a suitable direction?

TIA

[1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
Table ES-1. Thrifty Food Plan market baskets, quantities of food
purchased for a week, by age-gender group, 2006

If you look at
https://www.fns.usda.gov/cnpp/thrifty-food-plan-2021 instead, you can
find the underlying data in spreadsheet form (.xlsx). Perhaps that will
be an adequate substitute?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans@21:1/5 to All on Thu Feb 20 18:40:01 2025

This is a multi-part message in MIME format.

Am Donnerstag, 20. Februar 2025, 15:08:27 CET schrieb Richard Owlett:

I wish to extract CSV formatted data from a PDF document. [1]
Page ES-7 has a weekly grocery list for males grouped by age.
I need only the first and last columns.

Can someone point me in a suitable direction?

TIA

[1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
Table ES-1. Thrifty Food Plan market baskets, quantities of food
purchased for a week, by age-gender group, 2006

Without knowing the content of your pdf file, maybe you can port the pdf file to a text file for
example by using "pdftotext".

pdftotext [options] <PDF-file> [<text-file>]

Then, you could read every line in this text file and filter only lines with a unique word in this
line (i.e. "sold") and create a new file with all lines you only need. For example:

cat ~/my_file.txt | grep "sold" > my_new_file.txt

Now you have this one, you can cat and cut only words you need (see manual of cut).

The syntax is similar like:

cat `cut --fields 3 5 7` ~/my_new_files.txt > my_target_file.txt

This would read linewise and only print the 3rd, 5th and 7th word of the source file.

See manual of cut, what options you need.

At last, you can edit the my_target_file.txt with any editor and add a separator sign at any
space between the words. The space is also a sign like any other and can be exchanged like any
other letter.

Then you would have a csv file!

If you are familiar with these commands, you can write a shell script, which does all in once in a
future. Usefull also vor very very big files.

Please note: Above might not be the correct syntax!!! My goal was more, to show, which way
you could like to go and it maybe not usefull at your special pdf file.

Maybe, if my suggestion is usefull, someone more experienced as me can you tell the correct
commands.

Please take look at the manuals of pdftotext, cat and cut, hope this helps.

Best regards

Hans

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>Am Donnerstag, 20. Februar 2025, 15:08:27 CET schrieb Richard Owlett:
> I wish to extract CSV formatted data from a PDF document. [1]
> Page ES-7 has a weekly grocery list for males grouped by age.
> I need only the first and last columns.
> > Can someone point me in a suitable direction?
> > TIA
> > [1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
>      Table ES-1. Thrifty Food Plan market baskets, quantities of food
>       purchased for a week, by age-gender group, 2006
 Without knowing the content of your pdf file, maybe you can port the pdf file to a text file for example by using "pdftotext". 
 pdftotext [options] <PDF-file> [<text-file>] 
Then, you could read every line in this text file and filter only lines with a unique word in this line (i.e. "sold") and create a new file with all lines you only need. For
example:
 cat ~/my_file.txt | grep "sold" > my_new_file.txt
 Now you have this one, you can cat and cut only words you need (see manual of cut). 
 The syntax is similar like:
 cat `cut  --fields 3 5 7` ~/my_new_files.txt > my_target_file.txt
 This would read linewise and only print the 3rd,  5th and 7th word of the source file.
 See manual of cut, what options you need. 
 At last, you can edit the my_target_file.txt with any editor and add a separator sign at any space between the words. The space is also a sign like any other and can be exchanged
like any other letter. 
 Then you would have a csv file!
 If you are familiar with these commands, you can write a shell script, which does all in once in a future. Usefull also vor very very big files.
 Please note: Above might not be the correct syntax!!! My goal was more, to show, which way you could like to go and it maybe not usefull at your special pdf file. 
 Maybe, if my suggestion is usefull, someone more experienced as me can you tell the correct commands.
 Please take look at the manuals of pdftotext, cat and cut, hope this helps.
 Best regards
 Hans
 </body>
</html>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Owlett@21:1/5 to debian-user@howorth.org.uk on Thu Feb 20 21:00:01 2025

On 2/20/25 11:20 AM, debian-user@howorth.org.uk wrote:

Richard Owlett <rowlett@access.net> wrote:

I wish to extract CSV formatted data from a PDF document. [1]
Page ES-7 has a weekly grocery list for males grouped by age.
I need only the first and last columns.

Can someone point me in a suitable direction?

TIA

[1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
Table ES-1. Thrifty Food Plan market baskets, quantities of food
purchased for a week, by age-gender group, 2006

If you look at
https://www.fns.usda.gov/cnpp/thrifty-food-plan-2021 instead, you can
find the underlying data in spreadsheet form (.xlsx). Perhaps that will
be an adequate substitute?

You just demonstrated that "Murphy's Law" holds ;<

I click on the link you quoted in my default browser and a PDF is
displayed [actually my original starting point months ago].

If I use my alternate browser {Firefox instead of SeaMonkey} I get to
chose which of several files to view. {one of them is an .xlsx file}

Murphy gets a second jab in.
The 2006 version has the data I want in a slightly different layout that
the 2021 version. The first is a better match for how I do things ;/

Also the PDF structure of the two links react slightly differently when selecting with mouse movements/clicks. The 2006 version seems to allow
me to select only what I want. [ 2021 version grabs everything between
first and last click. 2006 appears to select only the columns of interest]

Can't spend time right now to verify first impression. Will know more
this weekend.

*THANK YOU*

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From debian-user@howorth.org.uk@21:1/5 to Greg on Fri Feb 21 22:30:01 2025

Greg <curtyshoo@gmail.com> wrote:

On 2025-02-21, David Wright <deblis@lionunicorn.co.uk> wrote:

[1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
Table ES-1. Thrifty Food Plan market baskets, quantities
of food purchased for a week, by age-gender group, 2006

I don't read PDFs /in/ the browser: it downloads it instead.
So while held captive at home by the weather, I dragged the mouse
across the Males table and dumped it in a file.

I get:

Access Denied
You don't have permission to access "http://www.fns.usda.gov/cnpp/thrifty-food-plan-2006" on this server. Reference #18.dd831002.1740148075.35e89c97

https://errors.edgesuite.net/18.dd831002.1740148075.35e89c97

Wacky!

For me, FF opens a normal web page and tries to download a PDF file as
well. Cheeky thing! For both the 2006 and 2021 pages. I can't be
bothered trying to find what particular combination of plugins and
preferences cause all these different behaviours.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Wright@21:1/5 to Max Nikulin on Fri Feb 21 23:10:02 2025

On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote:

On 21/02/2025 08:00, David Wright wrote:

I dragged the mouse
across the Males table and dumped it in a file.

David, I recall you mentioned xpdf in your messages. It allows to
select rectangular regions. Sometimes it is convenient since this
strategy does not depend on order of objects inside PDF files.

Yes, xpdf is my goto PDF viewer, and I should have mentioned that
in the post.

Other PDF viewers allows to conveniently select contiguous spans of
text, e.g. end of some line and beginning of next one. Unfortunately
enough PDF files have pieces of text put in almost random order. At
least in Firefox selection may work in a quite peculiar way skipping
some fragments and adding visually unrelated ones.

Yes, I scrape web pages from FF fairly frequently (new mostly),
and am familiar with the particular structures that result with
different organisations. And ^A^C is a useful tool that can
scrape off-screen text which gets blotted out if you try to scroll
to it, ie requiring login or whatever to view the page.

So selection of text in PDF files may strongly depend on viewer.

Yes, most of the others I have will paste text that's as jumbled
as raw pdftotext, eg evince, zathura. With mupdf, I don't even
know how to copy, as the mouse just drags the page around.

P.S. "pdftotext -layout" in some cases is better than without
"-layout".

I think the results are roughly comparable with my scrapings,
for this document at least. Perhaps both pdftotext and xpdf
rely on poppler to do the work.

When text file has properly aligned columns, instead of
"quoting" some spaces, it may be better to add TAB characters at
certain positions on each line. Perhaps LibreOffice Calc even has GUI
to select column widths during importing of text files.

Yes, gnumeric has that too. But I would hate to have a lot of
mousework if I were repeating this frequently. And for a
postprandial one-off, I just took a no-tools approach
(barring an editor, of course).

Cheers,
David.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From fxkl47BF@protonmail.com@21:1/5 to All on Fri Feb 21 23:40:01 2025

in discussions about pdf utilities i've don't recall atril being mentioned
it's become my goto viewer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From tomas@tuxteam.de@21:1/5 to David Wright on Sat Feb 22 07:50:01 2025

On Fri, Feb 21, 2025 at 03:59:55PM -0600, David Wright wrote:

On Fri 21 Feb 2025 at 21:20:45 (+0000), debian-user@howorth.org.uk wrote:

[...]

I get:

Access Denied
You don't have permission to access "http://www.fns.usda.gov/cnpp/thrifty-food-plan-2006" on this server. Reference #18.dd831002.1740148075.35e89c97

https://errors.edgesuite.net/18.dd831002.1740148075.35e89c97

Perhaps it depends on browser settings (and which browser),
or perhaps on where you are (your timezone is unknown), or
perhaps on your ISP.

...or of whatever part of US administration The Musk and his child
warriors happened to apply their wrecking ball to yesterday.

SCNR
--
t

-----BEGIN PGP SIGNATURE-----

iF0EABECAB0WIQRp53liolZD6iXhAoIFyCz1etHaRgUCZ7lzIQAKCRAFyCz1etHa RgppAJ9lGwGa37GftUmas5Vp/KtrJZ6YlwCfZFEOICrI0ko3QDiANw/SfKp6PKQ=
=dk+A
-----END PGP SIGNATURE-----

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From songbird@21:1/5 to fxkl47BF@protonmail.com on Sun Feb 23 00:50:01 2025

fxkl47BF@protonmail.com wrote:

in discussions about pdf utilities i've don't recall atril being mentioned it's become my goto viewer

perhaps because it is normally a part of the MATE
desktop?

i've been using it for years and so far no major issues
that i've noticed, but i'm also not doing very complicated
things with it.

songbird

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 07:56:03 2025
  from Rognac, France via SSH
- Gretchiie
  Sat Sep 13 07:22:10 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 06:57:56 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 06:47:28 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	147:57:34
Calls:	10,383
Calls today:	8
Files:	14,054
D/L today:	2 files (1,861K bytes)
Messages:	6,417,737

Re: Alternative to Debian Repository - extract CSV formatted data from

Who's Online

Recent Visitors

System Info