• Re: Alternative to Debian Repository - extract CSV formatted data from

    From John Hasler@21:1/5 to All on Thu Feb 20 18:20:01 2025
    Try pdftotext.
    --
    John Hasler
    john@sugarbit.com
    Elmwood, WI USA

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From debian-user@howorth.org.uk@21:1/5 to Richard Owlett on Thu Feb 20 18:30:01 2025
    Richard Owlett <rowlett@access.net> wrote:
    I wish to extract CSV formatted data from a PDF document. [1]
    Page ES-7 has a weekly grocery list for males grouped by age.
    I need only the first and last columns.

    Can someone point me in a suitable direction?

    TIA

    [1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
    Table ES-1. Thrifty Food Plan market baskets, quantities of food
    purchased for a week, by age-gender group, 2006

    If you look at
    https://www.fns.usda.gov/cnpp/thrifty-food-plan-2021 instead, you can
    find the underlying data in spreadsheet form (.xlsx). Perhaps that will
    be an adequate substitute?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans@21:1/5 to All on Thu Feb 20 18:40:01 2025
    This is a multi-part message in MIME format.

    Am Donnerstag, 20. Februar 2025, 15:08:27 CET schrieb Richard Owlett:
    I wish to extract CSV formatted data from a PDF document. [1]
    Page ES-7 has a weekly grocery list for males grouped by age.
    I need only the first and last columns.

    Can someone point me in a suitable direction?

    TIA

    [1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
    Table ES-1. Thrifty Food Plan market baskets, quantities of food
    purchased for a week, by age-gender group, 2006

    Without knowing the content of your pdf file, maybe you can port the pdf file to a text file for
    example by using "pdftotext".

    pdftotext [options] <PDF-file> [<text-file>]

    Then, you could read every line in this text file and filter only lines with a unique word in this
    line (i.e. "sold") and create a new file with all lines you only need. For example:

    cat ~/my_file.txt | grep "sold" > my_new_file.txt

    Now you have this one, you can cat and cut only words you need (see manual of cut).

    The syntax is similar like:

    cat `cut --fields 3 5 7` ~/my_new_files.txt > my_target_file.txt

    This would read linewise and only print the 3rd, 5th and 7th word of the source file.

    See manual of cut, what options you need.

    At last, you can edit the my_target_file.txt with any editor and add a separator sign at any
    space between the words. The space is also a sign like any other and can be exchanged like any
    other letter.

    Then you would have a csv file!

    If you are familiar with these commands, you can write a shell script, which does all in once in a
    future. Usefull also vor very very big files.

    Please note: Above might not be the correct syntax!!! My goal was more, to show, which way
    you could like to go and it maybe not usefull at your special pdf file.


    Maybe, if my suggestion is usefull, someone more experienced as me can you tell the correct
    commands.

    Please take look at the manuals of pdftotext, cat and cut, hope this helps.

    Best regards

    Hans





    <html>
    <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    </head>
    <body><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Am Donnerstag, 20. Februar 2025, 15:08:27 CET schrieb Richard Owlett:</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; I wish to extract CSV formatted data from a PDF document. [1]</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; Page ES-7 has a weekly grocery list for males grouped by age.</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; I need only the first and last columns.</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; </p> <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; Can someone point me in a suitable direction?</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; </p> <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; TIA</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; </p> <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt; [1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Table ES-1. Thrifty Food Plan market baskets, quantities of food</p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; purchased for a week, by age-gender group, 2006</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Without knowing the content of your pdf file, maybe you can port the pdf file to a text file for example by using &quot;pdftotext&quot;. </p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;"><span style="color:#000000;"><span style="font-family:monospace;"><span style="background-color:#ffffff;">pdftotext [options] &lt;PDF-file&gt; [&lt;text-file&gt;]</span></span></
    span><br /></p>
    <p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Then, you could read every line in this text file and filter only lines with a unique word in this line (i.e. &quot;sold&quot;) and create a new file with all lines you only need. For
    example:</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">cat ~/my_file.txt | grep &quot;sold&quot; &gt; my_new_file.txt</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Now you have this one, you can cat and cut only words you need (see manual of cut). </p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">The syntax is similar like:</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">cat `cut&nbsp; --fields 3 5 7` ~/my_new_files.txt &gt; my_target_file.txt</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">This would read linewise and only print the 3rd,&nbsp; 5th and 7th word of the source file.</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">See manual of cut, what options you need. </p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">At last, you can edit the my_target_file.txt with any editor and add a separator sign at any space between the words. The space is also a sign like any other and can be exchanged
    like any other letter. </p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Then you would have a csv file!</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">If you are familiar with these commands, you can write a shell script, which does all in once in a future. Usefull also vor very very big files.</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Please note: Above might not be the correct syntax!!! My goal was more, to show, which way you could like to go and it maybe not usefull at your special pdf file. </p>
    <br /><br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Maybe, if my suggestion is usefull, someone more experienced as me can you tell the correct commands.</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Please take look at the manuals of pdftotext, cat and cut, hope this helps.</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Best regards</p>
    <br /><p style="margin-top:0;margin-bottom:0;margin-left:0;margin-right:0;">Hans</p>
    <br /><br /><br /><br /></body>
    </html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Owlett@21:1/5 to debian-user@howorth.org.uk on Thu Feb 20 21:00:01 2025
    On 2/20/25 11:20 AM, debian-user@howorth.org.uk wrote:
    Richard Owlett <rowlett@access.net> wrote:
    I wish to extract CSV formatted data from a PDF document. [1]
    Page ES-7 has a weekly grocery list for males grouped by age.
    I need only the first and last columns.

    Can someone point me in a suitable direction?

    TIA

    [1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
    Table ES-1. Thrifty Food Plan market baskets, quantities of food
    purchased for a week, by age-gender group, 2006

    If you look at
    https://www.fns.usda.gov/cnpp/thrifty-food-plan-2021 instead, you can
    find the underlying data in spreadsheet form (.xlsx). Perhaps that will
    be an adequate substitute?



    You just demonstrated that "Murphy's Law" holds ;<

    I click on the link you quoted in my default browser and a PDF is
    displayed [actually my original starting point months ago].

    If I use my alternate browser {Firefox instead of SeaMonkey} I get to
    chose which of several files to view. {one of them is an .xlsx file}

    Murphy gets a second jab in.
    The 2006 version has the data I want in a slightly different layout that
    the 2021 version. The first is a better match for how I do things ;/

    Also the PDF structure of the two links react slightly differently when selecting with mouse movements/clicks. The 2006 version seems to allow
    me to select only what I want. [ 2021 version grabs everything between
    first and last click. 2006 appears to select only the columns of interest]

    Can't spend time right now to verify first impression. Will know more
    this weekend.

    *THANK YOU*

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From debian-user@howorth.org.uk@21:1/5 to Greg on Fri Feb 21 22:30:01 2025
    Greg <curtyshoo@gmail.com> wrote:
    On 2025-02-21, David Wright <deblis@lionunicorn.co.uk> wrote:

    [1] https://www.fns.usda.gov/cnpp/thrifty-food-plan-2006
    Table ES-1. Thrifty Food Plan market baskets, quantities
    of food purchased for a week, by age-gender group, 2006

    I don't read PDFs /in/ the browser: it downloads it instead.
    So while held captive at home by the weather, I dragged the mouse
    across the Males table and dumped it in a file.


    I get:

    Access Denied
    You don't have permission to access "http://www.fns.usda.gov/cnpp/thrifty-food-plan-2006" on this server. Reference #18.dd831002.1740148075.35e89c97

    https://errors.edgesuite.net/18.dd831002.1740148075.35e89c97

    Wacky!

    For me, FF opens a normal web page and tries to download a PDF file as
    well. Cheeky thing! For both the 2006 and 2021 pages. I can't be
    bothered trying to find what particular combination of plugins and
    preferences cause all these different behaviours.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Wright@21:1/5 to Max Nikulin on Fri Feb 21 23:10:02 2025
    On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote:
    On 21/02/2025 08:00, David Wright wrote:
    I dragged the mouse
    across the Males table and dumped it in a file.

    David, I recall you mentioned xpdf in your messages. It allows to
    select rectangular regions. Sometimes it is convenient since this
    strategy does not depend on order of objects inside PDF files.

    Yes, xpdf is my goto PDF viewer, and I should have mentioned that
    in the post.

    Other PDF viewers allows to conveniently select contiguous spans of
    text, e.g. end of some line and beginning of next one. Unfortunately
    enough PDF files have pieces of text put in almost random order. At
    least in Firefox selection may work in a quite peculiar way skipping
    some fragments and adding visually unrelated ones.

    Yes, I scrape web pages from FF fairly frequently (new mostly),
    and am familiar with the particular structures that result with
    different organisations. And ^A^C is a useful tool that can
    scrape off-screen text which gets blotted out if you try to scroll
    to it, ie requiring login or whatever to view the page.

    So selection of text in PDF files may strongly depend on viewer.

    Yes, most of the others I have will paste text that's as jumbled
    as raw pdftotext, eg evince, zathura. With mupdf, I don't even
    know how to copy, as the mouse just drags the page around.

    P.S. "pdftotext -layout" in some cases is better than without
    "-layout".

    I think the results are roughly comparable with my scrapings,
    for this document at least. Perhaps both pdftotext and xpdf
    rely on poppler to do the work.

    When text file has properly aligned columns, instead of
    "quoting" some spaces, it may be better to add TAB characters at
    certain positions on each line. Perhaps LibreOffice Calc even has GUI
    to select column widths during importing of text files.

    Yes, gnumeric has that too. But I would hate to have a lot of
    mousework if I were repeating this frequently. And for a
    postprandial one-off, I just took a no-tools approach
    (barring an editor, of course).

    Cheers,
    David.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From fxkl47BF@protonmail.com@21:1/5 to All on Fri Feb 21 23:40:01 2025
    in discussions about pdf utilities i've don't recall atril being mentioned
    it's become my goto viewer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From tomas@tuxteam.de@21:1/5 to David Wright on Sat Feb 22 07:50:01 2025
    On Fri, Feb 21, 2025 at 03:59:55PM -0600, David Wright wrote:
    On Fri 21 Feb 2025 at 21:20:45 (+0000), debian-user@howorth.org.uk wrote:

    [...]

    I get:

    Access Denied
    You don't have permission to access "http://www.fns.usda.gov/cnpp/thrifty-food-plan-2006" on this server. Reference #18.dd831002.1740148075.35e89c97

    https://errors.edgesuite.net/18.dd831002.1740148075.35e89c97

    Perhaps it depends on browser settings (and which browser),
    or perhaps on where you are (your timezone is unknown), or
    perhaps on your ISP.

    ...or of whatever part of US administration The Musk and his child
    warriors happened to apply their wrecking ball to yesterday.

    SCNR
    --
    t

    -----BEGIN PGP SIGNATURE-----

    iF0EABECAB0WIQRp53liolZD6iXhAoIFyCz1etHaRgUCZ7lzIQAKCRAFyCz1etHa RgppAJ9lGwGa37GftUmas5Vp/KtrJZ6YlwCfZFEOICrI0ko3QDiANw/SfKp6PKQ=
    =dk+A
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From songbird@21:1/5 to fxkl47BF@protonmail.com on Sun Feb 23 00:50:01 2025
    fxkl47BF@protonmail.com wrote:
    in discussions about pdf utilities i've don't recall atril being mentioned it's become my goto viewer

    perhaps because it is normally a part of the MATE
    desktop?

    i've been using it for years and so far no major issues
    that i've noticed, but i'm also not doing very complicated
    things with it.


    songbird

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)