• [gentoo-user] Emails are no indexable

    From Vitaly Zdanevich@21:1/5 to All on Mon Jul 8 17:10:01 2024
    This is a multi-part message in MIME format.
    Hi, I tried to google in "exact match" a few sentences from this email
    list - and nothing found. For example this mirroring https://marc.info/?l=gentoo-user&m=171984189706185&w=2 - and nothing in
    Google. Is it excluded from search? This is bad, because people google
    problems that are already solved in these emails :(

    Is it a known issue?

    <!DOCTYPE html>
    <html data-lt-installed="true">
    <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
    </head>
    <body style="padding-bottom: 1px;">
    <p>Hi, I tried to google in "exact match" a few sentences from this
    email list - and nothing found. For example this mirroring
    <a class="moz-txt-link-freetext" href="https://marc.info/?l=gentoo-user&amp;m=171984189706185&amp;w=2">https://marc.info/?l=gentoo-user&amp;m=171984189706185&amp;w=2</a> -
    and nothing in Google. Is it excluded from search? This is bad,
    because people google problems that are already solved in these
    emails :(</p>
    <p>Is it a known issue?<br>
    </p>
    </body>
    <lt-container></lt-container>
    </html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael@21:1/5 to All on Mon Jul 8 18:41:40 2024
    On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote:
    Hi, I tried to google in "exact match" a few sentences from this email
    list - and nothing found. For example this mirroring https://marc.info/?l=gentoo-user&m=171984189706185&w=2 - and nothing in Google. Is it excluded from search? This is bad, because people google problems that are already solved in these emails :(

    Is it a known issue?

    It depends what Google or other web crawlers have decided to list in their search results and what to exclude. You should be able to run a search within the content of a single website, but only if it has been ranked/listed by Google, e.g. say you want to find posts about blurred fonts, in a gentoo M/L, but not Debian, contained in marc.info. You can search Google like so:

    "blurred fonts" +gentoo -debian site:marc.info

    This should include Gentoo M/L posts about blurred fonts found in the
    marc.info domain, but exclude any Debian related search results.
    Unfortunately, this relies on Google first ranking them in their results to allow you to see them.

    -----BEGIN PGP SIGNATURE-----

    iQIzBAABCAAdFiEEXqhvaVh2ERicA8Ceseqq9sKVZxkFAmaMJNQACgkQseqq9sKV ZxlMow//X5g2lBkEnznMKQ3/A/BuUvzjJXqtvp7tILr9kppCtlfAILcvjZgTvvVk xwjxN0/Z8JlJ8/9W0h4T7LitSg7fDcwdJhHrxb7EcLCJtUBhV3anmmIhGdxSxVC1 eM7BYzd6yNLTVqy899ZoecSJgfDxseu+A03aEdjb2nce7AtVmANzpwsNDHkl3hRy Q+39ST6fIgrpKzu/D4d8KYjN82cLcVdGHaDnJHcQhi2cocXCM0b+rBylx6ugxZLK +Y1f8ZiRsE568/Ec0gIW+D1gULIVq3ztoOqhi0d30K4k5Pw2SmHwpw2RTRRGYUIX AcfRYJhGpiHxkEv5eMtX9AZ6UElDC6KMFaL0cotCMXsrTkEkt+Tca3FTR4OrlvC2 7U110lGBZxJRr5NwOEwEVxKNblweXXN+GePg54r01hP/Pzd9n6WiH/enSU+eZFAx JzdSqz0hxyZ+yyWAM5t2ZUfGmnTfrxlHeOuZXrkHFyo4FzlGCBjBQY4PpdkHU41X E0AU3S0TF0j6J2ARYW6X9AMLKtu0VqXm7t5xPpil5pn0yUGXZFWaSMO5D9HsA8Jp IArfGu3uDh1QAiHlpbFbJ25UjluLdFg1ipMmoOXDwMGU5cFu4TFwN3+ZCj32SzQN /JlIkQLPEIpe8wiMNEXIKJcLRWN/hUA5Y0k9oxOxwbI68LyZCdM=
    =rISh
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vitaly Zdanevich@21:1/5 to Michael on Tue Jul 9 12:10:01 2024
    This is a multi-part message in MIME format.
    In https://marc.info/robots.txt I see

    User-agent: *
    Disallow: /

    It looks bad.

    On 7/8/24 21:41, Michael wrote:
    On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote:
    Hi, I tried to google in "exact match" a few sentences from this email
    list - and nothing found. For example this mirroring
    https://marc.info/?l=gentoo-user&m=171984189706185&w=2 - and nothing in
    Google. Is it excluded from search? This is bad, because people google
    problems that are already solved in these emails :(

    Is it a known issue?
    It depends what Google or other web crawlers have decided to list in their search results and what to exclude. You should be able to run a search within
    the content of a single website, but only if it has been ranked/listed by Google, e.g. say you want to find posts about blurred fonts, in a gentoo M/L, but not Debian, contained in marc.info. You can search Google like so:

    "blurred fonts" +gentoo -debian site:marc.info

    This should include Gentoo M/L posts about blurred fonts found in the marc.info domain, but exclude any Debian related search results. Unfortunately, this relies on Google first ranking them in their results to allow you to see them.
    <!DOCTYPE html>
    <html data-lt-installed="true">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
    <body style="padding-bottom: 1px;">
    <p>In <a class="moz-txt-link-freetext" href="https://marc.info/robots.txt">https://marc.info/robots.txt</a> I see</p>
    <p>User-agent: *<br>
    Disallow: /</p>
    <p>It looks bad.<br>
    </p>
    <div class="moz-cite-prefix">On 7/8/24 21:41, Michael wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:2945403.e9J7NaK4W3@lenovo">
    <pre class="moz-quote-pre" wrap="">On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote:
    </pre>
    <blockquote type="cite">
    <pre class="moz-quote-pre" wrap="">Hi, I tried to google in "exact match" a few sentences from this email
    list - and nothing found. For example this mirroring
    <a class="moz-txt-link-freetext" href="https://marc.info/?l=gentoo-user&amp;m=171984189706185&amp;w=2">https://marc.info/?l=gentoo-user&amp;m=171984189706185&amp;w=2</a> - and nothing in
    Google. Is it excluded from search? This is bad, because people google
    problems that are already solved in these emails :(

    Is it a known issue?
    </pre>
    </blockquote>
    <pre class="moz-quote-pre" wrap="">
    It depends what Google or other web crawlers have decided to list in their search results and what to exclude. You should be able to run a search within the content of a single website, but only if it has been ranked/listed by Google, e.g. say you want to find posts about blurred fonts, in a gentoo M/L, but not Debian, contained in marc.info. You can search Google like so:

    "blurred fonts" +gentoo -debian site:marc.info

    This should include Gentoo M/L posts about blurred fonts found in the
    marc.info domain, but exclude any Debian related search results.
    Unfortunately, this relies on Google first ranking them in their results to allow you to see them.
    </pre>
    </blockquote>
    </body>
    <lt-container></lt-container>
    </html>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hank Leininger@21:1/5 to Vitaly Zdanevich on Fri Jul 26 00:30:01 2024
    [ Originally sent on 2024-07-09 but it never made it to the list,
    probably because I am not subscribed. ]

    On 2024-07-09, Vitaly Zdanevich wrote:
    In https://marc.info/robots.txt I see

    User-agent: *
    Disallow: /

    It looks bad.

    You had to scroll down quite a bit to get there. The very top of the
    file is:

    User-agent: Googlebot
    Allow: /
    Disallow: /?*s=*
    Disallow: /?*a=*

    [ Followed by similar stanzas for some specifically enumerated bots. ]

    Meaning: Google can index everything on MARC, except for searches
    (because of both load and transient nature of the results, even though
    we do those as GETs so the _browser_ feels free to cache them and for
    link color goodness, etc.) and lists of messages by author (because we
    want some kind of throttling on MARC's value for OSINT and spam
    harvesting).

    The problem is, Google _won't_ index everything. It thinks the number of
    unique pages on MARC is unreasonable (over 100 million messages, to say
    nothing of links to individual MIME attachments, list-by-date views, messages-in-thread, etc.). Google has only crawled a small percentage of
    that, and only indexed a portion of the pages it has crawled. There's no explanation of why, and you're actively discouraged from resubmitting
    "crawled but not indexed" pages (not practical for millions of URLs
    anyway).

    I used to generate sitemap XML files and feed them to googlebot so that
    it would be encouraged to come and get it. But it would/could never keep
    up with the volume of new data (~300k messages/month?), meanwhile the
    existing data it did have would get evicted from indexes with no
    explanation. It probably wouldn't hurt to try uploading fresh ones
    (other than the time it would cost me) but I don't have any confidence
    it would help, either.

    _Maybe_ it would help convince Google to keep Gentoo content in MARC
    indexed if each list we archive was individually linked in their entries
    at https://www.gentoo.org/get-involved/mailing-lists/all-lists.html ,
    but I have no actual evidence or indication that that's the case (nor am
    I indirectly asking for such a change to be made, because again I don't
    know that it would do any good).

    Also:

    On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote:
    list - and nothing found. For example this mirroring
    https://marc.info/?l=gentoo-user&m=171984189706185&w=2 - and
    nothing in Google. Is it excluded from search? This is bad, because
    people google problems that are already solved in these emails :(

    Any given page might, in fact, be excluded by Google on purpose, and I'm
    not supposed to be able to find out if it is[1].

    Google seems to be quick to act on GDPR requests and the like, which is
    nice overall. They do so by excluding certain contested search results
    when the search comes from a covered country. So if someone in the EU
    comments in a public email thread and later decides they want their name
    to disappear, they can cause their message and any that quote them to be suppressed - when searches originate from EU (simplified, I am not a
    lawyer, etc.).

    Google used to report which URLs were being removed from searches, but determined that that was itself an information leak they could not
    abide, so for years now when they send a webmaster a "Notice of European
    data protection law removal from Google Search" it says "to comply with developments in European law, which seek to prevent the identification
    of the requester, we are no longer disclosing the affected URLs".

    I see the rationales and don't object to them, but the result still kind
    of sucks.

    [1] Of course it should be possible to, say, use VPNs to evaluate the
    results of searches coming from different sources, but I'm not gonna.

    [ No comment on the other message in the thread by Michael/confabulate@
    other than, yes, 100% all of that. ]

    Thanks,

    --

    Hank Leininger <hlein@marc.info>
    CDFC 40DD 6B1D E176 8E84 A243 8FC6 9C04 40FD 2D11

    -----BEGIN PGP SIGNATURE-----

    iQIzBAEBCAAdFiEEzfxA3Wsd4XaOhKJDj8acBED9LREFAmagQjQACgkQj8acBED9 LRG3mRAAgGSnNaJ+PpWyu5wq5Ny3JJDpnB561RgkQk6aq+ONAF7qmYdAEWtkYNF8 VFm6ejyaYwEO7ciq/6FahpimqJ+iEz/iROkg4c2cb6mv2vNFTiTFIK1QoydZI14z twx767vQnAHrZWn0s9t1Ht8DOg4yHj9/1qoOhzmnshA3fiJXP62D4C3kQDRHGua1 ttMZRXxZKdgNT/xCKMnf7gHLt1lQx+Yw90JQuKNXhD/TuVLI2nMGCxGCdTIdZ7Qs tOmH65AVgUf+jvlthYQ7GUs5x8Vi04EmpClTfKbsACGo1kS0OJBe8mVgWAJoC9pg YPjtbdVxoN1atFlUD00zIqNk3YGPYqNkFfJ19M6gmht5RefuG3UWl24yPGO+igZf 2n9KAJ5Qr+laCfEZVYtt29kE+K5dnR7bH5RKn24KHq9tWUyNTe/K7sXGBRhSVxz1 tEMi54YCTxIAe5erZz+om5Yp+ijCqKsasTd/liigqhAacj+7bqh7ZAfouK/gkWKh 6zQbKNFVnmJSOzj+abAXxDAaNKvj2CLrDkSuBzKYm4KbkyUoYRoYivISxSmdzXgt EbuJqvxKFW3nCSy0MmqK8XOFUJbo/vhlSWoGtFwqnXt0zSfQDye4tJFFwO9euDvE VbnX55EVL1gexIribbGcI9X77JzLB3qiPg5E7zvfF8d6/ab6Aww=
    =ims0
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)