• Tilde search engine?

    From Andrew Singleton@singletona082@ctrl-c.club to tilde.meta on Sat Feb 22 12:18:34 2025
    Is that a thing? Like, google, except 'here we're only searching THESE
    places and only the places that are explicitely OK to be searched.'

    --- Synchronet 3.20a-Linux NewsLink 1.2
  • From say@say@tilde.club to tilde.meta on Sun Feb 23 02:08:33 2025
    On 2025-02-22, Andrew Singleton <singletona082@ctrl-c.club> wrote:
    Is that a thing? Like, google, except 'here we're only searching THESE
    places and only the places that are explicitely OK to be searched.'


    It might be cool. It's something I'm interested in.
    --- Synchronet 3.20a-Linux NewsLink 1.2
  • From yeti@yeti@tilde.institute to tilde.meta on Sun Feb 23 10:27:30 2025
    say@tilde.club wrote:

    It might be cool. It's something I'm interested in.

    Maybe this one can be adapted:

    Internet search engine for text-oriented websites.
    Indexing the small, old and weird web.

    <https://github.com/MarginaliaSearch/MarginaliaSearch>
    <https://about.marginalia-search.com/>
    <https://marginalia-search.com/>

    So maybe Marginalia can add crawling the Tildeverse and someone can add
    gopher: and friends to their crawler?


    The owner of (the now closed) <https://emacs.ch> Mastodon instance ran a
    Gopher and Gemini crawler feeding an own search engine. I've no idea
    how to reach him now and don't even remember the name of this search
    engine. The domain name <https://emacs.ch> still exists, maybe digging
    deeper from that starting point can help or someone of the left over
    users in <ircs://irc.tilde.chat/emacs.ch> has his email address?
    --
    When dictatorship is a fact, revolution becomes a right. -- Victor Hugo
    --- Synchronet 3.20a-Linux NewsLink 1.2
  • From xwindows@xwindows@tilde.club to tilde.meta on Mon Feb 24 20:58:26 2025
    On Sat, 22 Feb 2025, Andrew Singleton wrote:

    only the places that are explicitely OK to be searched

    These are two close by which I have from the top of my head:

    - https://wiby.me/
    ^ Coverage stems from individual URL users nominated.

    - https://searchmysite.net/
    ^ Coverage stems from sites users nominated, as well as sites
    paying for them to be their site-internal search engine.

    Tilde search engine?

    - I remember ~deepend has an AltaVista clone running at:
    https://notaltavista.com/
    ^ WWW/Gopher, coverage stems from tildes, a bit jank
    but might fit your use.

    - As ~yeti mentioned, there is also Marginalia.
    Marginalia has tilde filter, but it seems to be based on
    just having `~` as the first symbol in the path section of the URL.

    I'll link to the classic non JavaS'creep-infected version
    of Marginalia here:
    https://old-search.marginalia.nu/

    And as a tangent, for a general mid-sized search engine which has
    its own index [1], free for public uses [2], not playing shady ploys
    with `robots.txt`, is purely algorithm-based and untainted
    by machine learning black-box techniques, as well as not involving
    in prose-laundering cartel, there is:

    https://www.mojeek.com/
    ^ English only.

    Which I currently use as my main search engine.
    (And you cannot search Reddit with it of course [3];
    but that wasn't the point of the original question)

    Regards
    ~xwindows


    [1] DuckDuckGo don't qualify, because they use result from Micro$oft Bing.
    Some people ask how did I know for sure: it was because my WWW sites
    used to be searchable from DuckDuckGo several years ago.
    Then Micro$oft went all-in with their Open(washing)AI partnership
    and use Bing crawling data to feed their prose-laundering businees;
    so I banned Bing from my sites (in both `robots.txt` and CIDR blacklists),
    while explicitly allowing DuckDuckDo (in both `robots.txt`
    and IP whitelist which has greter precence than blacklist).
    And sure enough, within a month, my sites had all but disappeared
    from DuckDuckGo's search coverage.

    [2] Meaning Kagi don't qualify for my use; as a registration-required
    service, prolonged usage can result in bubble effect one can't verify;
    and as a paid service, there is huge risk of data association
    between search terms and user's real-life identity as well.

    [3] https://archive.ph/GS2I0
    --
    xwindows' gallery of freely-licensed artworks
    https://tilde.club/~xwindows/ http://tilde.club/~xwindows/ gopher://tilde.club/1/~xwindows/
    --- Synchronet 3.20a-Linux NewsLink 1.2
  • From Andrew Singleton@singletona082@ctrl-c.club to tilde.meta on Mon Feb 24 09:45:27 2025
    ...I am now going to have to update my bookmarks both on computer and
    what's shared on my websites. Thank you.

    On Mon, 24 Feb 2025 20:58:26 +0700 (+07)
    xwindows <xwindows@tilde.club> wrote:

    On Sat, 22 Feb 2025, Andrew Singleton wrote:

    [...]

    These are two close by which I have from the top of my head:

    - https://wiby.me/
    ^ Coverage stems from individual URL users nominated.

    - https://searchmysite.net/
    ^ Coverage stems from sites users nominated, as well as sites
    paying for them to be their site-internal search engine.

    [...]

    - I remember ~deepend has an AltaVista clone running at:
    https://notaltavista.com/
    ^ WWW/Gopher, coverage stems from tildes, a bit jank
    but might fit your use.

    - As ~yeti mentioned, there is also Marginalia.
    Marginalia has tilde filter, but it seems to be based on
    just having `~` as the first symbol in the path section of the URL.

    I'll link to the classic non JavaS'creep-infected version
    of Marginalia here:
    https://old-search.marginalia.nu/

    And as a tangent, for a general mid-sized search engine which has
    its own index [1], free for public uses [2], not playing shady ploys
    with `robots.txt`, is purely algorithm-based and untainted
    by machine learning black-box techniques, as well as not involving
    in prose-laundering cartel, there is:

    https://www.mojeek.com/
    ^ English only.

    Which I currently use as my main search engine.
    (And you cannot search Reddit with it of course [3];
    but that wasn't the point of the original question)

    Regards
    ~xwindows


    [1] DuckDuckGo don't qualify, because they use result from Micro$oft
    Bing. Some people ask how did I know for sure: it was because my WWW
    sites used to be searchable from DuckDuckGo several years ago.
    Then Micro$oft went all-in with their Open(washing)AI partnership
    and use Bing crawling data to feed their prose-laundering
    businees; so I banned Bing from my sites (in both `robots.txt` and
    CIDR blacklists), while explicitly allowing DuckDuckDo (in both
    `robots.txt` and IP whitelist which has greter precence than
    blacklist). And sure enough, within a month, my sites had all but
    disappeared from DuckDuckGo's search coverage.

    [2] Meaning Kagi don't qualify for my use; as a registration-required
    service, prolonged usage can result in bubble effect one can't
    verify; and as a paid service, there is huge risk of data association
    between search terms and user's real-life identity as well.

    [3] https://archive.ph/GS2I0



    --- Synchronet 3.20a-Linux NewsLink 1.2