Search Improvements

SaintMagoo December 28, 2010, at 3:24 AM: Thanks, Peter. At the moment we would hate to have to usher-on a database. One of the reasons why we love PmWiki is that we did not have to mess with that.

Thanks anyway,

-Rn :)


If you're interested I had a pretty good start towards replacing .pageinfo with an SQLite database. The problem I ran into was simply a question of optimization. Every search I threw at it worked better with the simple fullscan of the text file (.pageindex) as compared with the indexed database... Let me know if you'd be interested in the code. Peter Bowers December 26, 2010, at 04:38 PM


SaintMagoo December 26, 2010, at 10:44 AM: Thanks, Peter. Impressive - most impressive :)

Christmas Idea: Like .pageinfo, how about creating another two files? One that is indexed: Containing only file names, it can be a way to register a page that has been processed. After that, .pageinfo2 could use the registered-page index-number in a sorted word-list dictionary. Naturally sorted and binary searched, imo such a new search-system could easily replace + hyper-speed up the present searching subsystem?

The only drawback here is the relative speed of growing the dictionary-file: Inserting new words could be slow. However, if the Dictionary is also indexed, then the main dictionary need not be sorted per se. -Only the dictionary-index-file need be sorted. Thereafter, inserting a mere integer + offset into the Dictionary's index - when a new word is added - would speed dictionary-growth-up a lot?

I am considering writing this beastie. Even (worst case) for Cron use, when a page-file mtime is greater than the registry-file mtime, it is a sign that a page or three might need to be re-indexed. Should be fun.

.02 ends


Try GoogleSearch. Peter Bowers December 25, 2010, at 03:02 PM


SaintMagoo December 24, 2010, at 09:30 AM: We just uploaded 200,000 pages to our PmWiki. Not too surprisingly, searching is now glacial. After taking a peek at .pageinfo we see that is it a listing of pages, with keywords. Are there any plans afoot to do something a little more 'Googlie?

In the mean time, aside from deleting .pageinfo, is there an easy way to turn the search feature off? Found it - just remove it from z forms in the .tmpl. One might also want to set $EnablePageIndex = 0;

Tanks,

-Rn


lordmundi November 06, 2007, at 10:54 AM: Just to add to the discussion below, I thought I would put a link to a sample search result on Renato's site with Sphider integrated:

lordmundi November 05, 2007, at 09:05 AM: Wow... I really like the Sphider integration you did Renato!! It looks great. Looking at the sphider site, this looks like it could be a great cookbook recipe for pmwiki. I'm wondering how you or someone else might do the following:

  • How do you implement permissions in the search? For example, my search results don't list pages that people aren't authorized to read. This would be even more important since these search results actually show text from the page. I'm guessing each search result would just need to get wrapped in a (:if auth read:) or something more cuztomizable. Did you do this?
    • I've edited the config.php file, excluding certain categories (with lines such as $SearchPatterns['default'][] = '!^Site\.!';) as the PmWiki documentation says. Also, Sphider can avoid URLs-including-any-string-you-want and I've used that too.
    • lordmundi November 06, 2007, at 10:51 AM:That's cool. That lets you ignore certain areas based on location. But what I'm wondering is how you only list pages that the current logged in user has permission to read. For example, let's say PM adds some permission to a special page in the cookbook he doesn't want users looking at. When another user is logged in and searches, how does the page only show up in the search results if they authorized to read it. All that is needed to do it is have a way to insert user code around each result listing. If this feature is there, then we could just have a template that wraps each result from the search with a (:if auth read:) call. Hopefully this makes some sense.
  • Looking through the Sphider documenation, I didn't see how the software "re-crawls" the site. Is there supposed to be a cron job or something to make the robot crawl all the pages? Or did you add this in to your pmwiki recipe to search after some sort of timeout, kind of like the notify emails that pmwiki uses?
    • I guess anyone could easily code a quick bot to re-crawl your site after a few hours/days/weeks (as you wish), but I haven't done that. Not yet, at least, since college has been taking a lot of my time and I'm focused on other coding on my site, right now. So, for now, you have to re-crawl it yourself, which takes about two or three clicks any time you want.
    • Just to make this clear, I'm a real lame while handling the PmWiki code. I understand close to nothing until now, just the necessary to make it work. I've used the httpvariables recipe so I could get the query string from the URL inserted. As I've stated below (lol), I MAY have done a few more steps, but I can't really remember. I've used the HttpVariables recipe, Sphider and a little javascript (so I can resize the results page (iframe) down when are few results). I'm sure there's someone more skilled than me to do the job, but I'll contribute with anything I can. :)

All in all, it looks really nice. -- FG
Thanks! And I'm sorry I've messed up with your original post, but I guess this was going to be clearer than doing a new "post" above that one. So: my comments are in purple not to mess it ALL up. lol Renato

08/31/07 - Renato - Okay, six months later, I think I've got an idea. I've been playing with Sphider since yesterday. I could implement the search feature in one night (I had some difficulties managing on how to get the results INSIDE the main "window" on PmWiki - most of PmWiki code is Greek to me)... I'm having only a small bug with the "Did you Mean" feature (it gets weird when I use capital letters), but other than that (or if you disable it), the search engine is running ok. I'll try to solve that by tonight, but I can't make any promises. And I've messed with a lot of codes, randomly, so I'll have to check it all so to discover what I've done. :P (yeah, I should start writing changes history...)

Oh, anyone can take a look at it on my site, if you don't mind reading Portuguese. :P Good keyphrases are "guitarra elétrica", "symphony x", "steve howe". It will give you the idea. If you want to know what the bug is, search for "Stevee". I've changed the code so to take everything in lowercase. If that's ok with PmWiki, it will be an option. :)

Henning July 18, 2007, at 12:38 PM: It just occurred to me that it would be nice to have a search engine that on request excludes pages older than a certain date from the result (in order to concentrate on recent content). Just brainstorming ...

Henning February 22, 2007, at 10:40 AM:' I'd be interested in a solution for multiple buttons, too. I`ve seen multiple search buttons used in a non-wiki CMS, and it looks like an efficient user interface device I'd like to copy.

02/07/07 - Renato - The tips on this thread (PmWikiUsers:2006-October/034807.html) are great for the ones willing to search only for titlenames. Is it possible to have two buttons (Go/Search, as in MediaWiki, for instance)? One for searching titles and the other one for searching content?

12/14/06 - (:searchresults:) can be customized by editing page Site.Search, see also Search for pages.

6/5/06 - I totally understand the frustration with PmWiki's search results... But perhaps the issue has come to enough of a head that it's time for me to go ahead and implement a valid way to excerpt (and possibly rank) search results, even if it's very suboptimal in a number of respects. Most notably, it will be suboptimal in terms of speed -- every task and option we add to searching/page lists makes it run even slower than it does now.

I think I need to remind the group that PmWiki is not a search engine, has never been designed to be a search engine, and I have no intent to make it one. My stance on searching continues to be that if a site wants fast searches with relevance ranking of results and excerpted text outputs, then get a "real" search engine that is designed for such tasks and let it index the PmWiki site. (Bonus: such an engine can index and search things that aren't wiki pages, such as attachments or other static pages on the site.)

I should also point out that any author can create a custom search page on pmwiki.org, it doesn't require me to do it. For example, to have a search page that defaults to fmt=#title for its output, just create a page that looks something like:

(:searchbox:)

(:searchresults fmt=#title order=title:)

See, for example, http://www.pmwiki.org/wiki/Test/SearchByTitle . Then use that custom search page for searching instead of the PmWiki default.

Still, I'll see if I can write up an page variable in the very near future, as well as an order=rank option.

Pm


Also visit PmWiki.Search for a documented custom search page.


2/1/06 - A lot of people continue to ask for improvements to PmWiki's search capabilities. In the past I've essentially taken the position that "PmWiki is not a search engine", and that using another search engine package (one that is optimized for performing searches) would be much better than me trying to build one of my own.

The pmwiki.org site is starting to become so heavily used that I probably need to set up a search engine there, if only to help keep the server load down. Does anyone have any suggestions for a good, easy-to-install search engine package?

The two I've looked at in detail in the past include:

ht://Dig -- I've used this several times in the past for other
   projects, but it doesn't appear to be actively maintained
   anymore, and integrating it to PmWiki would be slightly kludgey.

swish-e -- I did a few experiments with this and concluded that

   it could be made to work, but curiously it seems to lack any
   sort of convenient "excerpting" capability.  (I could probably
   live without this.)

I also briefly looked at mnoGoSearch, but for some reason I didn't think it was a good fit with what I'm trying to do.

Any suggestions?


6/13/05 - PmWiki's search engine scans the markup text directly, not the page's rendered output.

  • However*, I have been playing with a notion for page caching that

might make it possible for PmWiki's search to also scan the rendered version of the text, so maybe we could go that way... :-)


4/15/05 - I've always maintained that PmWiki *isn't* a search engine, and for advanced searches a site is much better off integrating an existing search engine package rather than us trying to reinventing that particular wheel.

Still, there are times when it may be useful to provide teasers to things that aren't "searches". Most search engines have no clue of PmWiki's structures such as groups, trails, or categories, and so being able to provide teaser information in the context of those structures still makes a lot of sense.


6/13/04 - But your point is well taken. I never really thought of searching for markup sequences. :-) > > I actually do that now and then, so I think we actually have to implement > our own search engine.

Well, I wasn't planning to eliminate the search engine, either. I've just felt that once a basic search capability is available that meets the needs of most PmWiki users, my time and effort is better spent on other aspects of PmWiki and not reinventing search engines that already exist.


(Old content added to this page before Pm ever got a chance to write anything.)

After Pm made this empty entry, I shamelessly hijacked it to think aloud, maybe spark ideas in others :) I'll presently move these scribbles to PITS entries.

  • If not included in the core, maybe we could at least have a recipe for excluding the current page from searches (and pagelists, for the matter). I (unsuccessfully) tried adding $SearchPatterns['normal'][] = "!\.$Name$!";
    Try:
   $SearchPatterns['normal'][] = "!^$FullName\$!";
  • add (different) format arguments that would
    • add sorting to searchresults and pagelist (by file date or even other values definable in local/config; something like the fields in PITS, Field: value, one field per line, anywhere in the file)
      Already on the design books -- just haven't implemented it yet. --Pm
    • supress the 'search report' at the top, e.g. 'Results of search for group=GName fmt=simple list=normal :'
      Use (:pagelist:) instead of (:searchresults:) for this.
    • supress the 'search result' at the bottom, e.g. '3 pages found out of 27 pages searched'
      Use (:pagelist:) instead of (:searchresults:) for this.

-Radu March 11, 2005, at 01:43 PM


Radu March 14, 2005, at 11:25 PM
But (:pagelist:) does not allow SearchString ... or does it? It's definitely not documented under Page Directives

Pico March 27, 2006, at 03:51 PM
Apparently it does. Take a look at PageLists for more about pagelist (and searchresults). As for Page Directives, that needs work and is catagorized for Documentation To Do.

Category:


This page may have a more recent version on pmwiki.org: PmWiki:SearchImprovements, and a talk page: PmWiki:SearchImprovements-Talk.