NNTP Crawler for Secure Enterprise Search ========================================= ========================================= This is a sample crawler for crawling NNTP-Protocol sources, such as Usenet news servers. Files in this distribution: readme.txt - this file build.bat - Windows batch file to compile sources build.sh - Unix/Linux shell script to compile sources nntpcrawler.jar - precompiled jar file META-INF/MANIFEST.MF - manifest file, specifies location of a needed jar file app/crawler/nntp/NNTPCrawlerPluginManager.java - source for plugin manager app/crawler/nntp/NNTPCrawlerPlugin.java - source for main plugin javadoc/* javadoc for the crawler Conventions =========== This document uses windows-style folder paths. For UNIX platforms, substitute "/" for "\" in paths. Prerequisites ============= This crawler is designed to work with SES 10.1.8. It will NOT work with earlier versions Installing the crawler ====================== From compiled jar file ---------------------- Copy the supplied jar file to \search\lib\plugins where is the Oracle SES installation folder. From source ----------- Edit the file build.bat (Windows) or build.sh (UNIX) and set the correct ORACLE_HOME. The Oracle home is the location where Oracle SES is installed. On UNIX, make the build.sh file executable: chmod +x build.sh Run the file build.bat or build.sh from a command window, making sure that your current working directory is the location of the .bat/.sh file. This will compile the soure and create the jar file in the correct directory. If the code compiles cleanly, the output will look like this: added manifest adding: app/crawler/nntp/NNTPCrawlerPlugin.class(in = 5851) (out= 2867)(deflated 50%) adding: app/crawler/nntp/NNTPCrawlerPluginManager.class(in = 3039) (out= 1421)(deflated 53%) adding: app/crawler/nntp/NNTPHelper.class(in = 3899) (out= 1930)(deflated 50%) You can use an IDE, such as Oracle JDeveloper, instead of the script files. Installing the redirection JSP ------------------------------ Normally, when you click on a hitlist link in Oracle SES, a JSP file is invoked which forces a redirect (HTTP code 302) to the Display URL. If the display URL uses NTTP as the "scheme" part of the URL, this redirection does not work in Internet Explorer (though it does in other browsers). Therefore we provide a special JSP file for the NNTP crawler, which provides a redirect via a "Meta Refresh" instruction. You must copy the JSP file to the correct directory. On Windows: copy nntpredirect.jsp \oc4j\j2ee\OC4J_SEARCH\applications\search_query\query\nntpredirect.jsp On UNIX: cp nntpredirect.jsp /oc4j/j2ee/OC4J_SEARCH/applications/search_query/query/nntpredirect.jsp Note that the client browser must support nntp:// protocol, and launch an appropriate "helper" application when such a link is clicked on. For Internet Explorer on Windows, this will normally be Microsoft Outlook or Outlook Express. Registering the crawler ----------------------- In the SES admin screens, go to Home -> Global Settings -> Sources / Source Types page and click "Create". Enter the following information: Name: Newsgroups Crawler (or your own source type name) Description: Crawler for NNTP Protocol (or any description) Plug-in Manager Java Class Name: app.crawler.nntp.NNTPCrawlerPluginManager Plug-in Jar File Name: nntpplugin.jar Click "Next", and then "Finish". On the Home -> Sources page, you will see a new source type "Newsgroups Crawler" added to the drop-down list next to the "Create" button. You can use this to create new newsgroup crawls. Restricting the list of Groups to be Indexed ============================================ The crawler parameters will accept regular expressions for included or excluded groups. Note these are full Java regular expressions - to include all groups with "oracle" in the group name, you must enter ".*oracle.*" (without the quotes) - not just "oracle". An empty include string indicates that all groups should be indexed unless they match the exclude list. If a group matches both the include and the exclude strings, it will be excluded. Suggestions for further Enhancements ==================================== The crawler does not accept username / password for accessing protected NNTP sources. This could be an enhancement ------------------------------------------------------------------------------------------------ Last edited 20 February 2007 by Roger Ford roger.ford@oracle.com