Plugin for processing XML files in Secure Enterprise Search
-----------------------------------------------------------

[if reading this file in Notepad please switch on Word Wrap 
from the format menu]


For more information about this plugin, please read the Javadoc for XMLCrawlerPluginManager.
(Open javadoc/index.html in a web browser).

Files/directories in this distribution
--------------------------------------

javadoc:        directory containing Javadoc documentation. Open index.html in your browser.
README:         this file
rss20.xsl:      sample XSLT file for processing RSS feeds
src.zip:        source code for plugin
xmlplugin.jar:  compiled version of plugin

How to Build
------------

To deploy from Jar file:

copy xmlplugin.jar to the $ORACLE_HOME/search/lib/plugins directory, where ORACLE_HOME is the installation directory for Secure Enterprise Search.

To build from source on Linux:
(you only needed to build from source if you wish to make changes to the plugin)

Extract the source files from the zip file:

unzip src.zip

To Compile source:

$ORACLE_HOME/jdk/bin/javac -classpath $ORACLE_HOME/lib/xmlparserv2.jar:$ORACLE_HOME/search/lib/search.jar app/crawler/xml/*.java

Create jar file:

$ORACLE_HOME/jdk/bin/jar cvf $ORACLE_HOME/search/lib/plugins/xmlplugin.jar app/crawler/xml/*.class

Commands on other system may vary. You can of course build the jar file in your favorite IDE, such as JDeveloper.

To create Source Type for plugin
--------------------------------

Go to the SES admin pages

  Global Settings -> Sources ->  Source Types  ->  Create

Name :                            A name of your choice (eg. "XML Source Type")
Description:                      A comment
Plug-in Manager Java Class Name:  app.crawler.xml.XMLCrawlerPluginManager
Plug-in Jar File Name:            xmlplugin.jar

Then click on "Finish"

An example - Indexing CNN News via an RSS feed
----------------------------------------------

An XSLT file is provided with this package. It can be used to index XML files which are part of an RSS (Really Simple Syndication) feed.
These are available from many sources, but we will use CNN News Top Stories as an example.

Please refer to CNN's Terms of Use before following these steps:
http://www.cnn.com/services/rss/#terms

1/ Ensure that you have created the Source Type "XML Source Type" as described above

2/ Ensure that any needed HTTP proxies are set (Admin screens: Global Settings -> Proxy Settings)

3/ Make sure that the supplied rss20.xsl file is available from a file: URL on the SES server (or from an HTTP URL on any server if you prefer)

4/ Go to the Admin screen, and choose Home -> Sources -> Source Type XML Source Type [Create]

5/ Complete the following
     Source Name:          CNN Top Stories
     XML Root:             http://rss.cnn.com/rss/cnn_topstories.rss
     XSLT Stylesheet File: [File URL for rss20.xsl on your server], eg: file:///home/user/dir/XMLPlugin2/rss20.xsl or file:///C:/XMLPlugin/rss20.xsl

6/ Create the source, then monitor its progress from Home -> Schedule.

7/ That's it! You should now be able to search on CNN's top new stories.

Notes
-----

1/ To enable the logger.debug() output, edit the file $ORACLE_HOME/search/data/config/crawler.dat and in the SYSTEM_PROPERTIES, change:
-Doracle.search.logLevel=4
to
-Doracle.search.logLevel=2
Alternatively, use logger.info() which will always write to the log file.

2/ If your XSLT file is not correct, you will likely not get useful output from the plugin. You can check your XSLT using a tool such as XMLSpy from Altova, or a command line tool. PHP 5 can perform XSLT transformations (when built with the xsl extension) using a suitable script such as this one:
http://alanstorm.com/2005/projects/xslt


Last Modified by Roger Ford (roger.ford@oracle.com) on 16-May-2006