Data Provider Documentation
This documentation includes the following sections:
Overview
The jOAI data provider allows XML files from a file system to be exposed as items in an OAI data repository and made available for harvesting by others using the OAI-PMH. After pointing the software to one or more file directories, the software monitors the XML files inside, adding, updating or deleting them from the OAI repository as files are added, updated or deleted from the directories. Remote harvesters that monitor the OAI data repository can effectively mirror the files or harvest them as needed. jOAI can provide any XML format as long as the XML in the file is well formed.
The jOAI data provider implements protocol version 2.0. It uses resumption
tokens for flow control in the ListIdentifiers
and ListRecords
responses, supports selective harvesting by date
or set,
provides gzip response
compression and other protocol features.
See the Data Provider FAQ
for additional information.
Data provider setup
There are five steps necessary to make metadata files available through the jOAI data provider:
1. Install the jOAI software on a system in a servlet container such as Apache Tomcat.
See INSTALL.md for installation instructions. If reading this page, most likely this step has been completed.
2. Complete the Repository Information by clicking 'Edit repository info' in the Repository Information and Administration page and then:
- Enter a repository name (required)
- Include an administrators e-mail address (required)
- Provide a namespace identifier (optional but strongly recommended)
- Provide a description (optional)
The namespace-identifier is similar to an Internet domain name, for example "dlese.org" or "project.dlese.org." If specified, the namespace identifier is used to compose the OAI Identifier for items in the repository. See the
OAI Identifier Format guidelines
for more information.
Leave the description blank if not using.
3. Complete the Metadata Files Configuration in the Metadata Files Configuration page by clicking "Add metadata directory" to add one or more metadata directories to the repository. For each directory:
- Enter an appropriate nickname for the directory of files (required)
- Provide the metadata format (metadataPrefix) of the files (required)
- Enter the complete directory path to the metadata files (required)
- Enter the metadata namespace and schema for the format (optional but recommended)
The directory of files must contain XML files that conform to the rules described below under Preparing files for serving.
The metadata format may be any metadata (or data) format.
In the OAI protocol, the format specifier is know as the metadataPrefix.
The metadataPrefix may be any combination of URI unreserved characters, such as
letters, numbers, underscores and dashes.
Examples:
oai_dc |
- Dublin Core format |
adn |
- ADN format |
dlese_anno |
- DLESE annotation format |
dlese_collect |
- DLESE collection format |
news_opps |
- DLESE news and opportunities format |
In general, the metadata namespace and schema can be found near the top of an XML file for the given format.
If the format is recognized by the software these fields will be filled in automatically.
Tip: To test your jOAI installation you may configure your data provider to serve the enclosed sample reocrds:
- For the path to the directory, enter:
{TOMCAT_HOME}/webapps/oai/WEB-INF/sample_metadata
(replace {TOMCAT_HOME} with the absolute path to your tomcat installation).
- For the metadataPrefix, enter
adn
4. After completing step 3, the software automatically indexes the metadata files, which may take several minutes to complete. Once the files are indexed, the metadata is available for harvesting, for browsing using the OAI protocol via the Explorer page and for textual searching using the Search or Admin search pages. The Metadata Files Configuration shows information about the status of the files and indexing process. If metadata files are added, modified or deleted at a later time, the software automatically detects these changes and adds, deletes or re-indexes them
every
8
hours
.
The index can also be updated manually at any time from the Files index administration area.
5. Complete Sets Configuration. This step is optional. Define a new set and then:
- Enter a set name (required )
- Enter a setSpec (required )
- Provide a set description (optional )
- Add records to the newly created set by defining which records to include in the set (required )
The set name is a descriptive name for a group of metadata files that are a subgroup of all metadata files in the repository, for example "DLESE Community Collection".
The setSpec is a unique name or label that identifies the subgroup of metadata files; harvesters may use a setSpec to identify and get the correct set of metadata files from providers. A setSpec example is "dcc."
Limit the number files in a set by specifying certain directories, metadata formats or search criteria.
The optional description field contains information about the content, purpose, rights or history of the provider. Leave the description field blank if no information is available.
After completing the steps above, metadata files are available for harvesting by others.
This software supports the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), version 2.0. Detailed information about the protocol is outside the scope of this documentation. For background information on the OAI-PMH, please refer to the official OAI-PMH documentation and to additional information and tutorials available through the Open Archives Initiative.
Preparing files for serving
jOAI monitors each directory of files that is configured in the system and automatically adds, updates or deletes items from the OAI repository as files are added, updated or deleted from the directories.
After the initial configuration, the synchronization between the files and the OAI repository occurs automatically every
8
hours
or may be synchronized manually at any time.
To ensure proper operation, files must follow these conventions:
- Each file must be a well-formed XML instance document.
- Each file must contain a single record, which corresponds to a single item in the OAI repository.
- All files within a given directory must be of the same metadata (XML) format.
- Each file name must end with a .xml file extension.
- The file name up to the .xml file extension must indicate a unique identifier for the record*. For example, if the file is named abc-123.xml, the identifier for the record will be abc-123 (the identifier is used as the local identifier portion of the OAI Identifier Format in the OAI protocol). Identifiers in jOAI are not case sensitive. *Note that for files in the adn, dlese_anno, dlese_collect and news_opps formats, the file name is not used and instead the identifier must be indicated in the proper location in the file's XML.
- Identifiers must be unique across all files configured in the data provider. It is an error to have two or more files, regardless of format, with the same identifier.
- Reserved characters must be encoded with hex substitutes. For example to indicate a slash / in the identifier use %2F in the file name. Reserved characters and their hex substitutes are:
"/", "%2F"
"?", "%3F"
"#", "%23"
"=", "%3D"
"&", "%26"
":", "%3A"
";", "%3B"
" ", "%20"
"+", "%2B"
- A change in the file modification date will update the OAI datestamp and initiate a transfer of the record from the data provider to the harvester. For network efficiency, the file modification date should change only when the content of the file is modified.
- The XML files must be encoded using the UTF-8 representation of Unicode. Character references, rather than entity references, must used. See the XML response format specification for the OAI protocol.
Provide test records
To test your jOAI installation, configure your data provider to serve the enclosed sample records.
Providing files in multiple formats
jOAI can disseminate any given metadata file in multiple formats. For example, a file that resides in the adn format can also be disseminated to harvesters in the oai_dc format. This is done using metadata format converters. Several metadata format converters come pre-configured in the software as detailed below. New converters can be configured and implemented using an XSL stylesheet or a custom Java class that converts metadata from its native XML format to another XML format. Once a format converter is configured, all files in the native format will be disseminated in either the native or converted format depending on which format is requested by the harvester.
The software comes pre-configured with the following metadata format converters (plus others):
Native XML format |
Converted XML format |
nsdl_dc |
oai_dc |
adn |
oai_dc, nsdl_dc, briefmeta |
dlese_anno |
oai_dc |
dlese_collect |
oai_dc |
news_opps |
oai_dc |
To configure additional metadata format converters, do the following:
1. Create or obtain an XSL stylesheet or Java class that performs the desired format conversion from one XML format to another. The converter takes XML in the native format as its input and must generate XML in the converted format as its output. For Java converters, the class must implement the XMLFormatConverter Interface.
2. If using an XSL stylesheet to perform the conversion, place it in the "xsl_files" directory located in the "WEB-INF" directory of the OAI software context. If using a Java class to perform the conversion, place the class binary anywhere within the classpath of the servlet container.
3. Edit the"web.xml" file located in the "WEB-INF" directory and add a context-param element to configure each format converter (see the existing ones for examples).When configuring an XSL converter, the param-name element must start with the string with "xslconverter," followed by additional descriptive text. When configuring a Java class converter, the param-name element must start with the string with "javaconverter," followed by additional descriptive text. Each param-name must be unique; otherwise it will not be recognized.
For the param-value field, supply a string of the form
[convertername] | [from format] | [to format]
where convertername is either the name of an XSL file or a fully qualified Java class and the "to" and "from" formats are metadataPrefixes for the given formats.
For example, an XSL stylesheet named myDCConverter.xsl that converts from ADN to Dublin Core, the param-value would be
"myDCConverter.xsl|adn|oai_dc" (quotes omitted).
For a Java class converter by the full name org.institution.converter.MyDCConverter, the param-value would be
"org.institution.converter.MyDCConverter|adn|oai_dc" (quotes omitted).
An example of a complete context-param configuration for a format converter looks like the following:
<context-param>
<param-name>xslconverter - adn to oai_dc converter</param-name>
<param-value>adn-v0.6.50-to-oai_dc.xsl|adn|oai_dc</param-value>
</context-param>
4. After configuring the web.xml file and placing the converter in the appropriate location, start or restart the software. The software automatically recognizes the converter and adds the new format to its list of available formats and exposed in response to the OAI ListMetadataFormats request.
Tip: The format converter module caches the converted metadata to disk for
increased performance. These converted files may be accessed locally. Accessed
files are cached in the "WEB-INF/repository_data/converted_xml_cache"
directory.
Register your data provider
After you have set up your data provider you may wish to register it with the Open Archives Initiative.
Doing so will add your data provider to the list of OAI conforming repositories.
To register, see the Data Provider Validation and Registration page.
|