Can Berkeley DB Java Edition use a NFS, SAN, or other remote/shared/network filesystem for an environment?
There are two caveats with NFS based storage, although the motivation for them in Java Edition (JE) is different from that of Berkeley DB. First, JE requires that the underlying storage system reliably persist data to the operating system level when write()
is called and durably when fsync()
is called. However, some remote file system server implementations will cache writes on the server side (as a performance optimization) and return to the client (in this case JE) before making the data durable. While this is not a problem when the environment directory's disk is local, this can present issues in a remote file system configuration because the protocols are generally stateless. The problem scenario can occur if (1) JE performs a write()
call, (2) the server accepts the data but does not make it durable by writing it to the server's disk, (3) the server returns from the write()
call to the client, and then (4) the server crashes. If the client (JE) does not know that the server has crashed (the protocol is stateless), and then JE later successfully calls write()
on a piece of data later in the log file, it is possible for the JE log file to have holes in it, causing data corruption.
In JE 3.2.65 and later releases, a new parameter has been added, called je.log.useODSYNC
, which causes the JE environment log files to be opened with the O_DSYNC
flag. This flag causes all writes to be written durably to the disk. In the case of a remote file system it tells the server not to return from the write()
call until the data has been made durable on the server's local disk. The flag should never be used in a local environment configuration since it incurs a performance penalty. Conversely, this flag should always be used in a remote file system configuration or data corruption may result.
When using JE in a remote file system configuration, the system should never be configured with multiple file system clients (i.e. multiple hosts accessing the file system server). In this configuration it is possible for client side caching to occur which will allow the log files to become out of sync on the clients (JE) and therefore corrupt. The only solution we know of for this is to open the environment log files with the O_DIRECT
flag, but this is not available using the Java VM.
Second, Java Edition (JE) uses the file locking functionality provided through java.nio.channels.FileChannel.lock()
. Java does not specify the underlying implementation, but presumably in many cases it is based on the flock()
system call. Whether flock()
works across NFS is platform dependent. A web search shows several bug reports about its behavior on Linux where flock()
is purported to incorrectly return a positive status.
JE uses file locking for two reasons:
Of course the simplest way of dealing with flock()
vs NFS is to only use a single process to access a JE environment. If that is not possible, and if you cannot rely on flock()
across NFS on your systems, you could handle (1) by taking responsibility in your application to ensure that there is a single writer process attached. Having two writer processes in a single environment could result in database corruption. (Note that the issue is with processes, and not threads.)
Handling the issue of log cleaning (2) in your application is also possible, but more cumbersome. To do so, you must disable the log cleaner (by setting the je.env.runCleaner
property to false
) whenever there are multiple processes accessing an Environment
. If file deletion is not locked out properly, the reader processes might periodically see a com.sleepycat.je.log.LogFileNotFoundException
, and would have to close and reopen to get a fresh snapshot. Such an exception might happen very sporadically, or might happen frequently enough to make the setup unworkable. To perform a log cleaning, the application should first ensure that all reader processes have closed the Environment
(i.e. all read-only processes have closed all Environment
handles). Once closed, the writer process should perform log cleaning by calling Environment.cleanLog()
and Environment.checkpoint()
. Following the completion of the checkpoint, the reader processes can re-open the environment.
Can a Berkeley DB database be used by Berkeley DB Java Edition?
We've had a few questions about whether data files can be shared between Berkeley DB and Berkeley DB Java Edition. The answer is that the on disk format is different for the two products, and data files cannot be shared between the two. Both products do share the same format for the data dump and load utilities (com.sleepycat.je.util.DbDump, com.sleepycat.je.util.DbLoad
), so you can import and export data between the two products.
Also, JE data files are platform independent, and can be moved from one machine to another. Lastly, both products both support the Direct Persistence Layer API, the persistent Java Collections API and a similar byte array based API.
Does JE support high performance LOBs (large objects)?
JE supports get() and put() operations with partial data. However, this feature is not fully optimized, since the entire record is always read or written to the database, and the entire record is cached.
So the only advantage (currently) to using partial get() and put() operations is that only a portion of the record is copied to or from the buffer supplied by the application. In the future we may provide optimizations of this feature, but until then we cannot claim that JE has high performance LOB support.
For more information on partial get() and put() operations please see our documentation.
Does JE support key prefixing or key compression?
Key prefixing is a database storage technique which reduces the space used to store B-Tree keys. It is useful for applications with large keys that have similar prefixes. JE supports key prefixing as of version 3.3.62. See DatabaseConfig.setKeyPrefixing
.
JE also does not currently support key compression. While we have thought about it for both the DB and JE products, there are issues with respect to the algorithm that is used, the size of the key, and the actual values of the key. For example, LZW-style compression works well, but needs a lot of bytes to compress to be effective. If you're compressing individual keys, and they're relatively small, LZW-style compression is likely to make the key bigger, not smaller.
How can I set JE configuration properties?
JE configuration properties can be programmatically specified through Base API (com.sleepycat.je
) classes such as EnvironmentConfig
, DatabaseConfig
, StatsConfig
, TransactionConfig
, and CheckpointConfig
. When using the replication (com.sleepycat.je.rep
) package, ReplicatedEnvironment
properties may be set using ReplicationConfig
. When using the DPL (com.sleepycat.persist
) package, EntityStore
configuration properties may be set using StoreConfig
. The application instantiates one of these configuration classes and sets the desired values.
For Environment
and ReplicatedEnvironment
configuration properties, there's a second configuration option, which is the je.properties
file. Any property set though the get/set methods in the EnvironmentConfig
and ReplicationConfig
classes can also be specified by creating a je.properties
file in the environment home directory. Properties set through je.properties
take precedence, and give the application the option of changing configurations without recompiling the application. All properties that can be specified in je.properties
can also be set through EnvironmentConfig.setConfigParam
or ReplicationConfig.setConfigParam
.
The complete set of Environment
and ReplicatedEnvironment
properties are documented in the EnvironmentConfig
and ReplicationConfig
classes. The javadoc for each property describes the allowed values, default value and whether whether the property is mutable. Mutable properties can be changed after the environment open. Properties not documented in these classes are experimental and some may be phased out over time, while others may be promoted and documented.
How can insertion-ordered records or sequences be used?
The general capability for assigning IDs is a "sequence", and has the same functionality as a SQL SEQUENCE. The idea of a sequence is that it allocates values efficiently (without causing a performance bottleneck), and guarantees that the same value won't be used twice.
When using the DPL, the @PrimaryKey(sequence="...")
annotation may be used to define a sequence. When using the Base API, the Sequence
class provides a lower level form of sequence functionality, and an example case is in <jeHome>/examples/je/SequenceExample.java
.
How do I add fields to an existing tuple format when using the Java bindings?
If you are currently storing objects using a TupleBinding, it is possible to add fields to the tuple without converting your existing databases and without creating a data incompatibility. Please note also that class evolution is supported without any application level coding through the Direct Persistence Layer API.
This excerpt from the Javadoc may be made to tuple bindings. Collections Overview
The tuple binding uses less space and executes faster than the serial binding. But once a tuple is written to a database, the order of fields in the tuple may not be changed and fields may not be deleted. The only type evolution allowed is the addition of fields at the end of the tuple, and this must be explicitly supported by the custom binding implementation.
Specifically, if your type changes are limited to adding new fields then you can use the TupleInput.available()
method to check whether more fields are available for reading. The available()
method is the implementation of java.io.InputStream.available()
. It returns the number of bytes remaining to be read. If the return value is greater than zero, then there is at least one more field to be read.
When you add a field to your database record definition, in your TupleBinding.objectToEntry
method you should unconditionally write all fields including the additional field.
In your TupleBinding.entryToObject
method you should call available()
after reading all the original fixed fields. If it returns a value greater than zero, you know that the record contains the new field and you can read it. If it returns zero, the record does not contain the new field.
For example:
public Object entryToObject(TupleInput input) {
// Read all original fields first, unconditionally.
if (input.available() > 0) {
// Read additional field #1
}
if (input.available() > 0) {
// Read additional field #2
}
// etc
}
How do I build a simple Servlet using Berkeley DB Java Edition?
Below is a simple Servlet example that uses JE. It opens a JE Environment in the init method and then reads all the data out of it in the doGet()
method.
import java.io.*;
import java.text.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;
import com.sleepycat.je.Cursor;
import com.sleepycat.je.Database;
import com.sleepycat.je.DatabaseConfig;
import com.sleepycat.je.DatabaseEntry;
import com.sleepycat.je.DatabaseException;
import com.sleepycat.je.Environment;
import com.sleepycat.je.EnvironmentConfig;
import com.sleepycat.je.LockMode;
import com.sleepycat.je.OperationStatus;
/**
* The simplest possible servlet.
*/
public class HelloWorldExample extends HttpServlet {
private Environment env = null;
private Database db = null;
public void init(ServletConfig config)
throws ServletException {
super.init(config);
try {
openEnv("c:/temp");
} catch (DatabaseException DBE) {
DBE.printStackTrace(System.out);
throw new UnavailableException(this, DBE.toString());
}
}
public void doGet(HttpServletRequest request,
HttpServletResponse response)
throws IOException, ServletException {
ResourceBundle rb =
ResourceBundle.getBundle("LocalStrings",request.getLocale());
response.setContentType("text/html");
PrintWriter out = response.getWriter();
out.println("<html>");
out.println("<head>");
String title = rb.getString("helloworld.title");
out.println("<title>" + title + "</title>");
out.println("</head>");
out.println("<body bgcolor=\"white\">");
out.println("<a href=\"../helloworld.html\">");
out.println("<img src=\"../images/code.gif\" height=24 " +
"width=24 align=right border=0 alt=\"view code\"></a>");
out.println("<a href=\"../index.html\">");
out.println("<img src=\"../images/return.gif\" height=24 " +
"width=24 align=right border=0 alt=\"return\"></a>");
out.println("<h1>" + title + "</h1>");
dumpData(out);
out.println("</body>");
out.println("</html>");
}
public void destroy() {
closeEnv();
}
private void dumpData(PrintWriter out) {
try {
long startTime = System.currentTimeMillis();
out.println("<pre>");
Cursor cursor = db.openCursor(null, null);
try {
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry data = new DatabaseEntry();
while (cursor.getNext(key, data, LockMode.DEFAULT) ==
OperationStatus.SUCCESS) {
out.println(new String(key.getData()) + "/" +
new String(data.getData()));
}
} finally {
cursor.close();
}
long endTime = System.currentTimeMillis();
out.println("Time: " + (endTime - startTime));
out.println("</pre>");
} catch (DatabaseException DBE) {
out.println("Caught exception: ");
DBE.printStackTrace(out);
}
}
private void openEnv(String envHome)
throws DatabaseException {
EnvironmentConfig envConf = new EnvironmentConfig();
env = new Environment(new File(envHome), envConf);
DatabaseConfig dbConfig = new DatabaseConfig();
dbConfig.setReadOnly(true);
db = env.openDatabase(null, "testdb", dbConfig);
}
private void closeEnv() {
try {
db.close();
env.close();
} catch (DatabaseException DBE) {
}
}
}
How do I verify that the configuration settings that I made in my je.properties file have taken effect?
You can use the Environment.getConfig()
API to retrieve configuration information after the Environment has been created. For example:
import java.io.File;
import com.sleepycat.je.*;
public class GetParams {
static public void main(String argv[])
throws Exception {
EnvironmentConfig envConfig = new EnvironmentConfig();
envConfig.setTransactional(true);
envConfig.setAllowCreate(true);
Environment env = new Environment(new File("/temp"), envConfig);
EnvironmentConfig newConfig = env.getConfig();
System.out.println(newConfig.getCacheSize());
env.close();
}
}
will display
> java GetParams 7331512
>
Note that you have to call getConfig()
, rather than query the EnvironmentConfig
that was used to create the Environment
.
How does JE Concurrent Data Store (CDS) differ from JE Transactional Data Store (TDS)?
Berkeley DB, Java Edition comes in two flavors, Concurrent Data Store (CDS) and Transactional Data Store (TDS). The difference between the two products lies in whether you use transactions or not. Literally speaking, you are using TDS if you call the public API method, EnvironmentConfig.setTransactional(true)
.
Both products support multiple concurrent reader and writer threads, and both create durable, recoverable databases. We're using "durability" in the database sense, which means that the data is persisted to disk and will reappear if the application comes back up after a crash. What transactions provide is the ability to group multiple operations into a single, atomic element, the ability to undo operations, and control over the granularity of durability.
For example, suppose your application has a two databases, Person and Company. To insert new data, your application issues two operations, one to insert into Person, and another to insert into Company. You need transactions if your application would like to group those operations together so that the inserts only take effect if both operations are successful.
Note that an additional issue is whether you use secondary indices in JE. Suppose you have a secondary index on the address field in Person. Although it only takes one method call into JE to update both the Person database and its secondary index Address, the application needs to use transactions to make the update atomic. Otherwise, it's possible that if the system crashed at given point, Person could be updated but not Address.
Transactions also let you explicitly undo a set of operations by calling Transaction.abort()
. Without transactions, all modifications are final after they return from the API call.
Lastly, transactions give you finer grain durability. After calling Transaction.commit
, the modification is guaranteed to be durable and recoverable. In CDS, without transactions, the database is guaranteed to be durable and recoverable back to the last Environment.sync()
call, which can be an expensive operation.
Note that there are different flavors of Transaction.commit
that let you trade off levels of durability and performance. explained in this [#41|FAQ entry.
So in summary, choose CDS when:
Choose TDS when:
There is a single download and jar file for both products. Which one you use is a licensing issue, and has no installation impact.
Is a Berkeley DB database the same as a SQL "table"?
Yes; "tables" are databases, "rows" are key/data pairs, and "columns" are application-encapsulated fields. The application must provide its own methods for accessing a specific field, or "column" within the data value.
Is it considered best practice to keep databases closed when not in use?
The memory overhead for keeping a database open is quite small. In general, it is expected that applications will keep databases open for as long as the environment is open. The exception may be an application that has a very large number of databases and only needs to access a small subset of them at any one time.
If you notice that your application is short on memory because you have too many databases open, then consider only opening those you are using at any one time. Pooling open database handles could be considered at that point, if the overhead of opening databases each time they are used has a noticeable performance impact.
What is the smallest cache size I can set with JE?
The smallest cache size is 96KB (96 * 1024)
. You can set this by either calling EnvironmentConfig.setCacheSize(96 * 1024)
on the EnvironmentConfig
instance that you use to create your environment, or by setting the je.maxMemory property
in your je.properties
file.
Why don't Berkeley DB and Berkeley DB Java Edition both implement a shared set of Java interfaces for the API? Why are these two similar APIs in different Java packages?
In the past, we've discussed whether it makes sense to provide a set of interfaces that are implemented by the Berkeley DB JE API and the Java API for Berkeley DB. We looked into this during the design of JE and decided against it because in general it would complicate things for "ordinary" users of both JE and DB.
DatabaseEntry
) are also problematic: we could have common interfaces and a factory in a common package, but that doesn't allow for subclassing and presents problems for callbacks like SecondaryKeyCreate
.DatabaseException
into the common package (breaking applications in the process), applications using the common interfaces would need to explicitly catch exceptions from both packages. Otherwise, we would need to unify what exceptions are thrown from DB and JE, and given that DB exceptions are generated based on C error codes, there is no way we would ever get that right.Does Berkeley DB Java Edition run within J2ME?
JE requires Java SE 1.5 or later. There are no plans to support J2ME at this time.
Where does the je.jar file belong when loading within an application server?
It is important that je.jar and your application jar files—in particular the classes that are being serialized by SerialBinding-are loaded under the same class loader. For running in a servlet, this typically means that you would place je.jar and your application jars in the same directory.
Additionally, it is important to not place je.jar in the extensions directory for your JVM. Instead place je.jar in the same location as your application jars. The extensions directory is reserved for privileged library code.
One user with a WebSphere Studio (WSAD) application had a classloading problem because the je.jar was in both the WEB-INF/lib and the ear project. Removing the je.jar from the ear project resolved the problem.
How should I set directory permissions on the JE environment directory?
If you want to read and write to the JE environment, then you should provide r/w permission on the directory, je.lck, and *.jdb files for JE.
If you want read-only access to JE, then you should either:
If JE finds that the JE environment directory is writable, it will attempt to write to the je.lck file. If it finds that the JE environment directory is not writable, it will verify that the Environment is opened for read-only.
How do I deploy BDB JE Maven using the m2eclipse plugin in Eclipse?
We assume that you have installed the m2eclipse maven plugin in Eclipse, and created a maven project in Eclipse for your JE based project. To deploy BDB JE maven, you need to add a BDB JE dependency and repository into the pom.xml of your project.
<dependencies>
<dependency>
<groupId>com.sleepycat</groupId>
<artifactId>je</artifactId>
<version>4.0.103</version>
</dependency>
</dependencies>
Please confirm that the desired version of JE has been included into JE maven. You can find out the up-to-date JE version at: https://www.oracle.com/database/technologies/related/berkeleydb.html
<repositories>
<repository>
<id>oracleReleases</id>
<name>Oracle Released Java Packages</name>
<url>http://download.oracle.com/maven</url>
<layout>default</layout>
</repository>
</repositories>
Save the modified pom.xml
. Now m2eclipse will automatically download and configure JE's jar and javadoc (assuming that you have selected "Download Artifact Sources" and "Download Artifact JavaDoc" in the maven preference section of Eclipse) for your project.
How do I debug a lock timeout?
The common cause of a com.sleepycat.je.LockConflictException
is the situation where 2 or more transactions are deadlocked because they're waiting on locks that the other holds. For example:
The lock timeout message may give you insight into the nature of the contention. Besides the default timeout message, which lists the contending lockers, their transactions, and other waiters, it's also possible to enable tracing that will display the stacktraces of where locks were acquired.
Stacktraces can be added to a deadlock message by setting the je.txn.deadlockStackTrace
property through your je.properties
file or EnvironmentConfig
. This should only be set during debugging because of the added memory and processing cost.
Enabling stacktraces gives you more information about the target of contention, but it may be necessary to also examine what locks the offending transactions hold. That can be done through your application's knowledge of current activity, or by setting the je.txn.dumpLocks
property. Setting je.txn.dumpLocks
will make the deadlock exception message include a dump of the entire lock table, for debugging. The output of the entire lock table can be large, but is useful for determining the locking relationships between records.
Another note, which doesn't impact the deadlock itself, is that the default setting for lock timeouts (specified by je.lock.timeout
or EnvironmentConfig.setLockTimeout()
can be too long for some applications with contention, and throughput improves when this value is decreased. However, this issue only affects performance, not true deadlocks.
In JE 4.0 and later releases on the 4.x.y line, and JE 3.3.92 and later releases on the 3.3.x line, the NIO parameters je.log.useNIO
, je.log.directNIO
, and je.log.chunkedNIO
are deprecated. Setting them has no affect.
In JE 3.3.91 and earlier, the NIO parameters are functional, but should never be used since they are now known to cause data corrupting bugs in JE.
What is a safe way to stop threads in a JE application?
Calling Thread.interrupt()
is not recommended for an active JE thread if the goal is to stop the thread or do thread coordination. If you interrupt a thread which is executing a JE operation, the state of the database will be undefined. That's because JE might have been in the middle of I/O activity when the operation was aborted midstream, and it becomes very difficult to detect and handle all possible outcomes.
If JE can detect the interrupt, it will mark the environment as unusable and will throw a RunRecoveryException
. This tells you that you must close the environment and re-open it before using it again. If JE doesn't throw RunRecoveryException
, it is very likely that you would get some other exception that is less meaningful, or simply see corrupted data.
Instead, applications should use other mechanisms like Object.notify()
and wait()
to coordinate threads. For example, use a "keepRunning
" variable of some kind in each thread. Check this variable in your threads, and return from the thread when it is false. Set it to false when you want to stop the thread. If this thread is waiting to be woken up to do another unit of work, use Object.notify
to wake up it. This is the recommended technique.
If you absolutely must interrupt threads for some reason, you should expect that you will see RunRecoveryException
. Each thread should treat this exception as an indication that it should stop.
Why does my application sometimes get a checksum exception when running on Windows 7?
We believe that Windows 7 (as of build 7600) has an IO bug which is triggered by JE 3.3.91 and earlier under certain conditions. Unfortunately, this bug causes file corruption which is eventually detected by JE's checksumming mechanism. JE 4.0 (and later) has a "write queue" mechanism built into it which prevents this bug from being triggered. Because JE 3.3.91 (and earlier releases) were shipped before Windows 7, we were not aware of the bug and therefore not able to include preventive code. JE 3.3.92 detects if it is running on Windows 7 and prevents triggering the bug. We have reported this bug to Microsoft.
JE 2.0 has support for XA transactions in a J2EE app server environment. Can I use XA transactions (2 phase commit) in a non-J2EE environment?
Yes. The com.sleepycat.je.XAEnvironment
class implements the javax.transaction.xa.XAResource interface, which can be used to perform 2 phase commit transactions. The relevant methods in this interface are start()
, {{end()}}}, prepare()
, commit()
, and rollback()
. The XAEnvironment.setXATransaction()
is an internal entrypoint that is only public for the unit tests.
The XA Specification has the concept of implicit transactions (a transaction that is associated with a thread and does not have to be passed to the JE API); this is supported in JE 2.0. You can use the XAResource.start()
method to create a JE transaction and join it to the calling thread. To disassociate a transaction from a thread, use the end()
method. When you use thread-implied transactions, you do not have to pass in a Transaction argument to the JE API (e.g. through methods such as get()
and put()
). Instead, passing null in a thread-implied transaction environment tells JE to use the implied transaction.
Here's a small example of how to use XAEnvironment and 2 Phase Commit:
XAEnvironment env = new XAEnvironment(home, null);
Xid xid = [something...];
env.start(xid, 0); // creates a thread-implied transaction for you
... calls to get/put, etc. with null transaction arg will use the implicit transaction...
env.end(xid, 0); // disassociate this thread from the implied transaction
env.prepare(xid);
if (commit) {
env.commit(xid, false);
} else {
env.rollback(xid);
}
How can I perform wildcard queries or non-key queries?
Berkeley DB does not have a query language. It has API methods for performing queries that can be implemented as lookups in a primary or secondary index. Wildcard queries and non-key must be performed by scanning an entire index and examining each key and/or value.
In an SQL database or another database product with a query language, a full index scan is executed when you perform a wildcard query or a non-key query. In Berkeley DB you write a loop that scans the index. While you have to write the loop, you'll see better performance than in an SQL database because there is no SQL processing involved.
Berkeley DB supports simple key lookups as well as prefix or range queries. Range queries allow you to search for all keys in a range of keys or for all keys starting with a given prefix value. For more information on range queries in the Base API, see:
For more information on range queries in the DPL API, see:
Can I perform an efficient "keys only" query?
As of JE 4.0, key-only queries may be performed and I/O is significantly reduced if ReadUncommitted
isolation is configured. Because JE data records are stored separately, the I/O to read the data record is avoided when the data record is not already in cache. Note that if other isolation levels are used, then the I/O cannot be avoided because the data record must be read in order to lock the record.
To perform a {{ReadUncommitted}} key-only query using the Base API, use any Database or Cursor method to perform the query and specify the following:
LockMode.READ_UNCOMMITTED
or CursorConfig.READ_UNCOMMITTED
.DatabaseEntry.setPartial(0, 0, true)
on the data DatabaseEntry
, so that JE will not fetch and return the record data.To perform a ReadUncommitted
key-only query using the DPL, use any EntityIndex
to perform the query and specify the following:
ReadUncommitted
isolation using LockMode.READ_UNCOMMITTED
or CursorConfig.READ_UNCOMMITTED
.EntityIndex.keys
to obtain a key-only cursor, so that JE will not fetch and return the record data.How can I join two or more primary databases or indexes?
Berkeley DB has direct support only for intersection (AND) joins across the secondary keys of a single primary database. You cannot join more than one primary database. If you are using the DPL, the same is true: you can only join across the secondary keys of a single primary index.
For example, imagine a primary database (index) called Person that has two secondary keys: birthdate and favorite color. Using the Join API, you can find all Person records that have a given birthdate AND a given favorite color.
When using the Base API, see the http://download.oracle.com/docs/cd/E17277_02/html/java/com/sleepycat/je/Database.html#join(com.sleepycat.je.Cursor[,%20com.sleepycat.je.JoinConfig) Database.join] method. When using the DPL, see the EntityJoin class.
To perform a join across more than primary database (index), you must write a loop that iterates over the records of one database (index) and does a lookup in one or more additional databases (indexes).
In an SQL database or another database product with a query language, similar iterative processing takes place when the join is executed. In Berkeley DB while you must write the code that iteratively performs the join, you'll see better performance than in an SQL database because there is no SQL processing involved.
How can a join be combined with a range query?
Imagine an application where a single primary employee database has three fields that are indexed by secondary databases: status, department, salary. The user wishes to query for a specific status, a specific department, and range of salary values. Berkeley DB supports joins, and the join API can be used to select the AND (intersection) of a specific status and a specific department. However, the join API cannot be use to select a range of salaries. Berkeley DB also supports range searches, making it possible to iterate over a range of values using a secondary index such as a salary index. However, there is no way to automatically combine a range search and a join.
To combine a range search and a join you'll need to first perform one of the two using a Berkeley DB API, and then perform the other manually as a "filter" on the results of the first. So you have two choices:
Which option performs best depends on whether the join or the range query will produce a smaller result set, on average. If the join produces a smaller result set, use option 2; otherwise, use option 1. There is a 3rd option to consider if this particular query is performance critical. You could create a secondary index on preassigned ranges of salary For example, assign the secondary key 1 for salaries between $10,000 and $19,999, key 2 for salaries between $20,000 and $29,999, etc.
If a query specifies only one such salary range, you can perform a join using all three of your secondary indices, with no filtering after the join. If the query spans ranges, you'll have to do multiple joins and then union the results. If the query specifies a partial range, you'll have to filter out the non-matching results. This may be quite complex, but it can be done if necessary. Before performing any such optimization, be sure to measure performance of your queries to make sure the optimization is worthwhile.
If you can limit the specified ranges to only those that you've predefined, that will actually simplify things rather than make them more complex, and will perform very well also. In this case, you can always perform a single join with no filtering. Whether this is practical depends on whether you can constrain the queries to use predefined ranges.
On range searches in general, they can be done with Cursor.getSearchKeyRange
or with the SortedSet.subSet
and SortedMap.subMap
methods, depending on whether you are using the base API or the Collections API. It is up to you which to use.
If you use Cursor.getSearchKeyRange
you'll need to call getNext
to iterate through the results. You'll have to watch for the end range yourself by checking the key returned by getNext
. This API does not have a way to enforce range end values automatically.
If you use the Collections API you can call subMap or subSet and get an Iterator on the resulting collection. That iterator will enforce both the beginning and the end of the range automatically.
How do I perform a custom sort of secondary duplicates?
If you have a secondary database with sorted duplicates configured, you may wish to sort the duplicates according to some other field in the primary record. Let's say your secondary key is F1 and you have another field in your primary record, F2, that you wish to use for ordering duplicates. You would like to use F1 as your secondary key, with duplicates ordered by F2.
In Berkeley DB, the "data" for a secondary database is the primary key. When duplicates are allowed in a secondary, the duplicate comparison function simply compares those primary key values. Therefore, a duplicate comparison function cannot be used to sort by F2, since the primary record is not available to the comparison function.
The purpose of key and duplicate comparison functions in Berkeley DB is to allow sorting values in some way other than simple byte-by-byte comparison. In general it is not intended to provide a way to order keys or duplicates using record data that is not present in the key or duplicate entry. Note that the comparison functions are called very often - whenever any Btree operation is performed - so it is important that the comparison be fast.
There are two ways you can accomplish sorting by F2:
Cursor.getSearchKeyRange
).Option #1 has the advantage of automatically sorting by F2. However, you will never be able to do a join (via the Database.join
method) on the F1 key alone. You will be able to do a join on the F1+F2 value, but it seems unlikely that will be useful. Secondaries are often used for joins. Therefore, we recommend option #2 unless you are quite sure that you won't need to do a join on F1. The trade-offs are:
What is the best way to access duplicate records when not using collections?
Duplicate records are records that are in a single database and have the same key. Since there is more than one record per key, a simple lookup by key is not sufficient to find all duplicate records for that key.
When using the DPL, the SecondaryIndex.subIndex
method is the simplest way to access duplicate records.
When using the Base API, you need to position a cursor at the desired key, and then retrieve all the subsequent duplicate records. The Getting Started Guide has a good section on how to position your cursor: Search For Records and then how to retrieve the rest of the duplicates: Working with Duplicate Records.
What's the best way to get a count of either all objects in a database, or all objects that match a join criteria?
As with most Btree based data stores, Berkeley DB Java Edition does not store record counts for non-duplicate records, so some form of internal or application based traversal is required to get the size of the result set. This is in general true of relational databases too; it's just that the count is done for you internally when the SQL count statement is executed. Berkeley DB Java Edition version 3.1.0 introduced a Database.count()
method, which returns the number of all key/data pairs in the database. This method does an optimized, internal traversal, does not impact the working set in the cache, but may not be accurate in the face of concurrent modifications in the database.
To get a transactionally current count, or to count the result of a join, do this:
cursor = db.openCursor(...)
OR
db.join(someCursors);
count = 0;
while(cursor.getNext(...) == OperationStatus.SUCCESS) {
count++;
}
There are a few ways to optimize an application-implemented count:
Cursor.count()
which returns the number of records that share the current record's key value. Suppose you want to count all the records in a database that supports duplicates and contains 3000 records, but only 3 distinct keys. In that case, it would be far more efficient to do:
count = 0;
while (cursor.getNextNoDup(...) == OperationStatus.SUCCESS) {
count += cursor.count();
}
because you will only look up 3 records (one for each key value), not 3000 records.
DatabaseEntry.setPartial(0, 0, true)
on the key and data DatabaseEntry
to reduce the overhead of returning large records.LockMode.READ_UNCOMMITTED
will be faster, especially in combination with calling DatabaseEntry.setPartial(0, 0, true)
for the data entry. When read-uncommitted is used and no data is returned, only the key is read and much less I/O is performed. When using the DPL, the same thing can be accomplished by using LockMode.READ_UNCOMMITTED
in combination with EntityIndex.keys()
. The DPL keys
method calls DatabaseEntry.setPartial(0, 0, true)
for you.Btree.getLeafNodeCount()
, obtained from Database.getStats()
under certain circumstances. This returns a valid count of the number of records in the database, but because it is obtained without locks or transactions the count is only correct when the database is quiescent. In addition, although stats generation takes advantage of some internal code paths, it may consume more memory when analyzing large databases.How do I prevent "phantoms" when not using transactions?
Phantoms are records that can appear in the course of performing operations in one thread when records are inserted by other threads. For example, if you perform a key lookup and the record is not found, and then you later perform a lookup with the same key and the record is found, then the record was inserted by another thread and is called a phantom.
Phantoms and how to prevent them in transactional applications are described in Writing Transactional Applications under Configuring Serializable Isolation.
However, you may wish to prevent phantoms but you cannot use transactions. For example, if you are using Deferred Write, then you cannot use transactions. For phantoms that appear after a search by key, another technique for preventing them is to use a loop that tries to insert with putNoOverwrite, and if the insert fails then does a search by key.
Here is a code sketch using the base API:
Cursor cursor = ...;
DatabaseEntry key = ...;
DatabaseEntry insertData = ...;
DatabaseEntry foundData = ...;
boolean exists = false;
boolean done = false;
while (!done) {
OperationStatus status = cursor.putNoOverwrite(key, insertData);
if (status == OperationStatus.SUCCESS) {
/* A new record is inserted */
exists = false;
done = true;
} else {
status = cursor.getSearchKey(key, foundData, LockMode.RMW);
if (status == OperationStatus.SUCCESS) {
/* An existing record is found */
exists = true;
done = true;
}
/* else continue loop */
}
}
If the putNoOverwrite succeeds, the cursor holds the write lock on the inserted record and no other thread can change that record. If the putNoOverwrite fails, then the record must exist so we search by key to lock it. If the search succeeds, then the cursor holds the write lock on the existing record and no other thread can modify that record. If the search fails, then another thread must have deleted the record and we loop again.
This technique relies on a property of JE cursors called "cursor stability". When a cursor is positioned on a record, the cursor maintains its position regardless of the actions of other threads. The record at the cursor position is locked and no other thread may modify it. This is true whether transactions are used or not, and when deferred write is used.
With this technique it is necessary to use a cursor in order to hold the lock. Database.get, Database.putNoOverwrite and other Database methods do not hold a lock when used without an explicit transaction. The same is true of the corresponding DPL methods: EntityIndex.get, PrimaryIndex.put, etc.
Using this technique is recommended rather than using your own locking (with synchronized or java.util.concurrent). Custom locking is error prone and almost always unnecessary.
As an aside, JE cursor stability is slightly different when "dirty read" (LockMode.READ_UNCOMMITTED) is used. In this case, the cursor will remain positioned on the record, but another thread may change the record or even delete it.
Which are better: Private vs Shared Database instances?
Using a single Database instance for multiple threads is supported and, as of JE 4.0, has no performance drawbacks.
In JE 3.3 and earlier, using a single Database instance for multiple threads presented a minor bottleneck. The issue is that the Database object maintains a set of Cursors open against it. This set is used to check if all Cursors are closed against the Database when close() is called, but to do that JE has to synchronize against it before updating it. So if multiple threads are sharing the same Database handle it makes for a synchronization bottleneck. In a multi-threaded case, unless there's a good reason to share a Database handle, it's probably better to use separate handles for each thread.
Are there any locking configuration options?
JE 2.1.30 introduced two new performance motivated locking options.
No-locking mode is on one end of the spectrum. When EnvironmentConfig.setLocking(false)
is specified, all locking is disabled, which relieves the application of locking overhead. No-locking should be used with care. It's only valid in a non-transactional environment and the application must ensure that there is no concurrent activity on the database. Concurrent activity while in no-locking mode can lead to database corruption. In addition, log cleaning is disabled in no-locking mode, so the application is responsible for managing log cleaning through explicit calls to the Environment.cleanLog()
method.
On the other end of the spectrum is the je.lock.nLockTables
property, which can specify the number of lock tables. While the default is 1, increasing this number can improve multithreaded concurrency. The value of this property should be a prime number, and should ideally be the nearest prime that is not greater than the number of concurrent threads.
How can I estimate my application's optimal cache size?
A good starting point is to invoke DbCacheSize with the parameters:
-records <count> # Total records (key/data pairs); required
-key <bytes> # Average key bytes per record; required
[-data <bytes>] # Average data bytes per record; if omitted no leaf
# node sizes are included in the output
See the DbCacheSize javadoc for more information.
Note that DbPrintLog -S gives the average record size under Log statistics, in the LN (leaf node) row, at the avg bytes column.
To measure the cache size for a 64-bit JVM, DbCacheSize needs to be run on the 64-bit JVM.
To take full advantage of JE cache memory, it is strongly recommended that compressed oops (-XX:+UseCompressedOops) is specified when a 64-bit JVM is used and the maximum heap size is less than 32 GB. As described in the referenced documentation, compressed oops is sometimes the default JVM mode even when it is not explicitly specified in the Java command. However, if compressed oops is desired then it must be explicitly specified in the Java command when running DbCacheSize or a JE application. If it is not explicitly specified then JE will not aware of it, even if it is the JVM default setting, and will not take it into account when calculating cache memory sizes.
Why should the JE cache be large enough to hold the Btree internal nodes?
For read-write applications, we strongly recommend that the JE cache size be large enough to hold all Btree internal nodes (INs) for the records in the active data set. DbCacheSize can be used to estimate the cache size to hold all internal nodes for a given data set. For some applications the active data set may be a subset of the entire data set, for example, if there are hot spots in the access pattern. For truly random key access and other access patterns where there are no significant hot spots, the cache should be sized to hold all internal nodes.
JE, like most database products, performs best when the metadata (in this case the Btree internal nodes) needed to execute a read or write operation is present in its cache. For example, if the internal nodes at the bottom level of the Btree (BINs) are not in the cache, then for each operation a BIN must be fetched from the file system. This may often result in a random read I/O, although no storage device I/O will occur if the BIN happens to be present in the file system cache.
In addition, for write operations, the BIN will be dirtied. If the cache is not large enough to hold the BINs, the dirty BIN will quickly be evicted from the cache, and when it is evicted it will be written. The write of the BIN may be buffered, and the buffer will not be flushed to the file system until the write buffer fills, the log file fills, or another operation causes the buffer to be written.
The net effect is that additional reading and writing will be necessary when not all BINs are present in the JE cache. When all BINs are in cache, they will only be read when first accessed, and will only be written by a checkpoint. The checkpoint interval can be selected to trade off the cost of the writing, for reduced recovery time in the event of a crash.
The description of the performance trade-offs in the preceding paragraph is probably applicable, in a very rough sense, to many database products. For JE in particular, its log structured storage system adds another dimension to the picture. Each time a BIN is dirtied and written to the log via cache eviction, at least some redundant information is written to the log, because a BIN (like anything else in JE's append-only log) cannot be overwritten. The redundant BIN entries in the log must be garbage collected by the JE log cleaner, which adds an additional cost.
In general, the JE log cleaner thread acts like an application thread that is performing record updates. When it cleans a log file, each active record or internal node must be copied to the end of the log, which is very much like a record update (although nothing in the record is changed). These updates are just like application-initiated updates in the sense that when the cache does not hold a BIN that is needed, then the BIN must be fetched into cache, the BIN will be dirtied by the update, and the dirty BIN may soon be flushed to the log by cache eviction. If the cache is small enough and the application write rate is high enough, this can create a negative feedback cycle of eviction and log cleaning which has a large performance impact. This is the probably the most important reason that the JE cache should be sized to hold all internal nodes in the active data set.
It is worth noting that leaf nodes (LNs) are not affected by the issues discussed in the preceding two paragraphs. Leaf nodes hold record data and are logged at the time of the operation (insert, update or delete), as opposed to internal nodes (metadata) which are dirtied by each operation and later logged by checkpoints and eviction. Therefore, the preceding issue does not need to be taken into account when deciding whether to size the cache large enough to hold leaf nodes. Leaf nodes may be kept in the JE cache, the file system cache, or a combination of the two. In fact, it is often advantageous to evict leaf nodes immediately from the JE cache after an operation is completed (and rely instead on the file system cache to hold leaf nodes) because this reduces JVM GC cost -- see CacheMode.EVICT_LN for more information. DbCacheSize reports the amount of memory needed for internal nodes only, and for internal nodes plus leaf nodes.
Read-only applications are also not affected by the preceding issue. If only read operations are performed and the cache does not hold all internal nodes, then extra read I/Os may be necessary, but the append-only structured storage system and log cleaner do not come into play.
How can I tune JE's cache management policy for my application's access pattern?
JE, like most databases, performs best when database objects are found in its cache. The cache eviction algorithm is the way in which JE decides to remove objects from the cache and can be a useful policy to tune. The default cache eviction policy is LRU (least recently used) based. Database objects that are accessed most recently are kept within cache, while older database objects are evicted when the cache is full. LRU suits applications where the working set can stay in cache and/or there are some data records are used more frequently than others.
An alternative cache eviction policy was added in JE 2.0.83 that is instead primarily based on the level of the node in the Btree. This level based algorithm can improve performance for some applications with both of the following characteristics:
The alternative cache eviction policy is specified by setting this configuration parameter in your je.properties file or EnvironmentConfig object: je.evictor.lruOnly=false
The level based algorithm works by evicting the lowest level nodes of the Btree first, even if higher level nodes are less recently used. In addition, dirty nodes are evicted after non-dirty nodes. This algorithm can benefit random access applications because it keeps higher level Btree nodes in the tree for as long as possible, which for a random key, can increase the likelihood that the relevant Btree internal nodes will be in the cache.
When using je.evictor.lruOnly=false, you may also consider changing the default value for je.evictor.nodesPerScan
to a value larger than the default of 10, to perhaps 100. This setting controls the number of Btree nodes that are considered, or sampled, each time a node is evicted. We have found in our tests that a setting of 100 produces good results when system is IO bound. The larger the nodesPerScan, the more accurate the algorithm.
However, don't set it too high. When considering larger numbers of nodes for each eviction, the evictor may delay the completion of a given database operation, which impacts the response time of the application thread. In JE 4.1 and later, setting this value too high in an application that is largely CPU bound can reduce the effectiveness of cache eviction. It's best to start with the default value, and increase it gradually to see if it is beneficial for your application.
The cache management policies described above apply to all operations within an environment. In JE 4.0.103, a new com.sleepycat.je.CacheMode
class was introduced, which lets the application indicate caching policy at the per-operation, per-database, or per environment level. CacheMode is best used when the application has specific knowledge about the access pattern of a particular set of data. For example, if the application knows a given database will be accessed only once and will not be needed again, it might be appropriate to use CacheMode.MAKE_COLD
.
When in doubt, it is best to avoid specific CacheMode directives, or at least to wait until application development is finished, and you are doing performance tuning with a holistic view.
How do I begin tuning performance?
Gathering environment statistics is a useful first step to doing JE performance tuning. Execute the following code snippet periodically to display statistics for the past period and and to reset statistics counters for the next display.
StatsConfig config = new StatsConfig();
config.setClear(true);
System.err.println(env.getStats(config));
The Javadoc for com.sleepycat.je.EnvironmentStats describes each field. Cache behavior can have a major effect on performance, and nCacheMiss is an indicator of how hot the cache is. You may want to adjust the cache size, data access pattern, or cache eviction policy and monitor nCacheMiss.
Applications which use transactions may want to check nFSyncs to see how many of these costly system calls have been issued. Experimenting with other flavors of commit durability, like TxnWriteNoSync and TxnNoSync can improve performance.
nCleanerRuns and cleanerBacklog are indicators of log cleaning activity. Adjusting the property je.cleaner.minUtilization can increase or decrease log cleaning. The user may also elect to do batch log cleaning, as described in the Javadoc for Environment.cleanLog(), to control when log cleaning occurs.
High values for nRepeatFaultReads and nRepeatIteratorReads may indicate non-optimal read buffer sizes. See the FAQ entry on configuring read buffers.
In JE, what are the performance tradeoffs when storing to more than one database?
A user posted a question about the pros and cons of using multiple databases. The question was: We are designing a application where each system could handle at least 100 accounts. We currently need about 10 databases. We have three options we are considering.
All three options are practical solutions using JE. Which option is best depends on a number of trade-offs.
The data for each account is kept logically separate and easy to manage. Databases can be efficiently renamed, truncated and removed (see Environment.renameDatabase
, truncateDatabase
and removeDatabase
), although this is not as efficient as directly managing the log files, as with a separate environment (option 2). Copying a database can be done with the DbDump
and DbLoad
utilities, or with a custom utility.
With this option a single transaction can be used for records in multiple accounts, since transactions may span databases and a single environment is used for all accounts. Secondary indices cannot span accounts, since secondaries cannot span databases.
The cost of opening and closing accounts is mid-way between option 2 and 3. Opening and closing databases is less expensive than opening and closing environments.
The per-account overhead is lower than option 2, but higher than option 3. The per-database disk overhead is about 3 to 5 KB. The per-database memory overhead is about the same but is only incurred for open databases, so it can be be minimized by closing databases that are not in active use. Note that prior to JE 3.3.62 this memory overhead was not reclaimed when a database was closed. For this reason, if large numbers of databases are used then option 1 is not recommended with releases prior to JE 3.3.62.
The checkpoint overhead is higher than option 3, because the number of databases is larger. How much this overhead matters depends on the data access pattern and checkpoint frequency. This tradeoff is described following option 3 below.
If database names are used to identify accounts, another issue is that Database.getDatabaseName
does a linear search of the records in the mapping tree and is slow. A workaround is to store the name in your own application, with the reference to the Database
.
The data for each account is kept physically separate and easy to manage, since a separate environment directory exists for each account. Deleting and copying accounts can be performed as file system operations.
With this option a single transaction cannot be used for records in multiple accounts, since transactions may not span environments. Secondary indices cannot span accounts, since secondaries cannot span databases or environments.
The cost of opening and closing accounts is highest, because opening and closing databases is less expensive than opening and closing environments. Be sure to close environments cleaning to minimize recovery time.
This option has the highest overhead per account, because of the per-environment memory and disk space overhead, as well as the background threads for each environment. In JE 3.3.62 and later, a shared cache may be used for all environments. With this option it is important to configure a shared cache (see EnvironmentConfig.setSharedCache
) to avoid a multiplying effect on the cache size of all open environments. For this reason, this option is not recommended with releases prior to JE 3.3.62.
The checkpoint overhead is higher than option 3, because the number of environments and databases is larger. How much this overhead matters depends on the data access pattern and checkpoint frequency. This tradeoff is described following option 3 below.
The data for each account is kept logically separate using the key prefix, but accounts must be managed using custom utilities that take this prefix into account. The DPL (see com.sleepycat.persist
) may be useful for this option, since it makes it easy to use key ranges that are based on a key prefix.
With this option a single transaction can be used for records in multiple accounts, since a single environment is used for all accounts. Secondary indices can also span accounts, since the same database(s) are used for all accounts.
The cost of opening and closing accounts is lowest (close to zero), since neither databases nor environments are opened or closed.
Because the number of databases and environments is smallest, the per-account memory and disk overhead is lowest with this option. Key prefixing should normally be configured to avoid redundant storage of the account key prefix (see DatabaseConfig.setKeyPrefixing
).
The checkpoint overhead is lowest with this option, because the number of environments and databases is smallest. How much this overhead matters depends on the data access pattern and checkpoint frequency. This tradeoff is described below.
It costs more to checkpoint the root of a database than other portions. Whether this matters depends on how the application accesses the database. For example, it costs marginally less to insert 1000 records into 1 database than to insert 1 record into 1000 databases. However, it costs much less to checkpoint the former rather than the latter. Suppose we update 1 record in 1000 databases. In a small test program, the checkpoint takes around 730 ms. Suppose we update 1000 records in 1 database. In the same test, the checkpoint takes around 15 ms.
In addition, each environment has information that must be checkpointed, which will make the total checkpoint overhead somewhat larger in option 2, since each environment must be checkpointed separately.
Is a larger cache always better for JE?
In general, JE performs best when its working set fits within cache. But due to the interaction of Java garbage collection and JE, there can be scenarios when JE actually performs better with a smaller cache.
JE caches items by keeping references to database objects. To keep within the memory budget mandated by the cache size, JE will release references to those objects and they will be garbage collected by the JVM. Many JVMs use an approach called generational garbage collection. Objects are categorized by age in order to apply different collection heuristics. Garbage collecting items from the younger space is cheaper and is done with a "partial GC" pass while longer-lived items require a more expensive "Full GC".
If the application tends to access data records that are rarely re-used, <b>and</b> the JE cache has excessive capacity, the JE cache will become populated with data records that are no longer needed by the application. These data records will eventually age and the JVM will re-categorize them as older objects, which then provokes more Full GC. If the JE cache is smaller, JE itself will tend to dereference, or evict these once-used records more frequently and the JVM will have younger objects to garbage collect.
Garbage collection is really only an issue when the application is CPU bound. To find this point of equilibrium, the user can monitor EnvironmentStats.nCacheMisses and the application's throughput. Reducing the cache to the smallest size where nCacheMisses is 0 will show the optimal performance. Enabling GC statistics in the JVM can help too. (In the Java SE 5 JVM this is enabled with ("-verbose:gc", "-XX+PrintGCDetails", "-XX:+PrintGCTimeStamps")
What are JE read buffers and when should I change their size?
JE follows two patterns when reading items from disk. In one mode a single database object, which might be a Btree node or a single data record, is faulted in because the application is executing a Database or Cursor operation and cannot find the item in cache. In a second mode, JE will read large sequential portions of the log on behalf of activities like environment startup or log cleaning, and will read in one or multiple objects.
Single object reads use temporary buffers of a size specified by je.log.faultReadSize while sequential reads use temporary buffers of a size specified by je.log.iteratorReadSize. The defaults for these properties are listed in <jeHome>/example.properties, and are currently 2K and 8K respectively.
The ideal read buffer size is as small as possible to reduce memory consumption but is also large enough to adequately fit in most database objects. Because JE must fit the whole database object into a buffer when doing a single object read, a too-small read buffer for single object reads can result in wasted, repeated read calls. When doing sequential reading, JE can piece together parts of a database object, but a too-small read buffer for sequential reads may result in excessive copying of data. The nRepeatFaultReads and nRepeatIteratorReads fields in EnvironmentStats show the number of wasted reads for single and sequential object reading.
If nRepeatFaultReads is greater than 0, the application may try increasing the value of je.log.faultReadSize. If nRepeatIteratorReads is greater than 0, the application may want to adjust je.log.iteratorReadSize and je.log.iteratorMaxSize.
What are JE write buffers and when should I change their size?
JE log files are append only, and all record insertions, deletions, and modifications are added to the end of the current log file.
New data is buffered in write log buffers before being flushed to disk. As each log buffer is filled, a write system call is issued. As each .jdb file reaches its maximum size, a fsync system call is issued and a new .jdb file is created.
Increasing the write log buffer size and the JE log file size can improve write performance by decreasing the number of write and fsync calls. However, write log buffer size has to be balanced against the total JE memory budget, which is represented by the je.maxMemory, or EnvironmentConfig.getCacheSize(). It may be more productive to use available memory to cache database objects rather than write log buffers. Likewise, increasing the JE log file size can make it harder for the log cleaner to effectively compress the log.
The number and size of the write log buffers is determined by je.log.bufferSize, je.log.numBuffers, and je.log.totalBufferBytes. By default, there are 3 write log buffers and they consume 7% of je.maxMemory. The nLogBuffers and bufferBytes fields in EnvironmentStats will show what the current settings are.
An application can experiment with the impact of changing the number and size of write log buffers. A non-transactional system may benefit by reducing the number of buffers to 2. Any write intensive application may benefit by increasing the log buffer sizes. That's done by setting je.log.totalBufferBytes to the desired value and setting je.log.bufferSize to the total buffer size/number of buffers. Note that JE will restrict write buffers to half of je.maxMemory, so it may be necessary to increase the cache size to grow the write buffers to the desired degree.
Why is my application performance slower with transactions?
Many users see a large performance difference when they enable or disable transactions in their application, without doing any tuning or special configuration.
The performance difference is the result of the durability (the D in ACID) of transactions. When transactions are configured, the default configuration is full durability: at each transaction commit, the transaction data is flushed to disk. This guarantees that the data is recoverable in the event of an application crash or an OS crash; however, it comes with a large performance penalty because the data is written physically to disk.
If you need transactions (for atomicity, for example, the A in ACID) but you don't need full durability, you can relax the durability requirement. When using transactions there are three durability options:
You can call these specific Transaction methods, or you can call commit and change the default using an environment parameter. Without transactions, JE provides the equivalent of commitNoSync durability.
Note that the performance of commitSync can vary widely by OS/disk/driver combination. Some systems are configured to buffer writes rather than flush all the way to disk, even when the application requests an fsync. If you need full durability guarantees, you must use a OS/disk/driver configuration that supports this.
The different durability options beg the question: How can changes be explicitly flushed to disk at a specific point in time, in a non-transactional application or when commitNoSync or commitWriteNoSync is used?
In a transactional application, you have three options for forcing changes to disk:
In a non-transactional application, you have two options for forcing changes to disk:
How can I improve performance of a cursor walk over a database?
Berkeley DB Java Edition (JE) appends records to the log, so they are stored in the order they are written, that is in time or "temporal" order. But if the records are written in a non-sequential key-order, that is the "spatial" ordering is different than the "temporal" order, then reading them in key order will read in log (disk) order. Reading a disk in sequential order is faster than reading in random order. When key order is not equal to disk order, and neither the operating system's file system cache or the JE cache are "hot", a database scan will generally be slower because the disk head may have to move on every read.
One way to improve read performance when key (spatial) order is not the same as disk (temporal) order, is to preload the database into the JE cache using the Database.preload()
method. preload()
is optimized to read records in disk order, not key order. See the documentation.
The JE cache should be sized large enough to hold the pre-loaded data or you may actually see a negative performance impact.
Another alternative is to change the application that writes the records to write them in key order. If records are rewritten in key order then a cursor scan will cause the records to be read in a disk-sorted order. The disk head will be moved a minimum number of times during the scan, and when it does, it will always move in the same direction.
There are different ways to reorder the records. If the application can be taken off-line DbDump/DbLoad can be used to reorder the records. See the DbDump and DbLoad documentation.
If the application can not be taken off-line the reordering can be accomplished by reading keys via a Cursor in either of the following ways:
Cursor.putCurrent()
to re-write each one. When finished, although the total log size will be doubled, the log cleaner will eventually clean and delete the old (obsolete) records and hence some of the log files. If the rewrite (using a Cursor) is done gradually during the normal operation of the application, this will give the log cleaner a chance to delete the old files having less impact on disk space consumption.Database.put()
to write the records to a different environment. When the rewriting is finished, switch to using the new environment, close the old environment, and remove its entire directory (delete all its log files).What JVM parameters should I consider when tuning an application with a large cache size?
If your application has a large cache size, tuning the Java GC may be necessary. You will almost certainly be using a 64b JVM (i.e. -d64
), the -server
option, and setting your heap and stack sizes with -Xmx
and -Xms
. Be sure that you don't set the cache size too close to the heap size so that your application has plenty of room for its data and to avoided excessive full GC's. We have found that the Concurrent Mark Sweep GC is generally the best in this environment since it yields more predictable GC results. This can be enabled with -XX:+UseConcMarkSweepGC
.
Best practices dictates that you disable System.gc()
calls with -XX:-DisableExplicitGC
.
Other JVM options which may prove useful are -XX:NewSize
(start with 512m or 1024m as a value), -XX:MaxNewSize
(try 1024m as a value), and -XX:CMSInitiatingOccupancyFraction=55
. NewSize is typically tuned in relationship to the overall heap size so if you specify this parameter you will also need to provide a -Xmx
value. A convenient way of specifying this in relative terms is to use -XX:NewRatio
. The values we've suggested are only starting points. The actual values will vary depending on the runtime characteristics of the application.
You may also want to refer to the following articles:
What is so different about JE log files?
See Six Things Everyone Should Know about JE Log Files.
You'll find more about log files in the Getting Started Guide.
How can I check the disk space utilization of my log files?
The previous FAQ explained the basics of JE log files and the concept of obsolete data. The JE package includes a utility that can be used to measure the utilization level of your log files. DbSpace
gives you information about how packed your log files are. DbSpace
can be used this way:
$ java -jar je.jar DbSpace
usage: java { com.sleepycat.je.util.DbSpace | -jar je.jar DbSpace }
-h <dir> # environment home directory
[-q] # quiet, print grand totals only
[-u] # sort by utilization
[-d] # dump file summary details
[-V] # print JE version number
For example, you might see this output:
$ java -jar je.jar DbSpace -h <environment directory> -q
File Size (KB) % Used
-------- --------- ------
TOTALS 18167 60
which says that in this 18Mb environment, 60% of the disk space used is taken by active JE log entries, and 40% is not utilized.
How can I find the location of checkpoints within the log files?
Checkpoints serve to limit the time it takes to re-open a JE environment, and also enable log cleaning, as described in this FAQ. By default, JE executes checkpoints transparently to the application, but it can be useful when troubleshooting to find the location of the checkpoints. For example, lack of checkpointing could explain a lack of log file cleaning
The following unadvertised option to the DbPrintLog utility provides a summary of checkpoint locations, at the end of the utility's output. For example, this command:
java com.sleepycat.je.util.DbPrintLog -h <environment directory> -S
will result in output of this type:
<..snip..>
Per checkpoint interval info:
lnTxn ln mapLNTxn mapLN end-end end-start start-end maxLNReplay ckptEnd
<..snip..>
16,911 0 0 2 2,155,691 2,152,649 3,042 16,911 0x2/0x87ed74
83,089 0 0 2 10,586,792 10,581,048 5,744 83,090 0x3/0x90e19c
0 0 0 0 0 0 0 0 0x3/0x90e19c
The last column indicates the location of the last checkpoint. The first number is the .jdb file and the second number is the offset in the file. In this case, the last checkpoint is in file 00000003.jdb at offset 0x90e19c. The log cleaner can not reclaim space in any .jdb files that follow that location.
Note that DbPrintLog is simply looking at the CkptStart and CkptEnd entries in the log and attempting to derive the checkpoint intervals from those entries. CkptStart can be missing because it was part of a file that was cleaned and deleted. CkptEnd can be missing for the same reason, or because the checkpoint never finished because of an abnormal shutdown.
Note that DbPrintLog -S can consume significant resources. If desired, the -s option can be used to restrict the amount of log analyzed by the utility.
Earlier versions of the Java collections API required that iterators be explicitly closed. How can they be used with other components that do not close iterators?
As of Berkeley Db, Java Edition 3.0, the com.sleepycat.collections package is fully compatible with the Java Collections framework. In previous releases, Collections.size() was not supported, and collection iterators had to be explicitly closed. These incompatibilities have been addressed to provide full interoperability with other Java libraries that use the Java Collections Framework interfaces.
In earlier versions, the user had to consider the context any time a StoredCollection is passed to a component that will call its iterator() method. If that component is not aware of the StoredIterator class, naturally it will not call the StoredIterator.close() method. This will cause a leak of unclosed Cursors, which can cause performance problems.
If the component cannot be modified to call StoredIterator.close(), the only solution is to copy the elements from the StoredCollection to a standard Java collection, and pass the resulting copy to the component. The simplest solution is to call the StoredCollection.toList() method to copy the StoredCollection to a standard Java ArrayList.
If the StoredCollection is large and only some of the elements need to be passed to the component, it may be undesirable to call the toList() method on the entire collection since that will copy all elements from the database into memory. There are two ways to create a standard Java collection containing only the needed elements:
How do I access duplicates using the Java collections API?
Let's say you have two databases: a primary database with the key {ID, attribute} and a secondary database with the key {ID}. How can you access all the attributes for a given ID? In other words, how can you access all the duplicate records in the secondary database for a given key?
We must admit at this point that the example given is partially a trick question. In this particular case you can get the attributes for a given ID without the overhead of creating and maintaining a secondary database! If you call SortedMap.subMap(fromKey,toKey) for a given ID in the primary database the resulting map will contain only the attributes you're interested in. For the fromKey pass the ID you're interested in and a zero (or lowest valued) attribute. For the toKey pass an ID one greater than the ID you're interested in (the attribute doesn't matter). Also see the extension method StoredSortedMap.subMap(Object,boolean,Object,boolean) if you would like more control over the subMap operation.
Note that this technique works only if the {ID, attribute} key is ordered. Because the ID is the first field in the key, the map is sorted primarily by ID and secondarily by attribute within ID. This type of sorting works with tuple keys, but not with serial keys. In general serial keys do not provide a deterministic sort order. To use a tuple key using a TupleBinding or another tuple binding class in the com.sleepycat.bind.tuple package.
But getting back to the general question of how to access duplicates, if you have a database with duplicates you can access the duplicates in a number of ways using the collections API: by creating a subMap or subSet for the key in question, as described above, or by using the extension method StoredMap.duplicates(). This is described in [href="http://download.oracle.com/docs/cd/E17277_02/html/java/com/sleepycat/collections/StoredMap.html#duplicates%28java.lang.Object%29" StoredMap.duplicates()], [href="http://download.oracle.com/docs/cd/E17277_02/html/collections/tutorial/retrievingbyindexkey.html retrieving by index key tutorial], and using stored collections tutorial.
In earlier versions of the Java collections API, why did iterators need to be explicitly closed by the caller?
As of Berkeley DB, Java Edition 3.0, the com.sleepycat.collections package is now fully compatible with the Java Collections framework. In previous releases, Collections.size() was not supported, and collection iterators had to be explicitly closed. These incompatibilities have been addressed to provide full interoperability with other Java libraries that use the Java Collections Framework interfaces.
Using earlier releases, if you obtain an Iterator from a StoredCollection
, it is always a StoredIterator and must be explicitly closed. Closing the iterator is necessary to release the locks held by the underlying cursor. To avoid performance problems, is important to close the cursor as soon as it is no longer needed.
Since the Java Iterator interface has no close()
method, the close()
method on the StoredIterator
class must be used. Alternatively, to avoid casting the Iterator
to a StoredIterator
, you can call the StoredIterator.close(Iterator)
static method; this method will do nothing if the argument given is not a StoredIterator
.
To ensure that an Iterator is always closed, even if an exception is thrown, use a finally clause. For example:
Iterator i = myStoredCollection.iterator();
try {
while (i.hasNext()) {
Object o = i.next();
// do some work
}
} finally {
StoredIterator.close(i); }
What is the complete definition of object persistence?
Persistence is defined in the documentation for the Entity annotation and the following topics are discussed:
How do I define primary keys, secondary keys and composite keys?
Primary keys and sequences are discussed in the documentation for the ]http://download.oracle.com/docs/cd/E17277_02/html/java/com/sleepycat/persist/model/PrimaryKey.html PrimaryKey] annotation and the following general topics about keys are also discussed:
For information about secondary keys and composite keys, see the SecondaryKey annotation and the KeyField annotation.
How do I store and query data in an EntityStore?
Entities are stored by primary key in a PrimaryIndex and how to store entities is described in the PrimaryIndex class documentation.
Entities may also be queried and deleted by secondary key using a SecondaryIndex and this is described in the SecondaryIndex class documentation. The SecondaryIndex documentation also discusses the four mappings provided by primary and secondary indexes:
Both PrimaryIndex and SecondaryIndex implement the EntityIndex interface. The EntityIndex documentation has more information about the four mappings along with example data. It also discusses the following general data access topics:
How are relationships defined and accessed?
Relationships in the DPL are defined using secondary keys. A secondary key provides an alternate way to lookup entities, and also defines a relationship between the entity and the key. For example, a Person entity with a secondary key field consisting of a set of email addresses defines a One-to-Many relationship between the Person entity and the email addresses. The Person entity has multiple email addresses and can be accessed by any of them.
Optionally a secondary key can define a relationship with another entity. For example, an Employee entity may have a secondary key called departmentName that is the primary key of a Department entity. This defines a Many-to-One relationship between the Employee and the Department entities. In this case we call the departmentName a foreign key and foreign key constraints are used to ensure that the departmentName key is valid.
Both simple key relationships and relationships between entities are defined using the SecondaryKey annotation. This annotation has properties for defining the type of relationship, the related entity (if any), and what action to take when the related entity is deleted.
For more information about defining relationships and to understand how relationships are accessed, see the SecondaryIndex documentation which includes the following topics:
What is the difference between embedding a persistent object and a relationship with another entity object?
There are two ways that an entity object may refer to another object. In the first approach, called embedding, the referenced object is simply defined as a field of the entity class. The only requirement is the the class of the embedded object be @Persistent. For example:
@Entity
class Person {
@PrimaryKey
long id;
String name;
Address address;
private Person() {}
}
@Persistent
class Address {
String street;
String city;
String state;
String country;
int postalCode;
private Address() {}
}
The embedded object is stored in the same record as the entity, in this case in the Person PrimaryIndex. There is no way to access the Address object except by looking up the Person object in its PrimaryIndex and examining the Person.address field. The Address object cannot be accessed independently. If the same Address object is stored in more than one Person entity, a separate copy of the address will be stored in each Person record.
You may wish to access the address independently or to share the address information among multiple Person objects. To do that you must define the Address class as an @Entity and define a relationship between the Person and the Address entities using a secondary key. For example:
@Entity
class Person {
@PrimaryKey
long id;
String name;
@SecondaryKey(relate=MANY_TO_ONE, relatedEntity=Address.class)
long addressId;
private Person() {}
}
@Entity
class Address {
@PrimaryKey
long id;
String street;
String city;
String state;
String country;
int postalCode;
private Address() {}
}
With this second approach, the Address is an entity with a primary key. It is stored separately from the Person entity and can be accessed independently. After getting a Person entity, you must lookup the Address by the Person.addressId value in the Address PrimaryIndex. If more than one Person entity has the same addressId value, the referenced Address is effectively shared.
Why must all persistent classes (superclasses, subclasses and embedded classes) be annotated?
The requirement to make all classes persistent (subclasses and superclasses as well as embedded classes) is documented under "Complex and Proxy Types" in the documentation for the Entity annotation.
There are a couple of reasons for requiring that all persistent classes be annotated with @Persistent when using the annotation entity model (the default entity model).
First, bytecode annotation of persistent classes, to optimize marshaling and un-marshaling, is performed based on the presence of the annotation when the class is loaded. In order for this to work efficiently and correctly, all classes must be annotated.
Second, the annotation is used for version numbering when classes are evolved. For example, if a superclass instance fields change and that source code not under your control, you will have difficulty in controlling evolution of its instance fields. It is for this reason that a PersistentProxy is required when using external classes that cannot be annotated or controlled.
If the source code for a superclass or embedded class is not under your control, you should consider copying the data into your persistent class, or referencing it and using a PersistentProxy to translate the external class to a persistent class that is under your control.
If using JE annotations is undesirable then you may wish to subclass EntityModel and implement a different source of metadata. While annotations are very unobtrusive compared to other techniques such as implementing interfaces or subclassing from a common base class, we understand that they may be undesirable. Note, however, that if you don't use annotations then you won't get the performance benefits of bytecode annotation.
Why doesn't the DPL use the standard annotations defined by the Java Persistence API?
The EJB3-oriented Java Persistence API is SQL-oriented, and could not be fully implemented with Berkeley DB except by adding SQL or some other query language first. There was a public discussion on TheServerSide of this issue during the design phase for the DPL.
Note that it is not just the annotation syntax that is different between DPL and the EJB3 Java Persistence API:
In general, Berkeley DB's advantages are that it provides better performance and a simpler usage model than is possible with an SQL database. To implement the EJB3 Persistence API for Berkeley DB would negate both of these advantages to some degree.
How do I dump and load the data in an EntityStore?
The DbDump and DbLoad utilities can be used to dump and load all Databases in an EntityStore. They can be used to copy, backup or restore the data in a store when you don't wish to copy the entire set of log files in the environment.
The Database Names section of the EntityStore documentation describes how to determine which database names to dump or load.
In addition, it is useful to understand the process for dumping and loading databases in general. Two points are important to keep in mind:
Here is an example of the steps to dump a store from one environment and load the dumped output in a second environment.
DbDump -l
command line option or using the Environment.getDatabaseNames
method. Write this list of database names to a file. See the Database Names section mentioned above for information on how to identify all databases for a store.DbDump
command line program or the DbDump
class.Environment.removeDatabase
for each existing database.DbLoad
command line program or the DbLoad
class. Do not open the target EntityStore until all database are loaded.What is Carbonado and when should I use it instead of the Direct Persistence Layer (DPL)?
Carbonado is an open source Java persistence framework. It is published by Amazon on SourceForge:
Carbonado allows using Berkeley DB C Edition, Berkeley DB Java Edition, or an SQL database as an underlying repository. This is extremely useful when an abstraction that supports an SQL-based backend is a requirement, or when there is a need to synchronize data between Berkeley DB and SQL databases.
Because it supports SQL databases and the relational model, Carbonado provides a different set of features than the DPL. The following feature set comparison may be useful in deciding which API to use.
Feature | Carbonado | Direct Persistence Layer (DPL) |
---|---|---|
SQL and relational database support | Supports the relational model. The underlying repository may be an SQL database accessed via JDBC, or a Berkeley DB repository, both accessed using the same Carbonado API. Queries are performed by a simple query language that is not SQL but follows the relational model. | Does not support SQL databases or provide a query language. The relational model can be used, but is not enforced by the API. Queries are performed by directly accessing primary and secondary indexes. |
Synchronization of two repositories. | Supported for any two repositories, including Berkeley DB and SQL repositories. This allows using Berkeley DB as a front-end cache for an SQL database, for example. | Not supported. |
CRUD Operations | CRUD operations are performed using the ActiveRecord design pattern. CRUD methods are on the persistent object. | CRUD operations are performed by directly accessing primary and secondary indexes as in the Data Access Object design pattern, not by methods on the persistent objects. |
Relationships | Supports 1:1, 1:many, many:1, many:many. Relationships are traversed using methods on the persistent object. For X:1 relationships the method fetches the object. For X:many relationships the method returns a query that can be executed or composed with other queries. | Supports 1:1, 1:many, many:1, many:many. Relationships are traversed by directly accessing primary and secondary indexes. |
Transactions | Supports transactional use and all four ANSI isolation levels. Transactions are implicit per thread. | Supports transactional or non-transactional use and all four ANSI isolation levels. Transactions are explicitly passed as parameters and M:N transactions:threads are allowed. |
Persistent Object Model | JavaBean properties are persistent. Interfaces or abstract classes are defined to represent relational entities. The Storable interface must be extended by the interface or implemented by the abstract class. Instances are created using a Storage factory class. To conform to the relational model, nested objects are not supported and object graphs are not preserved. | Provides POJO persistence. All non-transient instance fields are persistent. Implementing interfaces or extending a base class is not required. Instances are created using ordinary constructors. Arbitrarily nested objects and arrays are supported and object graphs are preserved. |
Metadata description | Annotations are used to describe keys and other metadata. | Annotations are used to describe keys and other metadata. Or, metadata may be supplied externally to avoid the use of annotations. |
Class evolution | Adding and dropping fields and indexes is supported. | Adding, deleting, renaming, widening and converting fields and classes is supported, as is adding and dropping indexes. |
Optimistic locking | Supported via a Version property. | Not supported. |
Triggers | Supported for all write (create, update, delete) operations. | Not supported. |
Large objects | Supported but implemented for Berkeley DB using partial get/put operations, which do not provide the same level of performance as true LOB support. | Not supported. |
Sequences | Supported | Supported |
Support | Provided by the Carbonado open source community. | Provided by Oracle and the Berkeley DB open source community. |
JE HA replicated applications are by their nature distributed applications that communicate over the network. Attention to proper configuration will help prevent problems that may prove hard to diagnose and rectify in a running distributed environment. The checklist below is intended to help ensure that key configuration issues have been considered from the outset. It's also a good list to revisit in case you do encounter unusual problems after your application has been deployed, in case there were any inadvertent configuration changes.
Also, if multiple machines are in use check that there is a network path between any two machines. Use a network utility like ping
to verify that a network path exists.
netstat
to list the services running on each port. Also, check that the port is reachable from every other host configured for use by the other nodes in the replication group. In particular, ensure that a firewall is not blocking use of that port. A tool like telnet
can be used for this purpose.
In large or even medium sized production networks, it's not unusual to have a person or group of network administrators who are responsible for the allocation of ports, to help minimize port conflicts, and to ensure that they are used in a consistent way across the enterprise. In such situations, it's best to consult with the networking group while making these network level configuration decisions.
ntpd
. Use the time synchronization mechanism that's best suited to your particular platform. See Time Synchronization for further details.DbDumpGroup
can be used to display the nodes that are currently members of the replication group. For example, the following invocation of DbDumpGroup will print out all the nodes present in the replicated environment that's stored in the directory: /tmp/env.
java -jar je.jar DbGroupAdmin -groupName <group name>
-helperHosts <host:port>
-dumpGroup