[jdom-interest] JDOM and memory

Sat Jan 28 16:46:52 PST 2012

Hi Joe.

Thanks for that. I have run in to the problem before with the backing 
array not being the same as the actual String content. In the StringBin 
code I specifically account for that: 
https://github.com/hunterhacker/jdom/blob/master/core/src/java/org/jdom2/util/StringBin.java#L371

In essence, it ensures he String is as compact as possible.

Rolf

On 28/01/2012 7:10 PM, Joe Bowbeer wrote:
> A per-document string pool is a feature of binary xml formats.
>
> A potential problem with per-factory string pooling is the possibility
> of retaining large character arrays.  Android's String class description
> explains the problem:
>
>     This class is implemented using a char[]. The length of the array
>     may exceed the length of the string. For example, the string "Hello"
>     may be backed by the array |['H', 'e', 'l', 'l', 'o', 'W'. 'o', 'r',
>     'l', 'd']| with offset 0 and length 5.
>     Multiple strings can share the same char[] because strings are
>     immutable. The |substring(int)
>     <http://developer.android.com/reference/java/lang/String.html#substring(int)>| method
>     *always* returns a string that shares the backing array of its
>     source string. Generally this is an optimization: fewer character
>     arrays need to be allocated, and less copying is necessary. But this
>     can also lead to unwanted heap retention. Taking a short substring
>     of long string means that the long shared char[] won't be garbage
>     until both strings are garbage. This typically happens when parsing
>     small substrings out of a large input. To avoid this where
>     necessary, call |new String(longString.subString(...))|. The string
>     copy constructor always ensures that the backing array is no larger
>     than necessary.
>
>
> ...from http://developer.android.com/reference/java/lang/String.html
>
> If xml parsers create new strings, is it to avoid retaining the entire
> source document?
>
> I suggest choosing a name for the Slim factory that is more descriptive
> of what it does, as "slim" may depend on taste and application.
>
> Joe
>
> On Sat, Jan 28, 2012 at 8:38 AM, Rolf Lear wrote:
>
>     Hi All ... An update...
>
>     I have played with a number of options, and have not had significant
>     success with any.
>
>     Merging Content-list in to Element has a number of problems:
>     1. Document and Element end up duplicating a lot of code
>     2. It changes the API of Document and Element with it implementing
>     List<Content>
>
>     Document and Element almost always contain content... it is seldom
>     that you have empty Elements (there is normally some text at least).
>     As a result, the savings of not having to have a content array are
>     limited.
>
>     There can be some saving in not having a separate object as the
>     list, but it does not amount to much. Given the issues with the API
>     this approach does not make sense.
>
>     Michael Kay suggested keeping the ContentList independent of the
>     Element, and creating an instance when it was referenced in
>     getContent(). The problem with this is that the management of
>     ConcurrentModification becomes very complicated, and, as far as I
>     can tell, essentially impossible if there are multiple differet
>     instances of the ContentList class for any particular Element. Given
>     that almost all Element instances have content, it is not worth the
>     effort to lose the ConcurrentModification control, and not actually
>     save any memory in a typical use case.
>
>     So, neither option for changing the ContentList system is very
>     successful.
>
>     On the other hand, it is relatively common to have no Attributes on
>     an Element, and some careful changes to the Element class (adding a
>     hasAttributes() method and making the AttributeList variable a
>     'lazy' initialised field) this means that in ideal cases we never
>     need to actually create an AttributeList instance for the Element.
>     This has a significant impact on the 'hamlet' test, where there are
>     essentially no attributes. It has no 'negative' impact on memory in
>     the worst case either, and it has positive (small but significant)
>     impact on performance.
>
>     So, the lazy initialization of AttributeList is a 'win'.
>
>     Finally, I have in the past had some success with the concept of
>     'reusing' String values. XML Parsers (like SAX, etc.) typically
>     create a new String instance for all the variables they pass. For
>     example, the Element names, prefixes, etc. are all new instances of
>     String. Thus, if you have hundreds of Elements called 'car' in your
>     input XML, you will get hundreds of different String Element names
>     with the value 'car'. I have built a class that does something
>     similar to String.intern() in order to rationalize the hundreds of
>     different-but-equals() values that are passed in by the parsers.
>
>     I have incorporated this 'caching' class in to a new JDOMFactory
>     called 'SlimJDOMFactory'. This factory 'normalizes' all String
>     values to a single instance of each unique String value. This
>     significantly reduces the amount of memory used in the JDOM tree
>     especially if there are lots of: similarly named attributes,
>     elements, white-space-padding in otherwise empty elements, or
>     between elements. This process is significantly slower through...
>
>     For example, with the 'hamlet' test case, the 'baseline' memory
>     footprint for hamlet in JDOM is 2.27MB in 4.75ms.
>     With the SlimJDOMFactory it is: 1.77MB in 8.5ms
>     With Lazy AttributeList it is: 2.06MB in 4.55ms
>     With the both it is 1.57MB in 8.3ms
>
>     I am pushing both of these changes in to github. The AttributeList
>     is an easy one to justify. It is fully compatible with prior code,
>     it has positive memory and perfomance impacts.
>
>     The SlimJDOMFactory is also justifiable when you consider:
>     1. the user has to decide to use it specifically.
>     2. The memory saving can be very significant.
>     3. Even though the parse time is slower, the GC time savings can be
>     significant if the document 'hangs around' for a long time - the
>     quicker GC time can add up fast.
>     4. When you have lots of code doing comparisons it is much faster to
>     do equals() calls on Strings that are == as well. It saves a
>     hashCode calculation as well as a string character scan to prove
>     equals().
>
>     Rolf
>
>
>     On 02/01/2012 3:27 PM, Rolf wrote:
>
>         Hi all.
>
>         Memory optimization has never been a top priority for JDOM. At
>         the same
>         time, for what it does, JDOM is not a 'terrible' memory user.
>         Still, I
>         have done some analysis, and, I believe I can trim about a
>         quarter to a
>         half of 'JDOM Overhead' memory usage by making two 'simple'
>         changes....
>
>         The first is to merge the ContentList class in to the Element
>         class (and
>         also in to Document). This will reduce the number of Java objects by
>         about half, and that will save about 32 bytes per Element at a
>         minimum
>         in a 64-bit JRE. Additionally, by lazy-initialization of the Content
>         array, we can save memory on otherwise 'empty' Elements.
>
>         This can be done by extending the Element (and perhaps Document)
>         class
>         to extend 'List'. It can all be done in a 'backward compatible'
>         way, but
>         also leads to some interesting possibilities, like:
>
>         for (Content c : element) {
>         ... do something
>         }
>
>         (for backward compatibility, Element.getContent() will return
>         'this').
>
>
>         The second change is to make the AttributeList instance in Element a
>         lazy-initialization. This would save memory on all Elements that
>         have no
>         attributes, but would have an impact for people who sub-class the
>         Element class and may expect the attributes field to be non-null.
>
>
>         I am trying to get a feel for how important this sort of
>         optimization
>         may be. If there is interest then I will make some changes, and
>         test the
>         impact. I may make a separate branch in github to test it out....
>
>         If the above changes are unrealistic then I don't think it makes
>         sense
>         to even try....
>
>         Rolf
>
>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com