[jdom-interest] JDOM and memory
Rolf Lear
jdom at tuis.net
Sat Jan 28 16:46:52 PST 2012
Hi Joe.
Thanks for that. I have run in to the problem before with the backing
array not being the same as the actual String content. In the StringBin
code I specifically account for that:
https://github.com/hunterhacker/jdom/blob/master/core/src/java/org/jdom2/util/StringBin.java#L371
In essence, it ensures he String is as compact as possible.
Rolf
On 28/01/2012 7:10 PM, Joe Bowbeer wrote:
> A per-document string pool is a feature of binary xml formats.
>
> A potential problem with per-factory string pooling is the possibility
> of retaining large character arrays. Android's String class description
> explains the problem:
>
> This class is implemented using a char[]. The length of the array
> may exceed the length of the string. For example, the string "Hello"
> may be backed by the array |['H', 'e', 'l', 'l', 'o', 'W'. 'o', 'r',
> 'l', 'd']| with offset 0 and length 5.
> Multiple strings can share the same char[] because strings are
> immutable. The |substring(int)
> <http://developer.android.com/reference/java/lang/String.html#substring(int)>| method
> *always* returns a string that shares the backing array of its
> source string. Generally this is an optimization: fewer character
> arrays need to be allocated, and less copying is necessary. But this
> can also lead to unwanted heap retention. Taking a short substring
> of long string means that the long shared char[] won't be garbage
> until both strings are garbage. This typically happens when parsing
> small substrings out of a large input. To avoid this where
> necessary, call |new String(longString.subString(...))|. The string
> copy constructor always ensures that the backing array is no larger
> than necessary.
>
>
> ...from http://developer.android.com/reference/java/lang/String.html
>
> If xml parsers create new strings, is it to avoid retaining the entire
> source document?
>
> I suggest choosing a name for the Slim factory that is more descriptive
> of what it does, as "slim" may depend on taste and application.
>
> Joe
>
> On Sat, Jan 28, 2012 at 8:38 AM, Rolf Lear wrote:
>
> Hi All ... An update...
>
> I have played with a number of options, and have not had significant
> success with any.
>
> Merging Content-list in to Element has a number of problems:
> 1. Document and Element end up duplicating a lot of code
> 2. It changes the API of Document and Element with it implementing
> List<Content>
>
> Document and Element almost always contain content... it is seldom
> that you have empty Elements (there is normally some text at least).
> As a result, the savings of not having to have a content array are
> limited.
>
> There can be some saving in not having a separate object as the
> list, but it does not amount to much. Given the issues with the API
> this approach does not make sense.
>
> Michael Kay suggested keeping the ContentList independent of the
> Element, and creating an instance when it was referenced in
> getContent(). The problem with this is that the management of
> ConcurrentModification becomes very complicated, and, as far as I
> can tell, essentially impossible if there are multiple differet
> instances of the ContentList class for any particular Element. Given
> that almost all Element instances have content, it is not worth the
> effort to lose the ConcurrentModification control, and not actually
> save any memory in a typical use case.
>
> So, neither option for changing the ContentList system is very
> successful.
>
> On the other hand, it is relatively common to have no Attributes on
> an Element, and some careful changes to the Element class (adding a
> hasAttributes() method and making the AttributeList variable a
> 'lazy' initialised field) this means that in ideal cases we never
> need to actually create an AttributeList instance for the Element.
> This has a significant impact on the 'hamlet' test, where there are
> essentially no attributes. It has no 'negative' impact on memory in
> the worst case either, and it has positive (small but significant)
> impact on performance.
>
> So, the lazy initialization of AttributeList is a 'win'.
>
> Finally, I have in the past had some success with the concept of
> 'reusing' String values. XML Parsers (like SAX, etc.) typically
> create a new String instance for all the variables they pass. For
> example, the Element names, prefixes, etc. are all new instances of
> String. Thus, if you have hundreds of Elements called 'car' in your
> input XML, you will get hundreds of different String Element names
> with the value 'car'. I have built a class that does something
> similar to String.intern() in order to rationalize the hundreds of
> different-but-equals() values that are passed in by the parsers.
>
> I have incorporated this 'caching' class in to a new JDOMFactory
> called 'SlimJDOMFactory'. This factory 'normalizes' all String
> values to a single instance of each unique String value. This
> significantly reduces the amount of memory used in the JDOM tree
> especially if there are lots of: similarly named attributes,
> elements, white-space-padding in otherwise empty elements, or
> between elements. This process is significantly slower through...
>
> For example, with the 'hamlet' test case, the 'baseline' memory
> footprint for hamlet in JDOM is 2.27MB in 4.75ms.
> With the SlimJDOMFactory it is: 1.77MB in 8.5ms
> With Lazy AttributeList it is: 2.06MB in 4.55ms
> With the both it is 1.57MB in 8.3ms
>
> I am pushing both of these changes in to github. The AttributeList
> is an easy one to justify. It is fully compatible with prior code,
> it has positive memory and perfomance impacts.
>
> The SlimJDOMFactory is also justifiable when you consider:
> 1. the user has to decide to use it specifically.
> 2. The memory saving can be very significant.
> 3. Even though the parse time is slower, the GC time savings can be
> significant if the document 'hangs around' for a long time - the
> quicker GC time can add up fast.
> 4. When you have lots of code doing comparisons it is much faster to
> do equals() calls on Strings that are == as well. It saves a
> hashCode calculation as well as a string character scan to prove
> equals().
>
> Rolf
>
>
> On 02/01/2012 3:27 PM, Rolf wrote:
>
> Hi all.
>
> Memory optimization has never been a top priority for JDOM. At
> the same
> time, for what it does, JDOM is not a 'terrible' memory user.
> Still, I
> have done some analysis, and, I believe I can trim about a
> quarter to a
> half of 'JDOM Overhead' memory usage by making two 'simple'
> changes....
>
> The first is to merge the ContentList class in to the Element
> class (and
> also in to Document). This will reduce the number of Java objects by
> about half, and that will save about 32 bytes per Element at a
> minimum
> in a 64-bit JRE. Additionally, by lazy-initialization of the Content
> array, we can save memory on otherwise 'empty' Elements.
>
> This can be done by extending the Element (and perhaps Document)
> class
> to extend 'List'. It can all be done in a 'backward compatible'
> way, but
> also leads to some interesting possibilities, like:
>
> for (Content c : element) {
> ... do something
> }
>
> (for backward compatibility, Element.getContent() will return
> 'this').
>
>
> The second change is to make the AttributeList instance in Element a
> lazy-initialization. This would save memory on all Elements that
> have no
> attributes, but would have an impact for people who sub-class the
> Element class and may expect the attributes field to be non-null.
>
>
> I am trying to get a feel for how important this sort of
> optimization
> may be. If there is interest then I will make some changes, and
> test the
> impact. I may make a separate branch in github to test it out....
>
> If the above changes are unrealistic then I don't think it makes
> sense
> to even try....
>
> Rolf
>
>
>
> _______________________________________________
> To control your jdom-interest membership:
> http://www.jdom.org/mailman/options/jdom-interest/youraddr@yourhost.com
More information about the jdom-interest
mailing list