I have a large number of name - value pairs (approx 100k) that I need to store in some sort of cache (say a hash map) where the value is a string with an average of about 30k bytes in size.
Now I know for a fact that a large number of the values have exactly the same string data. In order to avoid having to allocate the identical string data several times, I would like to somehow reuse a previously allocated string thus consuming less memory. In addition this needs to be reasonably fast. i.e. scanning through all the previously allocated values one-by-one is not an option.
Any recommendations on how I could solve this problem?
-
String.intern() will help you here (most likely). It will resolve multiple instances of the same string down to one copy.
EDIT: I suggested this would 'most likely' help. In what scenarios will it not ? Interning strings will have the effect of storing those interned string representations permanently. If the problem domain is a one-shot process, this may not be an issue. If it's a long running process (such as a web app) then you may well have a problem.
I would hesitate to say never use interning (I would hesistate to say never do anything). However there are scenarios where it's not ideal.
Tom Hawtin - tackline : String.intern can be quite slow. It also places the String into the permanent generation, which could well cause GC performance problems.Brian Agnew : The permanent generation is an issue, granted. The question doesn't have the context in which this is to be used. If it's a standalone app, then it may well be ok. Otherwise (say a ong running web app), then not. As ever, solutions need to be evaluated in the context of where they'll be used.Stu Thompson : @Brian Agnew: My I suggest you edit and expand your answer then to include context? Comments don't count, if you get my drift. -
String.internis the obvious choice as Brian says. But if you don't want to intern across all the String in memory, you can use a Set to first see if the value is present. Here's untested code. You will have to work out removing from reverse map when removing from mainclass Map2<K, V> implements Map<K, V> { Map<K, V> _map = Maps.newHashMap(); Set<V, V> _rev = Maps.newHashMap(); V put(K k, V v) { if (_rev.containsKey(v)) { V prev = _rev.get(v); return _map.put(k, prev); } else { _rev.put(v, v); return _map.put(k,v); } }Tom Hawtin - tackline : ConcurrentMap has putIfAbsent, which might be useful.Ingo : I like this solution, it's no overkill with weak references etc. To optimize even more on storage, one could just search the existing values in the Map, given that the total number is small (say <10000). Upvote! -
Do not use String.intern (there have been various memory issues related to this through the years). instead, create your own cache, similar to String.intern. basically, you want a Map, where each key maps to itself. then, before caching any string, you "intern" it:
private Map<String,WeakReference<String>> myInternMap = new WeakHashMap<String,,WeakReference<String>>(); public String intern(String value) { synchronized(myInternMap) { WeakReference<String> curRef = myInternMap.get(value); String curValue = ((curRef != null) ? curRef.get() : null); if(curValue != null) { return curValue; } myInternMap.put(value, new WeakReference<String>(value)); return value; } }note, you use weakreferences for the keys and values so that you don't keep references for strings which you are no longer using.
kdgregory : james? as in JT?james : yep, tis JT. too funny that i wrote your code for you.StaxMan : No, this is very BAD advice. Most such comments refer to rather old issues for now obsolete JVMs. There is absolutely nothing wrong with String.intern() for long-living shared Strings. Much less than issues with roll-your-own replacements. -
It somewhat depends upon how you are creating the
String.One possible way is to use
TreeSetthat uses aComparatorthat can compare existingStrings and the source of your newString. UseSortedSet.tailSetand anIteratorto find an existingString. Or alternativelyNavigableSet.ceiling/flooror aTreeMapwith a similar setup.I wrote a weblog entry about another technique to cache immutable objects (in particular Strings), but that is more suitable for smaller objects.
String.internhas performance problems. -
Agreed with others on not using String.intern(): once you've put a string there, it will never go away. Look to early revisions of Xerces for why this is a bad idea.
A better solution is to use a WeakHashMap, wrapping the value in a WeakReference:
private Map<String,WeakReference<String>> _map = new WeakHashMap<String,WeakReference<String>>(); public synchronized String intern(String str) { WeakReference<String> ref = _map.get(str); String s2 = (ref != null) ? ref.get() : null; if (s2 != null) return s2; str = new String(str); _map.put(str, new WeakReference(str)); return str; }This code is from an article that I wrote on the Java reference objects. You'll find the explanation there.
EDIT: need to create a new string here (and I'll update the article) because the original might be a substring of a far larger character array. I thought that was fixed around JDK 1.3, but apparently not (at least not in 1.5).
PintSizedCat : Interning a string will not mean that it will 'never go away', you can garbage collect the perm gen though it may not be so efficient it can and will get garbage collected if there are no strong references to it.kdgregory : The permgen, at least in the Sun JVM, is managed separately from the rest of the heap. If you can point to code that removes strings from the intern table, then I'm willing to retract my statement. -
You could compress the strings. A 30K string should get a good compression ratio. I wrote a hack to compress large String as an exercise, but you could use a byte[] of the compressed data to store the String.
A 30K character string will use about 60KB (2 bytes per character) so even using getBytes() is likely to be an improvement.
-
Do you actually need Strings, or do you just need any old CharSequence? If not, then consider implementing a "compact" CharSequence such as the one I suggest in the link.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.