Class Lexicon
- java.lang.Object
-
- com.gengoai.hermes.lexicon.Lexicon
-
- All Implemented Interfaces:
Extractor
,PrefixSearchable
,WordList
,Serializable
,Iterable<String>
,Predicate<HString>
- Direct Known Subclasses:
PersistentLexicon
,TrieLexicon
public abstract class Lexicon extends Object implements Predicate<HString>, WordList, Extractor, PrefixSearchable, Serializable
A traditional approach to information extraction incorporates the use of lexicons, also called gazetteers, for finding specific lexical items in text. Hermes's Lexicon classes provide the ability to match lexical items using a greedy longest match first or maximum span probability strategy. Both matching strategies allow for case-sensitive or case-insensitive matching and the use of constraints (using the Lyre expression language), such as part-of-speech, on the match.
Lexicons are managed using the
LexiconManager
, which acts as a cache associating lexicons with a name and a language. This allows for lexicons to be defined via configuration and then to be loaded and retrieved by their name (this is particularly useful for annotators that use lexicons).Lexicons are defined using a
LexiconSpecification
in the following format:lexicon:(mem|disk):name(:(csv|json))*::RESOURCE(;ARG=VALUE)*
**The schema of the specification is "lexicon" and the currently supported protocols are: mem: An in-memory Trie-based lexicon. disk: A persistent on-disk based lexicon.The name of the lexicon is used during annotation to mark the provider. Additionally, a format (csv or json) can be specified, with json being the default if none is provided, to specify the lexicon format when creating in-memory lexicons. Finally, a number of query parameters (ARG=VALUE) can be given from the following choices:
caseSensitive=(true|false)
: Is the lexicon case-sensitive (true) or case-insensitive (false) (default false).defaultTag=TAG
: The default tag value for entry when one is not defined (default null).language=LANGUAGE
: The default language of entries in the lexicon (default Hermes.defaultLanguage()).
and the following for CSV lexicons:
lemma=INDEX
: The index in the csv row containing the lemma (default 0).tag=INDEX
: The index in the csv row containing the tag (default 1).probability=INDEX
: The index in the csv row containing the probability (default 2).constraint=INDEX
: The index in the csv row containing the constraint (default 3).
- Author:
- David B. Bracewell
- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description Lexicon()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description abstract void
add(LexiconEntry lexiconEntry)
Adds an entry to the lexiconvoid
addAll(@NonNull Iterable<LexiconEntry> lexiconEntries)
Adds all lexicon entries in the given iterable to the lexiconabstract Set<LexiconEntry>
entries()
Extraction
extract(@NonNull HString source)
Generate anExtraction
from the givenHString
.abstract Set<LexiconEntry>
get(@NonNull String word)
Returns theLexiconEntry
associated with a given word in the Lexicon or an empty set if there are none.abstract int
getMaxLemmaLength()
abstract int
getMaxTokenLength()
abstract String
getName()
double
getProbability(@NonNull HString hString)
Gets the maximum probability for matching the givenHString
double
getProbability(@NonNull HString hString, @NonNull Tag tag)
Gets the maximum probability for matching the givenHString
with the given Tagdouble
getProbability(@NonNull String lemma)
Gets the maximum probability for matching the given Stringdouble
getProbability(@NonNull String string, @NonNull Tag tag)
Gets the maximum probability for matching the given String with the given tagOptional<String>
getTag(@NonNull HString hString)
Gets the first matched tag, if one, for the givenHString
Optional<String>
getTag(@NonNull String lemma)
Gets the first matched tag, if one, for the given Stringabstract boolean
isCaseSensitive()
Is the Lexicon case sensitive or notabstract boolean
isProbabilistic()
Is the Lexicon case sensitive or notabstract List<LexiconEntry>
match(@NonNull HString string)
Gets the matched entries for a givenHString
abstract List<LexiconEntry>
match(@NonNull String term)
Returns theLexiconEntry
associated with a given word in the Lexicon or an empty set if there are none.protected String
normalize(CharSequence sequence)
Normalizes the string based whether the lexicon is case sensitive or not.abstract int
size()
The number of lexical items in the lexiconboolean
test(@NonNull HString hString)
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface java.lang.Iterable
forEach, iterator, spliterator
-
Methods inherited from interface com.gengoai.hermes.lexicon.PrefixSearchable
isPrefixMatch, isPrefixMatch, prefixes
-
-
-
-
Method Detail
-
add
public abstract void add(LexiconEntry lexiconEntry)
Adds an entry to the lexicon- Parameters:
lexiconEntry
- the lexicon entry to add
-
addAll
public void addAll(@NonNull @NonNull Iterable<LexiconEntry> lexiconEntries)
Adds all lexicon entries in the given iterable to the lexicon- Parameters:
lexiconEntries
- the lexicon entries to add
-
entries
public abstract Set<LexiconEntry> entries()
- Returns:
- the set of lexicon entries in the lexicon
-
extract
public Extraction extract(@NonNull @NonNull HString source)
Description copied from interface:Extractor
Generate anExtraction
from the givenHString
.
-
get
public abstract Set<LexiconEntry> get(@NonNull @NonNull String word)
Returns theLexiconEntry
associated with a given word in the Lexicon or an empty set if there are none.- Parameters:
word
- the word in the lexicon whose entries we want- Returns:
- the
LexiconEntry
associated with a given word in the Lexicon or an empty set if there are none.
-
getMaxLemmaLength
public abstract int getMaxLemmaLength()
- Returns:
- the max lemma length
-
getMaxTokenLength
public abstract int getMaxTokenLength()
- Returns:
- the max token length
-
getName
public abstract String getName()
- Returns:
- the name of the lexicon
-
getProbability
public final double getProbability(@NonNull @NonNull HString hString)
Gets the maximum probability for matching the givenHString
-
getProbability
public final double getProbability(@NonNull @NonNull String lemma)
Gets the maximum probability for matching the given String- Parameters:
lemma
- the String to match against- Returns:
- the maximum probability for the String
-
getProbability
public final double getProbability(@NonNull @NonNull HString hString, @NonNull @NonNull Tag tag)
Gets the maximum probability for matching the givenHString
with the given Tag
-
getProbability
public final double getProbability(@NonNull @NonNull String string, @NonNull @NonNull Tag tag)
Gets the maximum probability for matching the given String with the given tag- Parameters:
string
- the String to match againsttag
- the tag that must be present for the match- Returns:
- the maximum probability for the String with the given tag
-
getTag
public final Optional<String> getTag(@NonNull @NonNull String lemma)
Gets the first matched tag, if one, for the given String- Parameters:
lemma
- the String to match against- Returns:
- the first matched tag for the String
-
getTag
public final Optional<String> getTag(@NonNull @NonNull HString hString)
Gets the first matched tag, if one, for the givenHString
-
isCaseSensitive
public abstract boolean isCaseSensitive()
Is the Lexicon case sensitive or not- Returns:
- True if the lexicon is case sensitive, False if not
-
isProbabilistic
public abstract boolean isProbabilistic()
Is the Lexicon case sensitive or not- Returns:
- True if the lexicon is case sensitive, False if not
-
match
public abstract List<LexiconEntry> match(@NonNull @NonNull HString string)
Gets the matched entries for a givenHString
-
match
public abstract List<LexiconEntry> match(@NonNull @NonNull String term)
Returns theLexiconEntry
associated with a given word in the Lexicon or an empty set if there are none.- Parameters:
term
- the word in the lexicon whose entries we want- Returns:
- the
LexiconEntry
associated with a given word in the Lexicon or an empty set if there are none.
-
normalize
protected String normalize(CharSequence sequence)
Normalizes the string based whether the lexicon is case sensitive or not.- Parameters:
sequence
- the sequence- Returns:
- the string
-
size
public abstract int size()
The number of lexical items in the lexicon
-
-