Package com.gengoai.hermes.corpus
Interface Corpus
-
- All Superinterfaces:
AutoCloseable
,DocumentCollection
,Iterable<Document>
public interface Corpus extends DocumentCollection
A persistent collection of documents each having a unique document ID. In addition to the functionality provided by
DocumentCollection
, corpora allow:- Access to individual documents via
getDocument(String)
,remove(String)
,remove(Document)
, andupdate(Document)
methods. - Ability to add new documents via the
add(Document)
,addAll(Iterable)
, andimportDocuments(String)
methods. - Aggregation of document level metadata via
getAttributeValueCount(AttributeType)
- AnnotatableType completed at the corpus level via
getCompleted()
- Aggregation of the document ids in the corpus via
getIds()
Corpora are opened using the
open(String)
oropen(Resource)
methods which will load the appropriate corpus implementation based on the resource type.- Author:
- David B. Bracewell
-
-
Field Summary
-
Fields inherited from interface com.gengoai.hermes.corpus.DocumentCollection
REPORT_INTERVAL, REPORT_LEVEL
-
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description boolean
add(Document document)
Adds a document to the corpusdefault void
addAll(@NonNull Iterable<Document> documents)
Adds multiple documents to the corpus.default Corpus
annotate(@NonNull AnnotatableType... annotatableTypes)
Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types presentdefault Corpus
apply(TokenRegex pattern, SerializableConsumer<TokenMatch> onMatch)
Applies token regular expression to the corpus creating annotations of the given type for matches.default Corpus
apply(Lexicon lexicon, SerializableConsumer<HString> onMatch)
Applies a lexicon to the corpus creating annotations of the given type for matches.default void
assignRandomSplit(double pct)
default Corpus
compact()
Compacts the storage used for the corpus.Set<AttributeType<?>>
getAttributes()
<T> Counter<T>
getAttributeValueCount(@NonNull AttributeType<T> type)
Gets a count of the values for the given attribute across documents in the corpus.Set<AnnotatableType>
getCompleted()
default Document
getDocument(String id)
Gets the document with the given document iddefault List<String>
getIds()
default Corpus
importDocuments(@NonNull String specification)
Imports documents from the given document collection specification.static Corpus
open(@NonNull Resource resource)
Opens the corpus at the given resource.static Corpus
open(@NonNull String resource)
Opens the corpus at the given resource.default Corpus
process(@NonNull SequentialWorkflow processor)
Processes the corpus using the givenSequentialWorkflow
boolean
remove(Document document)
Removes a document from the corpusboolean
remove(String id)
Removes a document by its id.default Corpus
repartition(int numPartitions)
Repartitions the corpus.default Corpus
update(@NonNull CaduceusProgram program)
Updates all documents in the corpus using the givenCaduceusProgram
Corpus
update(@NonNull String operation, @NonNull SerializableConsumer<Document> documentProcessor)
Updates all documents in the corpus using the given document processorboolean
update(Document document)
Updates the given document-
Methods inherited from interface java.lang.AutoCloseable
close
-
Methods inherited from interface com.gengoai.hermes.corpus.DocumentCollection
apply, asDataSet, cache, documentCount, export, filter, getStreamingContext, groupBy, isEmpty, iterator, nGramCount, parallelStream, query, query, sample, sample, significantBigrams, significantBigrams, size, stream, termCount
-
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
-
-
-
Method Detail
-
open
static Corpus open(@NonNull @NonNull Resource resource)
Opens the corpus at the given resource.- Parameters:
resource
- the resource pertaining to the corpus- Returns:
- the corpus
-
open
static Corpus open(@NonNull @NonNull String resource)
Opens the corpus at the given resource.- Parameters:
resource
- the resource pertaining to the corpus- Returns:
- the corpus
-
add
boolean add(Document document)
Adds a document to the corpus- Parameters:
document
- the document to add- Returns:
- True if added, False if not
-
addAll
default void addAll(@NonNull @NonNull Iterable<Document> documents)
Adds multiple documents to the corpus.- Parameters:
documents
- the documents
-
annotate
default Corpus annotate(@NonNull @NonNull AnnotatableType... annotatableTypes)
Description copied from interface:DocumentCollection
Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types present- Specified by:
annotate
in interfaceDocumentCollection
- Parameters:
annotatableTypes
- The annotation types to annotate- Returns:
- A new corpus with the given annotation types present.
-
apply
default Corpus apply(Lexicon lexicon, SerializableConsumer<HString> onMatch)
Description copied from interface:DocumentCollection
Applies a lexicon to the corpus creating annotations of the given type for matches.- Specified by:
apply
in interfaceDocumentCollection
- Parameters:
lexicon
- the lexicon to matchonMatch
- the on match- Returns:
- the corpus
-
apply
default Corpus apply(TokenRegex pattern, SerializableConsumer<TokenMatch> onMatch)
Description copied from interface:DocumentCollection
Applies token regular expression to the corpus creating annotations of the given type for matches.- Specified by:
apply
in interfaceDocumentCollection
- Parameters:
pattern
- the patternonMatch
- the on match- Returns:
- the corpus
-
assignRandomSplit
default void assignRandomSplit(double pct)
-
compact
default Corpus compact()
Compacts the storage used for the corpus.- Returns:
- This corpus
-
getAttributeValueCount
<T> Counter<T> getAttributeValueCount(@NonNull @NonNull AttributeType<T> type)
Gets a count of the values for the given attribute across documents in the corpus.- Type Parameters:
T
- the attribute value type parameter- Parameters:
type
- the AttributeType we want to count- Returns:
- A Counter over the attribute values.
-
getAttributes
Set<AttributeType<?>> getAttributes()
- Returns:
- the set of attribute types found across the documents in the corpus
-
getCompleted
Set<AnnotatableType> getCompleted()
- Returns:
- the set of completed AnnotatableType where completed means completed by every document in the corpus.
-
getDocument
default Document getDocument(String id)
Gets the document with the given document id- Parameters:
id
- the id of the document- Returns:
- the document or null if it doesn't exist
-
importDocuments
default Corpus importDocuments(@NonNull @NonNull String specification) throws IOException
Imports documents from the given document collection specification.- Parameters:
specification
- the document format specification with path to documents.- Returns:
- the corpus
- Throws:
IOException
- Something went wrong loading the documents
-
process
default Corpus process(@NonNull @NonNull SequentialWorkflow processor) throws Exception
Processes the corpus using the givenSequentialWorkflow
- Parameters:
processor
- the processor- Returns:
- this Corpus
- Throws:
Exception
- the exception
-
remove
boolean remove(Document document)
Removes a document from the corpus- Parameters:
document
- the document to remove- Returns:
- True of removed, False otherwise
-
remove
boolean remove(String id)
Removes a document by its id.- Parameters:
id
- the id of the document to remove- Returns:
- True of removed, False otherwise
-
repartition
default Corpus repartition(int numPartitions)
Description copied from interface:DocumentCollection
Repartitions the corpus.- Specified by:
repartition
in interfaceDocumentCollection
- Parameters:
numPartitions
- the number of partitions- Returns:
- the corpus
-
update
Corpus update(@NonNull @NonNull String operation, @NonNull @NonNull SerializableConsumer<Document> documentProcessor)
Description copied from interface:DocumentCollection
Updates all documents in the corpus using the given document processor- Specified by:
update
in interfaceDocumentCollection
- Parameters:
operation
- the name of the update operation being performeddocumentProcessor
- the document processor- Returns:
- this corpus with updates
-
update
default Corpus update(@NonNull @NonNull CaduceusProgram program)
Description copied from interface:DocumentCollection
Updates all documents in the corpus using the givenCaduceusProgram
- Specified by:
update
in interfaceDocumentCollection
- Parameters:
program
- the CaduceusProgram to execute on each document.- Returns:
- this corpus with updates
-
update
boolean update(Document document)
Updates the given document- Parameters:
document
- the document to update- Returns:
- True if the document is updated, False if not
-
-