Package com.gengoai.hermes.corpus
Interface DocumentCollection
-
- All Superinterfaces:
AutoCloseable
,Iterable<Document>
- All Known Subinterfaces:
Corpus
,SearchResults
public interface DocumentCollection extends Iterable<Document>, AutoCloseable
A document collection represents a temporary collection of documents often used for ad-hoc analytics or to import documents into a corpus
Hermes provides a straightforward way of reading and writing document collections in a number of formats, including plain text, csv, and json. In addition, many formats can be used in a "one-per-line" corpus where each line represents a single document in the given format. For example, a json one-per-line corpus has a single json object representing a document on each line of the file. Each document format has an associated set of _DocFormatParameters_ that define the various options for reading and writing in the format.
-
-
Field Summary
Fields Modifier and Type Field Description static String
REPORT_INTERVAL
Configuration option for setting the reporting interval for when updating a DocumentCollection or Corpusstatic String
REPORT_LEVEL
Configuration option for setting the reporting log level for when updating a DocumentCollection or Corpus
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description default DocumentCollection
annotate(@NonNull AnnotatableType... annotatableTypes)
Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types presentdefault DocumentCollection
apply(@NonNull SerializableFunction<HString,HString> function)
default DocumentCollection
apply(@NonNull TokenRegex pattern, @NonNull SerializableConsumer<TokenMatch> onMatch)
Applies token regular expression to the corpus creating annotations of the given type for matches.default DocumentCollection
apply(@NonNull Lexicon lexicon, @NonNull SerializableConsumer<HString> onMatch)
Applies a lexicon to the corpus creating annotations of the given type for matches.default DataSet
asDataSet(@NonNull HStringDataSetGenerator HStringDataSetGenerator)
As data set data set.default DocumentCollection
cache()
Caches any actions performed on this collection.static DocumentCollection
create(@NonNull Document... documents)
Creates a document collection for one or more documents.static DocumentCollection
create(@NonNull Specification specification)
Creates a document collection from a specification detailing the document format and path of the documents.static DocumentCollection
create(@NonNull MStream<Document> documents)
Creates a document collection for a stream of documents.static DocumentCollection
create(@NonNull Iterable<Document> documents)
Creates a document collection for one or more documents.static DocumentCollection
create(@NonNull String specification)
Creates a document collection from a specification detailing the document format and path of the documents.static DocumentCollection
create(@NonNull Stream<Document> documents)
Creates a document collection for a stream of documents.default Counter<String>
documentCount(@NonNull Extractor extractor)
Calculates the document frequency of annotations of the given annotation type in the corpus.default void
export(String specification)
default DocumentCollection
filter(@NonNull SerializablePredicate<Document> predicate)
Filters the documents in the collection using the given predicateStreamingContext
getStreamingContext()
Gets the streaming context associated with this streamdefault <K> Multimap<K,Document>
groupBy(@NonNull SerializableFunction<? super Document,K> keyFunction)
Groups documents in the document store using the given function.default boolean
isEmpty()
Checks if the collection is emptydefault Iterator<Document>
iterator()
default Counter<Tuple>
nGramCount(@NonNull NGramExtractor nGramExtractor)
Calculates the total corpus frequencies for NGrams extracted using the given extractor.MStream<Document>
parallelStream()
Gets a parallel stream over the documents in the collectionSearchResults
query(@NonNull Query query)
Generates a new Corpus from the results of querying this corpus.default SearchResults
query(@NonNull String query)
Generates a new Corpus from the results of querying this corpus.default DocumentCollection
repartition(int numPartitions)
Repartitions the corpus.default DocumentCollection
sample(int size)
Create a sample of this corpus using Reservoir sampling.default DocumentCollection
sample(int count, @NonNull Random random)
Create a sample of this corpus using Reservoir sampling.default Counter<Tuple>
significantBigrams(@NonNull NGramExtractor nGramExtractor, int minCount, double minScore)
Calculates the bigrams with a significant co-occurrence using the Mikolov association measure.default Counter<Tuple>
significantBigrams(@NonNull NGramExtractor nGramExtractor, int minCount, double minScore, @NonNull ContingencyTableCalculator calculator)
Calculates the bigrams with a significant co-occurrence using the given association measure.default long
size()
The number of documents in the corpusMStream<Document>
stream()
Gets a stream over the documents in the collectiondefault Counter<String>
termCount(@NonNull Extractor extractor)
Calculates the total corpus frequency of terms extracted using the given extractor.default DocumentCollection
update(@NonNull CaduceusProgram program)
Updates all documents in the corpus using the givenCaduceusProgram
DocumentCollection
update(String operationName, @NonNull SerializableConsumer<Document> documentProcessor)
Updates all documents in the corpus using the given document processor-
Methods inherited from interface java.lang.AutoCloseable
close
-
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
-
-
-
Field Detail
-
REPORT_INTERVAL
static final String REPORT_INTERVAL
Configuration option for setting the reporting interval for when updating a DocumentCollection or Corpus- See Also:
- Constant Field Values
-
REPORT_LEVEL
static final String REPORT_LEVEL
Configuration option for setting the reporting log level for when updating a DocumentCollection or Corpus- See Also:
- Constant Field Values
-
-
Method Detail
-
create
static DocumentCollection create(@NonNull @NonNull Document... documents)
Creates a document collection for one or more documents.- Parameters:
documents
- the documents- Returns:
- the document collection
-
create
static DocumentCollection create(@NonNull @NonNull Iterable<Document> documents)
Creates a document collection for one or more documents.- Parameters:
documents
- the documents- Returns:
- the document collection
-
create
static DocumentCollection create(@NonNull @NonNull Stream<Document> documents)
Creates a document collection for a stream of documents.- Parameters:
documents
- the documents- Returns:
- the document collection
-
create
static DocumentCollection create(@NonNull @NonNull MStream<Document> documents)
Creates a document collection for a stream of documents.- Parameters:
documents
- the documents- Returns:
- the document collection
-
create
static DocumentCollection create(@NonNull @NonNull String specification)
Creates a document collection from a specification detailing the document format and path of the documents. The specification should have the document format as the schema, e.g.FORMAT::PATH;OPTIONS
- Parameters:
specification
- the specification- Returns:
- the document collection
-
create
static DocumentCollection create(@NonNull @NonNull Specification specification)
Creates a document collection from a specification detailing the document format and path of the documents. The specification should have the document format as the schema, e.g.FORMAT::PATH;OPTIONS
- Parameters:
specification
- the specification- Returns:
- the document collection
-
annotate
default DocumentCollection annotate(@NonNull @NonNull AnnotatableType... annotatableTypes)
Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types present- Parameters:
annotatableTypes
- The annotation types to annotate- Returns:
- A new corpus with the given annotation types present.
-
apply
default DocumentCollection apply(@NonNull @NonNull Lexicon lexicon, @NonNull @NonNull SerializableConsumer<HString> onMatch)
Applies a lexicon to the corpus creating annotations of the given type for matches.- Parameters:
lexicon
- the lexicon to matchonMatch
- the on match- Returns:
- the corpus
-
apply
default DocumentCollection apply(@NonNull @NonNull SerializableFunction<HString,HString> function)
-
apply
default DocumentCollection apply(@NonNull @NonNull TokenRegex pattern, @NonNull @NonNull SerializableConsumer<TokenMatch> onMatch)
Applies token regular expression to the corpus creating annotations of the given type for matches.- Parameters:
pattern
- the patternonMatch
- the on match- Returns:
- the corpus
-
asDataSet
default DataSet asDataSet(@NonNull @NonNull HStringDataSetGenerator HStringDataSetGenerator)
As data set data set.- Parameters:
HStringDataSetGenerator
- the example generator- Returns:
- the data set
-
cache
default DocumentCollection cache()
Caches any actions performed on this collection.- Returns:
- the document collection
-
documentCount
default Counter<String> documentCount(@NonNull @NonNull Extractor extractor)
Calculates the document frequency of annotations of the given annotation type in the corpus. Annotations are transformed into strings using the given toString function.- Parameters:
extractor
- the LyreExpression to use for extracting terms- Returns:
- A counter containing document frequencies of the given annotation type
-
export
default void export(String specification) throws IOException
- Throws:
IOException
-
filter
default DocumentCollection filter(@NonNull @NonNull SerializablePredicate<Document> predicate)
Filters the documents in the collection using the given predicate- Parameters:
predicate
- the predicate- Returns:
- the filtered document collection
-
getStreamingContext
StreamingContext getStreamingContext()
Gets the streaming context associated with this stream- Returns:
- the streaming context
-
groupBy
default <K> Multimap<K,Document> groupBy(@NonNull @NonNull SerializableFunction<? super Document,K> keyFunction)
Groups documents in the document store using the given function.- Type Parameters:
K
- The key type- Parameters:
keyFunction
- Converts the document into a key to group the documents by- Returns:
- A
Multimap
of key - document pairs.
-
isEmpty
default boolean isEmpty()
Checks if the collection is empty- Returns:
- True if this document collection has no documents.
-
nGramCount
default Counter<Tuple> nGramCount(@NonNull @NonNull NGramExtractor nGramExtractor)
Calculates the total corpus frequencies for NGrams extracted using the given extractor. Note tha all n-grams are returned in their string form as Tuples.- Parameters:
nGramExtractor
- the extractor- Returns:
- the counter of string tuples representing the ngrams
-
parallelStream
MStream<Document> parallelStream()
Gets a parallel stream over the documents in the collection- Returns:
- the stream of documents
-
query
default SearchResults query(@NonNull @NonNull String query) throws ParseException
Generates a new Corpus from the results of querying this corpus.- Parameters:
query
- the query- Returns:
- the SearchResult containing documents matching the query
- Throws:
ParseException
- the parse exception
-
query
SearchResults query(@NonNull @NonNull Query query)
Generates a new Corpus from the results of querying this corpus.- Parameters:
query
- the query- Returns:
- the SearchResult containing documents matching the query
-
repartition
default DocumentCollection repartition(int numPartitions)
Repartitions the corpus.- Parameters:
numPartitions
- the number of partitions- Returns:
- the corpus
-
sample
default DocumentCollection sample(int size)
Create a sample of this corpus using Reservoir sampling.- Parameters:
size
- the number of documents to include in the sample- Returns:
- the sampled corpus
-
sample
default DocumentCollection sample(int count, @NonNull @NonNull Random random)
Create a sample of this corpus using Reservoir sampling.- Parameters:
count
- the number of documents to include in the samplerandom
- Random number generator to use for selection- Returns:
- the sampled corpus
-
significantBigrams
default Counter<Tuple> significantBigrams(@NonNull @NonNull NGramExtractor nGramExtractor, int minCount, double minScore)
Calculates the bigrams with a significant co-occurrence using the Mikolov association measure.- Parameters:
nGramExtractor
- the extractor to use for extracting NGramsminCount
- the minimum co-occurrence count for a bigram to be consideredminScore
- the minimum score for a bigram to be significant- Returns:
- the counter of bigrams and their scores
-
significantBigrams
default Counter<Tuple> significantBigrams(@NonNull @NonNull NGramExtractor nGramExtractor, int minCount, double minScore, @NonNull @NonNull ContingencyTableCalculator calculator)
Calculates the bigrams with a significant co-occurrence using the given association measure.- Parameters:
nGramExtractor
- the extractor to use for extracting NGramsminCount
- the minimum co-occurrence count for a bigram to be consideredminScore
- the minimum score for a bigram to be significantcalculator
- the association measure to use for determining significance- Returns:
- the counter of bigrams and their scores
-
size
default long size()
The number of documents in the corpus- Returns:
- the number of documents in the corpus
-
stream
MStream<Document> stream()
Gets a stream over the documents in the collection- Returns:
- the stream of documents
-
termCount
default Counter<String> termCount(@NonNull @NonNull Extractor extractor)
Calculates the total corpus frequency of terms extracted using the given extractor.- Parameters:
extractor
- the extractor to use for generating terms- Returns:
- the counter of terms with frequencies
-
update
DocumentCollection update(String operationName, @NonNull @NonNull SerializableConsumer<Document> documentProcessor)
Updates all documents in the corpus using the given document processor- Parameters:
operationName
- the name of the update operation being performeddocumentProcessor
- the document processor- Returns:
- this corpus with updates
-
update
default DocumentCollection update(@NonNull @NonNull CaduceusProgram program)
Updates all documents in the corpus using the givenCaduceusProgram
- Parameters:
program
- the CaduceusProgram to execute on each document.- Returns:
- this corpus with updates
-
-