All Superinterfaces:

AutoCloseable, Iterable<Document>

All Known Subinterfaces:

Corpus, SearchResults
```
public interface DocumentCollection
extends Iterable<Document>, AutoCloseable
```
A document collection represents a temporary collection of documents often used for ad-hoc analytics or to import documents into a corpus

Hermes provides a straightforward way of reading and writing document collections in a number of formats, including plain text, csv, and json. In addition, many formats can be used in a "one-per-line" corpus where each line represents a single document in the given format. For example, a json one-per-line corpus has a single json object representing a document on each line of the file. Each document format has an associated set of _DocFormatParameters_ that define the various options for reading and writing in the format.

Field Summary

Fields
Modifier and Type	Field	Description
`static String`	`REPORT_INTERVAL`	Configuration option for setting the reporting interval for when updating a DocumentCollection or Corpus
`static String`	`REPORT_LEVEL`	Configuration option for setting the reporting log level for when updating a DocumentCollection or Corpus

Method Summary

All Methods Static Methods Instance Methods Abstract Methods Default Methods
Modifier and Type	Method	Description
`default DocumentCollection`	`annotate(@NonNull AnnotatableType... annotatableTypes)`	Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types present
`default DocumentCollection`	`apply(@NonNull SerializableFunction<HString,HString> function)`
`default DocumentCollection`	`apply(@NonNull TokenRegex pattern, @NonNull SerializableConsumer<TokenMatch> onMatch)`	Applies token regular expression to the corpus creating annotations of the given type for matches.
`default DocumentCollection`	`apply(@NonNull Lexicon lexicon, @NonNull SerializableConsumer<HString> onMatch)`	Applies a lexicon to the corpus creating annotations of the given type for matches.
`default DataSet`	`asDataSet(@NonNull HStringDataSetGenerator HStringDataSetGenerator)`	As data set data set.
`default DocumentCollection`	`cache()`	Caches any actions performed on this collection.
`static DocumentCollection`	`create(@NonNull Document... documents)`	Creates a document collection for one or more documents.
`static DocumentCollection`	`create(@NonNull Specification specification)`	Creates a document collection from a specification detailing the document format and path of the documents.
`static DocumentCollection`	`create(@NonNull MStream<Document> documents)`	Creates a document collection for a stream of documents.
`static DocumentCollection`	`create(@NonNull Iterable<Document> documents)`	Creates a document collection for one or more documents.
`static DocumentCollection`	`create(@NonNull String specification)`	Creates a document collection from a specification detailing the document format and path of the documents.
`static DocumentCollection`	`create(@NonNull Stream<Document> documents)`	Creates a document collection for a stream of documents.
`default Counter<String>`	`documentCount(@NonNull Extractor extractor)`	Calculates the document frequency of annotations of the given annotation type in the corpus.
`default void`	`export(String specification)`
`default DocumentCollection`	`filter(@NonNull SerializablePredicate<Document> predicate)`	Filters the documents in the collection using the given predicate
`StreamingContext`	`getStreamingContext()`	Gets the streaming context associated with this stream
`default <K> Multimap<K,Document>`	`groupBy(@NonNull SerializableFunction<? super Document,K> keyFunction)`	Groups documents in the document store using the given function.
`default boolean`	`isEmpty()`	Checks if the collection is empty
`default Iterator<Document>`	`iterator()`
`default Counter<Tuple>`	`nGramCount(@NonNull NGramExtractor nGramExtractor)`	Calculates the total corpus frequencies for NGrams extracted using the given extractor.
`MStream<Document>`	`parallelStream()`	Gets a parallel stream over the documents in the collection
`SearchResults`	`query(@NonNull Query query)`	Generates a new Corpus from the results of querying this corpus.
`default SearchResults`	`query(@NonNull String query)`	Generates a new Corpus from the results of querying this corpus.
`default DocumentCollection`	`repartition(int numPartitions)`	Repartitions the corpus.
`default DocumentCollection`	`sample(int size)`	Create a sample of this corpus using Reservoir sampling.
`default DocumentCollection`	`sample(int count, @NonNull Random random)`	Create a sample of this corpus using Reservoir sampling.
`default Counter<Tuple>`	`significantBigrams(@NonNull NGramExtractor nGramExtractor, int minCount, double minScore)`	Calculates the bigrams with a significant co-occurrence using the Mikolov association measure.
`default Counter<Tuple>`	`significantBigrams(@NonNull NGramExtractor nGramExtractor, int minCount, double minScore, @NonNull ContingencyTableCalculator calculator)`	Calculates the bigrams with a significant co-occurrence using the given association measure.
`default long`	`size()`	The number of documents in the corpus
`MStream<Document>`	`stream()`	Gets a stream over the documents in the collection
`default Counter<String>`	`termCount(@NonNull Extractor extractor)`	Calculates the total corpus frequency of terms extracted using the given extractor.
`default DocumentCollection`	`update(@NonNull CaduceusProgram program)`	Updates all documents in the corpus using the given `CaduceusProgram`
`DocumentCollection`	`update(String operationName, @NonNull SerializableConsumer<Document> documentProcessor)`	Updates all documents in the corpus using the given document processor

Methods inherited from interface java.lang.AutoCloseable
close

Methods inherited from interface java.lang.Iterable
forEach, spliterator

- Field Detail
  - REPORT_INTERVAL
```
static final String REPORT_INTERVAL
```
    Configuration option for setting the reporting interval for when updating a DocumentCollection or Corpus
    
    See Also:
    
    Constant Field Values
  - REPORT_LEVEL
```
static final String REPORT_LEVEL
```
    Configuration option for setting the reporting log level for when updating a DocumentCollection or Corpus
    
    See Also:
    
    Constant Field Values
- Method Detail
  - create
```
static DocumentCollection create(@NonNull
                                 @NonNull Document... documents)
```
    Creates a document collection for one or more documents.
    
    Parameters:
    
    documents - the documents
    
    Returns:
    
    the document collection
  - create
```
static DocumentCollection create(@NonNull
                                 @NonNull Iterable<Document> documents)
```
    Creates a document collection for one or more documents.
    
    Parameters:
    
    documents - the documents
    
    Returns:
    
    the document collection
  - create
```
static DocumentCollection create(@NonNull
                                 @NonNull Stream<Document> documents)
```
    Creates a document collection for a stream of documents.
    
    Parameters:
    
    documents - the documents
    
    Returns:
    
    the document collection
  - create
```
static DocumentCollection create(@NonNull
                                 @NonNull MStream<Document> documents)
```
    Creates a document collection for a stream of documents.
    
    Parameters:
    
    documents - the documents
    
    Returns:
    
    the document collection
  - create
```
static DocumentCollection create(@NonNull
                                 @NonNull String specification)
```
    Creates a document collection from a specification detailing the document format and path of the documents. The specification should have the document format as the schema, e.g. FORMAT::PATH;OPTIONS
    
    Parameters:
    
    specification - the specification
    
    Returns:
    
    the document collection
  - create
```
static DocumentCollection create(@NonNull
                                 @NonNull Specification specification)
```
    Creates a document collection from a specification detailing the document format and path of the documents. The specification should have the document format as the schema, e.g. FORMAT::PATH;OPTIONS
    
    Parameters:
    
    specification - the specification
    
    Returns:
    
    the document collection
  - annotate
```
default DocumentCollection annotate(@NonNull
                                    @NonNull AnnotatableType... annotatableTypes)
```
    Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types present
    
    Parameters:
    
    annotatableTypes - The annotation types to annotate
    
    Returns:
    
    A new corpus with the given annotation types present.
  - apply
```
default DocumentCollection apply(@NonNull
                                 @NonNull Lexicon lexicon,
                                 @NonNull
                                 @NonNull SerializableConsumer<HString> onMatch)
```
    Applies a lexicon to the corpus creating annotations of the given type for matches.
    
    Parameters:
    
    lexicon - the lexicon to match
    
    onMatch - the on match
    
    Returns:
    
    the corpus
  - apply
```
default DocumentCollection apply(@NonNull
                                 @NonNull SerializableFunction<HString,HString> function)
```
  - apply
```
default DocumentCollection apply(@NonNull
                                 @NonNull TokenRegex pattern,
                                 @NonNull
                                 @NonNull SerializableConsumer<TokenMatch> onMatch)
```
    Applies token regular expression to the corpus creating annotations of the given type for matches.
    
    Parameters:
    
    pattern - the pattern
    
    onMatch - the on match
    
    Returns:
    
    the corpus
  - asDataSet
```
default DataSet asDataSet(@NonNull
                          @NonNull HStringDataSetGenerator HStringDataSetGenerator)
```
    As data set data set.
    
    Parameters:
    
    HStringDataSetGenerator - the example generator
    
    Returns:
    
    the data set
  - cache
```
default DocumentCollection cache()
```
    Caches any actions performed on this collection.
    
    Returns:
    
    the document collection
  - documentCount
```
default Counter<String> documentCount(@NonNull
                                      @NonNull Extractor extractor)
```
    Calculates the document frequency of annotations of the given annotation type in the corpus. Annotations are transformed into strings using the given toString function.
    
    Parameters:
    
    extractor - the LyreExpression to use for extracting terms
    
    Returns:
    
    A counter containing document frequencies of the given annotation type
  - export
```
default void export(String specification)
             throws IOException
```
    Throws:
    
    IOException
  - filter
```
default DocumentCollection filter(@NonNull
                                  @NonNull SerializablePredicate<Document> predicate)
```
    Filters the documents in the collection using the given predicate
    
    Parameters:
    
    predicate - the predicate
    
    Returns:
    
    the filtered document collection
  - getStreamingContext
```
StreamingContext getStreamingContext()
```
    Gets the streaming context associated with this stream
    
    Returns:
    
    the streaming context
  - groupBy
```
default <K> Multimap<K,Document> groupBy(@NonNull
                                               @NonNull SerializableFunction<? super Document,K> keyFunction)
```
    Groups documents in the document store using the given function.
    
    Type Parameters:
    
    K - The key type
    
    Parameters:
    
    keyFunction - Converts the document into a key to group the documents by
    
    Returns:
    
    A Multimap of key - document pairs.
  - isEmpty
```
default boolean isEmpty()
```
    Checks if the collection is empty
    
    Returns:
    
    True if this document collection has no documents.
  - iterator
```
default Iterator<Document> iterator()
```
    Specified by:
    
    iterator in interface Iterable<Document>
  - nGramCount
```
default Counter<Tuple> nGramCount(@NonNull
                                  @NonNull NGramExtractor nGramExtractor)
```
    Calculates the total corpus frequencies for NGrams extracted using the given extractor. Note tha all n-grams are returned in their string form as Tuples.
    
    Parameters:
    
    nGramExtractor - the extractor
    
    Returns:
    
    the counter of string tuples representing the ngrams
  - parallelStream
```
MStream<Document> parallelStream()
```
    Gets a parallel stream over the documents in the collection
    
    Returns:
    
    the stream of documents
  - query
```
default SearchResults query(@NonNull
                            @NonNull String query)
                     throws ParseException
```
    Generates a new Corpus from the results of querying this corpus.
    
    Parameters:
    
    query - the query
    
    Returns:
    
    the SearchResult containing documents matching the query
    
    Throws:
    
    ParseException - the parse exception
  - query
```
SearchResults query(@NonNull
                    @NonNull Query query)
```
    Generates a new Corpus from the results of querying this corpus.
    
    Parameters:
    
    query - the query
    
    Returns:
    
    the SearchResult containing documents matching the query
  - repartition
```
default DocumentCollection repartition(int numPartitions)
```
    Repartitions the corpus.
    
    Parameters:
    
    numPartitions - the number of partitions
    
    Returns:
    
    the corpus
  - sample
```
default DocumentCollection sample(int size)
```
    Create a sample of this corpus using Reservoir sampling.
    
    Parameters:
    
    size - the number of documents to include in the sample
    
    Returns:
    
    the sampled corpus
  - sample
```
default DocumentCollection sample(int count,
                                  @NonNull
                                  @NonNull Random random)
```
    Create a sample of this corpus using Reservoir sampling.
    
    Parameters:
    
    count - the number of documents to include in the sample
    
    random - Random number generator to use for selection
    
    Returns:
    
    the sampled corpus
  - significantBigrams
```
default Counter<Tuple> significantBigrams(@NonNull
                                          @NonNull NGramExtractor nGramExtractor,
                                          int minCount,
                                          double minScore)
```
    Calculates the bigrams with a significant co-occurrence using the Mikolov association measure.
    
    Parameters:
    
    nGramExtractor - the extractor to use for extracting NGrams
    
    minCount - the minimum co-occurrence count for a bigram to be considered
    
    minScore - the minimum score for a bigram to be significant
    
    Returns:
    
    the counter of bigrams and their scores
  - significantBigrams
```
default Counter<Tuple> significantBigrams(@NonNull
                                          @NonNull NGramExtractor nGramExtractor,
                                          int minCount,
                                          double minScore,
                                          @NonNull
                                          @NonNull ContingencyTableCalculator calculator)
```
    Calculates the bigrams with a significant co-occurrence using the given association measure.
    
    Parameters:
    
    nGramExtractor - the extractor to use for extracting NGrams
    
    minCount - the minimum co-occurrence count for a bigram to be considered
    
    minScore - the minimum score for a bigram to be significant
    
    calculator - the association measure to use for determining significance
    
    Returns:
    
    the counter of bigrams and their scores
  - size
```
default long size()
```
    The number of documents in the corpus
    
    Returns:
    
    the number of documents in the corpus
  - stream
```
MStream<Document> stream()
```
    Gets a stream over the documents in the collection
    
    Returns:
    
    the stream of documents
  - termCount
```
default Counter<String> termCount(@NonNull
                                  @NonNull Extractor extractor)
```
    Calculates the total corpus frequency of terms extracted using the given extractor.
    
    Parameters:
    
    extractor - the extractor to use for generating terms
    
    Returns:
    
    the counter of terms with frequencies
  - update
```
DocumentCollection update(String operationName,
                          @NonNull
                          @NonNull SerializableConsumer<Document> documentProcessor)
```
    Updates all documents in the corpus using the given document processor
    
    Parameters:
    
    operationName - the name of the update operation being performed
    
    documentProcessor - the document processor
    
    Returns:
    
    this corpus with updates
  - update
```
default DocumentCollection update(@NonNull
                                  @NonNull CaduceusProgram program)
```
    Updates all documents in the corpus using the given CaduceusProgram
    
    Parameters:
    
    program - the CaduceusProgram to execute on each document.
    
    Returns:
    
    this corpus with updates

Interface DocumentCollection

Field Summary

Method Summary

Methods inherited from interface java.lang.AutoCloseable

Methods inherited from interface java.lang.Iterable

Field Detail

REPORT_INTERVAL

REPORT_LEVEL

Method Detail

create

create

create

create

create

create

annotate

apply

apply

apply

asDataSet

cache

documentCount

export

filter

getStreamingContext

groupBy

isEmpty

iterator

nGramCount

parallelStream

query

query

repartition

sample

sample

significantBigrams

significantBigrams

size

stream

termCount

update

update