Interface DocumentCollection

  • All Superinterfaces:
    AutoCloseable, Iterable<Document>
    All Known Subinterfaces:
    Corpus, SearchResults

    public interface DocumentCollection
    extends Iterable<Document>, AutoCloseable

    A document collection represents a temporary collection of documents often used for ad-hoc analytics or to import documents into a corpus

    Hermes provides a straightforward way of reading and writing document collections in a number of formats, including plain text, csv, and json. In addition, many formats can be used in a "one-per-line" corpus where each line represents a single document in the given format. For example, a json one-per-line corpus has a single json object representing a document on each line of the file. Each document format has an associated set of _DocFormatParameters_ that define the various options for reading and writing in the format.

    • Field Detail

      • REPORT_INTERVAL

        static final String REPORT_INTERVAL
        Configuration option for setting the reporting interval for when updating a DocumentCollection or Corpus
        See Also:
        Constant Field Values
      • REPORT_LEVEL

        static final String REPORT_LEVEL
        Configuration option for setting the reporting log level for when updating a DocumentCollection or Corpus
        See Also:
        Constant Field Values
    • Method Detail

      • create

        static DocumentCollection create​(@NonNull
                                         @NonNull Document... documents)
        Creates a document collection for one or more documents.
        Parameters:
        documents - the documents
        Returns:
        the document collection
      • create

        static DocumentCollection create​(@NonNull
                                         @NonNull Iterable<Document> documents)
        Creates a document collection for one or more documents.
        Parameters:
        documents - the documents
        Returns:
        the document collection
      • create

        static DocumentCollection create​(@NonNull
                                         @NonNull Stream<Document> documents)
        Creates a document collection for a stream of documents.
        Parameters:
        documents - the documents
        Returns:
        the document collection
      • create

        static DocumentCollection create​(@NonNull
                                         @NonNull MStream<Document> documents)
        Creates a document collection for a stream of documents.
        Parameters:
        documents - the documents
        Returns:
        the document collection
      • create

        static DocumentCollection create​(@NonNull
                                         @NonNull String specification)
        Creates a document collection from a specification detailing the document format and path of the documents. The specification should have the document format as the schema, e.g. FORMAT::PATH;OPTIONS
        Parameters:
        specification - the specification
        Returns:
        the document collection
      • create

        static DocumentCollection create​(@NonNull
                                         @NonNull Specification specification)
        Creates a document collection from a specification detailing the document format and path of the documents. The specification should have the document format as the schema, e.g. FORMAT::PATH;OPTIONS
        Parameters:
        specification - the specification
        Returns:
        the document collection
      • annotate

        default DocumentCollection annotate​(@NonNull
                                            @NonNull AnnotatableType... annotatableTypes)
        Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types present
        Parameters:
        annotatableTypes - The annotation types to annotate
        Returns:
        A new corpus with the given annotation types present.
      • apply

        default DocumentCollection apply​(@NonNull
                                         @NonNull Lexicon lexicon,
                                         @NonNull
                                         @NonNull SerializableConsumer<HString> onMatch)
        Applies a lexicon to the corpus creating annotations of the given type for matches.
        Parameters:
        lexicon - the lexicon to match
        onMatch - the on match
        Returns:
        the corpus
      • apply

        default DocumentCollection apply​(@NonNull
                                         @NonNull TokenRegex pattern,
                                         @NonNull
                                         @NonNull SerializableConsumer<TokenMatch> onMatch)
        Applies token regular expression to the corpus creating annotations of the given type for matches.
        Parameters:
        pattern - the pattern
        onMatch - the on match
        Returns:
        the corpus
      • asDataSet

        default DataSet asDataSet​(@NonNull
                                  @NonNull HStringDataSetGenerator HStringDataSetGenerator)
        As data set data set.
        Parameters:
        HStringDataSetGenerator - the example generator
        Returns:
        the data set
      • cache

        default DocumentCollection cache()
        Caches any actions performed on this collection.
        Returns:
        the document collection
      • documentCount

        default Counter<String> documentCount​(@NonNull
                                              @NonNull Extractor extractor)
        Calculates the document frequency of annotations of the given annotation type in the corpus. Annotations are transformed into strings using the given toString function.
        Parameters:
        extractor - the LyreExpression to use for extracting terms
        Returns:
        A counter containing document frequencies of the given annotation type
      • filter

        default DocumentCollection filter​(@NonNull
                                          @NonNull SerializablePredicate<Document> predicate)
        Filters the documents in the collection using the given predicate
        Parameters:
        predicate - the predicate
        Returns:
        the filtered document collection
      • getStreamingContext

        StreamingContext getStreamingContext()
        Gets the streaming context associated with this stream
        Returns:
        the streaming context
      • groupBy

        default <K> Multimap<K,​Document> groupBy​(@NonNull
                                                       @NonNull SerializableFunction<? super Document,​K> keyFunction)
        Groups documents in the document store using the given function.
        Type Parameters:
        K - The key type
        Parameters:
        keyFunction - Converts the document into a key to group the documents by
        Returns:
        A Multimap of key - document pairs.
      • isEmpty

        default boolean isEmpty()
        Checks if the collection is empty
        Returns:
        True if this document collection has no documents.
      • nGramCount

        default Counter<Tuple> nGramCount​(@NonNull
                                          @NonNull NGramExtractor nGramExtractor)
        Calculates the total corpus frequencies for NGrams extracted using the given extractor. Note tha all n-grams are returned in their string form as Tuples.
        Parameters:
        nGramExtractor - the extractor
        Returns:
        the counter of string tuples representing the ngrams
      • parallelStream

        MStream<Document> parallelStream()
        Gets a parallel stream over the documents in the collection
        Returns:
        the stream of documents
      • query

        default SearchResults query​(@NonNull
                                    @NonNull String query)
                             throws ParseException
        Generates a new Corpus from the results of querying this corpus.
        Parameters:
        query - the query
        Returns:
        the SearchResult containing documents matching the query
        Throws:
        ParseException - the parse exception
      • query

        SearchResults query​(@NonNull
                            @NonNull Query query)
        Generates a new Corpus from the results of querying this corpus.
        Parameters:
        query - the query
        Returns:
        the SearchResult containing documents matching the query
      • repartition

        default DocumentCollection repartition​(int numPartitions)
        Repartitions the corpus.
        Parameters:
        numPartitions - the number of partitions
        Returns:
        the corpus
      • sample

        default DocumentCollection sample​(int size)
        Create a sample of this corpus using Reservoir sampling.
        Parameters:
        size - the number of documents to include in the sample
        Returns:
        the sampled corpus
      • sample

        default DocumentCollection sample​(int count,
                                          @NonNull
                                          @NonNull Random random)
        Create a sample of this corpus using Reservoir sampling.
        Parameters:
        count - the number of documents to include in the sample
        random - Random number generator to use for selection
        Returns:
        the sampled corpus
      • significantBigrams

        default Counter<Tuple> significantBigrams​(@NonNull
                                                  @NonNull NGramExtractor nGramExtractor,
                                                  int minCount,
                                                  double minScore)
        Calculates the bigrams with a significant co-occurrence using the Mikolov association measure.
        Parameters:
        nGramExtractor - the extractor to use for extracting NGrams
        minCount - the minimum co-occurrence count for a bigram to be considered
        minScore - the minimum score for a bigram to be significant
        Returns:
        the counter of bigrams and their scores
      • significantBigrams

        default Counter<Tuple> significantBigrams​(@NonNull
                                                  @NonNull NGramExtractor nGramExtractor,
                                                  int minCount,
                                                  double minScore,
                                                  @NonNull
                                                  @NonNull ContingencyTableCalculator calculator)
        Calculates the bigrams with a significant co-occurrence using the given association measure.
        Parameters:
        nGramExtractor - the extractor to use for extracting NGrams
        minCount - the minimum co-occurrence count for a bigram to be considered
        minScore - the minimum score for a bigram to be significant
        calculator - the association measure to use for determining significance
        Returns:
        the counter of bigrams and their scores
      • size

        default long size()
        The number of documents in the corpus
        Returns:
        the number of documents in the corpus
      • stream

        MStream<Document> stream()
        Gets a stream over the documents in the collection
        Returns:
        the stream of documents
      • termCount

        default Counter<String> termCount​(@NonNull
                                          @NonNull Extractor extractor)
        Calculates the total corpus frequency of terms extracted using the given extractor.
        Parameters:
        extractor - the extractor to use for generating terms
        Returns:
        the counter of terms with frequencies
      • update

        DocumentCollection update​(String operationName,
                                  @NonNull
                                  @NonNull SerializableConsumer<Document> documentProcessor)
        Updates all documents in the corpus using the given document processor
        Parameters:
        operationName - the name of the update operation being performed
        documentProcessor - the document processor
        Returns:
        this corpus with updates
      • update

        default DocumentCollection update​(@NonNull
                                          @NonNull CaduceusProgram program)
        Updates all documents in the corpus using the given CaduceusProgram
        Parameters:
        program - the CaduceusProgram to execute on each document.
        Returns:
        this corpus with updates