Interface Corpus

    • Method Detail

      • open

        static Corpus open​(@NonNull
                           @NonNull Resource resource)
        Opens the corpus at the given resource.
        Parameters:
        resource - the resource pertaining to the corpus
        Returns:
        the corpus
      • open

        static Corpus open​(@NonNull
                           @NonNull String resource)
        Opens the corpus at the given resource.
        Parameters:
        resource - the resource pertaining to the corpus
        Returns:
        the corpus
      • add

        boolean add​(Document document)
        Adds a document to the corpus
        Parameters:
        document - the document to add
        Returns:
        True if added, False if not
      • addAll

        default void addAll​(@NonNull
                            @NonNull Iterable<Document> documents)
        Adds multiple documents to the corpus.
        Parameters:
        documents - the documents
      • annotate

        default Corpus annotate​(@NonNull
                                @NonNull AnnotatableType... annotatableTypes)
        Description copied from interface: DocumentCollection
        Annotates this corpus with the given annotation types and returns a new corpus with the given annotation types present
        Specified by:
        annotate in interface DocumentCollection
        Parameters:
        annotatableTypes - The annotation types to annotate
        Returns:
        A new corpus with the given annotation types present.
      • assignRandomSplit

        default void assignRandomSplit​(double pct)
      • compact

        default Corpus compact()
        Compacts the storage used for the corpus.
        Returns:
        This corpus
      • getAttributeValueCount

        <T> Counter<T> getAttributeValueCount​(@NonNull
                                              @NonNull AttributeType<T> type)
        Gets a count of the values for the given attribute across documents in the corpus.
        Type Parameters:
        T - the attribute value type parameter
        Parameters:
        type - the AttributeType we want to count
        Returns:
        A Counter over the attribute values.
      • getAttributes

        Set<AttributeType<?>> getAttributes()
        Returns:
        the set of attribute types found across the documents in the corpus
      • getCompleted

        Set<AnnotatableType> getCompleted()
        Returns:
        the set of completed AnnotatableType where completed means completed by every document in the corpus.
      • getDocument

        default Document getDocument​(String id)
        Gets the document with the given document id
        Parameters:
        id - the id of the document
        Returns:
        the document or null if it doesn't exist
      • getIds

        default List<String> getIds()
        Returns:
        the document ids of all documents in the corpus
      • importDocuments

        default Corpus importDocuments​(@NonNull
                                       @NonNull String specification)
                                throws IOException
        Imports documents from the given document collection specification.
        Parameters:
        specification - the document format specification with path to documents.
        Returns:
        the corpus
        Throws:
        IOException - Something went wrong loading the documents
      • remove

        boolean remove​(Document document)
        Removes a document from the corpus
        Parameters:
        document - the document to remove
        Returns:
        True of removed, False otherwise
      • remove

        boolean remove​(String id)
        Removes a document by its id.
        Parameters:
        id - the id of the document to remove
        Returns:
        True of removed, False otherwise
      • repartition

        default Corpus repartition​(int numPartitions)
        Description copied from interface: DocumentCollection
        Repartitions the corpus.
        Specified by:
        repartition in interface DocumentCollection
        Parameters:
        numPartitions - the number of partitions
        Returns:
        the corpus
      • update

        Corpus update​(@NonNull
                      @NonNull String operation,
                      @NonNull
                      @NonNull SerializableConsumer<Document> documentProcessor)
        Description copied from interface: DocumentCollection
        Updates all documents in the corpus using the given document processor
        Specified by:
        update in interface DocumentCollection
        Parameters:
        operation - the name of the update operation being performed
        documentProcessor - the document processor
        Returns:
        this corpus with updates
      • update

        boolean update​(Document document)
        Updates the given document
        Parameters:
        document - the document to update
        Returns:
        True if the document is updated, False if not