Interface DocFormat

  • All Known Subinterfaces:
    OneDocPerFileFormat
    All Known Implementing Classes:
    CoNLLFormat, CsvFormat, HermesJsonFormat, PennTreebankFormat, POSFormat, TaggedFormat, TwitterSearchFormat, TxtFormat, WholeFileTextFormat

    public interface DocFormat

    A DocFormat defines how to read and write documents in a given format. Each document format has an associated set of DocFormatParameters that define the various options for reading and writing in the format. By default the following parameters can be set:

    • defaultLanguage - The default language for new documents. (default calls Hermes.defaultLanguage())
    • normalizers - The class names of the text normalizes to use when constructing documents. (default calls TextNormalization.configuredInstance().getPreprocessors())
    • distributed - Creates a distributed document collection when the value is set to true (default false).
    • saveMode -Whether to overwrite, ignore, or throw an error when writing a corpus to an existing file/directory (default ERROR).

    • Method Detail

      • read

        MStream<Document> read​(Resource inputResource)
        Reads documents in this format from the given input resource.
        Parameters:
        inputResource - the input resource
        Returns:
        the stream of documents read
      • write

        void write​(DocumentCollection documentCollection,
                   Resource outputResource)
            throws IOException
        Writes a corpus of documents in this format to the given output resource
        Parameters:
        documentCollection - the corpus
        outputResource - the output resource
        Throws:
        IOException - Something went wrong writing the corpus
      • write

        void write​(Document document,
                   Resource outputResource)
            throws IOException
        Writes the given document in this format to the given output resource.
        Parameters:
        document - the document
        outputResource - the output resource
        Throws:
        IOException - Something went wrong writing the document