Package com.gengoai.hermes.format
Interface DocFormat
-
- All Known Subinterfaces:
OneDocPerFileFormat
- All Known Implementing Classes:
CoNLLFormat
,CsvFormat
,HermesJsonFormat
,PennTreebankFormat
,POSFormat
,TaggedFormat
,TwitterSearchFormat
,TxtFormat
,WholeFileTextFormat
public interface DocFormat
A DocFormat defines how to read and write documents in a given format. Each document format has an associated set of
DocFormatParameters
that define the various options for reading and writing in the format. By default the following parameters can be set:- defaultLanguage - The default language for new documents. (default calls Hermes.defaultLanguage())
- normalizers - The class names of the text normalizes to use when constructing documents. (default calls TextNormalization.configuredInstance().getPreprocessors())
- distributed - Creates a distributed document collection when the value is set to true (default false).
- saveMode -Whether to overwrite, ignore, or throw an error when writing a corpus to an existing file/directory (default ERROR).
-
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description DocFormatParameters
getParameters()
MStream<Document>
read(Resource inputResource)
Reads documents in this format from the given input resource.void
write(DocumentCollection documentCollection, Resource outputResource)
Writes a corpus of documents in this format to the given output resourcevoid
write(Document document, Resource outputResource)
Writes the given document in this format to the given output resource.
-
-
-
Method Detail
-
getParameters
DocFormatParameters getParameters()
- Returns:
- the
DocFormatParameters
set for the instance of this foramt
-
read
MStream<Document> read(Resource inputResource)
Reads documents in this format from the given input resource.- Parameters:
inputResource
- the input resource- Returns:
- the stream of documents read
-
write
void write(DocumentCollection documentCollection, Resource outputResource) throws IOException
Writes a corpus of documents in this format to the given output resource- Parameters:
documentCollection
- the corpusoutputResource
- the output resource- Throws:
IOException
- Something went wrong writing the corpus
-
write
void write(Document document, Resource outputResource) throws IOException
Writes the given document in this format to the given output resource.- Parameters:
document
- the documentoutputResource
- the output resource- Throws:
IOException
- Something went wrong writing the document
-
-