tika.tika module¶
Tika Python module provides Python API client to Aapche Tika Server.
Example usage:
import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])
Visit https://github.com/chrismattmann/tika-python to learn more about it.
Detect IANA MIME Type:
from tika import detector
print(detector.from_file('/path/to/file'))
Detect Language:
from tika import language
print(language.from_file('/path/to/file'))
Use Tika Translate:
from tika import translate
print(translate.from_file('/path/to/file', 'srcLang', 'destLang')
# Use auto Language detection feature
print(translate.from_file('/path/to/file', 'destLang')
*Tika-Python Configuration* You can now use custom configuration files. See https://tika.apache.org/1.18/configuring.html for details on writing configuration files. Configuration is set the first time the server is started. To use a configuration file with a parser, or detector:
parsed = parser.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)
- or:
detected = detector.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)
- or:
detected = detector.from_buffer(‘some buffered content’, config_path=’/path/to/configfile’)
- tika.tika.callServer(verb, serverEndpoint, service, data, headers, verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', httpVerbs={'get': <function get>, 'post': <function post>, 'put': <function put>}, classpath=None, rawResponse=False, config_path=None, requestOptions={})[source]¶
Call the Tika Server, do some error checking, and return the response. :param verb: :param serverEndpoint: :param service: :param data: :param headers: :param verbose: :param tikaServerJar: :param httpVerbs: :param classpath: :return:
- tika.tika.checkJarSig(tikaServerJar, jarPath)[source]¶
Checks the signature of Jar :param tikaServerJar: :param jarPath: :return:
Trueif the signature of the jar matches
- tika.tika.checkPortIsOpen(remoteServerHost='localhost', port='9998')[source]¶
Checks if the specified port is open :param remoteServerHost: the host address :param port: port which needs to be checked :return:
Trueif port is open,Falseotherwise
- tika.tika.checkTikaServer(scheme='http', serverHost='localhost', port='9998', tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', classpath=None, config_path=None)[source]¶
Check that tika-server is running. If not, download JAR file and start it up. :param scheme: e.g. http or https :param serverHost: :param port: :param tikaServerJar: :param classpath: :return:
- tika.tika.detectLang(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'file': '/language/stream'})[source]¶
Detect the language of the provided stream and return its 2 character code as text/plain. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:
- tika.tika.detectLang1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'file': '/language/stream'}, requestOptions={})[source]¶
Detect the language of the provided stream and return its 2 character code as text/plain. :param option: :param urlOrPath: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:
- tika.tika.detectType(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'type': '/detect/stream'})[source]¶
Detect the MIME/media type of the stream and return it in text/plain. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:
- tika.tika.detectType1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'type': '/detect/stream'}, config_path=None, requestOptions={})[source]¶
Detect the MIME/media type of the stream and return it in text/plain. :param option: :param urlOrPath: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:
- tika.tika.doTranslate(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'all': '/translate/all'})[source]¶
Translate the file from source language to destination language. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:
- tika.tika.doTranslate1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'all': '/translate/all'}, requestOptions={})[source]¶
- Parameters:
option
urlOrPath
serverEndpoint
verbose
tikaServerJar
responseMimeType
services
- Returns:
- tika.tika.getConfig(option, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', services={'detectors': '/detectors', 'mime-types': '/mime-types', 'parsers': '/parsers/details'}, requestOptions={})[source]¶
Get the configuration of the Tika Server (parsers, detectors, etc.) and return it in JSON format. :param option: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:
- tika.tika.getPaths(urlOrPaths)[source]¶
Determines if the given URL in urlOrPaths is a URL or a file or directory. If it’s a directory, it walks the directory and then finds all file paths in it, and ads them too. If it’s a file, it adds it to the paths. If it’s a URL it just adds it to the path. :param urlOrPaths: the url or path to be scanned :return:
listof paths
- tika.tika.getRemoteFile(urlOrPath, destPath)[source]¶
Fetches URL to local path or just returns absolute path. :param urlOrPath: resource locator, generally URL or path :param destPath: path to store the resource, usually a path on file system :return: tuple having (path, ‘local’/’remote’/’binary’)
- tika.tika.getRemoteJar(urlOrPath, destPath)[source]¶
Fetches URL to local path or just return absolute path. :param urlOrPath: remote resource locator :param destPath: Path to store the resource, usually a path on file system :return: tuple having (path, ‘local’/’remote’)
- tika.tika.parse(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', services={'all': '/rmeta', 'meta': '/meta', 'text': '/tika'}, rawResponse=False)[source]¶
Parse the objects and return extracted metadata and/or text in JSON format. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:
- tika.tika.parse1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', services={'all': '/rmeta/text', 'meta': '/meta', 'text': '/tika'}, rawResponse=False, headers=None, config_path=None, requestOptions={})[source]¶
Parse the object and return extracted metadata and/or text in JSON format. :param option: :param urlOrPath: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :param rawResponse: :param headers: :return:
- tika.tika.parseAndSave(option, urlOrPaths, outDir=None, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', metaExtension='_meta.json', services={'all': '/rmeta', 'meta': '/meta', 'text': '/tika'})[source]¶
Parse the objects and write extracted metadata and/or text in JSON format to matching filename with an extension of ‘_meta.json’. :param option: :param urlOrPaths: :param outDir: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param metaExtension: :param services: :return:
- tika.tika.runCommand(cmd, option, urlOrPaths, port, outDir=None, serverHost='localhost', tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', verbose=0, encode=0)[source]¶
Run the Tika command by calling the Tika server and return results in JSON format (or plain text). :param cmd: a command from set
{'parse', 'detect', 'language', 'translate', 'config'}:param option: :param urlOrPaths: :param port: :param outDir: :param serverHost: :param tikaServerJar: :param verbose: :param encode: :return: response for the command, usually adict
- tika.tika.startServer(tikaServerJar, java_path='java', java_args='', serverHost='localhost', port='9998', classpath=None, config_path=None)[source]¶
Starts Tika Server :param tikaServerJar: path to tika server jar :param serverHost: the host interface address to be used for binding the service :param port: the host port to be used for binding the service :param classpath: Class path value to pass to JVM :return: None