tika.tika module

Tika Python module provides Python API client to Aapche Tika Server.

Example usage:

import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Visit https://github.com/chrismattmann/tika-python to learn more about it.

Detect IANA MIME Type:

from tika import detector
print(detector.from_file('/path/to/file'))

Detect Language:

from tika import language
print(language.from_file('/path/to/file'))

Use Tika Translate:

from tika import translate
print(translate.from_file('/path/to/file', 'srcLang', 'destLang')
# Use auto Language detection feature
print(translate.from_file('/path/to/file', 'destLang')

*Tika-Python Configuration* You can now use custom configuration files. See https://tika.apache.org/1.18/configuring.html for details on writing configuration files. Configuration is set the first time the server is started. To use a configuration file with a parser, or detector:

parsed = parser.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)

or:

detected = detector.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)

or:

detected = detector.from_buffer(‘some buffered content’, config_path=’/path/to/configfile’)

exception tika.tika.TikaException[source]

Bases: Exception

tika.tika.callServer(verb, serverEndpoint, service, data, headers, verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', httpVerbs={'get': <function get>, 'post': <function post>, 'put': <function put>}, classpath=None, rawResponse=False, config_path=None, requestOptions={})[source]

Call the Tika Server, do some error checking, and return the response. :param verb: :param serverEndpoint: :param service: :param data: :param headers: :param verbose: :param tikaServerJar: :param httpVerbs: :param classpath: :return:

tika.tika.checkJarSig(tikaServerJar, jarPath)[source]

Checks the signature of Jar :param tikaServerJar: :param jarPath: :return: True if the signature of the jar matches

tika.tika.checkPortIsOpen(remoteServerHost='localhost', port='9998')[source]

Checks if the specified port is open :param remoteServerHost: the host address :param port: port which needs to be checked :return: True if port is open, False otherwise

tika.tika.checkTikaServer(scheme='http', serverHost='localhost', port='9998', tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', classpath=None, config_path=None)[source]

Check that tika-server is running. If not, download JAR file and start it up. :param scheme: e.g. http or https :param serverHost: :param port: :param tikaServerJar: :param classpath: :return:

tika.tika.detectLang(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'file': '/language/stream'})[source]

Detect the language of the provided stream and return its 2 character code as text/plain. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:

tika.tika.detectLang1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'file': '/language/stream'}, requestOptions={})[source]

Detect the language of the provided stream and return its 2 character code as text/plain. :param option: :param urlOrPath: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:

tika.tika.detectType(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'type': '/detect/stream'})[source]

Detect the MIME/media type of the stream and return it in text/plain. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:

tika.tika.detectType1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'type': '/detect/stream'}, config_path=None, requestOptions={})[source]

Detect the MIME/media type of the stream and return it in text/plain. :param option: :param urlOrPath: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:

tika.tika.die(*s)[source]
tika.tika.doTranslate(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'all': '/translate/all'})[source]

Translate the file from source language to destination language. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:

tika.tika.doTranslate1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='text/plain', services={'all': '/translate/all'}, requestOptions={})[source]
Parameters:
  • option

  • urlOrPath

  • serverEndpoint

  • verbose

  • tikaServerJar

  • responseMimeType

  • services

Returns:

tika.tika.echo2(*s)[source]
tika.tika.getConfig(option, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', services={'detectors': '/detectors', 'mime-types': '/mime-types', 'parsers': '/parsers/details'}, requestOptions={})[source]

Get the configuration of the Tika Server (parsers, detectors, etc.) and return it in JSON format. :param option: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:

tika.tika.getPaths(urlOrPaths)[source]

Determines if the given URL in urlOrPaths is a URL or a file or directory. If it’s a directory, it walks the directory and then finds all file paths in it, and ads them too. If it’s a file, it adds it to the paths. If it’s a URL it just adds it to the path. :param urlOrPaths: the url or path to be scanned :return: list of paths

tika.tika.getRemoteFile(urlOrPath, destPath)[source]

Fetches URL to local path or just returns absolute path. :param urlOrPath: resource locator, generally URL or path :param destPath: path to store the resource, usually a path on file system :return: tuple having (path, ‘local’/’remote’/’binary’)

tika.tika.getRemoteJar(urlOrPath, destPath)[source]

Fetches URL to local path or just return absolute path. :param urlOrPath: remote resource locator :param destPath: Path to store the resource, usually a path on file system :return: tuple having (path, ‘local’/’remote’)

tika.tika.killServer()[source]

Kills the tika server started by the current execution instance

tika.tika.main(argv=None)[source]

Run Tika from command line according to USAGE.

tika.tika.make_content_disposition_header(fn)[source]
tika.tika.parse(option, urlOrPaths, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', services={'all': '/rmeta', 'meta': '/meta', 'text': '/tika'}, rawResponse=False)[source]

Parse the objects and return extracted metadata and/or text in JSON format. :param option: :param urlOrPaths: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :return:

tika.tika.parse1(option, urlOrPath, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', services={'all': '/rmeta/text', 'meta': '/meta', 'text': '/tika'}, rawResponse=False, headers=None, config_path=None, requestOptions={})[source]

Parse the object and return extracted metadata and/or text in JSON format. :param option: :param urlOrPath: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param services: :param rawResponse: :param headers: :return:

tika.tika.parseAndSave(option, urlOrPaths, outDir=None, serverEndpoint='http://localhost:9998', verbose=0, tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', responseMimeType='application/json', metaExtension='_meta.json', services={'all': '/rmeta', 'meta': '/meta', 'text': '/tika'})[source]

Parse the objects and write extracted metadata and/or text in JSON format to matching filename with an extension of ‘_meta.json’. :param option: :param urlOrPaths: :param outDir: :param serverEndpoint: :param verbose: :param tikaServerJar: :param responseMimeType: :param metaExtension: :param services: :return:

tika.tika.runCommand(cmd, option, urlOrPaths, port, outDir=None, serverHost='localhost', tikaServerJar='http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server-standard/3.1.0/tika-server-standard-3.1.0.jar', verbose=0, encode=0)[source]

Run the Tika command by calling the Tika server and return results in JSON format (or plain text). :param cmd: a command from set {'parse', 'detect', 'language', 'translate', 'config'} :param option: :param urlOrPaths: :param port: :param outDir: :param serverHost: :param tikaServerJar: :param verbose: :param encode: :return: response for the command, usually a dict

tika.tika.startServer(tikaServerJar, java_path='java', java_args='', serverHost='localhost', port='9998', classpath=None, config_path=None)[source]

Starts Tika Server :param tikaServerJar: path to tika server jar :param serverHost: the host interface address to be used for binding the service :param port: the host port to be used for binding the service :param classpath: Class path value to pass to JVM :return: None

tika.tika.toFilename(url)[source]

gets url and returns filename

tika.tika.warn(*s)[source]