Click here to Skip to main content
15,881,725 members
Articles / Web Development / Apache

Building Apache-tika Project using Eclipse

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
25 Jan 2019CPOL4 min read 8.6K   1  
Basic understanding of how to write/use the Apache-tika facade function and debug it in Eclipse

Introduction

This tutorial provides a basic understanding of how to write/use the Apache-tika facade function and debug it in Eclipse.

Background

Apache Tika is a library that is used for document type detection and content extraction from various file formats. Reference: https://www.tutorialspoint.com/tika/tika_overview.htm

Using the Code

In this article, I would give an example for how to create a new project in Eclipse and try and run an example to detect file type using Apache tika library.

Steps

  1. I am using Apache-tika 1.20 version. This can be downloaded from http://tika.apache.org/download.html. Download the jar file and save it on your machine.
  2. Open Eclipse and create a new Java project like this:

    Image 1

  3. Give the project a name; say "DetectType" and set a version of JRE that you are using. If you do not have a comtable in the list, install it.

    Image 2

  4. Right click on the 'src' and select New->Class. Give it a name, say 'DetectType'. Refresh the project and you shall see the new file been added in src.
  5. Add body to the newly added file:
    Java
    public class DetectType 
    { 
        public static void main(String[] args) throws Exception
        { 
            }
        } 
    }
  6. Make a folder 'lib' under the same workspace as the above and copy the jar file into that lib folder.
  7. Add the jar file into your DetectType project. Right click on your project and select Properties -> Java Build Path -> Add JARs.
  8. Select the new copied jar file in your project. If you do not see the jar file, refresh your project and try again. Your properties window should now look like this:

    Image 3

  9. Refresh your project and on the Project Explorer, you could now see the jar file being added.
  10. Update your code body to include the Tika class and to detect the file type.
    Java
    import org.apache.tika.Tika;
    
    public class DetectType 
    { 
        public static void main(String[] args) throws Exception
        { 
            // Create a Tika instance with the default configuration
            Tika tika = new Tika();
            // Parse all given files and print out the extracted
            // text content
            for (String file : args) {
                String fileType = tika.detect(file);
                System.out.println("File type of '" + file + "' is : " + fileType);
            }
        } 
    }
  11. The Project heirachy should look like this (Note that you can have your package name as 'default package'. I have kept it as 'org.apache.tika'. As in the next section, I would import the entire tika source code which would be helpful in case of debug).

    Image 4

  12. The above program expects input param as a file name. This can be passed in as arguments. Like this:

    Image 5

  13. Now run the program and you should get result in console. Something like this:

    File type of 'format\1.vsd' is application/vnd.visio.

    The above example is a small one to detect the type of the file. There are lots of exposed API that can be used to extract more metadata and even content of the file type. For the complete list, see https://tika.apache.org/1.20/api/.

Tika supports these various functionalities:

  • Document type detection
  • Content extraction
  • Metadata extraction
  • Language detection

Debugging the Apache Tika Facade

In case you wish to add the entire Apache tika source code to your Eclipse project and debug your facade class/function, follow these steps.

  1. Create a new package 'org.apache.tika' in your src (as shown in point 11 in the above section)
  2. Create a new class under 'org.apache.tika'. Right click 'org.apache.tika'->New->Class. Give it a name of your choice, say 'DetectType'.
  3. Download the source code 'Mirrors for tika-1.20-src.zip' from http://tika.apache.org/download.html.
  4. Unzipping the above will give you packages which can be used for us to debug the facade classed in the above code.

    Image 6

  5. Go into tika-core from above and copy the content in folder 'tika-core\src\main\java\org\apache\tika' into the folder of your workspace 'DetectType\src\org\apache\tika'. Refresh your project in Eclipse and you shall see all these as packages. I have a screenshot of few but not all:

    Image 7

  6. In case you see any error in project, that is because of 'package-info.java'. Remove this file as this file's sole purpose is to provide a home for package level documentation and package level annotations.
  7. Start debugging and at any level, you do not find the source code, go into the file structure in point 4 and copy it to the appropriate workspace structure within org/apache/tika.

For error while using 'org.osgi.framework', 'org.osgi.util', go to http://www.java2s.com/Code/Jar/o/Downloadorgosgicore500jar.htm and download the jar file. Add it into your project as you added the tika-app.jar in step 8.

Similarly, you can find few more packages on the same site as they might be troubling you like 'org.sqlite.SQLiteConfig'.

Points of Interest

This is the first time I have tried to debug the tika facade class and found the steps to do so. In case you feel some bits are missing, please give feedback and we shall improve this article.

History

  • 25th January, 2019: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
United Kingdom United Kingdom
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
-- There are no messages in this forum --