Hello,
I have a project where i need to extract text and images from PDF pages and build a documentation database.
I am able to do it using vb.net, ikvm and pdfbox. However i still cannot get the x, y position of the text and images i am extracting.
Any solutions right there (other than going full Java - i am not a Java developer:-)?
Here is the piece of code i am using to extract images (adapting some examples from pdfbox documentation). Problem is that ImageX and ImageY are always returning 0. Other properties for the image (Heigh and Width) are correctly set.
Private PDF As PDDocument = Nothing
Private PDFPage As PDPage = Nothing
Private PDFPageResources As PDResources = Nothing
Private PDFPageStream As COSStream = Nothing
Private PDFDocumentPages As java.util.ArrayList = Nothing
Private ImageItem As PDXObjectImage = Nothing
Private ImageMap As java.util.Map = Nothing
Private ImageMapIterator As java.util.Iterator = Nothing
Dim PDFEngine = New PDFStreamEngine
PDFDocumentPages = PDF.getDocumentCatalog.getAllPages()
PDFPage = PDFDocumentPages.get(0)
PDFEngine.processStream(PDFPage, PDFPage.findResources, PDFPage.getContents.getStream)
ImageMap = PDFPage.getResources.getImages()
If ImageMap IsNot Nothing Then
Dim ImageNumber As Integer = 1
ImageMapIterator = ImageMap.keySet.iterator
While ImageMapIterator.hasNext()
Dim key As String
key = CType(ImageMapIterator.next(), String)
ImageItem = ImageMap.get(key)
Dim CTM As org.apache.pdfbox.util.Matrix
CTM = PDFEngine.getGraphicsState.getCurrentTransformationMatrix()
Dim rotationInRadians As Double = (PDFPage.findRotation * Math.PI) / 180
Dim rotation As New java.awt.geom.AffineTransform
rotation.setToRotation(rotationInRadians)
Dim rotationInverse As java.awt.geom.AffineTransform = rotation.createInverse
Dim rotationInverseMatrix As New org.apache.pdfbox.util.Matrix
rotationInverseMatrix.setFromAffineTransform(rotationInverse)
Dim rotationMatrix As New org.apache.pdfbox.util.Matrix
rotationMatrix.setFromAffineTransform(rotation)
Dim unrotatedCTM As org.apache.pdfbox.util.Matrix = CTM.multiply(rotationInverseMatrix)
Dim xScale As Single = unrotatedCTM.getXScale()
Dim yScale As Single = unrotatedCTM.getYScale()
Dim ImageX As Single = unrotatedCTM.getXPosition()
Dim imageY As Single = unrotatedCTM.getYPosition()
Dim ImageH As Single = yScale / 100.0F * ImageItem.getHeight()
Dim ImageW As Single = xScale / 100.0F * ImageItem.getWidth()
...... code to save the image, etc
ImageNumber += 1
End While
End If