Jump to content

Featured Replies

Posted

i have been using pdftotext.exe to extract text from pdf. The text accuracy was good by using this. But the problem was i can't able to identify bold and italics text. How can i identify the extracted text was bold or italic?

 

I had tried some other plugin like CSWTestingReflow, PDF parser etc..but for better text accuracy i was go with pdftotext.exe

 

 

 

 

 

Code:

 

objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"

''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """"

If fso.FileExists(sReadPDF & "_Text.txt") = True Then

'Read the text file

Set adoStreamOut = New ADODB.Stream

'adoStreamOut.Charset = "utf-8"

adoStreamOut.Charset = "us-ascii"

If adoStreamOut.State Then adoStreamOut.Close

adoStreamOut.Open

adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"

sText = adoStreamOut.ReadText

End If

 

DoEvents

sText = Trim(sText)

sText = Trim(Replace(sText, Chr(12), ""))

sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))

sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))

sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))

sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))

sText = Trim(Replace(sText, vbCrLf, " "))

sText = Trim(Replace(sText, ".|||", "." & vbCrLf))

sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))

sText = Trim(Replace(sText, "-|||", ""))

sText = Trim(Replace(sText, "||||||", "--"))

sText = Trim(Replace(sText, "--", "—"))

Do

sText = Trim(Replace(sText, " ", " "))

Loop Until InStr(sText, " ") = False

 

Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...