Jump to content

Featured Replies


i have been using pdftotext.exe to extract text from pdf. The text accuracy was good by using this. But the problem was i can't able to identify bold and italics text. How can i identify the extracted text was bold or italic?


I had tried some other plugin like CSWTestingReflow, PDF parser etc..but for better text accuracy i was go with pdftotext.exe








objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """"

''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """"

If fso.FileExists(sReadPDF & "_Text.txt") = True Then

'Read the text file

Set adoStreamOut = New ADODB.Stream

'adoStreamOut.Charset = "utf-8"

adoStreamOut.Charset = "us-ascii"

If adoStreamOut.State Then adoStreamOut.Close


adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt"

sText = adoStreamOut.ReadText

End If



sText = Trim(sText)

sText = Trim(Replace(sText, Chr(12), ""))

sText = Trim(Replace(sText, "." & vbCrLf, ".|||"))

sText = Trim(Replace(sText, "?" & vbCrLf, "?|||"))

sText = Trim(Replace(sText, "--" & vbCrLf, "||||||"))

sText = Trim(Replace(sText, "-" & vbCrLf, "-|||"))

sText = Trim(Replace(sText, vbCrLf, " "))

sText = Trim(Replace(sText, ".|||", "." & vbCrLf))

sText = Trim(Replace(sText, "?|||", "?" & vbCrLf))

sText = Trim(Replace(sText, "-|||", ""))

sText = Trim(Replace(sText, "||||||", "--"))

sText = Trim(Replace(sText, "--", "—"))


sText = Trim(Replace(sText, " ", " "))

Loop Until InStr(sText, " ") = False


Continue reading...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...