Posted May 7, 201410 yr i have been using pdftotext.exe to extract text from pdf. The text accuracy was good by using this. But the problem was i can't able to identify bold and italics text. How can i identify the extracted text was bold or italic? I had tried some other plugin like CSWTestingReflow, PDF parser etc..but for better text accuracy i was go with pdftotext.exe Code: objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " -layout " & """" & sReadPDF & "_Text.pdf" & """" ''objdos.ExecuteCommand """" & App.Path & "\pdftotext.exe" & """" & " " & """" & sReadPDF & "_Text.pdf" & """" If fso.FileExists(sReadPDF & "_Text.txt") = True Then 'Read the text file Set adoStreamOut = New ADODB.Stream 'adoStreamOut.Charset = "utf-8" adoStreamOut.Charset = "us-ascii" If adoStreamOut.State Then adoStreamOut.Close adoStreamOut.Open adoStreamOut.LoadFromFile Replace(sReadPDF, ".pdf", "") & "_Text.txt" sText = adoStreamOut.ReadText End If DoEvents sText = Trim(sText) sText = Trim(Replace(sText, Chr(12), "")) sText = Trim(Replace(sText, "." & vbCrLf, ".|||")) sText = Trim(Replace(sText, "?" & vbCrLf, "?|||")) sText = Trim(Replace(sText, "--" & vbCrLf, "||||||")) sText = Trim(Replace(sText, "-" & vbCrLf, "-|||")) sText = Trim(Replace(sText, vbCrLf, " ")) sText = Trim(Replace(sText, ".|||", "." & vbCrLf)) sText = Trim(Replace(sText, "?|||", "?" & vbCrLf)) sText = Trim(Replace(sText, "-|||", "")) sText = Trim(Replace(sText, "||||||", "--")) sText = Trim(Replace(sText, "--", "—")) Do sText = Trim(Replace(sText, " ", " ")) Loop Until InStr(sText, " ") = False Continue reading...
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.