C# PDF to Text Converter
C# Console Application using PDFBox 0.7.3 Converter DLL
This article covers the usage of the Apache Source Forge files PDFBox 0.7.3 to convert PDF files to text. The DLLs can be downloaded from PDFBox 0.7.3. The DLL only works with files at PDF 1.4 (Acrobat 5.x) or less. This freeware PDF Printer will get the PDF file reprinted to a level you can use. Bullzip Free PDF Printer
Creating the C# Console Application PDF Converter
Download the DLL files from SourceForge here: PDFBox 0.7.3
The required files are in the bin folder of the zip file:
IKVM.GNU.Classpath.dll
IKVM.Runtime.dll
FontBox-0.1.0-dev.dll
PDFBox-0.7.3.dll
Create a new Console application in Visual Studio
Copy the 4 PDFBox DLL files to the Debug folder of your new project.
Right click References, Add Reference and go to Browse. Browse to select and Add the IKVM.Gnu.Classpath.dll and the PDFBox-0.7.3.dll files.
Add the following statements to the top of the file.
using System.IO;
using org.pdfbox.pdmodel;
using org.pdfbox.util;
Create the code to utilize the classes and methods of the PDFBox DLL with a StreamWriter.
class Program
{
/// <summary>
/// PDF parser to extract text only from a document.
/// </summary>
[STAThread]
static void Main(string[] args)
{
try
{
DateTime start = DateTime.Now;
if (args.Length < 2)
{
Console.WriteLine("Please use: PDF_Parser <PDF input filename>
<Text output filename>");
return;
}
parsePDF(args[0], args[1]);
Console.WriteLine("Complete. " + (DateTime.Now - start));
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine("There were errors: {0}", ex.Message);
}
}
Add code to create the function parsePDF which uses the classes PDDocument and PDFTextStripper to extract the text from the PDF file.
public static void parsePDF(string pdfIn, string txtOut)
{
StreamWriter sw = new StreamWriter(txtOut, false);
try
{
sw.WriteLine("Begin Parsing.....");
sw.WriteLine(DateTime.Now.ToString());
PDDocument doc = PDDocument.load(pdfIn);
PDFTextStripper stripper = new PDFTextStripper();
sw.Write(stripper.getText(doc));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message + ": " + ex.StackTrace.ToString());
}
finally
{
sw.Close();
sw.Dispose();
}
}
Build the Application and test it in a command window (this one uses Federal Tax Form i8863.pdf). Note that the PDF cannot be a Scanned Image. It must be processed via an OCR method within the full version of Adobe Acrobat or other software.
The program executes without errors.
The text file is generated without issues.
C# PDF Converter Issue with Acrobat Version
Some PDF files do not work with the PDFBox DLL as the version of Acrobat is newer than the last one used at the time of the release of this DLL PDF 1.4 (Acrobat 5.x).
This example uses an incompatible PDF file of PDF 1.7 (Acrobat 8.x) format and attempts to convert it to text.
An error is thrown and the file is not converted. A look at the Document properties in Acrobat shows that the file is PDF Version 1.7 (Acrobat 8.x)
If you have the full edition of Adobe Acrobat, you will be able to print to PDF or Save As an Optimized PDF and then alter the PDF version during the printing process. If not, an excellent Freeware PDF Printer is available at Bullzip.com
Print from the existing PDF File and select the Bullzip printer (after installing it, of course).
After clicking OK, another dialog box will come up requesting information about saving the file: name of the file, location to save and the compatibility level, which can be changed to a level below 1.7.
A look at the Document Properties from within the Adobe Acrobat Reader shows that the new file created has the proper PDF version for conversion.
Once again, an attempt to use the console application to use the newly converted PDF file is successful.
With a DLL files, C# console application and a PDF Printer, conversion of PDF files into Text is an easy task.