Programming Samples

Click here to go to:



Excel VBA

Word VBA

MS Access

T-SQL

SSIS

SSRS

Power BI

Crystal Reports

SSAS

SQL Replication

C# Code

ASP .NET Code

Oracle PL/SQL

Database Diagramming


Back to Home Page


C# PDF to Text Converter

C# Console Application using PDFBox 0.7.3 Converter DLL

This article covers the usage of the Apache Source Forge files PDFBox 0.7.3 to convert PDF files to text. The DLLs can be downloaded from PDFBox 0.7.3. The DLL only works with files at PDF 1.4 (Acrobat 5.x) or less. This freeware PDF Printer will get the PDF file reprinted to a level you can use. Bullzip Free PDF Printer

Creating the C# Console Application PDF Converter

Download the DLL files from SourceForge here: PDFBox 0.7.3

The required files are in the bin folder of the zip file:
IKVM.GNU.Classpath.dll
IKVM.Runtime.dll
FontBox-0.1.0-dev.dll
PDFBox-0.7.3.dll

Create a new Console application in Visual Studio

VS New Project

Copy the 4 PDFBox DLL files to the Debug folder of your new project.

Right click References, Add Reference and go to Browse. Browse to select and Add the IKVM.Gnu.Classpath.dll and the PDFBox-0.7.3.dll files.

Set References

Add the following statements to the top of the file.

using System.IO;
using org.pdfbox.pdmodel;
using org.pdfbox.util;

Create the code to utilize the classes and methods of the PDFBox DLL with a StreamWriter.

class Program
{
  /// <summary>
  /// PDF parser to extract text only from a document.
  /// </summary>

  [STAThread]
  static void Main(string[] args)
  {
    try
    {
      DateTime start = DateTime.Now;
      if (args.Length < 2)
      {
        Console.WriteLine("Please use: PDF_Parser <PDF input filename>
        <Text output filename>");
        return;
      }
      parsePDF(args[0], args[1]);
      Console.WriteLine("Complete. " + (DateTime.Now - start));
      Console.ReadLine();
    }
    catch (Exception ex)
    {
      Console.WriteLine("There were errors: {0}", ex.Message);
    }
  }

Add code to create the function parsePDF which uses the classes PDDocument and PDFTextStripper to extract the text from the PDF file.

public static void parsePDF(string pdfIn, string txtOut)
{
  StreamWriter sw = new StreamWriter(txtOut, false);
  try
   {
     sw.WriteLine("Begin Parsing.....");
     sw.WriteLine(DateTime.Now.ToString());
     PDDocument doc = PDDocument.load(pdfIn);
     PDFTextStripper stripper = new PDFTextStripper();
     sw.Write(stripper.getText(doc));
   }
    catch (Exception ex)
    {
      Console.WriteLine(ex.Message + ": " + ex.StackTrace.ToString());
    }
    finally
    {
      sw.Close();
      sw.Dispose();
    }
}

Build the Application and test it in a command window (this one uses Federal Tax Form i8863.pdf). Note that the PDF cannot be a Scanned Image. It must be processed via an OCR method within the full version of Adobe Acrobat or other software.

i8863

The program executes without errors.

Command Window

The text file is generated without issues.

i8863.txt

C# PDF Converter Issue with Acrobat Version

Some PDF files do not work with the PDFBox DLL as the version of Acrobat is newer than the last one used at the time of the release of this DLL PDF 1.4 (Acrobat 5.x).

This example uses an incompatible PDF file of PDF 1.7 (Acrobat 8.x) format and attempts to convert it to text.

PDF 1.7 Format

An error is thrown and the file is not converted. A look at the Document properties in Acrobat shows that the file is PDF Version 1.7 (Acrobat 8.x)

IRS Form f8863 Acrobat 8.x

If you have the full edition of Adobe Acrobat, you will be able to print to PDF or Save As an Optimized PDF and then alter the PDF version during the printing process. If not, an excellent Freeware PDF Printer is available at Bullzip.com

Print from the existing PDF File and select the Bullzip printer (after installing it, of course).

Bullzip PDF Printer

After clicking OK, another dialog box will come up requesting information about saving the file: name of the file, location to save and the compatibility level, which can be changed to a level below 1.7.

Bullzip PDF Format Selector

A look at the Document Properties from within the Adobe Acrobat Reader shows that the new file created has the proper PDF version for conversion.

Converted File to PDF 1.4

Once again, an attempt to use the console application to use the newly converted PDF file is successful.

Command Window Successfully Converts File

With a DLL files, C# console application and a PDF Printer, conversion of PDF files into Text is an easy task.