Extract From Pdf Files

Extract From Pdf Files Rating: 6,6/10 9147 votes

Bonnie Bilski shares a tip on extracting image data from a PDF to use in another file type.
'Sometimes clients or vendors provide you with PDF files instead of DWG files, and you need to get the data out of the PDF and into your drawing or report. The following steps explain how to extract an image from a PDF document so you can place it in an AutoCAD drawing file, a Word document, a presentation, etc.
1. You must have Adobe Acrobat installed on your hard drive.
2. Once you have opened the PDF, choose Tools > Select & Zoom > Snapshot Tool
3. Click the upper-left corner of the image or figure you want to copy, and hold the mouse button down while you and drag a rectangle around it. Release the mouse button and Acrobat will copy the selected area to the clipboard.
4. Paste the image or figure into an AutoCAD drawing file or other type of file.
You can manipulate the sharpness of the pasted image by increasing or reducing the magnification of the image.'
Notes from Cadalyst Tip Patrol: PDF files are common file formats that are used throughout the design world. They aren't that easy to work with, however, if you need to use the data contained in them.
Using proprietary software that can edit PDF files, like Adobe Acrobat, is a great way to go. Not only can you select portions of the PDF file and copy them, you can also extract image files from the PDF. You can even convert PDF files into other image types (like JPG or TIF). AutoCAD can reference PDF files directly into your model or drawing as an overlay (if you lack access to Adobe Acrobat, you can work with them this way). If the PDF was made from another AutoCAD file, and was made with AutoCAD's print-to-PDF tool, then the linework will retain its layering and you can snap to the objects.
Plotting AutoCAD files to PDF using AutoCAD's tools is a great way to protect your data when you send it to outside groups; it provides the data in a form others can work with, but protects it at the same time. You might also want to try DWF files.

Split PDF file into pieces or pick just a few pages. To extract images from PDF, first upload the needed document to PDF Candy: hit the “Add file” button to select the file on your device or drag and drop the PDF into the browser's window. Right after the loading process of the file is complete, the images extraction process starts automatically.

This article aims to show how to extract data from PDF files including text, image, audio, video using C#. We all know that PDF format became the standard format of document exchanges and PDF documents are suitable for reliable viewing and printing of business documents. Almost of all office software like Microsoft Office, LibreOffice or OpenOffice.org had integrated the PDF format into them and they all had implemented the very useful feature known as “Export to PDF”. So exporting to a pdf file is now very easy, but what about the inverse process?
Let’s consider that you’ve received a document in PDF format and want to extract some information from it. At a first glance, the task seems to be quite easy with just copying from the document source and pasting it somewhere else. But thing becomes complicated when you’re dealing with a lot of data, this tremendous process will make your work life awful. Facing to that it’s appropriate to use dedicated tools or specialized frameworks to automate the whole of the job. Not only they will improve your productivity but also save your time. This article has three main sections:

Extract PDF data from tables

1. Extract data from PDF tables with Adobe Acrobat Pro DC
2. Extract data manually with Adobe Reader
3. Extracting data from PDF tables using C#

4. Export PDF table to CSV format with C#
5. Extract PDF table column with C#
6. PDF table To JSON using C#
7. Extract PDF table to XML using C#

Extract data from scanned documents / OCR

1. Extract data with Adobe Acrobat DC
2. Extract data from scanned document with poor quality of printing and handwriting note
3. Extract data with OCR from scanned documents using C# and PDF Extractor SDK

Extract rich media contents

1. Extract rich media contents with Adobe Acrobat DC
2. Extract rich media contents from PDF with PDF Extractor SDK and C#
3. Extract audio file mp3 from PDF document with PDF extractor SDK and C#
4. Extract video file from PDF document with PDF extractor SDK and C#
5. Extract images from PDF file using Adobe Acrobat DC
6. Extract images from PDF file using C#
7. Extract embedded documents in PDF file

Extract PDF data from tables

Extract data from pdf tables with Adobe Acrobat Pro DC

As it’s name implies, Adobe Acrobat is a commercial app made by Adobe and it is the first and the official software to work with PDF files. You can download the 7 days trial version at https://acrobat.adobe.com/us/en/free-trial-download.html. At the time of writing, the released version is Adode Acrobat Pro DC 2015 Release.

You also have to download our case study files here (sample1) . It’s content looks like below

The table contain daily historical Microsoft and Facebook stock prices and volumes from the Nasdaq public website.

We need to manually extract the table’s content and export it to different formats like CSV, TXT,….

Step 1: Open the PDF file

In Adobe Acrobat Pro DC > File > Open

Step 2: Locate the table from which you want to extract data and drag a selection over the table as shown below

Step 3: Right click and select “Export Selection As…”

Step 4: Choose the export type

Extract Arabic Text From Pdf Files

Adobe Acrobat Pro DC can handle up to 8 different formats:

  • – Word Document (*.docx)
  • – Word 97-2003 Document (*.doc)
  • – Excel Workbook (*.xlsx)
  • – PowerPoint Presentation *(*.pptx)
  • – Rich Text Format (*.rtf)
  • – XML Spreadsheet 2003 (*.xml)
  • – HTML (*.html, *.htm)
  • – Comma Separated Values (*.csv)

The exported CSV file looks like


'Date','Open','High','Low','Close / Last','Volume'
2017-01-04T00:00:00.000,62.48,62.75,62.12,62.3,21325140
2017-01-03T00:00:00.000,62.79,62.84,62.125,62.58,20655190
2016-12-30T00:00:00.000,62.96,62.99,62.03,62.14,25575720
2016-12-29T00:00:00.000,62.86,63.2,62.73,62.9,10248460
2016-12-28T00:00:00.000,63.4,63.4,62.83,62.99,14348340
2016-12-27T00:00:00.000,63.21,64.07,63.21,63.28,11743650
2016-12-23T00:00:00.000,63.45,63.54,62.8,63.24,12399540
2016-12-22T00:00:00.000,63.84,64.1,63.405,63.55,22175270
2016-12-21T00:00:00.000,63.43,63.7,63.12,63.54,17084370
2016-12-20T00:00:00.000,63.69,63.8,63.025,63.54,26017470
2016-12-19T00:00:00.000,62.56,63.77,62.42,63.62,34318500
2016-12-16T00:00:00.000,62.95,62.95,62.115,62.3,42452660

Adobe Acrobat Pro is the most powerful tools to manipulate PDF files. In few words, you can do whatever you want with your pdf file with it – except some limitations that we’re going to see at section 4 (dealing with rich media content).

Extract data manually with Adobe Reader

Adobe Reader PC is a simple software to read PDF files. It has some limitations compared to it’s counterpart Adobe Acrobat Pro. However, you can do some basic stuffs like copying table’s contents and pasting it into your favorite spreadsheet app.

Step 1: Open the file with Adobe Reader

Step 2: Select the table’s content by dragging any desired rows and columns

Step 3: Open your favorite spreadsheet app and paste the selection into it, we’re using LibreOffice Calc in this article

As seen in the figure below, we have to define column delimiter in order to correctly display the content.

Step 4: Click OK

Using our spreadsheet software, we can then export to many other formats. In our case, LibreOffice gives us 15 available formats.

Abode Reader is not as flexible as Adobe Acrobat Pro, it hasn’t actually no export features. Its main utility is to visualize, to print and to fill out PDF documents.

The two previous sections show you two ways to manually extract data from tables. They both are working well and are very useful for small loads. Gta underground download. The next section will show you how to extract data from PDF tables using programming tools. We will focus essentially on PDF Extractor SDK.

Extracting data from PDF tables using C#

Prerequisites

In order to run all the following programs, you have to install the PDF Extractor SDK. You can download it at https://bytescout.com/products/developer/pdfsdk/index.html

PDF Extractor SDK (https://bytescout.com/products/developer/pdfextractorsdk/index.html) is one of Bytescout’s products. It allows developers to convert/extract data from PDF and export them to other formats. This is important to know that we can do that without any additional software required, unlike the actual Adobe SDK which mandatory needs Adobe Acrobat software to be installed.

After installing PDF Extractor SDK, all requisites dll can be found in the folder C:Program FilesBytescout PDF Extractor SDK

.NET Compatibility

PDF Extractor SDK supports the following .NET Frameworks:

  • .NET Framework 2.0
  • .NET Framework 3.5 / .NET Framework 3.5 Client Profile
  • .NET Framework 4.0 / .NET Framework 4.0 Client Profile

You then need to do “add a reference” to Bytescout.PDFExtractor.dll library.

PDF Extractor SDK, how does it works?

Prior to any data extraction processes, we need to locate the targeted table among all the tables in the PDF document. This task is done by the Bytescout.PDFExtractor.TableDetector object which can loop over existing tables in the document.

The program below shows how to locate the N-th table (targetTableNumber variable) in the P-th page (targetPageNumber variable) of the whole PDF document.

Filters:

The TableDetector class offers some useful properties to filter the search:

  • DetectionMinNumberOfColumns
  • DetectionMinNumberOfRows

After locating the right table, we want to gather some data from it. This is achieved by an instance of extractor class: CSVExtractor, TextExtractor, JSONExtractor, XLSExctrator,…

Export PDF table to CSV format with C#

We need to export the first PDF table of our case study document to CSV format. The previous program is updated as following

Once the table is located, we create an extractor object to define the area inside which we want to extract data and the final CSV looks like


'Date';'Open';'High';'Low';'Close / Last';'Volume';
'01/04/2017';'117.55';'119.66';'117.29';'118.69';'19,594,560';
'01/03/2017';'116.03';'117.84';'115.51';'116.86';'20,635,600';
'12/30/2016';'116.595';'116.83';'114.7739';'115.05';'18,668,290';
'12/29/2016';'117';'117.531';'116.06';'116.35';'9,925,082';
'12/28/2016';'118.19';'118.25';'116.65';'116.92';'11,985,740';
'12/27/2016';'116.96';'118.68';'116.864';'118.01';'12,034,590';
'12/23/2016';'117';'117.56';'116.3';'117.27';'10,885,030';
'12/22/2016';'118.86';'118.99';'116.93';'117.4';'16,226,770';
'12/21/2016';'118.92';'119.2';'118.48';'119.04';'10,747,610';
'12/20/2016';'119.5';'119.77';'118.8';'119.09';'13,673,570';
'12/19/2016';'119.85';'120.36';'118.51';'119.24';'15,871,360';
'12/16/2016';'120.9';'121.5';'119.27';'119.87';'25,316,220';

Extract PDF table column with C#

The next program shows how to extract a specific column from a given table.

The class Bytescout.PDFExtractor.TextExtractor is used to locate a specific text pattern in the PDF document. Then we define the extraction area and finally save the column content in a text file.

Bytescout.PDFExtractor.TextExtractorclass is not only limited to PDF files, but it can also locate and extract text from PNG, JPEC, BMP, TIFF files.

We also need to add the System.Drawing (because we’re using the RectangleF class) assembly to our project.

The content of the result file looks like:

Close / Last
62.3
62.58
62.14
62.9
62.99
63.28
63.24
63.55
63.54
63.54
63.62
62.3

More generally, the class Bytescout.PDFExtractor.TextExtractor uses a rectangle surface called extraction area. The extraction area is well defined by using four parameters:

  • the left and top coordinates are used to locate to top-left corner of the extraction area
  • the width is used to set the width of the extraction area
  • the height parameter specifies the height of the extraction area

Only texts standing inside the extraction area are going to be gathered during the extraction phase.

Parsing PDF table cell by cell with C# PDF API

With PDF Extractor SDK, we can navigate through the table’s cells using the Bytescout.PDFExtractor.StructuredExtractor class in the way of enumerating a matrix structure. The following program shows how to do that

The program output is

PDF table To JSON using C#

The following program shows how to extract data from PDF table and save them as a json file using the Bytescout.PDFExtractor.JSONExtractor class. We can also retrieve some metadata (like font name,font size, font style and position) informations in addition to the actual cell content value.

The result looks like

Extract PDF table to XML using C#

The same process as exporting to JSON applies here. Instead of using JSONExtractor class we have to use XMLExtractor class. To save the XML into the file system we call the method XMLExtractor.SaveXMLToFile

Extract data from scanned documents / OCR

The rest of this article is about extracting data from scanned documents and OCR capabilities. We’ll see how to extract data with Adobe Acrobat DC and we’ll also see how to handle the data extraction process using C# and PDF Extractor SDK.

You can find below five scanned files that we’re going to use.

– scan_sample1_600dpi_normal.pdf : document scanned at 600 dpi

– scan_sample1_600dpi_handwritingnote.pdf : document scanned at 600dpi with handwriting note at the bottom of the page

– scan_sample1_70dpi_handwritingnote.pdf: document scanned at 70 dpi with handwriting note

– scan_sample1_600dpi_badqualityprinting.pdf : printed with a very poor quality and scanned at 600dpi

– scan_sample1_600dpi_badorientation.pdf : scanned at 600dpi with bad orientation during the scan process

Extract data with Adobe Acrobat DC

When we open the pdf scanned file with Adobe Acrobat DC, we see that it automatically tries to convert the page to editable contents. It pops out the message below

Once the pattern recognition is done each cell of our table becomes editable. More generally Adobe Acrobat DC has a powerful built-in OCR to automatically detect characters and texts inside the scanned page.

We can then export the modified document to many other formats.

Go to File > Export To >

You can download here the “Text (Plain)” version of our document

Extract data from scanned document with poor quality of printing and handwriting note

The corresponding demo file is scan_sample1_70dpi_handwritingnote.pdf. The specificity of this file is that the scan is done at low resolution 70 dpi and having a handwriting note at the bottom of the table.

Despite of the low accuracy of the OCR at 70 dpi, the major part of the data has been well reconstructed. However, it seems to have a trouble to detect all cell borders, the result file is available here. The pattern recognition over the handwriting note had also failed.

The same process applied to the same document scanned at 600dpi is very accurate. The extraction process has performing well and all of the cells data are successfully gathered (here)

The file scan_sample1_600dpi_badqualityprinting.pdf (scan at 600 dpi with poor ink quality) had definitely failed through the extraction process. Adobe Acrobat DC didn’t recognized any patterns and had considered it as a blank page.

The last file we’re trying to extract is scan_sample1_600dpi_badorientation.pdf. The particularity of this file is the bad orientation during the scan process.

Castle crashers download code. However Adobe Acrobat automatically adjusts the document page orientation during the pattern recognition process as seen in the figure below

The data extraction result is available here We can see that almost all of the data are well retrieved.

Extract data with OCR from scanned documents using C# and PDF Extractor SDK

The following program extracts data from the pdf document file scanned at 600dpi under normal conditions

For each pattern, the OCR engine associates the property named @OCRConfidence which indicates how good or bad the recognition was, higher the value more accurate is the result. The OCR engine also returns the predicted font name, the size, the coordinate of the data, it’s width and height in the PDF document and the text value of course.

The figure below shows a partial output

The OCR uses a set of language libraries (located at C:Program FilesBytescout PDF Extractor SDKnet4.00tessdata ), the default installation contains four languages: english,german, french and spanish. The property JSONExtractor. OCRLanguageDataFolder is set to the actual language of the document. The OCR process is active when the property JSONExtractor.OCRMode is different than OCRMode.Off. We can also apply multiple preprocessing algorithms to the OCRImagePreprocessingFilters property of the extractor object to help the OCR Engine to give better pattern recognition performance. We can mix between the following methods AddContrast(), AddDeskew(),AddDilate(), AddGammaCorrection(), AddHorizontalLinesRemover(), AddMedian(), AddVerticalLinesRemover(). According to the case, the time process may vary from a few seconds to one minute or even more per page.

Extract rich media contents

Extract rich media contents with Adobe Acrobat DC

In the Adobe Acrobat glossary, rich media are audio and video contents. They can be a 3D animation, an audio file, a flash SWF animation or a video file in H264 compilant format. At the time of writing, extracting rich media contents isn’t a supported feature, this is one of the major lacks of Adobe Acrobat and you should use a third-party tools to do that. PDF Extractor SDK can fortunately do all the jobs for you just with only a few lines of code. This is exactly what we’ll show you in the next section.

Extract rich media contents from PDF with PDF Extractor SDK and C#

Pdf content is not only limited to text format. Most of the time PDF documents contain pictures or documents and even more complex objects like audio or video media files may be embedded into the document.
The following example shows you how to extract such objects using PDF extractor SDK
The basic steps to perform this process are:
1- Create the extractor object. The type actually depends on what kind of objects we’re going to extract
2- Locate the object in the document. We can loop over existing objects to find the index of the targeted object.
3- Call the appropriate method of the extractor object in step 1 to extract the data. The extractor object has also some interesting properties about the data as file type, data size,
4- Save the extracted data to the file system.
For more details about supported rich media contents, please visit the official adobe acrobat help page https://helpx.adobe.com/acrobat/using/rich-media.html

Extract audio file mp3 from PDF document with PDF extractor SDK and C#

The Bytescout.PDFExtractor.MultimediaExtractor is the most suitable component to extract an embedded audio file from a PDF document.
The file Pdf_with_mp3.pdf contains an audio mp3 object. We can extract the audio file using the following lines of code

Extract Images From Pdf Files

Note: The file extension was originally “.mp3”, however the method MultimediaExtractor. GetCurrentAudioExtension() returns “.mpa” file extension.

Some interesting methods are MultimediaExtractor.GetCurrentAudioBytesSize() to get the actual file size, MultimediaExtractor.GetDocumentAudioCount() returns the number of embedded audio files in the document and MultimediaExtractor.GetNextAudio() allows to switch to the next audio file.

Extract video file from PDF document with PDF extractor SDK and C#

You can extract any embeded videos files using the following steps.
1- Create an instance of Bytescout.PDFExtractor.MultimediaExtractor class to grab the video file
2- Save the file to disk using the method MultimediaExtractor .SaveCurrentVideoToFile.

The following program shows how to extract the embedded video file ( H264 compilant otherwise it will fail) in Pdf_with_video.pdf and save it into the file system.

The method MultimediaExtractor.GetFirstVideo() is where we locate the targeted video. Some useful methods are:
– MultimediaExtractor.GetDocumentVideoCount() to get the total number of video objects in the current file.
– MultimediaExtract.GetNextVideo() to navigate to the next video file.

Extract images from PDF file using Adobe Acrobat DC

Adobe Acrobat DC can extract embedded images in the PDF document.

The following steps show you how to do that.

Step 1: Open the document in Adobe Acrobat DC

Step 2: Tools > Export PDF

Step 3: The last screen allows you to configure the image type

Extract images from PDF file using C#

Pdf extractor Sdk can extract any embedded images in the pdf document. It has a full support on Gif, Tiff, jpg… formats. You can achieve that in three steps:
1- Create an instance of Bytescout.PDFExtractor.ImageExtractor class
2- Load the PDF document with the method Locate Bytescout.PDFExtractor.ImageExtractor .LoadDocumentFromFile
3- Locate the image in the page
4- Save the image with the method Bytescout.PDFExtractor.ImageExtractor.SaveCurrentImageToFile

Note: We need to use System.Drawing.Imaging assembly in order to use ImageFormat class.

You can download the document here

One interesting feature is the ability to choose the export format regardless of the initial format of the image. Depending on your need, you can pass here Png, Jpeg, Ico, gif, bmp, Exif they all are well handled by PDF Extractor SDK.

Extract embedded documents in PDF file

PDF Extractor can extract any embedded documents from pdf. This process involves four steps
1- Create an instance of Bytescout.PDFExtractor.AttachementExtractor class
2- Load the PDF file using Bytescout.PDFExtractor.AttachementExtractor.LoadDocumentFromFile method
3- Locate the file with Bytescout.PDFExtractor.AttachementExtractor.GetFileName with file’s index as input parameter.
4- Call the Save method to write to disk.

We’ve seen along this article several ways to extract data from PDF document. If you want to do it manually, Adobe Acrobat DC is definitely the best choice. However this product is not free and you have to pay to get the commercial license. An alternative is to use Adobe Reader but there’s some limitations using it. For automated extraction process, we’ve seen that PDF Extractor SDK is a simple, complete and reliable tools for pdf extraction data. It supports a lot of commonly used formats (xml, csv, json, html ,text and so on). For more complex OCR tasks, Adobe Acrobat is a very reliable software, the pattern recognition error rate is quite low and it also supports many export formats. PDF Extractor SDK offers a powerful OCR engine, many features are available to developers to optimize the character recognition process. PDF Extractor SDK is definitely a well placed tools when your business requires to deal with rich media contents. We’ve seen so far that it has a full support of extracting audios, videos from any PDF files and is compatible with any CLR compilant programming languages C# / VB.NET.

Extract Data From Pdf Document

About the Author

ByteScout Team of Writers

ByteScout has a team of professional writers specialized in different technical topics. We select the best writers to cover interesting and trending topics for our readers. We love developers and we hope our articles help you learn about programming and programmers.