Wednesday, February 11, 2015

.NET APIs To Convert PDF document to HTML document

If you want to build WCF service using C# or VB.NET then you require some .NET code or inbuilt APIs where you can pass PDF document in byte array format and return back byte array in HTML format. writing .NET code looks very difficult and time consuming until unless you are developing conversion product itself. you can also try iTextSharp open source APIs but this is not 100% suitable for graphics objects,Images, tables borders, logos, font style and format etc.
  
There are multiple inbuilt APIs available and they are capable to do PDF to multiple format conversion like : PDF to HTML, PDF to DOC/DOCX. PDF to RTF, PDF to JPG etc.
All below options require License to use.

C#.net code to to use these APIs and build WCF service, I have done couple of conversion (PDF to HTML) using trial version of APIs and conversion result looks very promising. if you have such kinds of requirement try these options and below are some C#.NET code for quick development

Using SautinSoft APIs

using SautinSoft;
using System.IO;

public byte[] PDFToHTML(byte[] inputFilePDF)
{
        PdfFocus pdfFocusObject = new PdfFocus();
        pdfFocusObject.OpenPdf(inputFilePDF);
        byte[] outPutHtmlByte= null;

        if (pdfFocusObject.PageCount > 0)
        {
            pdfFocusObject.HtmlOptions.IncludeImageInHtml = true;
            pdfFocusObject.HtmlOptions.Title = "Simple text";
            string html = pdfFocusObject.ToHtml();
            outPutHtmlByte = GetBytes(html);
        }
        return outPutHtmlByte;
}

public byte[] PDFToDoc(byte[] inputFilePDF)
{
        PdfFocus pdfFocusObject = new PdfFocus();
        pdfFocusObject.OpenPdf(inputFilePDF);
        byte[] outPutDocByte = null;

        if (pdfFocusObject.PageCount > 0)
        {
           string word=  pdfFocusObject.ToWord();
           outPutDocByte = GetBytes(word);
        }
        return outPutDocByte;
    }

    private byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

Using Aspose APIs

using Aspose.Pdf;
using System.IO;

byte[] resultHtmlAsBytes;
public byte[] PDFToHTML(byte[] inputFileByteArray)
 {
        Document doc = new Document(new MemoryStream(inputFileByteArray));
        HtmlSaveOptions newOptions = new HtmlSaveOptions();
        newOptions.RasterImagesSavingMode = HtmlSaveOptions.RasterImagesSavingModes.AsEmbeddedPartsOfPngPageBackground;
        newOptions.FontSavingMode = HtmlSaveOptions.FontSavingModes.SaveInAllFormats;
        newOptions.PartsEmbeddingMode = HtmlSaveOptions.PartsEmbeddingModes.EmbedAllIntoHtml;
        newOptions.LettersPositioningMethod = HtmlSaveOptions.LettersPositioningMethods.UseEmUnitsAndCompensationOfRoundingErrorsInCss;
        newOptions.PagesFlowTypeDependsOnViewersScreenSize = false;
        newOptions.RemoveEmptyAreasOnTopAndBottom = true;
        newOptions.SplitCssIntoPages = false;
        newOptions.CustomHtmlSavingStrategy = new HtmlSaveOptions.HtmlPageMarkupSavingStrategy(SavingToStream);
        string outHtmlfile = "Test.html";
        doc.Save(outHtmlfile, newOptions);
        return resultHtmlAsBytes;
    }

    private void SavingToStream(HtmlSaveOptions.HtmlPageMarkupSavingInfo htmlSavingInfo)
    {
        resultHtmlAsBytes = new byte[htmlSavingInfo.ContentStream.Length];
        htmlSavingInfo.ContentStream.Read(resultHtmlAsBytes, 0, resultHtmlAsBytes.Length);
    }

Using RasterEdge APIs:



Call above functions from client

using System;
using System.IO;

private string inputFile = "C:\Test.pdf"; 
private string outputFile = "C:\Test.html"; 

byte[] outputFileByteArray = client.PDFToHTML(File.ReadAllBytes(inputFile));
File.WriteAllBytes(outHtmlfile, outputFileByteArray);

client: Create instance of WCF proxy object and use.

No comments: