c# - How to extract text from Pdf, Word and Excel documents? -
i'd need .net library using can extract text data pdf, excel , word files.
ideally, free tool!
would recommend any?
many thanks,
as has spent many days looking free solutions (nearly) exact problem, can tell not find free library able extract text all of formats well. library i'm aware of great job of formats (and more) commercial library, , it's not native .net, it's c++/com library, c++/cli .net wrapper.
what options?
itextsharp -- 1 absolutely fantastic in extracting text pdfs. while later versions of library commercial friendly (lgpl), authors have decided instead want charge software, they've instead released under agpl, unless want release of source code, don't want use 1 of versions. however, last version (4.1.6) licensed under lgpl can found on internet. this question has link version under lgpl.
pdfbox -- pdf library. one, imo, better because it's under apache 2.0 license. there few issues it, sometimes (perhaps rarely) not of job itextsharp. attribute more fact it's newer library else. however, experience library months ago. project actively developed, , in last month, 52 issues have been resolved. keep eye on one. please note java library. (keep reading below more information on why i've included this.)
poi or npoi -- these libraries written microsoft office documents, particularly pre-2007 formats, ole binary file formats. support newer openxml formats, though i'm not sure how mature part of library is. poi java version (keep reading below more information on why i've included this.), npoi native .net version. however, npoi supports excel documents, poi can text extraction on many more types.
open xml sdk 2.0 -- library reading/modifying office 2007+ (unencrypted openxml) documents created microsoft themselves! amazing library working these kinds of documents. however, lower-level library , therefore doesn't (as far know of), have it everything text extraction class. there's example, (i'm not sure covers cases text in tables, etc), of text extraction word document at answer
tika -- once again, java library (i'm not telling java libraries no reason. keep on reading! :)), , close "one library" text extraction can get. tika can extract metadata , structured text content many different kinds of files, using existing parsing libraries. uses poi , pdfbox under hood office , pdf documents.
non-commercial
- dtsearch -- library i'm familiar with. fantastic job, , can parse ridiculous amount of file formats. however, costs money , overkill need. it's exactly need, we're trying rid of ourselves, because use parsing (it's full-text search engine), , there's plenty of parsing libraries out there can use or modify suit our needs, blows these other libraries out of water. mentioned before, not native .net code. c++/cli wrapper used intertop between dll , .net runtime.
ifilters can used, , mentioned in several other answers on different questions, text unstructured. it's bad...unreadable humans, @ least. believe ifilters deprecated, , depending on license issues, might not able redistribute them.
why did mention of java libraries? well, 2 reasons. first, there no free .net equivalents come close quality of these java libraries. secondly, can use these libraries in .net (i've done myself these libraries, can @ least vouch that) using ikvm. it's implementation of java inside of .net. here example on using ikvm convert tika .net assembly can used in project. perhaps scariest thing ikvm, it works!
edit: forgot author of blog had posted code , converted libraries on a github project. so, if want check out, can there. however, it's older version of tika , on year old. if results aren't expected, suggest trying latest version.
Comments
Post a Comment