The Computer Oracle

How to extract the text from MS Office documents in Linux?

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Take control of your privacy with Proton's trusted, Swiss-based, secure services.
Choose what you need and safeguard your digital life:
Mail: https://go.getproton.me/SH1CU
VPN: https://go.getproton.me/SH1DI
Password Manager: https://go.getproton.me/SH1DJ
Drive: https://go.getproton.me/SH1CT


Music by Eric Matyas
https://www.soundimage.org
Track title: Drifting Through My Dreams

--

Chapters
00:00 How To Extract The Text From Ms Office Documents In Linux?
00:35 Answer 1 Score 0
00:45 Answer 2 Score 16
01:02 Answer 3 Score 8
02:00 Accepted Answer Score 9
02:25 Thank you

--

Full question
https://superuser.com/questions/165978/h...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#linux #pdf #extract

#avk47



ANSWER 1

Score 16


Catdoc can convert doc,xls & ppt to text. Second option would be wvWare.

For more utils check check http://www.linux.com/archive/articles/52385 for word to text coverters and




ACCEPTED ANSWER

Score 9


I finally found the perfect tool for scripting document parsing , it is apache-tika , it can parse gazillion non-text formats into text which is very cool!

Get Apache Tika here:

http://tika.apache.org/

(Mac Homebrew users: brew install tika)

The command-line interface works like this:

tika --text something.docx > something.txt




ANSWER 3

Score 8


Abiword can convert from the commandline between any file formats it knows.

Convert from Word to plain text:

abiword --to=txt myfile.doc

Make a pdf from a Word file:

abiword --to=pdf myfile.doc

And so on. The results in these cases would be myfile.txt or myfile.pdf. If you want to specify the output name you can do that too:

abiword --to=txt --to-name=output.txt myfile.doc

Convert ODT to Word:

abiword --to=doc myfile.odt

Convert Word to ODT:

abiword --to=odt myfile.doc

In fairness to other answers, it should be noted that AbiWord uses wvWare to handle Word documents, but even the wvWare homepage recommends using AbiWord instead for most conversions.

I hate word processors. This is the main reason I have AbiWord installed.

You might also be interested in unoconv, which is a similar tool supporting formats OpenOffice knows (which would include spreadsheets and the like), but I have no experience with it personally.




ANSWER 4

Score 0


You could use CUPS ( virtual printer ) and by using ld.