I have a script that takes input from
wget or similar and searches through it for key words using
grep. (I promise i am not trying to parse HTML with regular expressions, it is just a convenient way to emulate the content-detection behaviour we have in another much more complex product.) This works great, as long as the HTML content isn't too severely minified. When it is, the lines can become very long (over 50 kB in some cases i've seen), and
grep chokes on them.
To remedy this, i would like to be able to fold or re-indent the HTML so that it is spread out over more lines. However, in order for the script to give accurate results, i need to be able to do this without otherwise altering the content. This means it can't correct invalid or unclosed tags, and it must fold only between elements, not inside them.
These two requirements seem to rule out all of the HTML-tidying or prettifying utilities i've found.
Are there any UNIX-based shell utilities, perl/python/ruby modules, or similar that can do this for me?
Alternatively, since all i need is to add some new lines in between tags, is there a way that i can semi-reliably do this myself?
Run the file through HTML Tidy.
curl http://superuser.com | tidy -i | less
-i is for indentation of the input.
Ok, for anyone else in need of this, I'm recording the suggestions made in this awesome thread (in case that link goes down, as per StackExchange guidelines):
- HTB 2.0 - DOS based - http://www.digital-mines.com/htb/
- HTML-Kit - a full-featured free HTML editor running on Windows, you need to config TIDY options [Tools /Check code using Tidy /Add new config], uncheck all swithes except "Output only the body content" and "Convert non-breaking space to entities", then go to Actions /Tools /HTML Tidy /Indent Tags or beautify - http://www.chami.com/html-kit/
- SCREEM - only for Linux -
- NetBeans - " After openining an html file with NetBeans, click Source then select Format. That's it. " -
- WebmasterGate's HTML / XHTML Beautifier - Online tool - http://www.webmastergate.com/html-beautifier/
- Aptana Studio (Version 2.0.4) - "Select Edit > Format or press Ctrl-Shift F to format the html code. The format function can be configured from Windows > Preferrences, then select Aptana > Editors > HTML > Formatting, click Edit to add tags which should not take a new line then save it as a new preferrence." -
- UniversalIndentGUI - Uses HTB Beautifier internally - While running Notepad++, go to Plugins > Plugin Manager > Show Plugin Manager, select UniversalIndentGUI from the available list to install it.
- tidy with these options:
(filler text since the markdown engine seems to have problem when code directly follows bullets)
[HTML, XHTML, XML Options]
[Pretty Print Options]
I'm yet to try out these options (the
input-xml: yes and
force-output: yes config suggestions to HTML tidy mentioned http://stackoverflow.com/questions/7151180/use-html-tidy-to-just-indent-html-code works for my immediate purpose), will update this answer if I do.
The simplest way to do this without parsing/fixing the document is to look for a closing tag, followed by an open-angle-bracket or whitespace, and insert a newline. Search for:
and replace with
You will still need to manually check over each output document and verify that it didn't break anything, but this should work for most cases. It won't be pretty output, but it should kill 50KB lines.