Strip Text From Html Document Using Ruby
There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly. What
Solution 1:
This works too:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
Solution 2:
You can scan the string to create an array of "tokens", and then only select those that are html tags:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <ahref='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><ahref='http://foo.bar'></a></p>"
==Edit==
Or even better, just scan for html tags ;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><ahref='http://foo.bar'></a></p>"
Solution 3:
To grab everything not in a tag, you can use nokogiri like this:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script>
or <style>
tags, so you could also remove blacklisted tags:
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
Solution 4:
I just came up with this, but @andre-r's solution is soo much better!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html><htmlxmlns='http://www.w3.org/1999/xhtml'xml:lang='en'lang='en'><head><metahttp-equiv='Content-type'content='text/html; charset=UTF-8' /><title>Test HTML Document</title><metahttp-equiv='content-language'content='en' /></head><body><h1>Test <abbrtitle='Hypertext Markup Language'>HTML</abbr> Document</h1><divclass='main'><p><strong>Test</strong><abbrtitle='Hypertext Markup Language'>HTML</abbr><em>Document</em></p></div></body></html>
- |
<!DOCTYPE html><htmlxmlns="http://www.w3.org/1999/xhtml"xml:lang="en"lang="en"><head><metahttp-equiv="Content-Type"content="text/html; charset=UTF-8"><title></title><metahttp-equiv="content-language"content="en"></head><body><h1><abbrtitle="Hypertext Markup Language"></abbr></h1><divclass="main"><p><strong></strong><abbrtitle="Hypertext Markup Language"></abbr><em></em></p></div></body></html>
Post a Comment for "Strip Text From Html Document Using Ruby"