a little ferret’ing
I was trying to find some information tonight in an RFC and realized I didn’t have any good way of searching the RFC documents. I still don’t have a very good solution, but I had a little fun trying to make one…
First thing I did was grab all the RFC’s (275M)
wget --passive-ftp -r -l 1 ftp://ftp.isi.edu/in-notes/
Mean while, as that downloaded I started reading up on a great little ruby project Ferret. It’s actually a really impressive indexer, sporting a really easy to use interface and from the looks of it pretty extensible. Anyways… looking around the rfc documents I noticed the rfc-index.xml. Using ruby-libxml and it’s SaxParser, I was able to easily extract fields to describe each rfc like, title, author, date, etc…
Finally, once the index was built a few lines of ruby and searching is lighting fast and not too bad either…
Now to provide a web interface and some sensible ordering…
Anyways for those interested here’s the source files
The best part is probably that the indexer is 109 lines and the search is 11!
index.rb:
#!/usr/bin/env ruby require 'rubygems' require 'xml/libxml' require 'ferret' include Ferret RFC_PATH="ftp.isi.edu/in-notes" class RfcEntry attr_accessor :doc_id, :title, :authors, :month, :year, :file def initialize self.authors = [] end def update( index ) # call the indexer here to add data to index file = doc_id.gsub(/RFC0*/,"rfc") file += ".txt" path = "#{RFC_PATH}/#{file}" if( File.exist?( path ) ) puts "record: #{doc_id}, '#{title}' by #{authors} - #{month}, #{year} => #{file}" index << {:id => doc_id, :title => title, :content => File.read(path), :authors => authors, :month => month, :year => year } end end end class RfcIndexParser # tag table TAGS = { "rfc-entry" => { :start => :start_entry, :end => :add_entry }, "doc-id" => { :start => :collect, :end => :store_doc_id }, "title" => { :start => :collect, :end => :store_title }, "name" => { :start => :collect, :end => :store_author }, "month" => { :start => :collect, :end => :store_month }, "year" => { :start => :collect, :end => :store_year } } # always have one entry def initialize( index ) # pass in the index @entry = nil @buffer = "" @index = index end def parse( rfc_index ) parser = XML::SaxParser.new parser.filename = rfc_index parser.on_start_element {|name,attrs| self.on_start(name,attrs) } parser.on_end_element {|name| self.on_end(name) } parser.on_characters {|chars| self.on_chars(chars) } parser.parse end # when we find a new start tag check the table, # if it's in the table call the start method def on_start( tag, attrs ) action = TAGS[tag] self.send( action[:start], tag, attrs ) if( action ) end # when we find a new end tag check the table, # if it's in the table call the end method def on_end( tag ) action = TAGS[tag] self.send( action[:end], tag ) if( action ) end def on_chars( char ) @buffer << char end def start_entry( tag, attrs ) @entry = RfcEntry.new end def add_entry( tag ) @entry.update( @index ) if @entry @entry = nil end def collect( tag, attrs ) @buffer = "" # reset the buffer end def store_doc_id( tag ) @entry.doc_id = @buffer.squeeze(" ") if @entry and @entry.doc_id.nil? end def store_author( tag ) @entry.authors << @buffer.squeeze(" ") if @entry end def store_title( tag ) @entry.title = @buffer.squeeze(" ") if @entry end def store_month( tag ) @entry.month = @buffer.squeeze(" ") if @entry end def store_year( tag ) @entry.year = @buffer.squeeze(" ") if @entry end end # parse and create the index RfcIndexParser.new(Index::Index.new(:path => 'index')).parse( "#{RFC_PATH}/rfc-index.xml" )
search.rb:
#!/usr/bin/env ruby require 'rubygems' require 'ferret' include Ferret index = Index::Index.new(:path => 'index') index.search_each('title|content:"URL"') do |id, score| doc = index[id] puts "#{doc[:id]} '#{doc[:title]}' #{score}" end

Recent Comments