Archive

Posts Tagged ‘libxml’

a little ferret’ing

September 28th, 2006

I was trying to find some information tonight in an RFC and realized I didn’t have any good way of searching the RFC documents. I still don’t have a very good solution, but I had a little fun trying to make one…

First thing I did was grab all the RFC’s (275M)

wget --passive-ftp -r -l 1 ftp://ftp.isi.edu/in-notes/

Mean while, as that downloaded I started reading up on a great little ruby project Ferret. It’s actually a really impressive indexer, sporting a really easy to use interface and from the looks of it pretty extensible. Anyways… looking around the rfc documents I noticed the rfc-index.xml. Using ruby-libxml and it’s SaxParser, I was able to easily extract fields to describe each rfc like, title, author, date, etc…

Finally, once the index was built a few lines of ruby and searching is lighting fast and not too bad either…

Now to provide a web interface and some sensible ordering…

Anyways for those interested here’s the source files

The best part is probably that the indexer is 109 lines and the search is 11!

index.rb:

#!/usr/bin/env ruby
 
require 'rubygems'
require 'xml/libxml'
require 'ferret'
 
include Ferret
 
RFC_PATH="ftp.isi.edu/in-notes"
 
class RfcEntry
  attr_accessor :doc_id, :title, :authors, :month, :year, :file
 
  def initialize
    self.authors = []
  end
 
  def update( index )
    # call the indexer here to add data to index
    file = doc_id.gsub(/RFC0*/,"rfc")
 
    file += ".txt"
    path = "#{RFC_PATH}/#{file}"
    if( File.exist?( path ) )
 
      puts "record: #{doc_id}, '#{title}' by #{authors} - #{month}, #{year} => #{file}"
 
      index << {:id => doc_id, :title => title, :content => File.read(path),
 
      :authors => authors, :month => month, :year => year }
 
    end
  end
end
 
class RfcIndexParser
  # tag table
  TAGS = { "rfc-entry" => { :start => :start_entry, :end => :add_entry },
 
  "doc-id" => { :start => :collect, :end => :store_doc_id },
 
  "title" => { :start => :collect, :end => :store_title },
 
  "name" => { :start => :collect, :end => :store_author },
 
  "month" => { :start => :collect, :end => :store_month },
 
  "year" => { :start => :collect, :end => :store_year } }
 
  # always have one entry
  def initialize( index ) # pass in the index
    @entry = nil
 
    @buffer = ""
    @index = index
  end
 
  def parse( rfc_index )
    parser = XML::SaxParser.new
 
    parser.filename = rfc_index
    parser.on_start_element {|name,attrs| self.on_start(name,attrs) }
 
    parser.on_end_element {|name| self.on_end(name) }
 
    parser.on_characters {|chars| self.on_chars(chars) }
 
    parser.parse
  end
 
  # when we find a new start tag check the table,
  # if it's in the table call the start method
  def on_start( tag, attrs )
 
    action = TAGS[tag]
    self.send( action[:start], tag, attrs ) if( action )
 
  end
 
  # when we find a new end tag check the table,
  # if it's in the table call the end method
  def on_end( tag )
 
    action = TAGS[tag]
    self.send( action[:end], tag ) if( action )
 
  end
 
  def on_chars( char )
    @buffer << char
 
  end
 
  def start_entry( tag, attrs )
    @entry = RfcEntry.new
 
  end
 
  def add_entry( tag )
    @entry.update( @index ) if @entry
 
    @entry = nil
  end
 
  def collect( tag, attrs )
 
    @buffer = "" # reset the buffer
  end
 
  def store_doc_id( tag )
 
    @entry.doc_id = @buffer.squeeze(" ") if @entry and @entry.doc_id.nil?
 
  end
 
  def store_author( tag )
    @entry.authors << @buffer.squeeze(" ") if @entry
 
  end
 
  def store_title( tag )
    @entry.title = @buffer.squeeze(" ") if @entry
 
  end
 
  def store_month( tag )
    @entry.month = @buffer.squeeze(" ") if @entry
 
  end
 
  def store_year( tag )
    @entry.year = @buffer.squeeze(" ") if @entry
 
  end
 
end
 
# parse and create the index
RfcIndexParser.new(Index::Index.new(:path => 'index')).parse( "#{RFC_PATH}/rfc-index.xml" )

search.rb:

#!/usr/bin/env ruby
require 'rubygems'
require 'ferret'
include Ferret
 
index = Index::Index.new(:path => 'index')
 
index.search_each('title|content:"URL"') do |id, score|
 
  doc = index[id]
  puts "#{doc[:id]} '#{doc[:title]}' #{score}"
 
end

Software , ,

svn merge hell!

August 30th, 2006
svn: REPORT request failed on '/svn/rhg/!svn/vcc/default'
svn: Working copy path 'path/to/afile/in/your/project/a_file_that_is_broken' does not exist in repository

If you’ve ever seen this error you’ve probably resorted to ‘rm -rf’

Thanks to some of the great minds of revolution we have a simple fix that involves editing the .svn/entries file and locating an incorrect attribute revision=”0″

And to automate this I wrote a little ruby script. It uses libxml-ruby because I wanted to get fimilar with the API, which thankfully is very similar to the C API.

Note: This only applies to subversion 1.3 client, the newer 1.4 client does not generate xml property files.

#!/usr/bin/env ruby
require 'find'
require 'pathname'
require 'rubygems'
 
require 'xml/libxml'
# going to search through all the folders in the current project
# and locate all .svn/entries files.  Parse each file looking for
# bad entries
# a bad entry is defined as
# any entry with a revision="0"
 
# that is not scheduled="add"
 
 
def start_doc
end
 
def start_element(name,attrs, entry_path)
 
 if( name == "entry" && attrs["revision"] == "0" && attrs["schedule"] != "add" )
 
   puts "Potential Error in #{entry_path}"
 end
end
 
def end_element(name)
 
end
 
def chars
end
def comments
end
 
 
subversion_folder = /\.svn$/i
 
root_path = Pathname.new(".").realpath
Find.find(root_path) do |file_name|
 
 if subversion_folder.match(file_name)
   Find.find(file_name) do |sub_file|
 
     entry_file = File.basename(sub_file)
     if entry_file == "entries"
 
       entry_path = "#{file_name}/#{entry_file}"
       parser = XML::SaxParser.new
 
       parser.on_start_element {|name,attrs| start_element( name, attrs, entry_path ) }
 
       parser.on_end_element {|name| end_element(name) }
       parser.filename = entry_path
 
       parser.parse
       break
     end
   end
 end
 
end

Software , ,