TextMate XHTML Entity Encoding

22 August 2009

For those of us use TextMate, using the built in "Convert Character / Selection to Entities" and "Convert Character / Selection to Entities Excl. Tags" commands can be a real timesaver. However, if you are working with XHMTL, these commands do no produce valid markup as they use name entities. With two simple edits and a TextMate environment variable, you can have your version use hex encoded entities instead.

Step One: TM_XHTML

The first step is to tell TextMate that we are working with XHTML. Under Prefences > Advanced > Shell Variables add a new variable called TM_XHTML with the value ` /` (note the space there, it is important). This will have the added benefit of making all your snippet commands insert self-closing tags where applicable.

Step Two: Edit the Commands

Now click Bundles > Bundle Editor > Edit Commands. Click the triangle next to HTML in the left column to show all the available commands. Now choose "Convert Character / Selection to Entities" and replace its contents with

#!/usr/bin/env ruby
$KCODE = 'U'
$char_to_entity = { }
File.open("#{ENV['TM_BUNDLE_SUPPORT']}/entities.txt").read.scan(/^(\d+)\t(.+)$/) do |key, value|
  $char_to_entity[[key.to_i].pack('U')] = value
end
def encode (text)
  text.gsub(/[^\x00-\x7F]|["'<>&]/) do |ch|
    if ENV['TM_XHTML'] then
      ent = sprintf("&#x%02X;", ch.unpack("U")[0])
    else
      ent = $char_to_entity[ch]
      ent ? "&amp;#{ent};" : sprintf("&amp;#x%02X;", ch.unpack("U")[0])
    end
  end
end
print encode(STDIN.read)

Then choose "Convert Character / Selection to Entities Excl. Tags" and replace its contents with

#!/usr/bin/env ruby
$KCODE = 'U'
$char_to_entity = { }
File.open("#{ENV['TM_BUNDLE_SUPPORT']}/entities.txt").read.scan(/^(\d+)\t(.+)$/) do |key, value|
  $char_to_entity[[key.to_i].pack('U')] = value
end
def encode (text)
  text.gsub(/[^\x00-\x7F]|["'<>&]/) do |ch|
    if ENV['TM_XHTML'] then
      ent = sprintf("&#x%02X;", ch.unpack("U")[0])
    else
      ent = $char_to_entity[ch]
      ent ? "&amp;#{ent};" : sprintf("&amp;#x%02X;", ch.unpack("U")[0])
    end
  end
end
STDIN.read.scan(/(?x)
    ( <\?(?:[^?]*|\?(?!>))*\?>
    | <!-- (?m:.*?) -->
    | <\/? (?i:a|abbr|acronym|address|applet|area|b|base|basefont|bdo|big|blockquote|body|br|button|caption|center|cite|code|col|colgroup|dd|del|dfn|dir|div|dl|dt|em|fieldset|font|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|hr|html|i|iframe|img|input|ins|isindex|kbd|label|legend|li|link|map|menu|meta|noframes|noscript|object|ol|optgroup|option|p|param|pre|q|s|samp|script|select|small|span|strike|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|u|ul|var)\b
(?:[^>"']|"[^"]*"|'[^']*')*
      >
    | &(?:[a-zA-Z0-9]+|\#[0-9]+|\#x[0-9a-fA-F]+);
    )
    |([^<&]+|[<&])
  /x) do |tag, text|
  print tag.to_s, encode(text.to_s)
end

Close the Bundle Editor, which will save the changes automatically, and you're all set.

Extra Credit: Transcode Named to Hex

Now, if you want to be able to transcode all of a file or selection from named entities to their hex equivalents, you can add a new command to the HTML bundle to handle just that.

Go to Bundles > Bundle Editor again. Click on HTML in the left column then click the Plus sign at the bottom left and choose "New Command". Give it a name like "Convert Named Entities to Hex". Paste the following into the Command(s) area on the right:

#!/usr/bin/env ruby
$KCODE = 'U'
$entity_to_char = { }
File.open("#{ENV['TM_BUNDLE_SUPPORT']}/entities.txt").read.scan(/^(\d+)\t(.+)$/) do |key, value|
  $entity_to_char[value] = [key.to_i].pack('U')
end
$char_to_entity = { }
File.open("#{ENV['TM_BUNDLE_SUPPORT']}/entities.txt").read.scan(/^(\d+)\t(.+)$/) do |key, value|
  $char_to_entity[[key.to_i].pack('U')] = value
end
def decode ( text )
  text.gsub(/&(?:([a-z0-9]+)|#([0-9]+)|#x([0-9A-F]+));/i) do |m|
    if $1 then
      $entity_to_char[$1] || m
    else
      [$2 ? $2.to_i : $3.hex].pack("U")
    end
  end
end
def encode( text )
  text.gsub(/[^\x00-\x7F]|["'<>&]/) do |ch|
    if ENV['TM_XHTML'] then
      text = sprintf("&#x%02X;", ch.unpack("U")[0])
    else
      text = $char_to_entity[ch]
      text ? "&amp;#{text};" : sprintf("&amp;#x%02X;", ch.unpack("U")[0])
    end
  end
end
def transcode( text )
  encode( decode( text ) )
end
STDIN.read.scan(/(?x)
    ( <\?(?:[^?]*|\?(?!>))*\?>
    | <!-- (?m:.*?) -->
    | <\/? (?i:a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdo|big|blockquote|body|br|button|caption|center|cite|code|col|colgroup|command|datagrid|datalist|datatemplate|dd|del|details|dfn|dialog|dir|div|dl|dt|em|fieldset|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|header|hr|html|i|iframe|img|input|ins|isindex|kbd|label|legend|li|link|map|menu|meta|nav|noframes|noscript|object|ol|optgroup|option|p|param|pre|q|s|samp|script|section|select|small|span|strike|strong|style|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|u|ul|var|video)\b
(?:[^>"']|"[^"]*"|'[^']*')*
      >
    | &(?:[a-zA-Z0-9]+|\#[0-9]+|\#x[0-9a-fA-F]+);
    )
    |([^<&]+|[<&])
  /x) do |tag, text|
  print tag.to_s, transcode(text.to_s)
end

In the Scope section, put text.html. If you want it to be available with the same keyboard command as the other Convert to Entities functions, click inside the Activation area and press Command+& (Command+Shift+7). Close the Bundle Editor and you're done.

Conclusion

It would be nice for the "Convert Character / Selection to Entities" and "Convert Character / Selection to Entities Excl. Tags" commands to be able to detect the doctype like Tidy does instead of having to rely on TM_XHTML. If anyone wants to contribute those changes I'd be happy to post them. All three of these commands are also available at GitHub.