Jos Kingston's homepage

File updated 22/03/05 10:41

What is htmltag?
System requirements
Htmltag philosophies
Conditions of use
Setting up the htmltag macro ready for use
Converting Word files with htmltag
Style conversion information
Heading 3 - set by default to generate top of file hyperlinks
Htmltag and hyperlinks
Converting tables with htmltag
Setting up a .css file for use with htmltag
Extracting images from Word files
Troubleshooting Information
Contact details

Clean, automated html production from simple Word files

Jos Kingston

What is htmltag?

When you save as a Web page from Word Microsoft assumes that what you want is to replicate as precisely as possible what the original Word document looks like. This applies whether you select either the "full" or "filtered" html option, But loyalty to individualised formatting is at loggerheads with the objective of producing clean, flexible html for a website where all pages take their text formatting specifications from a standard .css stylesheet. You need a different approach if this is what you want.

System requirements

Htmltag philosophies

1. Most authors prefer working in Word

Even where documents are being produced with a view to use as web pages, it often makes sense to work in Word, and circulate documents in Word format for editing and proofing.

But difficulties can arise in situations where the final document is handed over to a separate Web team for html conversion. After they have spent time hacking around to produce a decent html version from the Word document, they have no inclination to do that work all over again every time the author makes changes to the document. Consequently, the html version becomes the definitive version, and the author no longer has the option to update and edit the definitive version in Word.

The solution to this is for the author to have the capacity to submit their material in squeaky-clean html ready to receive all its formatting from the website's standard css, and to be able to regenerate the html on the spot from within Word whenever changes are required to the document.

2. Best practice in formatting documents is essentially the same in Word and html

Consistent application of styles is the key, and htmltag requires this.

3. Learning to write and use macros in Word is empowering

It's partly in the hope of demystifying macros that htmltag is being distributed as code rather than a compiled program. There will be many occasions where customising the macro (for example to handle different stylenames) will make it a more useful tool.

4. Htmltag isn't designed for casual users

Htmltag isn't foolproof, and the process of learning to work within its limitations may take the user some time. It's probably not worth the effort unless you need to convert Word documents to html on a regular basis, and definitely not worth the effort unless you're prepared to apply styles consistently to Word documents.

Conditions of use

The necessary components can be downloaded direct from the link in the next section.

The macro code is entirely open source. However, Jos Kingston hereby asserts her moral rights of authorship as laid down in the 1988 Copyright, Design and Patents Act. These rights include the right to be identified as the author of htmltag and the right not to have this work "subjected to derogatory treatment"' - for example "addition, deletion or alteration prejudicial to the honour or reputation of the author."

This assertion is not intended to discourage customisation of the macro for personal use or to restrict its distribution. I have decided to make the macro freely available at this stage because I was diagnosed in December 2004 with a terminal cancer, and thought that a few people would find it useful or at least interesting.

There are some bits of code which I would like to have the energy to clean up a bit further, but unfortunately I don't! Details of known problems are included in Troubleshooting Information. If you do any work to improve the code, I would like to receive your revised versions of the macro. If you find the macro useful, please go to https://www.bmycharity.com/V2/joskingston and make a donation.

Please remember: it isn't the case that all Word files will produce validating html when you use htmltag. You must check your files with a reputable validating utility. The one at http://validator.w3.org/ is reputable, quick and easy to use, and lets you validate files stored on your local hard disk. Even where documents have been appropriately formatted using htmltag-recognised style names, there may be formatting quirks which I haven't encountered and which therefore aren't taken into account by the macro. If you use htmltag regularly, you should be able to sort workarounds in how you format your documents; or, if you have some understanding of VBA, customise the macro accordingly.

Setting up the htmltag macro ready for use

The macro, plus all the required htmltag style names, need to be available in a Word template (.dot) file. It can then be run from any Word document which is attached to the template. Before first use, you need to assemble the template file. This isn't being distributed ready-to-use for security reasons. Macros can be written to infect your PC with a virus, so encouraging people to download .dot files is generally a bad idea. Htmltag is intended for those who have a good level of computer knowhow, so if the instructions below are daunting you are recommended to decide now that it's not for you!

  1. Download the three required files from here.
  2. In Word, open the file htmltaguser.doc
  3. From the Word File menu, select Save As. Set Save As Type to Document Template (.dot). Word will automatically set the Save As folder to its default template location. You can change this if you want, but in future you will find it quicker to attach files to the template if you leave it at the default.
  4. Save this file with the name htmltag.dot. As long as you keep the .dot extension, you can call it something different if you want, but these instructions assume that htmltag.dot is the name of the htmltag template file.
  5. Select all the text in htmltag.dot and delete it. This leaves you with a template file containing just the style definitions. Change these style definitions if you want your Word documents to reflect your own formatting decisions - it's the style names which are important to htmltag, not how the styles are formatted. If you want headers or footers included in your Word template, just set them up as usual. They will then appear in all Word documents based on the template, but htmltag will ignore them when you convert to html.
  6. Check your Word configuration settings. If your Word configuration has been left at the default settings, new styles constantly get created "on the fly". Styles with names such as "Heading 1 char" are a symptom of this. It makes the whole process of using styles thoroughly confusing, and could also prevent htmltag working correctly.

    To prevent this, in Word make the following changes from Tools | Options:
    Tools | Autocorrect | Autoformat
    - under "Automatically as you type", switch off "Define styles based on your formatting"

    Additionally in Word 2002 and later: Tools | Options | Edit
    - under "Editing options", switch off "keep track of formatting"

    You may find that these changes don't take effect until you close Word and reload.
  7. With htmltag.dot still your active file, from the Word Tools Menu, select Macro | Macros. Set Macros In to htmltag.dot. Now click the Create button. You must supply a name for a macro at this stage. It doesn't matter what this is - just x will do. In the following steps the macro "shell" which Word sets up for you automatically, will be replaced with the htmltag macro code.
  8. When the Word Visual Basic window opens, delete all its contents so you have an empty window ready to paste into.
  9. Open the file htmltagcode.txt, either in Word or in a text editor like Notepad. Select all the contents of the file, and copy it to the clipboard. Before going any further, general good practice when working with macros from "unknown" sources dictates that you should look through the macro code to satisfy yourself that there's nothng dangerous about it.
  10. Return to the Visual Basic window in htmltag.dot, paste the htmltag macro text, save the file again, and close the Visual Basic window.
  11. Htmltag can't run if you have Word macro security set to High. With htmltag.dot still open, from Tools | Options | Security, click the Macro Security button and select Medium or Low. If you select Medium, you will be prompted to OK whenever you open a file containing, or attached to a template containing, htmltag or any other macro.
  12. The third file which you should have downloaded is htmltag.css - see Setting up a .css file for use with htmltag for further information.

Converting Word files with htmltag

You can use htmltag on existing files by attaching to the htmltag.dot template as described below. New files can be created based on the htmltag.dot template - this way, you will have all the htmltag-recognised style names available to you as you work and can thus avoid the pre-conversion preparation which is otherwise likely to be required.

Preparing your own files for conversion with Htmltag

If you want to convert multiple documents for a website, before you convert:

Style conversion information

The following are htmltag-recognised styles. They will all be available when a document has been attached to the htmltag template with "Update Styles" selected. In Word, they can be formatted however the user wants to format them. In html, their formatting will be dictated by the settings in your .css file, which you can change as you wish.

Normal style + all unrecognised stylenames become paragraph text <p>.

Heading 1 - 5 styles become <h1>, <h2> etc.

Heading 1

Heading 2

Heading 3 - set by default to generate top of file hyperlinks

If you have a basic understanding of VBA, you will be able to tweak the htmltag macro to change the heading level from which top of file hyperlinks are generated. With more advanced VBA knowhow, you could edit the macro to generate top of file hyperlinks from more than one heading level.

Heading 4

Heading 5

The macro can be tweaked to add more heading levels if required. To be effective, all Heading levels which are included must be matched by style definitions in your .css file.

Indent style becomes <blockquote>.

Here's what indent style looks like.

You're especially likely to want to use it for quotations. Note that if you have defined a style for quotations in a Word document, you can switch them in one to htmltag Indent style using Word capability to Find and Replace styles. In the Search and Replace window, click Format and select Styles. Set to replace (say) Quotation with Indent, with the Find and Replace text boxes left blank.

Bullet style becomes <ul> (unordered list).

Numlist style becomes <ol> (ordered list) Bulleted and numbered lists set to htmltag stylenames will maintain hanging indents after conversion.

  1. Here's Numlist style...

    Numbered lists will all start at 1 after conversion. They can be quickly amended in (say) Dreamweaver.

Numlist2 style becomes <blockquote><ol>

  1. Here's Numlist2. Note that it's a good idea to be sparing in your use of Numlist2 if you want your page to be usable at very narrow browser window widths - desirable if, for example, your page consists of software help.

Bulleted and numbered lists set from the Toolbar, without applying the Numlist or Numlist2 style, won't be honoured by htmltag. You can define formatting for Numlist and Numlist2 exactly as you want in the Word file, but Word (especially 2002) is very unwilling to forsake control over indenting of numbered lists and will probably cause you some grief whatever the formatting you specify.

The good news is that Htmltag will simply ignore how Word has indented your numbered lists - it just tags them according to which of the two numbered list styles you have applied. If you link your html file to a cascading style sheet, this can define indent positioning etc. for the <ol>, <ul> and <blockquote> tags.

Numbered lists converted by htmltag will always restart at 1 after a break. You need to tweak in Dreamweaver after conversion to reset. It would be possible to improve the macro to resolve this.

Word uses two different types of style - paragraph styles, which apply to a paragraph in its entirety, and character styles, which apply to selected characters within a paragraph. (Any string of text after which you have pressed the Enter key is a paragraph - heading styles are paragraph styles.) In addition to the paragraph styles described above, htmltag can convert the following character styles.

Htmltag and hyperlinks

Converting tables with htmltag

Here's an example table - blank cells are honoured as in the Word version.

Species Date and location 2004 2003 2002 2001
Red-throated Diver Rousay, Orkney, 7/01       Orkney, 7
Great Northern Diver Rousay, Orkney, 12/92        
Little Grebe Thyburgh CP, 1/02 Bakewell, 2 Birchington, 4 Thyburgh CP, 1  
Great Crested Grebe Norfolk, 7/01   Stodmarsh, 4 Norfolk, 7 Norfolk, 7
Red Necked Grebe Carsington, Derbys, 1/98        
Fulmar Orkney, 7/01   Birchington, 4 Bamburgh, 4 Orkney, 7
Sooty Shearwater Shetland, 6/78        
Manx Shearwater Hoy, Orkney, 7/01       Orkney, 7
Gannet Orkney, 7/01       Orkney, 7
Cormorant Orkney, 7/01 Wells, 3 Birchington, 4 Norfolk, 7 Norfolk, 7
Shag Orkney, 7/01       Orkney, 7
Little Egret Breydon, Norfolk, 7/02 Wells, 3 Cuckmere, 1 Norfolk, 7 Stodmarsh, 12
Grey Heron Rousay, Orkney, 7/01 Hathersage, 3 Stodmarsh, 4 Norfolk, 7 Norfolk, 7
Cattle Egret Brighton        
Mute Swan Rousay, Orkney, 7/01 Wells, 3 Birchington, 4 Norfolk, 7 Norfolk, 7
Whooper Swan Rousay, Orkney, 7/01       Orkney, 7

Setting up a .css file for use with htmltag

It is assumed that htmltag users will understand the basics of using .css files to define styles. Your CSS file must include specifications for all the heading levels used by htmltag, and for lists, blockquotes etc. You can download a sample htmltag.css by right-clicking here. CSS files are simple text files and can be edited in Notepad - you can redefine settings however you want.

I have noted that the same .css style sheet gives different results in Firefox and Internet Explorer with regard to indentations for lists and blockquotes. Any illumination on this point will be welcome.

Tweaking the macro to call your own stylesheet

You can get straight into the macro code from any document which has been attached to htmltag.dot. You are advised to ensure that you have an unedited copy of htmltag to return to if necessary.

Extracting images from Word files

Htmltag simply ignores images when converting. If you want to include images in an htmltag-converted file, you'll need to quickly tweak the html in Dreamweaver or however. Htmltag does honour links to image files. Your work will be much easier if you bear in mind that that your working folder structure should exactly replicate the folder structure of your destination website.

In order to extract images at optimum quality from Word documents, the following knowhow is useful:

Troubleshooting Information

Under Windows 2000, Htmltag terminates at the end when text file is saved as html

On PCs at work (Sheffield Hallam University) with a standard Windows 2000 image, htmltag ran with no problems during the academic year 2003/4. Since the PCs were reimaged for 2004/5, again with Win2K but including some later Windows security patches etc, it has fallen over at the end of the macro when the temporary text file generated by htmltag is saved with the .html extension. This has also been reported on a tester's Win2K setup. A simple workaround in the macro may well be all that is needed, but unfortunately it isn't possible for me to sort this out. Later Win 2K/Word updates may have resolved the problem, which isn't present in XP.

If the macro terminates this way, it's probably best to give up on it unless you're a VBA whizzo. You may find that you can retrieve the html code from the temporary file temptext.txt which is generated by htmltag, and should be available in the same folder as the file you have converted. (If htmltag runs correctly, this file is automatically deleted right at the end.) You can quickly check that this is likely to be good html by whether there's anything after the closing </html> tag - if there is, the conversion process didn't complete before the program termination.

Program terminates: Requested member of the Collection does not exist

Program terminates and window opens with this message. This will happen if not all the htmltag styles are available in your document. This is likely to be the case if you run htmltag on existing documents not initially based on the template htmltag.dot, and you didn't check the "Update styles" box when you attached the htmltag template to your file.

Solution: end the program, close the file tempdoc.doc, and return to the file which you were running htmltag from. This will not have been changed. From Tools | Templates and Add-ins, check the Update Styles box and OK. You can now run htmltag again, and all the required styles will be found.

This behaviour has deliberately not been eliminated so that users are forced to work with the htmltag macro as the author intended.

Program terminates with "String Parameter too long" message

This happens if you set up a doclink to the last of the hyperlink-generating headings in your file (by default, the last heading 3). The reason is because the doclinks part of the macro works in a silly way which needs changing. (See later in this troubleshooting section.)

Solution: end the program, close the file tempdoc.doc, and return to the file which you were running htmltag from. Workaround by avoiding internal links to the last section for now, and re-run htmltag.

Bulleted and numbered lists aren't appearing as hanging indents

You must apply htmltag's stylenames to all lists in order for them to be tagged as lists during html conversion.

Solution: edit your Word document to set lists to recognised htmltag styles. (Bullet, Numlist, or Numlist2.) Note: This is a much quicker process than the sorting out you're likely to need on lists in Dreamweaver if you currently generate Word html then import into DW to clean. Further note: numbered lists not holding their formatting are a common cause of grief in Word (especially if you transfer between different PCs - apparently this is because some Word numbered list settings are stupidly held in the PC's Registry). Unfortunately you may encounter irritations when you set numbered lists to htmltag styles. But however stupid Word makes the indentation, formatting of Bullet and Numlist styles will be consistent (as set in the .css) in the htmltag version.

All new lists restart at number 1

This is a limitation of htmltag which can be quickly corrected where required by editing the html in Dreamweaver or however.

Numbers aren't being carried through from outline numbered headings

In Word, the numbers as such aren't part of the file's text if you have used outline numbering - they are auto-generated fields in the file.

Solution: It's useful knowhow that headings set to outline numbering are converted into hard numbers if you save the file in Word 2 format. I have written an additional pre-conversion macro making use of this Word "feature" to run pre-htmltag. Outline numbering in your original Word document is maintained. But this needs customising to suit different usages of outline numbered styles. It isn't anticipated that people without macro knowhow will be able to make use of it. Click here for more information.

End paragraph tag missing prior to a table, hence subsequent code doesn't validate

This may happen when there isn't an empty line formatted in normal style between a table and the text preceding it. The missing </p> can lead to a host of validation errors in subsequent html. I hope that I have fixed the problem in most instances, but watch out for it.

Solution: edit your Word documents to add a carriage return before all tables.

First line of text directly following a table gets included in the table, and code consequently doesn't validate.

This may happen when there isn't an empty line formatted in normal style between a table and the following text, but I hope that I have fixed the problem.

Solution: edit your Word documents to add a carriage return after all tables.

Blank lines in top-of-file hyperlinks

This happens if a blank line has been formatted as a heading style at whichever level is set in htmltag to generate the file hyperlinks. (By default, Heading 3 level.) Note that you will similarly get unwanted results if you generate a Table of Contents in Word from a file with blank lines formatted as heading styles.

In general, blank lines set to any style other than Normal are bad news as far as htmltag is concerned and can cause formatting failures in conversion.

Solution: edit your Word documents to remove or set to normal style, any blank lines set to a heading style.

Incorrect nesting of bold and italic tags causing validation errors.

This was happening on this troubleshooting section of document where I had the space following "Solution:" set to italic.

Solution: In this case, with the colon and space after "Solution" set to not bold, not italic, correct nesting was achieved.

The "Doclink" style is unreliable and cumbersome as a way of setting internal document links.

Solution: Doclink is indisputably basically naff. Inevitably if you use the Doclink style, , at some point you'll change a heading without remembering to change all doclink instances which link to that heading. And then things will go haywire - you'll get a chunk of the wrong text plonked in instead of a doclink. Check the html version carefully wherever you have linked to an internal heading. Other points about htmltag doclink style: it usually creates internal links correctly in the middle of a paragraph but sometimes doesn't work correctly if the link is on a line of its own.

A competent VBA programmer could almost certainly add the capability to convert bookmarks to internal html links, hence eliminating any temptation to use the awful doclink.

Hyperlinks go "out of sync" - appear in the wrong places and to the wrong links

This has been encountered where the problem was caused by a graphic in the Word document containing a "behind the scenes" internal hyperlink. If the document was set to view Field Codes (Tools | Options | View), the "rogue" graphic was no longer displayed but appeared as a field code {SHAPE ..\*.MERGEFORMAT}.

Htmltag has not been tested with Word documents containing pictures other than inserted bitmaps, which are ignored. "Rogue" behaviour as described above has been replicated where graphic objects are nested within one another (e.g. a picture fill in a drawing shape), and these objects are cut and pasted. On the basis of this experience, similar results may occur where a document contains complex graphic objects (charts pasted from Excel, drawing canvas objects etc.)

Solution in this particular case was to delete the field, then re-insert the graphic from file and check it no longer showed as a field. (Paste Special to a bitmap format didn't stop the graphic behaving as a field.) Htmltag conversion then worked correctly. Specific "shape - mergeformat" behaviour could be addressed by simple additions to the macro if htmltag was being used in contexts where this limitation was a regular nuisance. However htmltag is only intended for use with simple Word files which contain no more in the way of graphics than inserted bitmap images.

My html doesn't validate - complains about "&"

Htmltag converts all &s in the text to &amp. The problem will arise if, when htmltag prompts you for page title, you add a title which includes & - validation will fail. Things will be fine if you type &amp in the title prompt, not just &.

My html validated straight from htmltag, but not after I'd tweaked it in Dreamweaver

Watch out with Dreamweaver for default tagging which doesn't validate to xhtml 1.0 Transitional. In particular, note <br> as opposed to <br />; absolute height specs in tables. Dreamweaver will also strip out the / which is required for validation as xhtml 1.0 transitional, from any <head> items you edit. For example, this happens if you attach to your .css within Dreamweaver (which you may wish to do for the sake of greater "wysiwyg" in Dreamweaver). There may be ways to avoid this in Dreamweaver versions later than 4 - this hasn't been tested.

Contact details

Jos@Joskingston.org

Please get in touch if you find htmltag useful or have made any improvements to the macro - I may not be dead yet! You can quickly check at http://www.joskingston.org/Terminal/index.html. As at December 2005, I can see from my web statistics that a steady trickle of people are downloading the macro and it would be very interesting to know how they get on with it. However, I don't intend to undertake any further programming work on htmltag myself, and I'm unlikely to be able to provide troubleshooting support. The biggest frustration to my mind is that as it stands, htmltag is unusable with versions of Windows 2000 where post-2003 security patches have been installed, despite being perfectly fine with XP. I have a feeling that a quick VBA tweak could provide a workaround and it's really most frustrating that my brain simply can't bend itself round such things any more!