Java API for XML (JAXP) – DOM

What is DOM

DOM is an Object representation of an XML, HTML or XHTML document. In this tutorial we will be dealing with only XML. DOM represents the XML as a Document tree. JAXP provides API for DOM implementation in Java. It also provides parsing interface which can be used to plugin different parsers (JAXP provides a default implementation)

Classes

  • org.w3c.dom – Contains classes that are DOM representation of an XML Document and its components. Classes include :
    • Document – Represents an entire XML or HTML Document. It is the root of the Document tree.
    • Element – Represents an element in an XML or HTML Document. It has methods to access the attributes of an xml element.
    • Attribute – Represents an attribute in an Element object.
    • CDATASection – Represents CDATA Section. These are blocks of text that can contain characters that are normally part of markup.
    • Text – Represents textual content of an element or an Attribute. If the text does not contain markup then all text is contain in a single node, if it contains markup then the various elements are added as children of the Text element.
    • Processing Instruction – Represents a Processing Instruction in an XML document.
    • Comment – Represents a comment in an XML Document. Contains comment text.
  • javax.xml.parsers – Contains interfaces that the DOM and SAX Parsers need to implement :
    • DocumentBuilderFactory – Defines a factory that can be used to obtain DOM parsers
    • DocumentBuilder – Defines interface methods that can be used to obtain a DOM Object tree from an XML Document

JAXP DOM in action

Lets now see an example of a DOM representation of an XML document. In this example we look at the following:

  • Parsing the XML using the default DOM Parser.
  • Obtaining the root element
  • Obtaining all elements with a specific name
  • Obtaining all elements with a specific name and in a specific namespace
  • Iterating through all child nodes and parsing through them.
package com.studytrails.xml.jaxp;

import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class JaxpDOMExample1 {

	private static String xmlSource = "http://feeds.bbci.co.uk/news/technology/rss.xml?edition=int";

	public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {
		JaxpDOMExample1 example = new JaxpDOMExample1();
		example.startParsing();

	}

	void startParsing() throws ParserConfigurationException, SAXException, IOException {

		// create the factory for the DocumentBuilder. JAXP ships with a xerces
		// as the default DOM parser.
		DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
		System.out.println(factory.getClass());
		// prints class
		// com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
		// we want the factory to be namespace aware. This is important if the
		// XML declares and uses additional namespaces
		factory.setNamespaceAware(true);
		// the actual builder or parser
		DocumentBuilder builder = factory.newDocumentBuilder();

		// the Document that represents the XML
		Document bbcDoc = builder.parse(xmlSource);

		// the root element.
		Element rootElement = bbcDoc.getDocumentElement();
		System.out.println(rootElement.getNodeName());
		// prints rss

		// search for an element using the name
		NodeList list = rootElement.getElementsByTagName("channel");
		// get the first item in the list
		Node channel = list.item(0);
		// get the child nodes
		NodeList channelChildren = channel.getChildNodes();
		int length = channelChildren.getLength();
		for (int i = 0; i < length; i++) {
			Node node = channelChildren.item(i);
			// node type 1 is text
			if (1 == node.getNodeType()) {
				if ("title".equals(node.getNodeName()))
					// the text element is the child node
					System.out.println(node.getFirstChild().getTextContent());
			}
		}

		// get all elements with the name 'link'. We just print the first link
		NodeList linkList = rootElement.getElementsByTagName("link");
		System.out.println(linkList.item(0).getFirstChild().getTextContent());
		// &ltatom:link href="http://feeds.bbci.co.uk/news/technology/rss.xml"
		// rel="self" type="application/rss+xml"/&gt

		// get all elements with the name 'link' and in a specific namespace
		NodeList linkList2 = rootElement.getElementsByTagNameNS("http://www.w3.org/2005/Atom", "link");
		Node atomLink = linkList2.item(0);
		System.out.println(atomLink.hasAttributes()); // prints true
		NamedNodeMap atomLinkAttributes = atomLink.getAttributes();
		for (int i = 0; i < atomLinkAttributes.getLength(); i++) {
			Node atomLinkAttribute = atomLinkAttributes.item(i);
			System.out.println(atomLinkAttribute.getNodeName());
			System.out.println(atomLinkAttribute.getNodeValue());
			/*prints 
			 * href 
			 * http://feeds.bbci.co.uk/news/technology/rss.xml
			 * rel
			 * self
			 */
		}

		Node firstChildOfRoot = rootElement.getFirstChild();
		System.out.println(firstChildOfRoot.getNodeName());
		// prints #text

		Node siblingOfFirstChild = firstChildOfRoot.getNextSibling();
		System.out.println(siblingOfFirstChild.getNodeName());
		// prints channel

	}

}

Leave a Comment