파이썬 - XML 처리

이전 페이지다음 페이지

XML은 프로그래머가 운영 체제 및/또는 개발 언어에 관계없이 다른 응용 프로그램에서 읽을 수 있는 응용 프로그램을 개발할 수 있도록 하는 이식 가능한 오픈 소스 언어입니다.

XML이란 무엇입니까?

XML(Extensible Markup Language)은 HTML 또는 SGML과 매우 유사한 마크업 언어입니다. 이는 World Wide Web Consortium에서 권장하며 공개 표준으로 제공됩니다.

XML은 SQL 기반 백본 없이도 중소 규모의 데이터를 추적하는 데 매우 유용합니다.

XML 파서 아키텍처 및 API

Python 표준 라이브러리는 XML 작업을 위한 최소한의 유용한 인터페이스 세트를 제공합니다.

XML 데이터에 대해 가장 기본적이고 널리 사용되는 두 가지 API는 SAX 및 DOM 인터페이스입니다.

XML용 단순 API(SAX) − 여기에서 관심 이벤트에 대한 콜백을 등록한 다음 파서가 문서를 진행하도록 합니다. 이것은 문서가 크거나 메모리 제한이 있는 경우에 유용합니다. 디스크에서 파일을 읽을 때 파일을 구문 분석하고 전체 파일이 메모리에 저장되지 않습니다.
문서 개체 모델(DOM) API − XML 문서의 모든 기능을 나타내기 위해 전체 파일을 메모리로 읽고 계층적(트리 기반) 형식으로 저장하는 World Wide Web 컨소시엄 권장 사항입니다.

SAX는 분명히 큰 파일로 작업할 때 DOM만큼 빠르게 정보를 처리할 수 없습니다. 반면에 DOM을 독점적으로 사용하면 특히 많은 작은 파일에서 사용되는 경우 리소스를 실제로 죽일 수 있습니다.

SAX는 읽기 전용이지만 DOM은 XML 파일을 변경할 수 있습니다. 이 두 가지 API는 말 그대로 서로를 보완하므로 대규모 프로젝트에 둘 다 사용할 수 없습니다.

모든 XML 코드 예제에 대해 간단한 XML 파일 movies.xml을 사용하겠습니다. 입력으로 -

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

SAX API로 XML 구문 분석

SAX는 이벤트 기반 XML 구문 분석을 위한 표준 인터페이스입니다. SAX로 XML을 구문 분석하려면 일반적으로 xml.sax.ContentHandler를 서브클래싱하여 고유한 ContentHandler를 만들어야 합니다.

귀하의 ContentHandler XML의 풍미(들)의 특정 태그와 속성을 처리합니다. ContentHandler 객체는 다양한 구문 분석 이벤트를 처리하는 메서드를 제공합니다. 소유 파서는 XML 파일을 파싱할 때 ContentHandler 메소드를 호출합니다.

startDocument 메소드 및 endDocument XML 파일의 시작과 끝에서 호출됩니다. 메소드 문자(텍스트) 매개변수 텍스트를 통해 XML 파일의 문자 데이터가 전달됩니다.

ContentHandler는 각 요소의 시작과 끝에서 호출됩니다. 파서가 네임스페이스 모드가 아닌 경우 startElement(tag, attributes) 메소드 및 endElement(태그) 불려진다; 그렇지 않으면 해당 메소드 startElementNS 및 endElementNS 라고 합니다. 여기서 tag는 요소 태그이고 속성은 Attributes 개체입니다.

진행하기 전에 이해해야 할 다른 중요한 방법은 다음과 같습니다. −

make_parser 방법

다음 메소드는 새로운 파서 객체를 생성하고 반환합니다. 생성된 파서 개체는 시스템이 찾는 첫 번째 파서 유형이 됩니다.

xml.sax.make_parser( [parser_list] )

다음은 매개변수의 세부사항입니다 -

파서_목록 − 모두 make_parser 메서드를 구현해야 하는 사용할 파서 목록으로 구성된 선택적 인수입니다.

분석 방법

다음 메소드는 SAX 파서를 생성하고 문서를 파싱하는 데 사용합니다.

xml.sax.parse( xmlfile, contenthandler[, errorhandler])

다음은 매개변수의 세부사항입니다 -

xml파일 − 읽을 XML 파일의 이름입니다.
콘텐츠 핸들러 − ContentHandler 객체여야 합니다.
오류 처리기 − 지정된 경우 errorhandler는 SAX ErrorHandler 객체여야 합니다.

parseString 방법

SAX 파서를 생성하고 지정된 XML 문자열을 구문 분석하는 방법이 하나 더 있습니다. .

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])

다음은 매개변수의 세부사항입니다 -

xmlstring − 읽을 XML 문자열의 이름입니다.
콘텐츠 핸들러 − ContentHandler 객체여야 합니다.
오류 처리기 − 지정된 경우 errorhandler는 SAX ErrorHandler 객체여야 합니다.

예

#!/usr/bin/python

import xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Call when an element starts
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print "*****Movie*****"
         title = attributes["title"]
         print "Title:", title

   # Call when an elements ends
   def endElement(self, tag):
      if self.CurrentData == "type":
         print "Type:", self.type
      elif self.CurrentData == "format":
         print "Format:", self.format
      elif self.CurrentData == "year":
         print "Year:", self.year
      elif self.CurrentData == "rating":
         print "Rating:", self.rating
      elif self.CurrentData == "stars":
         print "Stars:", self.stars
      elif self.CurrentData == "description":
         print "Description:", self.description
      self.CurrentData = ""

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content
  
if ( __name__ == "__main__"):
   
   # create an XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # override the default ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )
   
   parser.parse("movies.xml")

이것은 다음 결과를 생성합니다 -

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

SAX API 문서에 대한 자세한 내용은 표준 Python SAX API를 참조하세요.

DOM API로 XML 파싱

문서 개체 모델("DOM")은 XML 문서에 액세스하고 수정하기 위한 W3C(World Wide Web Consortium)의 교차 언어 API입니다.

DOM은 랜덤 액세스 애플리케이션에 매우 유용합니다. SAX를 사용하면 한 번에 문서의 한 비트만 볼 수 있습니다. 하나의 SAX 요소를 보고 있다면 다른 요소에 액세스할 수 없습니다.

다음은 XML 문서를 빠르게 로드하고 xml.dom 모듈을 사용하여 minidom 개체를 만드는 가장 쉬운 방법입니다. minidom 개체는 XML 파일에서 DOM 트리를 빠르게 생성하는 간단한 파서 메서드를 제공합니다.

샘플 구문은 minidom 객체의 parse( file [,parser] ) 함수를 호출하여 file 로 지정된 XML 파일을 DOM 트리 객체로 파싱합니다.

#!/usr/bin/python

from xml.dom.minidom import parse
import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print "Root element : %s" % collection.getAttribute("shelf")

# Get all the movies in the collection
movies = collection.getElementsByTagName("movie")

# Print detail of each movie.
for movie in movies:
   print "*****Movie*****"
   if movie.hasAttribute("title"):
      print "Title: %s" % movie.getAttribute("title")

   type = movie.getElementsByTagName('type')[0]
   print "Type: %s" % type.childNodes[0].data
   format = movie.getElementsByTagName('format')[0]
   print "Format: %s" % format.childNodes[0].data
   rating = movie.getElementsByTagName('rating')[0]
   print "Rating: %s" % rating.childNodes[0].data
   description = movie.getElementsByTagName('description')[0]
   print "Description: %s" % description.childNodes[0].data

이것은 다음과 같은 결과를 낳을 것입니다 -

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

DOM API 문서에 대한 자세한 내용은 표준 Python DOM API를 참조하세요.

Python - 다중 스레드 프로그래밍 파이썬 - GUI 프로그래밍 (Tkinter)

python

파이썬 - XML ​​처리

XML이란 무엇입니까?

XML 파서 아키텍처 및 API

SAX API로 XML 구문 분석

make_parser 방법

분석 방법

parseString 방법

예

DOM API로 XML 파싱

파이썬 - XML 처리