DDI-Codebook Processing

The ddicodebook subpackage provides functionality for reading and processing DDI-Codebook 2.6 XML documents in Python. It is backward compatible with DDI-Codebook 2.5 and earlier versions.

The subpackage is designed to be flexible and accommodate various versions of DDI-Codebook, including slightly invalid DDI documents that are sometimes found in practice. The package is primarily intended for reading and processing existing DDI documents, not for creating new DDI-XML or validation.

Overview

DDI-Codebook is the lightweight version of the DDI standard, intended primarily to document simple survey data. This specification has been widely adopted around the globe by statistical agencies, data producers, archives, research centers, and international organizations.

Basic Usage

Load a DDI-Codebook document:

from dartfx.ddi import ddicodebook

# Load from file
my_codebook = ddicodebook.loadxml('mycodebook.xml')

# Load from XML string
my_codebook = ddicodebook.loadxmlstring(xml_content)

Accessing Study Metadata

# Access study description
study = my_codebook.studyDscr

# Get title
if study and study.citation and study.citation.titlStmt:
    title = study.citation.titlStmt.titl.content

# Get abstract
if study and study.stdyInfo:
    abstract = study.stdyInfo.abstract.content if study.stdyInfo.abstract else None

Working with Variables

# Access data description
if my_codebook.dataDscr:
    for var in my_codebook.dataDscr.var:
        print(f"Variable: {var.name}")
        print(f"Label: {var.labl.content if var.labl else 'No label'}")
        print(f"Format: {var.varFormat.type if var.varFormat else 'Unknown'}")

        # Access categories/codes
        if var.catgry:
            print("Categories:")
            for cat in var.catgry:
                value = cat.catValu.content if cat.catValu else "No value"
                label = cat.labl.content if cat.labl else "No label"
                print(f"  {value}: {label}")

Working with Files

# Access file descriptions
if my_codebook.fileDscr:
    for file_desc in my_codebook.fileDscr:
        file_info = file_desc.fileTxt
        print(f"File: {file_info.fileName}")
        print(f"Format: {file_info.format}")

        # Access file statistics if available
        if hasattr(file_desc, 'fileCont') and file_desc.fileCont:
            print(f"Records: {file_desc.fileCont.dimensns.caseQnty}")

Error Handling

The module is designed to be robust when dealing with incomplete or slightly malformed DDI documents:

try:
    codebook = ddicodebook.loadxml('problematic_file.xml')

    # Safely access potentially missing elements
    title = "No title"
    if (codebook.studyDscr and
        codebook.studyDscr.citation and
        codebook.studyDscr.citation.titlStmt and
        codebook.studyDscr.citation.titlStmt.titl):
        title = codebook.studyDscr.citation.titlStmt.titl.content

except Exception as e:
    print(f"Error loading codebook: {e}")

Implementation Notes

Based on DDI-Codebook version 2.6 schema (backward compatible with 2.5)
Models are located in the ddicodebook.model subpackage
Class names match the complex types defined in DDI-Codebook
Property names match the DDI-Codebook element names
Type annotations are used to determine DDI property types
All classes inherit from a base baseElementType class
The subpackage handles XML namespace issues automatically

Performance Considerations

For large DDI-Codebook documents:

The entire document is loaded into memory
Use streaming approaches for very large files if needed
Consider processing variables in batches for memory efficiency