Chapter 17: Organizing the Scrolls – Data Management

Our heroes have been recruited, and the data scrolls (questionnaires) are filled out. This chapter is about Data Management—the process of converting these physical (or digital) scrolls into a clean, usable, and organized digital database. Good data management is essential for reliable analysis.

Contents

1. Data Management: The Key Steps

Data management is a systematic process encompassing:

Defining Variables and creating the Data Dictionary (the master list).
Creating the Study Database (the entry structure).
Entering Data and correcting errors (Data Cleaning).
Creating the final Data Set for analysis.
Backing up and Archiving the data set securely.

2. Understanding Data Structure

A database is structured like a table:

Records: The horizontal lines, representing all the information collected from one particular individual (one child).
Variables: The vertical columns, representing a specific piece of information collected (e.g., Age, Gender, SuperPaste Use, Caries Status).

A. Unique Identifier (ID)

This is the most critical variable.

The ID must be unique for every record.
It should be secured by a quality assurance procedure.
Composite IDs: The ID can be designed to carry meaning (e.g., first 2 digits = Village, next 2 = Street, last 2 = Person’s sequence number).

B. Variable Specifications

Before entry, all variables must be defined:

Type: Digits (Numeric), Text (Alphanumeric), or Date.
Format: Length of text, number of decimals, and specific date format (e.g., DD/MM/YYYY vs. MM/DD/YYYY).
Consistency: Turn all textual entries into capitals to avoid errors during analysis.

C. Variable Naming Conventions

Variable names must be:

Clear and Self-Explanatory: Should refer directly to an item in the data collection instrument (e.g., EXERDAILY for “Do you exercise daily?”).
Short: Most software requires names to be less than 10 characters (no spaces).
Consistent: Maintain a consistent pattern (e.g., use EXERPAST and EXERCURR for exercise history and current status).
Denoting Dichotomization: Clearly rename variables when they are recoded from crude numerical values into categories (e.g., change EXERCISE to EXERCISE_12 to denote a dichotomized (Yes/No) outcome).

3. Data Entry and Coding

A. Data Collection Instrument Design

The questionnaire should be designed to facilitate easy data entry:

Sections: Divide the instrument into broad sections (Identifiers, Demographics, Outcome, Exposure) that correspond to database sections.
Auto-Coding: The instrument should have codes written directly next to the response options (e.g., Yes [1], No [2]) so the entry person enters the code, not the text. This is called auto-coding.

B. Numerical Coding and Missing Values

Prefer Numerical Coding: Use numbers (digits) whenever possible.
Missing Values: Code missing values consistently (e.g., using a dot . or a numerical code like 999—but be careful not to use numbers that could represent a real value).
Dichotomous Variables: Be consistent in coding binary variables (e.g., always use 1 for Present/Yes and 0 or 2 for Absent/No).

C. Data Dictionary (Variable Catalogue)

The Data Dictionary is the master key to the database.

It documents every variable: the original Question Item, the Variable Name, the Type, the Format, the numerical Values assigned, and the Meaning of those values (e.g., $1=$ Yes, $2=$ No).
Importance: It allows anyone (including a future analyst or the original investigator months later) to understand and use the database correctly.

4. Checks and Balances: Quality Assurance

Quality checks must be implemented before and during data entry:

Check/Balance	Purpose	Example
Range Checks	Specifies minimum and maximum acceptable values for a field.	If the study is only on children up to 5 years, the database will not accept an entry greater than 5 in the ‘Age’ column.
Skip Patterns	Automatically skips irrelevant fields based on a preceding answer.	If the child’s mother answers “No” to having diabetes, the database skips all questions about her diabetes medication.
Automatic Calculation	The database performs complex calculations upon entry.	Enter Height and Weight, and the database automatically calculates Body Mass Index (BMI).
Cleaning	Data entry serves as the first step of data cleaning. Entry personnel note ambiguous responses and refer them back to the investigator for clarification.

5. 🔗 Database Aggregation and Linking

A. Individual vs. Aggregated Databases

Individual (Normalized): Each horizontal record represents one observation (one child). This is ideal.
Aggregated: Records contain counts or summaries (e.g., one record showing the total count of carie cases by village).

B. Mother and Daughter Databases (Linking)

When information is collected at different levels (Village, Household, Individual), they should be kept in separate linked databases, rather than repeating information across all levels.

Mother Database (e.g., Household Level): Contains information common to everyone in the household (e.g., House ID, Income, Community Status).
Daughter Database (e.g., Individual Level): Contains information specific to the person (e.g., Person ID, Disease Status, Exposure).
Linking: Files can be linked or merged during analysis using a common identifier (e.g., the House ID) that appears in both databases. This avoids unnecessary redundancy and confusion.

✨ Try Perplexity Pro – FREE for 1 Month! ✨

Upgrade your curiosity—get smarter, faster answers with Perplexity Pro. Enjoy advanced AI models, priority speeds, and deeper research tools—perfect for students, professionals, and creators.

💻 Get 20% OFF on All Hosting Plans & a Free Domain! 🚀

Build your dream website with Hostinger — fast, secure, and affordable hosting trusted by millions worldwide. Now you can unlock exclusive 20% savings on every hosting plan.

👉 Click here to see all plans & prices with your extra 20% discount!

👉 Click here to claim your 1‑month free trial and experience the future of knowledge.