Our heroes have been recruited, and the data scrolls (questionnaires) are filled out. This chapter is about Data Management—the process of converting these physical (or digital) scrolls into a clean, usable, and organized digital database. Good data management is essential for reliable analysis.
1. Data Management: The Key Steps
Data management is a systematic process encompassing:
- Defining Variables and creating the Data Dictionary (the master list).
- Creating the Study Database (the entry structure).
- Entering Data and correcting errors (Data Cleaning).
- Creating the final Data Set for analysis.
- Backing up and Archiving the data set securely.
2. Understanding Data Structure
A database is structured like a table:
- Records: The horizontal lines, representing all the information collected from one particular individual (one child).
- Variables: The vertical columns, representing a specific piece of information collected (e.g., Age, Gender, SuperPaste Use, Caries Status).
A. Unique Identifier (ID)
This is the most critical variable.
- The ID must be unique for every record.
- It should be secured by a quality assurance procedure.
- Composite IDs: The ID can be designed to carry meaning (e.g., first 2 digits = Village, next 2 = Street, last 2 = Person’s sequence number).
B. Variable Specifications
Before entry, all variables must be defined:
- Type: Digits (Numeric), Text (Alphanumeric), or Date.
- Format: Length of text, number of decimals, and specific date format (e.g., DD/MM/YYYY vs. MM/DD/YYYY).
- Consistency: Turn all textual entries into capitals to avoid errors during analysis.
C. Variable Naming Conventions
Variable names must be:
- Clear and Self-Explanatory: Should refer directly to an item in the data collection instrument (e.g.,
EXERDAILYfor “Do you exercise daily?”). - Short: Most software requires names to be less than 10 characters (no spaces).
- Consistent: Maintain a consistent pattern (e.g., use
EXERPASTandEXERCURRfor exercise history and current status). - Denoting Dichotomization: Clearly rename variables when they are recoded from crude numerical values into categories (e.g., change
EXERCISEtoEXERCISE_12to denote a dichotomized (Yes/No) outcome).
3. Data Entry and Coding
A. Data Collection Instrument Design
The questionnaire should be designed to facilitate easy data entry:
- Sections: Divide the instrument into broad sections (Identifiers, Demographics, Outcome, Exposure) that correspond to database sections.
- Auto-Coding: The instrument should have codes written directly next to the response options (e.g., Yes [1], No [2]) so the entry person enters the code, not the text. This is called auto-coding.
B. Numerical Coding and Missing Values
- Prefer Numerical Coding: Use numbers (digits) whenever possible.
- Missing Values: Code missing values consistently (e.g., using a dot
.or a numerical code like999—but be careful not to use numbers that could represent a real value). - Dichotomous Variables: Be consistent in coding binary variables (e.g., always use 1 for Present/Yes and 0 or 2 for Absent/No).
C. Data Dictionary (Variable Catalogue)
The Data Dictionary is the master key to the database.
- It documents every variable: the original Question Item, the Variable Name, the Type, the Format, the numerical Values assigned, and the Meaning of those values (e.g., $1=$ Yes, $2=$ No).
- Importance: It allows anyone (including a future analyst or the original investigator months later) to understand and use the database correctly.
4. Checks and Balances: Quality Assurance
Quality checks must be implemented before and during data entry:
| Check/Balance | Purpose | Example |
| Range Checks | Specifies minimum and maximum acceptable values for a field. | If the study is only on children up to 5 years, the database will not accept an entry greater than 5 in the ‘Age’ column. |
| Skip Patterns | Automatically skips irrelevant fields based on a preceding answer. | If the child’s mother answers “No” to having diabetes, the database skips all questions about her diabetes medication. |
| Automatic Calculation | The database performs complex calculations upon entry. | Enter Height and Weight, and the database automatically calculates Body Mass Index (BMI). |
| Cleaning | Data entry serves as the first step of data cleaning. Entry personnel note ambiguous responses and refer them back to the investigator for clarification. |
5. 🔗 Database Aggregation and Linking
A. Individual vs. Aggregated Databases
- Individual (Normalized): Each horizontal record represents one observation (one child). This is ideal.
- Aggregated: Records contain counts or summaries (e.g., one record showing the total count of carie cases by village).
B. Mother and Daughter Databases (Linking)
When information is collected at different levels (Village, Household, Individual), they should be kept in separate linked databases, rather than repeating information across all levels.
- Mother Database (e.g., Household Level): Contains information common to everyone in the household (e.g., House ID, Income, Community Status).
- Daughter Database (e.g., Individual Level): Contains information specific to the person (e.g., Person ID, Disease Status, Exposure).
- Linking: Files can be linked or merged during analysis using a common identifier (e.g., the House ID) that appears in both databases. This avoids unnecessary redundancy and confusion.

Leave a Reply