Department of Public Health and Primary Care, Charing Cross and Westminster Medical School, Chelsea and Westminster Hospital, London SW10 9NH
Tel: 0181-746 8160, Fax: 0181-746 8151, email: f.tyrer@cxwms.ac.uk
Freya Tyrer BSc Data Manager
Ian Hambleton BA MSc Statistician
Ross Lawrenson MRCGP FAFPHM Senior Lecturer in Public Health
Mary Pierce LMSSA MRCGP Senior Lecturer in General Practice
Over 80% of general practices have computers[1] and 8% of practices are 'paperless', using their computers to record all their clinical data[2]. General practitioners have a number of objectives when purchasing a computer system. These usually include improved practice management, easily retrievable prescribing and clinical information, the ability to produce scripts (including repeat prescribing), letters and accounts. Additional advantages that may be sought include the ability to transfer information about individual patients to and from hospitals, and the ability to audit practice.
98% of people in the UK are registered with a general practitioner and on average each patient has just over 3 consultations per year[3]. Patient records move between practices with the patient. As well as standard prescribing and clinical information, the records include details of specialist referrals and hospital admissions.
Pooled general practice data can provide a large sample that is representative of the general population. This makes it attractive to epidemiologists.
The unique morbidity data available from computerised practices and its value in health needs assessment was recognised by family health service authorities (FHSAs), who invested considerable sums to help GPs buy computer hardware and software. One objective was that they would then have available all the required information for planning services. Another advantage of GP computer records is that they link prescribing and clinical data - something PACT (prescription analysis and costing) is unable to do[4]. This was not lost on the FHSAs, which may have envisaged more relevant economic analyses from the data.
The reality is that the aims of GPs and epidemiologists differ, and the use of GP computer systems is not always compatible with the information needs of epidemiological studies. However, the RCGP Research Unit has produced useful information in its morbidity surveys based on information from GP practices[5]. OPCS now own the VAMP database, although the principal researcher is American[6]. Aggregated data from Meditel practices has also been used and is available through IMS (Intercontinental Medical Systems). Only by aggregating data from a number of practices can a potentially representative sample of the general population be collected. All these aggregated databases have possible bias in that they are drawn from the 8% of practices that are fully computerised. The doctors who work in these practices are not representative of all general practitioners (more are fundholders, more work in group practices)[3]. It may also be true that the patients that attend these practices are not truly representative of the UK population.
Having said this, certain validation checks can be carried out comparing age, gender, and mortality distributions against the general population. Other demographic data are rarely available. For example, although the feasibility of recording ethnicity has been demonstrated[7] this is not recorded by general practitioners and so it cannot be shown that the ethnic mix within the sample is representative of the ethnic minorities in the UK.
A second issue when building a representative database is the variety of software systems available. The Department of Health's survey in 1991 revealed 119 different types[8]. As many as twenty of these are in current usage, the three largest currently being VAMP, Meditel and EMIS. Each of these systems use a different coding system. The non-standardised nature of drug and disease coding from these systems causes problems in the pooling and comparison of data from practices running different systems[9]. Attempts to overcome this problem have been attempted through the MIQUEST project[10], but most large databases collect a national sample from one software company (IMS and Meditel, OPCS and VAMP, etc.)
Despite these limitations, the use of GP data provides an efficient and easily accessible source of information that can be used for retrospective studies of rare events. These data are thus well suited to case-control analyses. The Department of Public Health and Primary Care, Charing Cross and Westminster Medical School were interested in conducting a study looking at events related to the use of antidepressants (such as suicides and cardiovascular side effects). A number of participating general practices had provided data from differing computer systems. As a prerequisite to analysis, these data had to be pooled. This exercise is described below.
Computerised practices from the whole of England and Wales were recruited. Their data had to meet several minimal criteria:
Initially 36 practices were telephoned and were visited by one of three doctors if they met these criteria. (Two of the practices did not in fact meet the last criterion [i.e. they had fewer than 7,000 registered patients] but were recruited because they had good quality data.) The two years of complete data were specified to be between 1st August 1992 and 31st July 1994, removing time bias between practices. Thirty-two of the practices were prepared to supply anonymous data; they were subsequently recruited for the study. The general practitioners were paid a fee for access to the information. The database received Ethics Committee approval provided that access was limited to the Department of Public Health and Primary Care, and that the data were anonymised.
The VAMP, EMIS and Meditel data from the practices arrived on data cartridges from each practice. Each patient's day of birth was altered and their names and addresses deleted to retain anonymity. For each practice there were four files:
This contained registration data on the patient and consisted of a unique practice identifier and patient number (for each practice), gender (M, F and U [unknown]), date of birth (to the first of the month and year), the first four characters of the patient's postcode, NHS number, date of registering and leaving the practice, and the observed number of patient years. The file contained one record per patient. All the registration files for the practices were joined and indexed by practice ID and patient number, so that each patient could be uniquely identified.
This contained clinical data on the patient and included a unique practice ID and patient number which could be mapped onto the registration file, to join patient registration data to clinical data. The file also contained information on the date the diagnosis was entered, the doctor seen on that date, a numeric value to match any biometric test, such as blood pressure readings or height and weight measurements. The diagnosis was entered in the form of a unique classification code, which was either in Read, Oxford Medical Computer System (OXMIS) or EMIS format. A list of the codes and associated diagnoses was provided by CAMS (Computer Aided Medical Systems). The file contained multiple records per patient.
This contained patient prescription data and again included a unique practice and patient field which could be mapped onto both the registration and the clinical files. In addition, the file contained the date on which the prescription was given, the doctor who was seen, the dosage (e.g. "take one 3 times/day") and the quantity supplied (e.g. "60 tabs"). A description of the drug (e.g. "Ibuprofen|Tablets|400mg") was provided along with a unique drug classification code which was either in Read, EMIS or OXMIS code format. Again the file contained multiple records per patient. Each practice used their computer system to print out patient prescriptions, so the data contained in this file were expected to be fairly accurate.
The diagnosis and drug files were interim files to allow conversion to a common format (see Figure 1). Read was chosen as the standard coding system. The diagnosis file consisted of all the unique clinical codes (Read, OXMIS or EMIS) from the clinical file and a text description of the code using the CAMS database. The drug file contained the unique drug codes from the prescription file and a text description of the prescription given, e.g. "Ibuprofen|400mg| Tablets". The file relationships are shown in Figure 1. For both diagnosis and drug files, a 'Read' field was derived. All clinical and drug codes from the VAMP and EMIS practices were mapped to Read codes via conversion files of EMIS and VAMP. Any codes which could not be converted were coded manually, by finding the closest match between the text field to the text from the Read file, for example:
"alcohol intake" -> "alcohol consumption" (diagnosis file)
"Ibuprofen|Tablets|400mg" -> "Ibuprofen|400mg|Tablets" (drugs file)
Figure 1: File relationships in the development of a pooled general practice database
This mapping from the prescription and clinical codes to Read classification codes proved to be a long and difficult process with EMIS allowing general practitioners to develop their own codes. For these practices about 20% of codes had to be manually coded.
The clinical and prescription data were useful in their own right. Once duplicate records and patients registered outside the study period were discarded, the files could be linked to the registration file and to each other, by practice ID and patient number. Daily dosage for drugs could be calculated by multiplying derivatives of dosage and strength so that numbers of tablets taken a day of certain prescriptions could be stored; for example:
"Take one 3 times a day" (3) * "Ibuprofen|400mg|Tablets"(400mg)
=> 3*400mg => 1200mg (Daily dosage)
Again this was a long coding process as there was no consistency in the coding: "Take one 3 times/day" can be written as "Take one 3 times daily", "one t.i.d" and so on.
From this information however, a report could be created of a patient's personal history on the database (containing all details of diagnoses and prescriptions and some details of the patient's medical history before joining the practice). Although this is not practical for looking at large data sets, it is useful for smaller samples, such as patients who had attempted suicide in any practice.
For the majority of statistical software it is convenient if data is in a 'flat file' format - one record per patient. Each record contained registration data and a binary indicator for each clinical and prescription record. Indicators were initially coded for current studies relating to mental health, diabetes and oral contraception. A few common conditions were also included. The indicator was set to one if the patient had the condition/prescription and to zero otherwise. The flat file creation is summarised in Figure 2. The prescription indicators were only set to one if the prescriptions were given within the two-year period. Most of the clinical indicators had no time dimension, as the clinical conditions in which we were interested were chronic diseases.
Figure 2: Producing flat file: example for clinical data [Sorry - figure not available.]
The quality of the data was determined by each practice having to be within 10% of the age standardised expected birth and death rates. Two practices did not fit into this range and were excluded from the study, leaving 30 practices.
The registration file was robust in practice. There were no duplicate records and most of the information was self-explanatory and easy to interpret.
Our practices were obtained from selected areas in England and Wales. The pooled database, in its flat file format, should be validated against the general population that it is attempting to approximate. Our sample population in the middle of 1993 should be similar to the England and Wales population statistics in that year (provided by OPCS and based on the 1991 census[11]).
Figure 3: Number of people in database by gender and 5-year age groups (31st July 1993) compared with the 1993 population in England and Wales [Sorry - figure not available.]
Figure 3 shows the sample population by age compared with the England and Wales OPCS population. The graph shows that the sample population is lower between 5 and 19 years and higher between 20 and 39 years, particularly for females. It is, however, a reasonably good fit and implies that our sample population is representative of the general population. This gives more weight to the generalisation of any conclusions drawn.
The age-adjusted standardised mortality ratios were calculated for each year in the two-year period, giving a value of 95 for males and 97 for females. Both are low, but are within 5% of the expected OPCS mortality rates. Birth rates were more difficult to calculate. There was little information on pregnancy and delivery in the clinical file. Birth rates were calculated by counting the number of children who were registered with each practice under the age of three months. It was assumed that there was a 100% registration coverage for infants of this age. The age-standardised birth ratio was 90. This was not adjusted for gender as too many of the babies had an 'unknown gender' recorded.
Data on occupations, ethnicity and marital status were rarely recorded, with only 2.8%, 0.4%, and 2.2% respectively of patients with any recordings in these fields. Postcodes, on the other hand, were generally well-recorded (94.5%) - although we only looked at the first 4 characters.
This paper illustrates the value of general practice data for epidemiological research. Our work seems to indicate that the sample's approximation to the general population is good, using age and death statistics as markers for comparability. The low standardised birth ratio is slightly worrying, as it is believed that most babies are registered with their general practitioners so that they can attend their initial six-week examination. Validation studies have shown that general practice data are accurate, with over a 90% correlation with written notes. This is particularly true of chronic diseases such as schizophrenia and diabetes[12,13] although there may be ambiguity in doctors' reporting of diseases where diagnostic behaviour varies, such as depression[13]. This would need to be investigated further. Similarly, prescribing data are valid, although this does not necessarily mean that the prescribed medication were taken.
Once constructed, the database is a significant resource for epidemiological analysis. There are limitations which should be recognised:
From the general practice perspective, the general practitioners rarely enter all the patient information into the computer, as this would prove too time-consuming for their purposes, so there are often missing and ambiguous data. Some variables are more affected than others, and this is an analytical weakness if a poorly recorded variable is an anticipated confounder to the condition under investigation. The completeness of recording postcodes as opposed to ethnicity, for example, suggests that correspondence is important within practices, but general practitioners are less interested in personal details that are not relevant to their prescribing. We also chose larger practices so that fewer had to be recruited. Such practices, which can afford such computerised systems, may not be typical of smaller practices.
We have shown that computerised general practice data can be converted for use in epidemiological studies. If a large enough sample is obtained, conclusions may be drawn through large-sample statistical analysis. On a more general scale, however, as more practices adopt computerised systems the sample will become more representative, and a standardised system of coding will be needed. If general practice data are to be used for epidemiological research, then standard demographic data need to be recorded for all patients. The extent of these data and how they are recorded needs to be agreed with participating practices and would be entered into the database by receptionists or clerical staff within the practice. Data entry personnel can be employed to enter more detailed information on patients, but this requires considerable financial investment from general practitioners which may not be cost-effective. With more incentives, however, there may be a move in the future towards standardised computer systems and general practice-based epidemiological studies.
Too many unsubstantiated, unreferenced assertions:
Method:
Results:
Overall:
Our reviewers had some reservations about this paper. However, I have published the paper, together with some of the reviewers' comments, because I would like to start a debate on the use of GP data for purposes other than the original purpose for which it was collected. This is likely to happen increasingly across the country, as Departments of Public Health in our new unified Health Authorities realise the 'data mine' [Richards J. GP Data Goldmine. Jnl Informatics in Primary Care 1993 (Jan):6-8] they have potentially available to them. My own concerns are for the confidentiality of patient data: notwithstanding Ethics Committee approval, this study retained the patients' NHS Numbers, although the names and addresses were deleted from the study database. I cannot see the need for this, as each patient was given a unique study number. I was also unclear about why the researchers excluded two practices whose data were not within 10% of the age standardised expected birth and death rates. There may be perfectly valid reasons for this deviation from the norms, which have no bearing on the quality of the data in the computer system.
I hope that our correspondence columns are full in the next issue!