Korea University
Korea Hangeul Help Sitemap Calc Link
SAS DATA Step
This page provides brief and essential information of the SAS DATA Step. The two pages for INPUT statement and the INFILE/IMPORT/EXPORT procedures were separated from this DATA setp document on April 2003.

This web document may not be used for any commercial purposes. This page may contain mistakes and errors. If you have any question and suggestion, please leave a message on SAS bulletin board.

    Overview | Examples | Data Sources | Library | DATA Statement | Data Sets | Selecting
    Appending | Merging | Manipulation | Recoding | Renaming | Multiple Response | References

DATA STEP OVERVIEW

A SAS program is a collection of SAS statements that may include keywords, various names (e.g., data sets, and variables), special characters, and operators. A SAS statement may be used in a DATA step, PROC (procedure) steps, or anywhere in a SAS program.

A SAS program consists of DATA steps and PROC (procedure) steps. DATA steps handle data sets, while PROC steps actually conduct analyses.

A DATA step is used to create or modify data sets by creating and modifying variables; checking and correcting errors in data sets; and writing programs (for simulations).

SAS has following basic rules.

  • A statement begins and ends at any place.
  • A statement ends with semi-colon (;). A line can have more than one statements.
  • SAS is not case-sensitive.
  • Operators (+, -, *, and /) do not work with missing values, while functions ignore missing.
  • A comment is enclosed by /* and */

SAS statements used in a DATA step are either executiable (e.g., DO, INPUT, INFILE, OUTPUT) or declarative (e.g., ARRAY, DATALINES, DROP, RETAIN).

SAS has arithmetic, relational, logical, and concaternation (||) operators. But SAS does not have the modulus operatior; the MOD function is used instead.

SAS has various functions for mathematics, statistics, string, date/time, probability, and randomization.

The OPTIONS statement changes the value of SAS system options that affect SAS system initialization, hardware and software interfacing, and the input, processing, and output of jobs and SAS files. See Chapter 8 of SAS Language Reference: Dictionary (1451-1647).

OPTIONS PAGESIZE=60 LINESIZE=100 PAGENO=1 NODATE NOCENTER;
  • PAGESIZE (PS) sets the number of lines per output page.
  • LINESIZE (LS) specifies the maximum width of a line.
  • NODATE suppresses the display of the date and time in output pages.
  • NOCENTER gets outputs left-aligned.
  • PAGENO resets the first page number to be printed.

Up

TYPICAL EXAMPLES

Let us consider a typical DATA Step example that reads an ASCII text file "tiger.dat".

DATA student;
INFILE 'A:\tiger.dat';
INPUT id name $ math stat;
RUN;
  • The DATA statement specifies a data set "student" in which outputs of the DATA step are stored. The "student" is a temporary file stored in the memory (RAM) and thus will be removed after the SAS program terminates. If you want to save a data set into a permanent file, use a SAS data library (LIBNAME).
  • INFILE 'A:\tiger.dat'; retrieves data from the 'tiger.dat' stored in the A: drive. The data file should be in the ASCII text format.
  • The INPUT statement specifies the arrangement of data to be read. The example reads a variable "id" as numeric, "name" as string, and numeric "math" and "stat." See the INPUT statement for details.
  • RUN; gives a que to execute DATA steps and/or PROC steps.

The following example reads three varaibles directly from the data stream in the DATA step.

OPTIONS PAGESIZE=60 LINESIZE=100 PAGENO=1 NODATE NOCENTER;

DATA growth;
INPUT year y x;

DATALINES;
2000 100 400
2002 120 380
2004 351 684
RUN;
  • The DATALINES (or CARDS) statement indicates that data lines follow. So the INFILE statement is not necessary.
  • Notice that ';' should not be put at the end of each data line.

The next illustrates an example the DATALINES4 statement, which is needed when data lines contain semicolon (;).

DATA reference;
INPUT no ref $30.;

DATALINES4;
1 Perry 1999; Good 2000
2 Dimaggio and Powell 1983
;;;;
RUN;
  • $30. means that the variable "ref" is 30 characters long. The period (.) may not omitted.
  • Note that there should be four semicolons at the end of data entry.

Up

DATA SOURCES

What is a data set in SAS? A SAS data set is a group of data values that SAS creates and processes. It contains a table of observations (rows) and variables (columns) as well as descriptor information (e.g., variable names and formats). A SAS data set is often referred to a SAS data file. A SAS data view is a virtual data set of descriptor information that points to data from other sources.

SAS has a powerful feature of data manipulation that can handle various data sources such as ASCII text, database, and spreadsheet. You may type in data and directly read them using the DATALINES (or CARDS) statement.

SAS can read ASCII text files delimited with space, comma (CSV), tab, and other characters using the INPUT/INFILE/DATALINES statements in a DATA step. The INFILE statement also reads remote data files through the SAS/ACCESS using the TCP/IP, FTP, and URL protocols.

The IMPORT procedure can read these ASCII text files, but it can also import database (dBASE III, FoxPro, Access) and spreadsheet (Excel and Lotus 1-2-3) files. SAS/SQL (PROC SQL) allows you to connect those database and spreadsheet files through the ODBC (Open Database Connection).

SAS data sets may be generated by PROC steps. For example, the MEANS procedure can produce a data set with aggregate statistics and matrices may be transformed into data sets in SAS/IML. The following PROC REG saves the residuals and predicted values to "pew_work" that includes original variables in "jeeshim.pew2004" as well.

PROC REG DATA=jeeshim.pew2004;
   MODEL engagement = interest knowledge income egov /R P;
   OUTPUT OUT=pew_work R=residual P=predict;
RUN;

Finally, you can generate data using functions, in particular, random number generators in a DATA step.

DATA dgp;
   DO i=1 to 10 BY 1;
      random=RANNOR(7654321);
   OUTPUT dgp; END;
RUN;

Up

USING SAS DATA LIBRARY

A SAS Data library is an alias of the collection of data sets, thereby making data management more convenient and efficient. Like a directory or folder, a library tell SAS the place where data sets exist. Unlike a directory or folder, a library is not physical but logical in a sense that library itself does not exist in any secondary memory unit.

Every data set should be referred using a library in SAS, although the default library, .WORK, is often omitted. If you want to retrieve a data set by a point-and-click method, use the SAS Enterprise Guide.

The LIBNAME statement associates a SAS data library with a library reference (specific directory or folder). It declares which directory is to be referred to the library specified. Libraries should be declared before DATA steps and PROC steps.

If you want to use the default WORK library, you do not need to declare any library. However, you should know that data sets in the WORK library remain in the RAM (primary memory units), not in the secondary memory units (e.g., hard disks or memory sticks). If you want to store data sets into physical files, you must use your libraries.

The following LIBNAME statement declares a library "jeeshim" that is associated with c:\temp. A specific SAS data file is referred using a library name and a file name divided by a period. The "jeeshim.nes2004" indicates the "nes2004.sas7bdat" in the "jeeshim" library (c:\temp).

LIBNAME jeeshim 'c:\temp\';

DATA jeeshim.nes2004;
...;
RUN;

PROC REG DATA=jeeshim.nes2004;
...;
RUN;

How do you know which data sets are included in a library? Use the CONTENTS or DATASETS procedures with a system variable _ALL_. DATASETS can also manipulate (e.g., copy and delete) datasets.

PROC CONTENTS DATA= jeeshim._ALL_;
RUN;

PROC DATASETS LIBRARY=jeeshim DETAILS;
RUN;

If you need to use specific libraries frequently, declare them in the autoexec.sas, an ASCII text file in the SAS root directory. SAS automatically executes statements in the file immediately after SAS is launched. Consider the following example.

OPTIONS PAGESIZE=55 LINESIZE=80 NOCENTER;
LIBNAME jeeshim 'c:\data\sas';
FILENAME nes 'c:\data\sas\nes2004.txt';
  • The FILENAME statement specifies a file name that refers to a physical file in a secondary memory unit.

You may specify SAS engine name like EXCEL. If you want to deassign a library, add CLEAR withtout a library reference.

LIBNAME xls EXCEL 'c:\data\excel\airline.xls';
LIBNAME xls CLEAR

Up

DATA STATEMENT

The DATA statement begins a DATA step and provides data set names. The output of a DATA step is stored into the data set specified.

LIBNAME jeeshim 'c:\temp\';

DATA jeeshim.egov;
...;
RUN;

A SAS DATA step can creates more than one data set. The following example creates two data sets "WORK.egov1" and "WORK.egov2" from the "jeeshim.egov." The "gov1" and "gov2" in the WORK library are identical except that the "egov1" does not include variables "state" and "msa," and has a variable "id" whose name is changed from "respid."

DATA egov1
   egov2 (DROP=state msa RENAME=(respid=id));
SET jeeshim.egov
...;
RUN;

If a data set name is omitted, the computer will automatically name eash successive data set as WORK.data1, WORK.data2, WORK.data3, and so on. These data sets, however, may consume computing resources and slow down the access and response speed.

DATA;
...
RUN;

If you want to use a DATA step only for transactions, you may use the _NULL_ in the DATA statement to enhance memory management efficiency. The _NULL_ tells SAS not to create any data set when it execute the DATA step.

DATA _NULL_;
...
RUN;

Up

SELECTING OBSERVATIONS

How to select and delete some observations in a data set? The IF... THEN statement can do that for you..

The following example retrieves observations from a data set "jeeshim.pew2004"; selects only male observations (male=1) and discards female observations; and stores the result into a data set "WORK.pew_work." The IF statement may add the KEEP statement to get the identical result (IF male EQ 1 THEN KEEP;).

DATA pew_work;
SET jeeshim.pew2004;
IF male = 1;
RUN;

You may use the DELETE statement that works in the reverse way. This statement removes observations that meet the conditions provided.

DATA pew_work;
SET jeeshim.pew2004;
IF male = 0 THEN DELETE;
RUN;

The REMOVE statement following the MODIFY statement in a DATA step also delete observations.

DATA pew_work;
MODIFY jeeshim.pew2004;
IF male EQ 0 THEN REMOVE;
RUN;

You may also select observations by specifying a range of record numbers. Use the _N_, a SAS system variable, that contains the record numbers of observations.

DATA pew_user pew_nonuser;
SET jeeshim.pew2004;
IF _N_ <= 500 THEN OUTPUT pew_user;
   ELSE OUTPUT pew_nonuser;
RUN;

The first 500 observations are saved into "WORK.pew_user," while remaining observations are put into "WORK.pew_nonuser."

You may try the WHERE statement, which selects observations from an existing data set without physically removing observerations that do not meet a condition.

DATA pew_female;
   SET jeeshim.pew2004;
   WHERE male EQ 0;
RUN;

The above data step reads only female (male=0) from jeeshim.pew2004 and then stores them into pew_female. Note that SAS checks if observations meet the condition when executing SET, MERGE, MODIFY, and UPDATE statements.

In a data step, WHERE cannot can used together with INFILE and DATALINES. In a procedure step, this statement limits observations used in analysis.

PROC REG DATA=pew_female;
   MODEL money = income;
   WHERE it_use AND age >= 20;
RUN;

"WHERE it_use" means selecting observations whose values of it_use is not missing nor zero.

Up

SELECTING VARIABLES

You can select variables using the KEEP and DROP statements. The following example reads observations from "jeeshim.gss2004"; selects only four variables; and then stores them into "WORK.gss_work1."

DATA gss_work1;
SET jeeshim.gss2004;
KEEP income education male egov;
RUN;

Alternatively, you may add the KEEP option in the DATA statement to make it simple.

DATA gss_work1 (KEEP=income education male egov);
SET jeeshim.gss2004;
RUN;

The following two examples excludes three variables "state", "msa" and "vote" out of "WORK.gss_work2."

DATA gss_work2;
SET jeeshim.gss2004;
DROP state msa vote;
RUN;

DATA gss_work2 (DROP=state msa vote);
SET jeeshim.gss2004;
RUN;

Keep in mind that both KEEP and DROP statements may not be used in a DATA step. However, you may use both KEEP and DROP options in a DATA stetement.

DATA gss_work1 (KEEP=income education male)
     gss_work2 (DROP=state msa vote);
SET jeeshim.gss2004;
RUN;

Up

APPENDING OBSERVATIONS

If you want to append observations, use the SET statement to add observations in secondary data sets (jeeshim.nes2002) to the master data set (jeeshim.nes2004).

DATA nes_work;
   SET jeeshim.nes2004 jeeshim.nes2002;
RUN;

The APPEND procedure and the DATASETS procedure also append the observations from one SAS data set to the end of master data set. These procedure are useful when the master data set is huge.

PROC APPEND BASE=jeeshim.nes2004 DATA=jeeshim.nes2002;
RUN;

If master and secondary data sets have different data structures, the FORCE option is necessary. This option, however, does not append the variables that exist only in secondary data set.

PROC DATASETS;
   APPEND BASE=jeeshim.nes2004 DATA=jeeshim.nes2002 FORCE;
RUN;

Up

MERGING DATA SETS

SAS MERGE and UPDATE statements can merge SAS data sets. There are two types of merging: one-to-one merging and match-merging.

The one-to-one merging mechanically puts data sets together without distinguishing one observation from others. It looks like putting a new sheet of paper over an existing paper.

DATA merge_work1;
   MERGE math stat;
RUN;

The match-merging distinguishes individual observations using identification variables (e.g., id and name). Thus, it requires the BY statement that specifies the common denominators.

DATA merge_work2;
   MERGE jeeshim.pew jeeshim_egov;
   BY year state;
RUN;

You may also use the UPDATE statement with the NOMISSINGCHECK option. Since this statement supports only the match-merging, the BY statement is required.

DATA merge_work3;
   UPDATE jeeshim.pew jeeshim_egov secondard UPDATEMODE=NOMISSINGCHECK;
   BY year state;
RUN;

See the merge.pdf for actual examples of the MERGE statement and the UPDATE statement. For complicated merging, use SAS/SQL to take advantage of SQL statement.

Up

MANIPULATING VARIABLES

Variables are created, modified, recoded, and/or deleted in DATA steps.

DATA nes_work;
SET jeeshim.nes2004;

interest = principal*(1+r)**year;
id =_N_;

luck = RANNOR(9876543);
log_inc = LOG(income);
lag_inc = LAG(income);
dif_inc = DIF(income);
RUN;
  • "interested" is created by an expression of a formula.
  • "id" is created by a system variable _N_.
  • "luck", "log_inc", "lag_inc", and "dif_inc" are created from functions.

The RETAIN statement is useful to do various tasks. For example, you can compute the cumulative sum of a variable. Click here for details.

DATA egov.work;
SET jeeshim.egov;
RETAIN sum 0 lag_inc;

sum = sum + income;

dif_inc = income - lag_inc;
OUTPUT;
lag_inc = income;
RUN;

Recoding:You may recode a variable using the IF statement.

IF area IN (1, 3) THEN region=1;
   ELSE IF area=2 THEN region=2;
   ELSE area=3;

The following usage is very convenient despite its complexity.

new_var = (var >0) + (var >5) + (var >10);

This usage is equivalent to the following.

IF var <=0 THEN new_var=0;
IF var > 0 THEN new_var=1;
IF var > 5 THEN new_var=2;
IF var >10 THEN new_var=3;

You may recode a variable in a reverse order using an array.

ARRAY a_var trust;
DO OVER a_var;
   a_var=6-a_var;
END;

SAS array can also conduct more complicated tasks as follows.

ARRAY a_var(3) q19-q21;

DO i=1 TO 3;
   a_var(i)=6-a_var(i);
END;

Renaming:You may change variable names using the RENAME statement.

RENAME old=new;
RENAME old1=new1 old2=new2 old3=new3;
RENAME sex=male old1-old3=new1-new3;

Up

MULTIPLE RESPONSE QUESTIONS

If you need to handle multiple response questions, stack up the data set using the OUTPUT statement.

Suppose that respondents are asked to pick three choices out of ten regardless of order in choices (equal weight). The choices are coded into three variables x1 through x3.

DATA fruit;
INPUT age grade x1 x2 x3;
choice = x1; OUTPUT;
choice = x2; OUTPUT;
choice = x3; OUTPUT;

DATALINES;
...
RUN;

The OUTPUT statement is executed three times to generate three observations per subject.

Up

REFERENCES
  • SAS Institute. 2005. SAS Language Reference: Concepts, 2nd ed., Version 9. Cary, NC: SAS Institute.
  • SAS Institute. 2005. SAS Language Reference: Dictionary, 2nd ed., Version 9. Cary, NC: SAS Institute.
  • Korea University Computer Center. 1980s. SAS Workshop Manual.
  • Korea University Computer Club. 1980s. SAS User's Gudie.
  • Kim, Choong Ryun. 1993. The Statistics Package Called SAS: Focusing on the Statistics Analysis and Marketing Research Methods. Seoul: Data Research.