B. Sc. Part I Semester I I.I Introduction to Statistics :Nature of Data, Sampling, Classification and Tabulation

 


B. Sc. Part 1 Semester 1

DSC-7A-STATISTICS-1

(DESCRIPTIVE STATISTICS-1)

Theory: 30 hrs. Mark 50 (Credits: 02)

 

           ‘Statistical thinking will one day be necessary for effective

            Citizenship as the ability to read and write’ - H.G. Wells

 

1.1 Introduction to Statistics:

Meaning of the word Statistics

Statistics is as old as language Statistics is in use from the time when man began to count & measure: The word ‘Statistics’ has several meanings.

 Statistics-Plural- Figures, Numbers, Data (for ex. statistics of a batsman)

                Singular Methods, Science, Subject itself

In ancient days kings used to maintain records of manpower, wealth, population, income, taxes & military etc. The word 'Statistics seems to be derived from the Italian word 'statista', Greek word statistik & Latin word 'status', French (Statistique). (meaning a political state).

Some definitions of statistics are given below

Statistics as numerical data

It is the science of kings. It is the science of numbers. It is the body of knowledge.

It is the science of applied mathematics.  It is the science of probability.

It is the science of gamblers. It is the science of sampling.

Statistics as statistical methods

It is the science of counting. It is the science of averages.

The best of the definitions is given by Croxton & Cowdon. It is "Statistics may be defined as the collection, presentation, analysis & interpretation of numerical data."

Importance of statistics

1. Statistical methods enable to condense the data. It facilitates several functions apart from summarization.  

2. Statistical methods give tools of comparison.

3. Estimation, prediction is also possible using statistical tools.

4.  One can get idea about the shape, spread and symmetry of the data.

5. Inter relation between two or more variables can be measured using statistical techniques.  

6. Statistical methods help in planning, controlling, decision making etc.

7. The use of statistical methods is a important because considerable amount of time, money  and manpower can be saved.

8. Uncertainty's can be reduced to get reliable results.

9. Statistical methods give systematic methods of data collection and investigation

 Various fields where statistics is used (Scope of Statistics):

The application of statistical techniques is widespread. From a common man in his everyday life to an expert in his special field, everybody uses statistical methods.

i) Statistics & agriculture- The success of our 'Green Revolution is due to use of statistical techniques (Which type of sugarcane yields maximum amount of sugar? Which fertilizer is the most suitable to a particular soil?)

ii) Statistics & Medical sciences- Statistical methods are used in medical & pharmaceutical research (Effectiveness of any new drug, cause or causes of a disease) 'Smoking causes cancer is a statistical result’.

iii) Statistics & economics- The conclusions of economists in production, taxation, import, export, etc. are all based upon the statistical data & techniques.

 iv) Statistics in industry (Engineering) - In industry to control the quality of the product, to design & test the new products before they are marketed.

v) Statistics in Biological Sciences - Analysis of agriculture experiments makes use of statistical method known as design of experiment. The regression analysis was used by Sir F. Galton in the field of genetics. Pearson pioneered the study of correlation analysis. Various methods of cultivation, irrigation are tested and compared using statistical tests. Estimation of number of trees in jungle, forest density, number of animals in a jungle, fish in a lake etc. can be done using various Statistical techniques. Demography uses statistical methods for forecasting population, measuring death rates, birth rates, growth rates.

 vi) Use of Statistics in Social Sciences- In social sciences we need to test association between two variables such as education and criminality, education and marriage adjustment score, sex and education, richness and criminality etc. for this the statistical methods are used.

 vii) Use of statistics in business world- i) Use in economics ii) Use in Commerce Statistics provides methods of forecasting production, demand & supply iii) Business management- Statistical methods help the manager to analyze the problem & take decisions.

viii) Statistics & other sciences- The methods of statistics are also helpful in Geology, Biology, Psychology, Sociology, and Meteorology.

It is difficult in fact, to find any scientific activity where statistical methods are not serviceable.

Limitations of Statistics

i) Statistics does not study individuals as it deals with aggregates.

ii) It does not study qualitative o phenomenon.

iii) Its results are true only on an average.

iv) Its results are subject to bias.

v) It can be misused.

Statistical Organizations in India and their functions:

1. Central Statistical Organization (CSO):

    It has established by Central Government in May 1951 and it works under the Cabinet Secretariat.  It is the main coordinating agency of various Statistical Organizations in center as well as states of our country. It looks after various activities such as collection & compilation of data. CSO brings out number of publications such as:

i) Monthly Abstract Statistics

ii) Monthly Statistics of the Production of Selected Industries of India

iii) Statistical Pocket Book of the Indian Union.

iv) Statistical Abstract of India

v) Annual Survey of Industries

 2) National Sample Survey (N.S.S.)

NSS was set up in 1950 under the guidance of P.C. Mahalanobis and it was reorganized in 1970 under the name National Sample Survey Organization (N.S.SO.). Its Main functions are:

 

i) Data collection for Estimation of National Income for the activities of Planning Commission for the activities of various Ministries, Collection of Socio-economic and demographic data.


ii) NSSO collects data regarding prices, wages, consumption, production, agriculture     etc.


iii)  NSSO conducts sample surveys in the registered industrial sectors.

iv)  It provides guidance to the various states & does the job of supervision for

     conducting various surveys.

 3) The Indian Statistical Institute (I.S.I.):

 It is a non-Government organization of highest importance in India. It was established in 1932 & has done considerable work in developing Indian Statistical System. 

Functions-

i) To produce learned & expert Statisticians.

ii) To provide training & research facilities at various levels.

iii) To conduct large scale statistical projects.

It gives technical assistance to NSSO. Since 1960, the institute started its own courses B. Stat & M. Stat. It also confers Ph. D. & D.Sc. degrees on those who have done research work of excellent order.

 It also brings out a journal named “Sankhya".

Bureau of economics and Statistics

Statistical system in states varies from state to state. In Maharashtra State Statistical Bureau is functioning for various activities such as,

i) Statistical coordination

ii) State income

iii) Socio-economic survey.

The specific functions of economics and Statistics Bombay are described as follows:

i) Co-ordinate statistics corrected by various departments of state government

ii) Provide guidance regarding statistics to various departments

iii) Collect statistical information conduct statistical enquiries and statistical service

iv) Provide liaison between State and CSO.

v) Conduct economic and statistical research.

vi) Provide statistical assistance to state planning agencies.

vii) Compile economic indicators and give state income estimates.

viii) Publish, Annual State Statistical Abstract and Quarterly District Statistical Abstracts.

 International Institute for population Sciences (IIPS)

In 1956, the United Nations, the Government of India and Sir Dorabji Tata trust jointly established the institute to serve as a regional center for teaching, training and conducting research in the area of population studies. The IIPS has helped in building a nucleus of professionals in the field of population and health in governments of various countries. During the past 50 years students from 42 different countries of Asia and the Pacific region, Africa and North America have been trained at the Institute.

Indian Agricultural Statistics Research Institute (IASRI)

       IASRI is a pioneer institute of Indian Council of Agricultural Research (ICAR) undertaking research, teaching and training in Agricultural Statistics, Computer Application and Bioinformatics. Ever since its inception way back in 1930, as small Statistical Section of the then Imperial Council of Agricultural Research, the Institute has grown in stature and made its presence felt both nationally and internationally. ICAR-IASRI has been mainly responsible for conducting research in Agricultural Statistics and Informatics to bridge the gaps in the existing knowledge. It has also been providing education/ training in Agricultural Statistics and Informatics to develop trained human resources in the country. The research and education are used for improving the quality and meeting the challenges of agricultural research in newer emerging areas.

  1.2    Population and Sample

        Sampling is quite often used in our day to day practical life. For example -  in a shop we assess the quality of sugar, wheat or any  other commodity by taking a handful of it from the bag and then decide to purchase it or not. A housewife normally tests, the cooked products to find if they are properly cooked and contain the proper quantity of salt.

In sampling theory we first define the following terms.

i)  Population (Universe)

The group of individuals under study is called population or universe.  (The totality of the objects of study). For example -if we are going to study the economic conditions of primary teachers in Maharashtra state, then the total of all the primary teachers in Maharashtra state is the population or universe for the study. In short the totality of the members of study is called the population. It may be a group of men, animals, trees or electric bulbs cars etc.

ii) Sample

A finite subset of individuals in a population is called a sample and the number of individuals in a sample is called the sample space. (A part of the population is called sample).

iii) Census method

The method of collecting data from entire population is called the census method. If the census method is to be followed in the above example then we have to collect data about the economic conditions of every primary teacher in Maharashtra state.

iv) Sampling method

If instead of studying the entire population, a part of it is studied it is called the sampling method. Thus if the sampling method is to be used in the above example, we would study the economic conditions of a few properly selected primary teachers and then estimate the results for all the teachers. In short, if the data is collected from a selected few it is called sampling method.

Advantages of sampling method over census method

i) Time

 If the population is large (generally it is), then the study of the entire population not only for collection but also for analyzing the data will require a lot of time. As against this collecting and analysis of the sample will largely reduce the time required. In some cases where the results are required quickly census method is not used

ii) Cost

It is also obvious that the study of entire population will be very costly. Since in a sample survey only a part of is to be studied, the cost involved will be proportionately less.  Sampling method is much more economical than the census method.

 iii) Reliability (Accuracy) 

Since in a sample, only a part is to be studied a number of precautions can be taken and a very careful investigation can be made. On the other hand information may be lost in census method on account of the large size of the population. Due to small size of sample, it is possible to check the information also to check the results during analysis. All this leads to increased reliability of the sampling method.

iv) Details of information

Again, since the size of the sample is small, every member of the sample can be studied rigorously and detailed information can be obtained about it.

v) In some cases sampling is the only possible method

 In certain investigations census method is not possible to use and only the sampling method is used. For example: examining blood of a human body, inspection of crackers, explosive materials, measuring life time of electric components etc. In such cases sampling is the only possible method. Thus sampling method is found to be much superior to the census method. 

 Note:

1. As sample is selected to study the population, it should be such that it will represent all important characteristics of the population. Thus sample is miniature of population.

2. Sampling units should be independent.

3. It should be evenly spread over the population. It can be achieved by dividing population in homogeneous subgroups and selecting samples from each subgroup.  

                     Population            Sample 

Samples can be selected in two ways.

Random sampling

   In this method, the sample is selected impartially.  Personal or any kind of bias in selection is avoided and pure statistical approach is used. These methods least affected by personal bias, so these methods are widely used in practice. It is also referred as probabilistic sampling; since it is random sampling laws of probability can be applied.

Non random sampling

It is a process of sampling without randomization. A non random sample is selected on the basis of judgment or convenience and not under the probability consideration.  Investigators select elements in any manner suitable to him. For example: he may select elements in first come first serve basis.

 To select candidates for debate competition, deliberate selection of suitable candidates will be done. It is purposive sampling (non random). In the advertisement campaign for cosmetics, certainly a sample of youngsters will be taken. This method is unscientific and produces unreliable results.

Methods of sampling

There are various methods used to select the sample from the population. We shall study the following method of sampling.

Simple random sampling (SRS)

             In this method, each item in the population has an equal and independent chance of being selected in the sample.

Suppose we take a sample of size n from a finite population of size N, then there are NCn  possible samples. A sampling method in which each of the NCn samples has an equal chance of being selected is known as random sampling and the sample obtained by this method is called as a random sample. The following methods are commonly used for selecting a simple random sample.

Lottery method

In this method, the numbers or the names of all the members of the population are written on separate pieces of paper of the same size, shape and color.  The pieces are folded in the same manner, mixed up thoroughly in a drum and the required numbers of pieces are drawn blindly.  All this ensures that, each member of the population has equal opportunity of being included in the sample. The method is used for drawing the prizes of a lottery and hence the name.

Table of random numbers

If population is large, lottery method is tedious to follow.  An alternative method is the method of random numbers. In this method, all the items are given numbers. Then a book of random numbers is taken. The book is opened at random and from any row any column, the numbers are taken. The items bearing these numbers are included in the sample.

  SRSWR and SRSWOR

            If the units are selected one by one in such a way that, a unit selected is replaced back to   the population before the next draw (selection), it is known as SRSWR. If a unit selected once is not replaced back to the population before the next draws (selections), it is known as SRSWOR.

For ex.: Population of size N= 4,   contains say 1, 2, 3 & 4 items, then the SRSWR and      SRSWOR’s of size 2 are,

             SRSWOR (Total = 06)                                        SRSWR (Total = 16)

              (1, 2), (1, 3), (1, 4)                                      (1,1),  (1,2), (1,3), (1,4) ,(2,1), (2,2),

                (2,3), (2,4)                                                (2,3), (2,4), (3,1), ,(3,2), (3, 3), (3, 4),  

           (3,4)                                     (4,1), (4,2), (4,3), (4,4)                                                                                                                                             

Stratified random sampling

When data are heterogeneous and are composed of different strata’s or classes or subgroups, a sample by SRS method does not gives proper representation of population. It also does not ensure that each class will be given proper representation in the sample selected. Stratified sampling is a method of random sampling where researchers first divide a population into smaller subgroups, or strata, based on shared characteristics of the members and then randomly select among these groups to form the final sample. These shared characteristics can include gender, age, sex, education level or income. Suppose population of size N is divided into k strata’s having strata sizes n1, n2,  . . . ,  nk and if sample of size n is to be selected from entire population then under proportional allocation, sample size to be selected from ith strata is given by, ni= Ni×(n/N ).

 Systematic sampling   In systematic sampling method, sample members from a larger population are selected according to a random starting point but with a fixed, periodic interval. This interval, called the sampling interval is calculated by dividing the population size N by the desired sample size. Suppose we have a population of size N and these population units are labeled as 1, 2, 3, . . . ,N. If sample of size n is to be selected from entire population then under systematic random sampling method first unit is selected at random from 1 to k =N/n .  After selection of first unit, every kth unit is selected successively from the population. If a label of first randomly selected unit is ‘x’ then units in a systematic sample of size n are, (x+k ), (x+2k), . . . , x+ (n−1)k.

1.3 Nature of Data :

Time series data:

Time series data is a sequence of data points collected or recorded at regular time intervals, typically measured in seconds, minutes, hours, days, weeks, months, quarters, or years. This type of data is used to track changes, patterns, or trends over time.

Here are some examples of time series data:

 1. Stock prices: Daily closing prices of a stock over a year.

2. Economic data: Quarterly GDP growth rate, inflation rate, or unemployment rate.

3. Energy consumption: Daily or hourly energy usage in a building or household.

4. Medical data: Daily blood pressure, heart rate, or blood glucose levels of a patient.

These examples illustrate how time series data can be used in various domains, such as finance, environmental monitoring, marketing, industrial operations, economics, transportation, energy management, healthcare, and social media analytics.

***

 1.4 Presentation of Data 


***

Tabulation

            Tabulation is the next step of classification. The systematic arrangement of data into rows and columns is called as tabulation. The main parts of a good statistical table are,

Table No. :

                                                                    Title:

Head note:

 

 

Caption

(Headings of columns )

Stub

(Headings of rows )

Body

(Numerical data)

Foot note:

  Source note:

1. Table number:  Each table should be given a number for our reference.

2. Title: Every table should be given a suitable title, which is short, clear and precise.

3. Head note: It is at the top o the table and in which the units of data are mentioned.

4. Caption: It refers to the column headings.

5. Stub: It refers to the row headings.

6. Body: It is the main part of the table and in which the data in numerical form is entered.

7. Foot note: It is written at the bottom of the table. Its purpose is to indicate the special        

features of data, if any.

8. Source note: Here the source of the data is mentioned, if it is known.

Characteristics (Requirements) of a good table:

1. Each and every statistical table should be given the most appropriate title.

2. The table should suit the size of the paper usually with more rows than columns.

3. The unit of measurement is given in the head note.

4. Give clear headings to rows and columns.

5. Show totals and subtotals whenever necessary.

6. Figure should be rounded to avoid unnecessary details in the table and a foot note to this effect should be given.

7. Do not use ditto marks. If a figure is repeated, show it each time.  A ditto mark may be mistaken as ‘11’ (eleven).

8. Abbreviations should be avoided especially  in titles and headings. For example: ‘yr’ not be used for year.

Types of tables

There are two types of tables, simple and complex tables: 

Simple table : In this table the data are classified with respect to a single characteristic and accordingly it is also called as One-Way or Simple table. 

One-Way Table: In this type of table only one characteristic is shown. This is the simplest type of table. The following is the illustration of such a table.

Table No. 1

Title: Number of employees in a bank according to age group

Age in Years

No. of employees

Below 25

 

25-35

 

35-45

 

45-55

 

Above 55

 

Total

 

Complex table: If the data are grouped into different classes  w.r.t. two or more characteristics or criteria simultaneously, then we get a Complex or Manifold table.

Two-Way Table: In this type of table two characteristics are considered  simultaneously.  Stub and caption are subdivided to include the two  characteristics under consideration. One characteristics is taken in stub and other in caption.  The following is the illustration of such a table.

Table No. 2

Title: Number of employees in a bank in different  age groups according to sex.

Age in Years

No. of employees

Total

Male

Female

Below 25

 

 

 

25-35

 

 

 

35-45

 

 

 

45-55

 

 

 

Above 55

 

 

 

Total

 

 

 

Three-Way Table: If the data are classified simultaneously with respect to three characteristics, we get three way table. Thus a three-way table gives us  information regarding three interrelated characteristics of a particular phenomenon.  The following is the illustration of such a table.

Table No. 3

Title: Distribution of students according to sex, faculty and university.

Headnote:

 

Pune University

Total

Shivaji University

Total

Total

Total

Male

Female

Male

Female

Male

Female

Art

 

 

 

 

 

 

 

 

 

Commerce

 

 

 

 

 

 

 

 

 

Science

 

 

 

 

 

 

 

 

 

Total

 

 

 

 

 

 

 

 

 

Foot note:

Source note:

The table can also be prepared as follows.

 

Arts

Commerce

Science

Total

 

Male

Female

Total

Male

Female

Total

Male

Female

Total

Male

Female

Total

Pune

 

 

 

 

 

 

 

 

 

 

 

 

Shivaji

 

 

 

 

 

 

 

 

 

 

 

 

Total

 

 

 

 

 

 

 

 

 

 

 

 




Objects of Classification and Tabulation:

 1. It condenses and simplifies complex data:

 The data is complex and bulky, after classification and tabulation complexities are removed   and it becomes simplified.

2. It enables comparison:

 In a table data are arranged according to different characteristics (properties), it is therefore very   useful to compare different parts of the table.

3. It gives prominence to important figures.

4. It enables to analyze data further.

 Principles of Classification (Frequency distribution):

        While preparing a frequency distribution, the following points should be taken into account.

1. The number of class intervals:  It should neither be very small nor very large. Generally it should be between 6 and 15.

2. The size of class intervals:  It should neither be very small nor very large. It should be 5, 10 or multiple of 5 and should be equal for all classes.

3. The lower limit of first class:  That is the starting point of the table will be 0, 5, 10 or multiple of 5.

4. The upper limit of the class: It depends upon the length of class and the method used. If inclusive method is used, then upper limit is different from the lower limit of the next class and if exclusive method is used, then upper limit is equal to the lower limit of the next class. Generally, use exclusive method of classification.

5. Open end classes are generally avoided.

 

 ***

Diagrammatic Presentation

By the method of classification and tabulation, we condense the mass of data. Statistical data may be represented by diagrams and graphs to make it simple, attractive, intelligible, appealing and more meaning full. The representation of data, with the help of diagrams is called as Diagrammatic representation of data.

We shall study the following two diagrams.

1. Line or bar diagram

2. Pie diagram or Pie- chart.

 Line or bar diagram

To draw a line or bar diagram, bars are drawn whose length is proportional to the value to be represented. There are different types of line diagram.

l. Simple bar diagram

To get a simple bar diagram, vertical bars with their heights proportional to the numerical data are drawn. The width of all the bars must be same. & the space between each bas is also same. The bar of maximum height is drawn first and the bar of lowest height is drawn at last.

Ex: Population of a city in 1970, 1975 and 1980 is given. Draw a simple bar diagram.                      

Year

Population

1970

10,000

1975

15,000

1980

20,000




                                                      



Box Plots

Box plots are a graphical representation of data (easy to visualize descriptive statistics); they are also known as box-and-whisker diagrams.  A box plot provides more information about the data than does a bar graph.

 Things to know about box plots

1.    Sample is presented as a box. 

2.   The spacing  between the different parts of the box help to indicate the degree of dispersion (spread) and skewness in the data, and identify outliers.

3.   A box plot shows a 5-number data summary: minimum, first (lower) quartile, median, third (upper)      quartile, maximum.

4.   The box is divided at the median.

5.   The length of the box is the interquartile range (IQR).

6.   The 1st quartile is the bottom line.

7.  The 3rd quartile is the top line.

 Example 


Quartiles divide frequency distributions:

1.  Q1 :1st or lower quartile: cuts off lowest 25% of the data

2.  Q2 :2nd quartile or median: 50% point, cuts data set in half

3.   Q3 :3rd quartile or upper quartile: cuts off lowest 75% of the data (or highest 25%)

Q1 is the median of the first half of data set. Q3 is  the median of the second half of data set.            The difference between the upper and lower quartiles is called the interquartile range. The interquartile range spans 50% of a data set, and eliminates the influence of outliers because the highest and lowest quarters are removed.

 Example:

A biologist samples 12 red oak trees in a forest plot and counts the number of caterpillars on each tree.

The following is a list of the number of caterpillars on each tree: 34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37

Calculate the median, 1st and 3rd quartile.

Step 1: Arrange the values in ascending order 1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57                    Step 2: Calculate the median Median = (24 + 28)/ 2 = 26                                                                      Step 3: Determine Q1

Lower quartile = value of middle of first half of data Q1 = the median of 1, 11, 15, 19, 20, 24

                 = (3rd + 4th observations) ÷ 2  = (15 + 19) ÷ 2 = 17

    Step 4: Determine Q3

   Upper quartile = value of middle of second half of data Q3                                                                     = the median of 28, 34, 37, 47, 50, 57

= (3rd + 4th observations) ÷ 2

= (37 + 47) ÷ 2 = 42


 Outliers

Observations that are 1.5 x IQR greater than Q3 or less than Q1 are called outliers and are distinguished by a different mark, e.g., an asterisk. In the figure below the arrows are pointing to the outliers.


Do not remove outliers from the dataset unless there is good reason to do so. A good reason could be that the outlier is a typo, for example a student records a caterpillar mass of 10 grams instead of 0.10 grams. Or if the equipment used to measure the observation failed, for example if the balance measures a caterpillar as 10 grams. Don’t remove outliers just because you want to make your dataset look “prettier”. Outliers can point us to interesting patterns or let us know that we may need to increase the sample size.

 

How do you determine if there are any outliers in your sample?

1.   Calculate IQR x 1.5

2.    Add this value to Q3. Are there any values greater than Q3 + (IQR x 1.5)? If so, then these values are outliers. 

3.    Subtract this value from Q1. Are there any values smaller than Q1 – (IQR x 1.5)? If so, then these values are outliers.

Whiskers

     The two vertical lines (called whiskers) outside the box extend to the smallest and largest observations within 1.5 x IQR (inter quartile range) of the quartiles. If there are no outliers, then the whiskers extend to the min and max values.

 

 









Comments

Popular posts from this blog

Unit 1 : Multiple Regression , Multiple Correlation and Partial Correlation 1.1: Multiple Linear Regression (for trivariate data)

Time Series