B. Sc. Part I Semester I I.I Introduction to Statistics :Nature of Data, Sampling, Classification and Tabulation
B.
Sc. Part 1 Semester 1
DSC-7A-STATISTICS-1
(DESCRIPTIVE
STATISTICS-1)
Theory:
30 hrs. Mark 50 (Credits: 02)
‘Statistical thinking will one day
be necessary for effective
Citizenship as the ability to read and write’ - H.G. Wells
1.1 Introduction to Statistics:
Meaning of the word Statistics
Statistics
is as old as language Statistics is in use from the time when man began to
count & measure: The word ‘Statistics’ has several meanings.
Statistics-Plural- Figures, Numbers, Data (for
ex. statistics of a batsman)
Singular Methods, Science,
Subject itself
In ancient days kings used to maintain records of manpower, wealth, population, income, taxes & military etc. The word 'Statistics seems to be derived from the Italian word 'statista', Greek word statistik & Latin word 'status', French (Statistique). (meaning a political state).
Some definitions of statistics are given
below
Statistics
as numerical data
It
is the science of kings. It is the science of numbers. It is the body of
knowledge.
It
is the science of applied mathematics. It is the science of probability.
It
is the science of gamblers. It is the science of sampling.
Statistics
as statistical methods
It
is the science of counting. It is the science of averages.
The best of the
definitions is given by Croxton & Cowdon. It is "Statistics may be
defined as the collection, presentation, analysis & interpretation of
numerical data."
Importance of
statistics
1.
Statistical methods enable to condense the data. It facilitates several
functions apart from summarization.
2.
Statistical methods give tools of comparison.
3.
Estimation, prediction is also possible using statistical tools.
4.
One can get idea about the shape, spread
and symmetry of the data.
5.
Inter relation between two or more variables can be measured using statistical techniques.
6.
Statistical methods help in planning, controlling, decision making etc.
7.
The use of statistical methods is a important because considerable amount of
time, money and manpower can be saved.
8.
Uncertainty's can be reduced to get reliable results.
9.
Statistical methods give systematic methods of data collection and
investigation
Various fields where statistics is used (Scope of Statistics):
The
application of statistical techniques is widespread. From a common man in his
everyday life to an expert in his special field, everybody uses statistical
methods.
i)
Statistics & agriculture- The success of our
'Green Revolution is due to use of statistical techniques (Which type of
sugarcane yields maximum amount of sugar? Which fertilizer is the most suitable
to a particular soil?)
ii) Statistics &
Medical sciences- Statistical methods are used in medical
& pharmaceutical research (Effectiveness of any new drug, cause or causes
of a disease) 'Smoking causes cancer is a statistical result’.
iii) Statistics &
economics- The conclusions of economists in production,
taxation, import, export, etc. are all based upon the statistical data &
techniques.
iv) Statistics in industry (Engineering)
- In industry to control the quality of the product, to design & test the
new products before they are marketed.
v)
Statistics in Biological Sciences - Analysis of agriculture experiments
makes use of statistical method known as design of experiment. The regression
analysis was used by Sir F. Galton in the field of genetics. Pearson pioneered
the study of correlation analysis. Various methods of cultivation, irrigation are
tested and compared using statistical tests. Estimation of number of trees in
jungle, forest density, number of animals in a jungle, fish in a lake etc. can
be done using various Statistical techniques. Demography uses statistical
methods for forecasting population, measuring death rates, birth rates, growth
rates.
vi) Use of Statistics in Social Sciences- In social sciences we need to test association between two variables such as education and criminality, education and marriage adjustment score, sex and education, richness and criminality etc. for this the statistical methods are used.
vii) Use of statistics in business world- i) Use in economics ii) Use in Commerce Statistics provides methods of forecasting production, demand & supply iii) Business management- Statistical methods help the manager to analyze the problem & take decisions.
viii) Statistics & other sciences- The methods of statistics are also helpful in Geology, Biology, Psychology, Sociology, and Meteorology.
It is difficult in fact, to find any scientific
activity where statistical methods are not serviceable.
Limitations of Statistics
i)
Statistics does not study individuals as it deals with aggregates.
ii)
It does not study qualitative o phenomenon.
iii)
Its results are true only on an average.
iv)
Its results are subject to bias.
v)
It can be misused.
Statistical
Organizations in India and their functions:
1. Central Statistical Organization (CSO):
It has established by Central Government in May 1951 and it works under the Cabinet Secretariat. It is the main coordinating agency of various Statistical Organizations in center as well as states of our country. It looks after various activities such as collection & compilation of data. CSO brings out number of publications such as:
i)
Monthly Abstract Statistics
ii)
Monthly Statistics of the Production of Selected Industries of India
iii)
Statistical Pocket Book of the Indian Union.
iv)
Statistical Abstract of India
v)
Annual Survey of Industries
2) National Sample Survey (N.S.S.)
NSS was set up in 1950 under the
guidance of P.C. Mahalanobis and it was reorganized in 1970 under the name
National Sample Survey Organization (N.S.SO.). Its Main functions are:
i) Data collection for Estimation of National Income
for the activities of Planning Commission for the activities of various
Ministries, Collection of Socio-economic and demographic data.
ii) NSSO collects data regarding prices, wages, consumption, production, agriculture etc.
iii) NSSO conducts sample surveys in the registered industrial sectors.
iv)
It provides guidance to the various
states & does the job of supervision for
conducting various surveys.
3) The Indian Statistical Institute (I.S.I.):
It is a non-Government organization of highest importance in India. It was established in 1932 & has done considerable work in developing Indian Statistical System.
Functions-
i)
To produce learned & expert Statisticians.
ii)
To provide training & research facilities at various levels.
iii)
To conduct large scale statistical projects.
It
gives technical assistance to NSSO. Since 1960, the institute started its own
courses B. Stat & M. Stat. It also confers Ph. D. & D.Sc. degrees on
those who have done research work of
excellent order.
It also brings out a journal named “Sankhya".
Bureau of economics and
Statistics
Statistical
system in states varies from
state to state. In Maharashtra State Statistical Bureau is functioning for
various activities such as,
i)
Statistical coordination
ii)
State income
iii)
Socio-economic survey.
The
specific functions of economics and Statistics Bombay are described as follows:
i) Co-ordinate
statistics corrected by various departments of state government
ii) Provide
guidance regarding statistics to various departments
iii) Collect
statistical information conduct statistical enquiries and statistical service
iv) Provide liaison
between State and CSO.
v) Conduct economic
and statistical research.
vi) Provide
statistical assistance to state planning agencies.
vii) Compile
economic indicators and give state income estimates.
viii) Publish, Annual
State Statistical Abstract and Quarterly District Statistical Abstracts.
International Institute for population Sciences (IIPS)
In 1956, the United Nations, the Government of India and Sir Dorabji
Tata trust jointly established the institute to serve as a regional center for
teaching, training and conducting research in the area of population studies. The
IIPS has helped in building a nucleus of professionals in the field of
population and health in governments of various countries. During the past 50 years students from 42 different
countries of Asia and the Pacific region, Africa and North America have been
trained at the Institute.
Indian
Agricultural Statistics Research Institute (IASRI)
IASRI is a pioneer institute of
Indian Council of Agricultural Research (ICAR) undertaking research, teaching
and training in Agricultural Statistics, Computer Application and
Bioinformatics. Ever since its inception way back in 1930, as small Statistical
Section of the then Imperial Council of Agricultural Research, the Institute
has grown in stature and made its presence felt both nationally and
internationally. ICAR-IASRI has been mainly responsible for conducting research
in Agricultural Statistics and Informatics to bridge the gaps in the existing
knowledge. It has also been providing education/ training in Agricultural
Statistics and Informatics to develop trained human resources in the country.
The research and education are used for improving the quality and meeting the
challenges of agricultural research in newer emerging areas.
Sampling is quite often used in our day
to day practical life. For example - in
a shop we assess the quality of sugar, wheat or any other commodity by taking a handful of it
from the bag and then decide to purchase it or not. A housewife normally tests,
the cooked products to find if they are properly cooked and contain the proper
quantity of salt.
In sampling theory we first define the following
terms.
i)
Population (Universe)
The
group of individuals under study is called population or universe. (The totality of the objects of study). For
example -if we are going to study the economic conditions of primary teachers
in Maharashtra state, then the total of all the primary teachers in Maharashtra
state is the population or universe for the study. In short the totality of the
members of study is called the population. It may be a group of men, animals,
trees or electric bulbs cars etc.
ii) Sample
A
finite subset of individuals in a population is called a sample and the number
of individuals in a sample is called the sample space. (A part of the
population is called sample).
iii) Census method
The method of
collecting data from entire population is called the census method. If the
census method is to be followed in the above example then we have to collect
data about the economic conditions of every primary teacher in Maharashtra
state.
iv) Sampling method
If
instead of studying the entire population, a part of it is studied it is called
the sampling method. Thus if the sampling method is to be used in the above
example, we would study the economic conditions of a few properly selected
primary teachers and then estimate the results for all the teachers. In short,
if the data is collected from a selected few it is called sampling method.
Advantages of sampling method over
census method
i)
Time
If the population is large (generally it is),
then the study of the entire population not only for collection but also for
analyzing the data will require a lot of time. As against this collecting and
analysis of the sample will largely reduce the time required. In some cases
where the results are required quickly census method is not used
ii)
Cost
It
is also obvious that the study of entire population will be very costly. Since
in a sample survey only a part of is to be studied, the cost involved will be
proportionately less. Sampling method is
much more economical than the census method.
iii) Reliability (Accuracy)
Since
in a sample, only a part is to be studied a number of precautions can be taken
and a very careful investigation can be made. On the other hand information may
be lost in census method on account of the large size of the population. Due to
small size of sample, it is possible to check the information also to check the
results during analysis. All this leads to increased reliability of the
sampling method.
iv)
Details of information
Again, since the size of the sample is small, every
member of the sample can be studied rigorously and detailed information can be
obtained about it.
v) In some cases sampling is the only
possible method
In certain investigations census method is not
possible to use and only the sampling method is used. For example: examining
blood of a human body, inspection of crackers, explosive materials, measuring
life time of electric components etc. In such cases sampling is the only
possible method. Thus sampling method is found to be much superior to the
census method.
Note:
1. As sample is selected to study the population, it should be such that it will represent all important characteristics of the population. Thus sample is miniature of population.
2. Sampling units should be independent.
3.
It should be evenly spread over the population. It can be achieved by dividing
population in homogeneous subgroups and selecting samples from each subgroup.
Samples can be selected in two ways.
Random sampling
In this method, the sample is selected
impartially. Personal or any kind of bias
in selection is avoided and pure statistical approach is used. These methods least
affected by personal bias, so these methods are widely used in practice. It is
also referred as probabilistic sampling; since it is random sampling laws of
probability can be applied.
Non random sampling
It is a process of
sampling without randomization. A non random sample is selected on the basis of
judgment or convenience and not under the probability consideration. Investigators select elements in any manner
suitable to him. For example: he may select elements in first come first serve
basis.
To select candidates for debate competition,
deliberate selection of suitable candidates will be done. It is purposive
sampling (non random). In the advertisement campaign for cosmetics, certainly a
sample of youngsters will be taken. This method is unscientific and produces unreliable
results.
Methods of sampling
There are various methods used to select
the sample from the population. We shall study the following method of sampling.
Simple random sampling (SRS)
In this method, each item in the
population has an equal and independent chance of being selected in the sample.
Suppose we take a
sample of size n from a finite population of size N, then there are NCn possible samples. A sampling method in which
each of the NCn samples has an equal chance of being
selected is known as random sampling and the sample obtained by this method is
called as a random sample. The following methods are commonly used for
selecting a simple random sample.
Lottery method
In this method, the numbers or the names
of all the members of the population are written on separate pieces of paper of
the same size, shape and color. The
pieces are folded in the same manner, mixed up thoroughly in a drum and the
required numbers of pieces are drawn blindly. All this ensures that, each member of the
population has equal opportunity of being included in the sample. The method is
used for drawing the prizes of a lottery and hence the name.
Table of random numbers
If population is large, lottery method
is tedious to follow. An alternative
method is the method of random numbers. In this method, all the items are given
numbers. Then a book of random numbers is taken. The book is opened at random
and from any row any column, the numbers are taken. The items bearing these
numbers are included in the sample.
SRSWR and SRSWOR
If the units are selected one by
one in such a way that, a unit selected is replaced back to the population before the next draw
(selection), it is known as SRSWR. If a unit selected once is not replaced back
to the population before the next draws (selections), it is known as SRSWOR.
For ex.: Population of
size N= 4, contains say 1, 2, 3 & 4
items, then the SRSWR and SRSWOR’s
of size 2 are,
SRSWOR
(Total = 06)
SRSWR (Total = 16)
(1, 2), (1, 3), (1, 4) (1,1), (1,2), (1,3), (1,4) ,(2,1), (2,2),
(2,3), (2,4) (2,3), (2,4), (3,1), ,(3,2), (3, 3), (3, 4),
(3,4) (4,1), (4,2),
(4,3), (4,4)
Stratified random sampling
When
data are heterogeneous and are composed of different strata’s or classes or
subgroups, a sample by SRS method does not gives proper representation of
population. It also does not ensure that each class will be given proper
representation in the sample selected. Stratified sampling is a method of
random sampling where researchers first divide a population into smaller
subgroups, or strata, based on shared characteristics of the members and then
randomly select among these groups to form the final sample. These shared
characteristics can include gender, age, sex, education level or income.
Suppose population of size N is divided into k strata’s having strata sizes n1,
n2, . . . , nk and if sample of size n is to be
selected from entire population then under proportional allocation, sample size
to be selected from ith strata is given by, ni= Ni×(n/N
Systematic sampling In systematic sampling
method, sample members from a larger population are selected according to a
random starting point but with a fixed, periodic interval. This interval,
called the sampling interval is calculated by dividing the population size N by
the desired sample size. Suppose we have a population of size N and these
population units are labeled as 1, 2, 3, . . . ,N. If sample of size n is to be
selected from entire population then under systematic random sampling method
first unit is selected at random from 1 to k =N/n
Time
series data:
Time series data is a sequence of data points
collected or recorded at regular time intervals, typically measured in seconds,
minutes, hours, days, weeks, months, quarters, or years. This type of data is
used to track changes, patterns, or trends over time.
Here are some examples of time series data:
1. Stock prices: Daily closing prices of a stock over a year.
2.
Economic data: Quarterly GDP growth rate, inflation rate, or unemployment rate.
3.
Energy consumption: Daily or hourly energy usage in a building or household.
4.
Medical data: Daily blood pressure, heart rate, or blood glucose levels of a
patient.
These examples illustrate how time series
data can be used in various domains, such as finance, environmental monitoring,
marketing, industrial operations, economics, transportation, energy management,
healthcare, and social media analytics.
***
1.4 Presentation of Data
***
Tabulation
Tabulation is the next step of
classification. The systematic arrangement of data into rows and columns is
called as tabulation. The main parts of a good statistical table are,
Table No. :
Title:
Head note:
|
Caption (Headings of columns ) |
Stub (Headings of rows ) |
Body (Numerical data) |
Foot note:
Source note:
1. Table number: Each table should be given a number for our
reference.
2. Title: Every
table should be given a suitable title, which is short, clear and precise.
3. Head note: It
is at the top o the table and in which the units of data are mentioned.
4. Caption: It
refers to the column headings.
5. Stub: It
refers to the row headings.
6. Body:
It is the main part of the table and in which the data in numerical form is
entered.
7. Foot note:
It is written at the bottom of the table. Its purpose is to indicate the
special
features of data, if any.
8. Source note: Here
the source of the data is mentioned, if it is known.
Characteristics (Requirements) of a good table:
1. Each and every statistical table should be given the most appropriate title.
2. The table
should suit the size of the paper usually with more rows than columns.
3. The unit of
measurement is given in the head note.
4. Give clear
headings to rows and columns.
5. Show totals and
subtotals whenever necessary.
6. Figure should
be rounded to avoid unnecessary details in the table and a foot note to this
effect should be given.
7. Do not use
ditto marks. If a figure is repeated, show it each time. A ditto mark may be mistaken as ‘11’ (eleven).
8. Abbreviations
should be avoided especially in titles
and headings. For example: ‘yr’ not be used for year.
Types of tables
There are two types of tables, simple and complex tables:
Simple table : In this table the data are classified with respect to a single characteristic and accordingly it is also called as One-Way or Simple table.
One-Way Table: In this type of table only one characteristic is shown. This is the simplest type of table. The following is the illustration of such a table.
Table No. 1
Title: Number of employees in a bank according to age group
Age
in Years |
No.
of employees |
Below
25 |
|
25-35 |
|
35-45 |
|
45-55 |
|
Above
55 |
|
Total |
|
Complex table: If the data are grouped into different classes w.r.t. two or more characteristics or criteria simultaneously, then we get a Complex or Manifold table.
Two-Way Table: In this type of table two characteristics are considered simultaneously. Stub and caption are subdivided to include the two characteristics under consideration. One characteristics is taken in stub and other in caption. The following is the illustration of such a table.
Table No. 2
Title: Number of employees in a bank in different age groups according to sex.
Age in Years |
No. of employees |
Total |
|
Male |
Female |
||
Below 25 |
|
|
|
25-35 |
|
|
|
35-45 |
|
|
|
45-55 |
|
|
|
Above 55 |
|
|
|
Total |
|
|
|
Three-Way Table: If the data are classified simultaneously with respect to three characteristics, we get three way table. Thus a three-way table gives us information regarding three interrelated characteristics of a particular phenomenon. The following is the illustration of such a table.
Table No. 3
Title: Distribution of students according to sex, faculty and university.
Headnote:
|
Pune University |
Total |
Shivaji University |
Total |
Total |
Total |
|||
Male |
Female |
Male |
Female |
Male |
Female |
||||
Art |
|
|
|
|
|
|
|
|
|
Commerce |
|
|
|
|
|
|
|
|
|
Science |
|
|
|
|
|
|
|
|
|
Total |
|
|
|
|
|
|
|
|
|
Foot note:
Source note:
The table can also be prepared as follows.
|
Arts |
Commerce |
Science |
Total |
||||||||
|
Male |
Female |
Total |
Male |
Female |
Total |
Male |
Female |
Total |
Male |
Female |
Total |
Pune |
|
|
|
|
|
|
|
|
|
|
|
|
Shivaji |
|
|
|
|
|
|
|
|
|
|
|
|
Total |
|
|
|
|
|
|
|
|
|
|
|
|
Objects of Classification and Tabulation:
1. It condenses and simplifies complex data:
The
data is complex and bulky, after classification and tabulation complexities are
removed and it becomes simplified.
2. It enables comparison:
In a table data are arranged according to
different characteristics (properties), it is therefore very useful to compare different parts of the
table.
3. It gives prominence to important figures.
4. It enables to analyze data further.
Principles of Classification (Frequency distribution):
While preparing a frequency distribution, the following points should be taken into account.
1. The number of class intervals: It should neither
be very small nor very large. Generally it should be between 6 and 15.
2. The size of class intervals: It should neither
be very small nor very large. It should be 5, 10 or multiple of 5 and should be
equal for all classes.
3. The lower limit of first class: That is the
starting point of the table will be 0, 5, 10 or multiple of 5.
4. The upper limit of the class: It depends upon the length of class and the method used.
If inclusive method is used, then upper limit is different from the lower limit
of the next class and if exclusive method is used, then upper limit is equal to
the lower limit of the next class. Generally, use exclusive method of
classification.
5. Open end
classes are generally avoided.
***
Diagrammatic Presentation
By the method of classification and tabulation,
we condense the mass of data. Statistical data may be represented by diagrams
and graphs to make it simple, attractive, intelligible, appealing and more
meaning full. The representation of data, with the help of diagrams is called
as Diagrammatic representation of data.
We
shall study the following two diagrams.
1.
Line or bar diagram
2.
Pie diagram or Pie- chart.
Line or bar diagram
To draw a line or bar
diagram, bars are drawn whose length is proportional to the value to be
represented. There are different types of line diagram.
l. Simple bar diagram
To
get a simple bar diagram, vertical bars with their heights proportional to the
numerical data are drawn. The width of all the bars must be same. & the
space between each bas is also same. The bar of maximum height is drawn first and
the bar of lowest height is drawn at last.
Ex:
Population of a city in 1970, 1975 and 1980 is given. Draw a simple bar
diagram.
Year |
Population
|
1970 |
10,000 |
1975 |
15,000 |
1980 |
20,000 |
Box Plots
Box plots are a graphical representation of data (easy to visualize descriptive statistics); they are also known as box-and-whisker diagrams. A box plot provides more information about the data than does a bar graph.
Things to know about box plots
1. Sample is presented as a box.
2. The spacing between the different parts of the box help to indicate the degree of dispersion (spread) and skewness in the data, and identify outliers.
3. A box plot shows a 5-number data summary: minimum, first (lower) quartile, median, third (upper) quartile, maximum.
4. The box is divided at the median.
5. The length of the box is the interquartile
range (IQR).
6. The 1st quartile is the bottom line.
7. The 3rd quartile is the top line.
Example
Quartiles divide frequency distributions:
1. Q1 :1st or lower quartile: cuts off lowest 25% of
the data
2. Q2 :2nd quartile or median: 50% point, cuts data
set in half
3. Q3 :3rd quartile or upper quartile: cuts off
lowest 75% of the data (or highest 25%)
Q1 is the median of the first half of data set. Q3 is the median of the second half of data set. The difference between the upper and lower quartiles is called the interquartile range. The interquartile range spans 50% of a data set, and eliminates the influence of outliers because the highest and lowest quarters are removed.
A biologist samples 12 red oak trees in a forest plot and counts the number of caterpillars on each tree.
The following is a list of the number of caterpillars on each tree: 34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37
Calculate the median, 1st and 3rd quartile.
Step 1: Arrange the values in ascending order 1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57 Step 2: Calculate the median Median = (24 + 28)/ 2 = 26 Step 3: Determine Q1
Lower quartile = value of middle of first half of data Q1 = the median of 1, 11, 15, 19, 20, 24
= (3rd + 4th observations) ÷ 2 = (15 + 19) ÷ 2 = 17
Step 4: Determine Q3
Upper quartile = value of middle of second half of data Q3 = the median of 28, 34, 37, 47, 50, 57
= (3rd + 4th observations) ÷ 2
= (37 + 47) ÷ 2 = 42
Outliers
Do not remove outliers from the dataset unless there is good reason to do so. A good reason could be that the outlier is a typo, for example a student records a caterpillar mass of 10 grams instead of 0.10 grams. Or if the equipment used to measure the observation failed, for example if the balance measures a caterpillar as 10 grams. Don’t remove outliers just because you want to make your dataset look “prettier”. Outliers can point us to interesting patterns or let us know that we may need to increase the sample size.
How do you determine if there are any outliers in your sample?
1. Calculate IQR x 1.5
2. Add this value to Q3. Are there any values greater than Q3 + (IQR x 1.5)? If so, then these values are outliers.
3. Subtract this value from Q1. Are there any values smaller than Q1 – (IQR x 1.5)? If so, then these values are outliers.
Whiskers
The two vertical lines (called whiskers) outside the box extend to the smallest and largest observations within 1.5 x IQR (inter quartile range) of the quartiles. If there are no outliers, then the whiskers extend to the min and max values.
Comments
Post a Comment