Repair Design Furniture

Statistical sampling. Interval estimation of the general share

It often happens that it is necessary to analyze a specific social phenomenon and obtain information about it. Such tasks often arise in statistics and in statistical studies. It is often impossible to verify a fully defined social phenomenon. For example, how to find out the opinion of the population or all residents of a certain city on any issue? Asking absolutely everyone is almost impossible and very laborious. In such cases, we need a sample. This is exactly the concept on which almost all research and analysis is based.

What is sampling

When analyzing a specific social phenomenon, it is necessary to obtain information about it. If you take any research, you will notice that not every unit of the totality of the object of research is subject to research and analysis. Only a certain part of this entire set is taken into account. This process is a sampling: when only certain units from a set are examined.

Of course, a lot depends on the type of sample. But there are also basic rules. The main one is that selection from the population must be completely random. The population units to be used should not be selected due to any criterion. Roughly speaking, if it is necessary to recruit a population from the population of a certain city and select only men, then there will be an error in the study, because the selection was not carried out by chance, but was selected by gender. Almost all sampling methods are based on this rule.

Sampling rules

In order for the selected population to reflect the basic qualities of the entire phenomenon, it must be built according to specific laws, where the main attention must be paid to the following categories:

  • sample (sample population);
  • general population;
  • representativeness;
  • error of representativeness;
  • unit of the population;
  • sampling methods.

The features of sampling and sampling are as follows:

  1. All the results obtained are based on mathematical laws and rules, that is, if the research is carried out correctly and with correct calculations, the results will not be distorted on a subjective basis.
  2. It makes it possible to get a result much faster and with less time and resources by studying not the entire array of events, but only a part of them.
  3. It can be used to study various objects: from specific issues, for example, age, gender of the group of interest to us, to the study of public opinion or the level of material security of the population.

Selective observation

Selective is this statistical observation, in which not the entire aggregate of what is being studied is subjected to research, but only a certain part of it, selected in a certain way, and the results of the study of this part are extended to the entire aggregate. This part is called the sample population. it the only way studying a large array of research objects.

But selective observation can be used only in cases where it is necessary to study only a small group of units. For example, when studying the ratio of men to women in the world, sample observation will be used. For obvious reasons, it is impossible to take into account every inhabitant of our planet.

But with the same study, but not all inhabitants of the earth, but a certain 2 "A" class in a particular school, a certain city, a certain country, can do without selective observation. After all, it is quite possible to analyze the entire array of the object of research. It is necessary to count the boys and girls of this class - that will be the ratio.

Sample and general population

In fact, everything is not as complicated as it sounds. In any object of study there are two systems: the general population and the sample population. What is it? All units are classified as general. And to the sample - those units of the general population that were taken for the sample. If everything is done correctly, then the selected part will constitute a reduced model of the entire (general) population.

If we talk about the general population, then we can distinguish only two of its varieties: a definite and indefinite general population. Depends on whether the total number of units of a given system is known or not. If this is a specific population, then sampling will be easier because it is known what percentage of the total number of units will be sampled.

This point is very necessary in research. For example, if you want to investigate the percentage of poor quality confectionery products at a particular plant. Let us assume that the population has already been determined. It is known for certain that this enterprise produces 1000 confectionery products a year. If we make a sample of 100 random confectionery products from this thousand and send them for examination, then the error will be minimal. Roughly speaking, 10% of all products were subject to research, and according to the results, we can, taking into account the error of representativeness, talk about the poor quality of all products.

And if we sample 100 confectionery products from an undefined general population, where they actually had, say, 1 million units, then the result of the sample and the study itself will be critically implausible and inaccurate. Do you feel the difference? Therefore, the certainty of the general population in most cases is extremely important and greatly affects the result of the study.

Representativeness of the population

So, now one of the most important questions - what should be the sample? This is the most main point research. At this stage, it is necessary to calculate the sample and select units from the total number into it. The population was selected correctly if certain features and characteristics of the general population remain in the sample. This is called representativeness.

In other words, if, after selection, a part retains the same tendencies and characteristics as the entire quantity of the investigated person, then such a set is called representative. But not every particular sample can be selected from a representative population. There are also such objects of research, the sample of which simply cannot be representative. This is where the concept of the error of representativeness arises. But let's talk about this in more detail a little more.

How to make a sample

So, in order to maximize representativeness, there are three main sampling rules:


Error (error) of representativeness

The main characteristic the quality of the selected sample is the concept of "representativeness error". What is it? These are certain discrepancies between the indicators of selective and continuous observation. In terms of error indicators, representativeness is divided into reliable, normal and approximate. In other words, deviations of up to 3%, from 3 to 10% and from 10 to 20%, respectively, are permissible. Although in statistics it is desirable that the error does not exceed 5-6%. Otherwise, there is a reason to talk about insufficient representativeness of the sample. Many factors are taken into account to calculate the bias of representativeness and how it affects a sample or general population:

  1. The probability with which you want to get an accurate result.
  2. The number of units in the sample. As mentioned earlier, the fewer units the sample makes, the larger the representativeness error will be, and vice versa.
  3. The homogeneity of the studied population. The more heterogeneous the population is, the greater the error in representativeness will be. The ability of an aggregate to be representative depends on the homogeneity of all its constituent units.
  4. The method of selecting units for the sample.

In specific studies, the percentage of error of the mean is usually determined by the researcher himself based on the observation program and according to previous studies. As a rule, the acceptable marginal sampling error (representativeness error) is considered to be within 3-5%.

Bigger is not always better

It is also worth remembering that the main thing in organizing selective observation is to bring its volume to an acceptable minimum. At the same time, one should not strive for an excessive reduction in the sampling error boundaries, as this can lead to an unjustified increase in the sample size and, consequently, to an increase in the cost of conducting a selective survey.

At the same time, it is impossible to excessively increase the size of the error of representativeness. Indeed, in this case, although there will be a decrease in the size of the sample population, this will lead to a deterioration in the reliability of the results obtained.

What questions are usually asked before the researcher

Any research, if carried out, is for some purpose and to obtain some results. When conducting a sample study, as a rule, the initial questions are posed:


Methods for selecting research units in the sample

Not every sample is representative. Sometimes the same sign is expressed differently in the whole and in its part. To achieve the requirements of representativeness, it is advisable to use various sampling techniques. Moreover, the use of one method or another depends on the specific circumstances. These sampling techniques include:

  • random selection;
  • mechanical selection;
  • typical selection;
  • serial (nested) selection.

Random sampling is a system of measures aimed at random selection of units of the population, when the probability of getting into the sample is equal for all units of the general population. It is advisable to use this technique only in the case of homogeneity and a small number of inherent features. Otherwise, some specific traits risk not being reflected in the sample. Random sampling is at the heart of all other sampling methods.

With mechanical selection of units, it is carried out at a certain interval. If it is necessary to form a sample of specific crimes, it is possible to remove every 5th, 10th or 15th card from all statistical records of registered crimes, depending on their total number and the available sample size. The disadvantage of this method is that before sampling, it is necessary to have a complete account of the units of the population, then it is necessary to conduct a ranking, and only after that it is possible to carry out a sampling at a certain interval. This method is time consuming and therefore not often used.

Typical (zoned) selection is a type of sampling in which the general population is divided into homogeneous groups according to a certain characteristic. Sometimes researchers use other terms instead of "groups": "areas" and "zones". Then, from each group, a certain number of units are randomly selected in proportion to specific gravity groups in the general population. Typical selection is often carried out in several steps.

Serial sampling is a method in which the selection of units is carried out in groups (series) and all units of the selected group (series) are subject to examination. The advantage of this method is that it is sometimes more difficult to select individual units than series, for example, when studying a person who is serving a sentence. Within the selected areas, zones, the study of all units without exception is applied, for example, the study of all persons serving a sentence in a particular institution.

sampling types:

Actually random;

Mechanical;

Typical;

Serial;

Combined.

Properly random sampling consists in the selection of units from the general population at random or at random without any systematic elements. However, before making proper random selection, it is necessary to make sure that all units of the general population, without exception, have absolutely equal chances of being included in the sample, there are no omissions in the lists or list, ignoring individual units, etc. Clear boundaries of the population should also be established so that the inclusion or exclusion of individual units is clear. So, for example, when examining students, it is necessary to indicate whether persons on academic leave, students of non-state universities, military schools, etc .; it is important to determine whether the population will include trade pavilions, commercial tents and other similar objects. Actually random selection can be either repeated or non-repeated. To carry out a non-repeated selection in the drawing process, the drawn lots are not returned to the original set and do not participate in the further selection. When using tables random numbers non-repetition of selection is achieved by skipping numbers if they are repeated in the selected column or columns.

Mechanical sampling applies in cases where the general population is ordered in some way, i.e. there is a certain sequence in the arrangement of units (personnel numbers of employees, voter lists, telephone numbers of respondents, numbers of houses and apartments, etc.).

The general population in mechanical selection can be ranked or ordered according to the value of the studied or correlated with it characteristic, which will increase the representativeness of the sample. However, in this case, the danger of a systematic error increases, associated with an underestimation of the values ​​of the trait under study (if the first value is recorded from each interval) or with its overestimation (if the last value is recorded from each interval). Therefore, it is advisable to start selection from the middle of the first interval

Typical selection. This selection method is used in cases where all units of the general population can be divided into several typical groups. When surveying the population, such groups can be, for example, districts, social, age or educational groups, when surveying enterprises - an industry or sub-industry, form of ownership, etc. Typical selection involves the sampling of units from each typical group by random or mechanically... Since the sample population in one proportion or another necessarily includes representatives of all groups, the typification of the general population makes it possible to exclude the influence of intergroup variance on the mean sampling error, which in this case is determined only by intragroup variation.

The selection of units in a typical sample can be organized either in proportion to the volume of typical groups, or in proportion to the intragroup differentiation of the trait.

Serial selection. This selection method is convenient when the population units are combined into small groups or series. Packages with a certain quantity may be considered as such series. finished products, consignments of goods, student groups, brigades and other associations. The essence of serial sampling is actually random or mechanical selection of series, within which a continuous survey of units is performed.

It often happens that it is necessary to analyze a specific social phenomenon and obtain information about it. Such tasks often arise in statistics and in statistical studies. It is often impossible to verify a fully defined social phenomenon. For example, how to find out the opinion of the population or all residents of a certain city on any issue? Asking absolutely everyone is almost impossible and very laborious. In such cases, we need a sample. This is exactly the concept on which almost all research and analysis is based.

What is sampling

When analyzing a specific social phenomenon, it is necessary to obtain information about it. If you take any research, you will notice that not every unit of the totality of the object of research is subject to research and analysis. Only a certain part of this entire set is taken into account. This process is a sampling: when only certain units from a set are examined.

Of course, a lot depends on the type of sample. But there are also basic rules. The main one is that selection from the population must be completely random. The population units to be used should not be selected due to any criterion. Roughly speaking, if it is necessary to recruit a population from the population of a certain city and select only men, then there will be an error in the study, because the selection was not carried out by chance, but was selected by gender. Almost all sampling methods are based on this rule.

Sampling rules

In order for the selected population to reflect the basic qualities of the entire phenomenon, it must be built according to specific laws, where the main attention must be paid to the following categories:

  • sample (sample population);
  • general population;
  • representativeness;
  • error of representativeness;
  • unit of the population;
  • sampling methods.

The features of sampling and sampling are as follows:

  1. All the results obtained are based on mathematical laws and rules, that is, if the research is carried out correctly and with correct calculations, the results will not be distorted on a subjective basis.
  2. It makes it possible to get a result much faster and with less time and resources by studying not the entire array of events, but only a part of them.
  3. It can be used to study various objects: from specific issues, for example, age, gender of the group of interest to us, to the study of public opinion or the level of material security of the population.

Selective observation

Selective is such a statistical observation in which not the entire set of the studied is subjected to research, but only a certain part of it, selected in a certain way, and the results of the study of this part are extended to the entire set. This part is called the sample population. This is the only way to study a large array of research objects.

But selective observation can be used only in cases where it is necessary to study only a small group of units. For example, when studying the ratio of men to women in the world, sample observation will be used. For obvious reasons, it is impossible to take into account every inhabitant of our planet.

But with the same study, but not all inhabitants of the earth, but a certain 2 "A" class in a particular school, a certain city, a certain country, can do without selective observation. After all, it is quite possible to analyze the entire array of the object of research. It is necessary to count the boys and girls of this class - that will be the ratio.


Sample and general population

In fact, everything is not as complicated as it sounds. Any object of study has two systems: the general population and the sample population. What is it? All units are classified as general. And to the sample - those units of the general population that were taken for the sample. If everything is done correctly, then the selected part will constitute a reduced model of the entire (general) population.

If we talk about the general population, then we can distinguish only two of its varieties: a definite and indefinite general population. Depends on whether the total number of units of a given system is known or not. If this is a specific population, then sampling will be easier because it is known what percentage of the total number of units will be sampled.

This point is very necessary in research. For example, if you want to investigate the percentage of poor quality confectionery products at a particular plant. Let us assume that the population has already been determined. It is known for certain that this enterprise produces 1000 confectionery products a year. If we make a sample of 100 random confectionery products from this thousand and send them for examination, then the error will be minimal. Roughly speaking, 10% of all products were subject to research, and according to the results, we can, taking into account the error of representativeness, talk about the poor quality of all products.

And if we sample 100 confectionery products from an undefined general population, where they actually had, say, 1 million units, then the result of the sample and the study itself will be critically implausible and inaccurate. Do you feel the difference? Therefore, the certainty of the general population in most cases is extremely important and greatly affects the result of the study.


Representativeness of the population

So, now one of the most important questions - what should be the sample? This is the most important point of the study. At this stage, it is necessary to calculate the sample and select units from the total number into it. The population was selected correctly if certain features and characteristics of the general population remain in the sample. This is called representativeness.

In other words, if, after selection, a part retains the same tendencies and characteristics as the entire quantity of the investigated person, then such a set is called representative. But not every particular sample can be selected from a representative population. There are also such objects of research, the sample of which simply cannot be representative. This is where the concept of the error of representativeness arises. But let's talk about this in more detail a little more.

How to make a sample

So, in order to maximize representativeness, there are three main sampling rules:

  1. The most unique indicator of the sample size is considered to be 20%. Statistical sampling in 20% will almost always give the result as close to reality as possible. At the same time, there is no need to transfer to the collected large part of the general population. 20% of the sample is the figure that has been developed by many studies. Let's give some more theory. The larger the sample, the smaller the error of representativeness and the more accurate the research result. The closer the sample is to the general population in terms of the number of units, the more accurate and correct the results will be. After all, if you examine the entire system, then the result will be 100%. But there is no more sampling here. These are studies in which the entire array is examined, all units, so this is not of interest to us.
  2. In case of inexpediency of processing 20% ​​of the general population, it is allowed to study units of the population in an amount of at least 1001. This is also one of the indicators of the study of the array of the object of research, which has developed over time. Of course, it will not give accurate results for large volumes of research, but it will bring it as close as possible to the possible sampling accuracy.
  3. There are many formulas and summary tables in statistics. Depending on the object of research and on the sampling criterion, it is advisable to choose one or another formula. But this point is used in complex and multi-stage research.

Error (error) of representativeness

The main characteristic of the quality of the selected sample is the concept of "representativeness error". What is it? These are certain discrepancies between the indicators of selective and continuous observation. In terms of error indicators, representativeness is divided into reliable, normal and approximate. In other words, deviations of up to 3%, from 3 to 10% and from 10 to 20%, respectively, are permissible. Although in statistics it is desirable that the error does not exceed 5-6%. Otherwise, there is a reason to talk about insufficient representativeness of the sample. Many factors are taken into account to calculate the bias of representativeness and how it affects a sample or general population:

  1. The probability with which you want to get an accurate result.
  2. The number of units in the sample. As mentioned earlier, the fewer units the sample makes, the larger the representativeness error will be, and vice versa.
  3. The homogeneity of the studied population. The more heterogeneous the population is, the greater the error in representativeness will be. The ability of an aggregate to be representative depends on the homogeneity of all its constituent units.
  4. The method of selecting units for the sample.

In specific studies, the percentage of error of the mean is usually determined by the researcher himself based on the observation program and according to previous studies. As a rule, the acceptable marginal sampling error (representativeness error) is considered to be within 3-5%.


Bigger is not always better

It is also worth remembering that the main thing in organizing selective observation is to bring its volume to an acceptable minimum. At the same time, one should not strive for an excessive reduction in the sampling error boundaries, as this can lead to an unjustified increase in the sample size and, consequently, to an increase in the cost of conducting a selective survey.

At the same time, it is impossible to excessively increase the size of the error of representativeness. Indeed, in this case, although there will be a decrease in the size of the sample population, this will lead to a deterioration in the reliability of the results obtained.

What questions are usually asked before the researcher

Any research, if carried out, is for some purpose and to obtain some results. When conducting a sample study, as a rule, the initial questions are posed:

  1. Definition the required amount sampling units, that is, how many units will be studied. In addition, the population must be representative for accurate research.
  2. Calculation of the error of representativeness with a specified level of probability. It should be noted right away that there are no sample studies with a 100% probability level. If the authority that conducted the study of a certain segment claims that their results are accurate with a probability of 100%, then this is a lie. Long-term practice has already established the percentage of the likelihood of a correctly conducted sample study. This figure is equal to 95.4%.

Methods for selecting research units in the sample

Not every sample is representative. Sometimes the same sign is expressed differently in the whole and in its part. To achieve the requirements of representativeness, it is advisable to use various sampling techniques. Moreover, the use of one method or another depends on the specific circumstances. These sampling techniques include:

  • random selection;
  • mechanical selection;
  • typical selection;
  • serial (nested) selection.

Random sampling is a system of measures aimed at random selection of units of the population, when the probability of getting into the sample is equal for all units of the general population. It is advisable to use this technique only in the case of homogeneity and a small number of inherent features. Otherwise, some of the characteristics run the risk of not being reflected in the sample. Random sampling is at the heart of all other sampling methods.

With mechanical selection of units, it is carried out at a certain interval. If it is necessary to form a sample of specific crimes, it is possible to remove every 5th, 10th or 15th card from all statistical records of registered crimes, depending on their total number and the available sample size. The disadvantage of this method is that before sampling, it is necessary to have a complete account of the units of the population, then it is necessary to conduct a ranking, and only after that it is possible to carry out a sampling at a certain interval. This method is time consuming and therefore not often used.


Typical (zoned) selection is a type of sampling in which the general population is divided into homogeneous groups according to a certain characteristic. Sometimes researchers use other terms instead of "groups": "areas" and "zones". Then, a certain number of units are randomly selected from each group in proportion to the specific weight of the group in the total population. Typical selection is often carried out in several steps.

Serial sampling is a method in which the selection of units is carried out in groups (series) and all units of the selected group (series) are subject to examination. The advantage of this method is that it is sometimes more difficult to select individual units than series, for example, when studying a person who is serving a sentence. Within the selected areas, zones, the study of all units without exception is applied, for example, the study of all persons serving a sentence in a particular institution.


Plan

  • Introduction
  • 1. The role of sampling
  • Conclusion
  • Bibliography

Introduction

Statistics is an analytical science that is necessary for all modern specialists. Modern specialist cannot be literate if he does not know the statistical methodology. Statistics is the most important tool for communication between an enterprise and society. Statistics is one of the most important disciplines in the curriculum of all specialties, because statistical literacy is an integral part higher education, and in terms of the number of hours allotted in the curriculum, it takes one of the first places. Working with numbers, each specialist must know how certain data were obtained, what is their nature of calculation, how complete and reliable they are.

1. The role of sampling

The set of all units of the population, possessing a certain characteristic and subject to study, is called the general population in statistics.

In practice, for one reason or another, it is not always possible or impractical to consider the entire general population. Then they restrict themselves to studying only a certain part of it, the ultimate goal of which is to extend the results obtained to the entire general population, i.e. apply the sampling method.

For this, some of the elements, the so-called sample, are selected from the general population in a special way, and the results of processing the sample data (for example, arithmetic mean values) are generalized to the entire population.

The theoretical basis of the sampling method is the law large numbers... By virtue of this law, with a limited dispersion of a trait in the general population and a sufficiently large sample with a probability close to complete reliability, the sample mean can be arbitrarily close to the general average. This law, which includes a group of theorems, has been proved strictly mathematically. Thus, the arithmetic mean calculated for the sample can reasonably be regarded as an indicator characterizing the general population as a whole.

2. Methods of probabilistic selection to ensure representativeness

In order to be able to draw a conclusion about the properties of the general population from the sample, the sample must be representative (representative), i.e. it must fully and adequately represent the properties of the general population. The representativeness of the sample can only be ensured with the objectivity of the selection of data.

The sample set is formed according to the principle of mass probabilistic processes without any exceptions from the adopted selection scheme; it is necessary to ensure the relative homogeneity of the sample population or its division into homogeneous groups of units. When forming the sampling frame, a clear definition of the sampling unit should be given. Approximately the same size of sampling units is desirable, and the smaller the sampling unit, the more accurate the results.

There are three possible selection methods: random selection, selection of units according to a certain scheme, a combination of the first and second methods.

If the selection in accordance with the adopted scheme is carried out from the general population, previously divided into types (layers or strata), then such a sample is called typical (or stratified, or stratified, or zoned). Another division of the sample by species is determined by what is the sampling unit: an observation unit or a series of units (sometimes the term "nest" is used). In the latter case, the sample is called serial, or nested. In practice, a combination of typical sampling with batch sampling is often used. In mathematical statistics, when discussing the problem of data selection, the division of the sample into repeated and non-repeated ones is necessarily introduced. The first corresponds to the reversible ball scheme, the second - to the irrevocable one (when considering the data selection process using the example of the selection of balls different color from the urn). In socio-economic statistics, it makes no sense to apply repeated sampling, therefore, as a rule, we mean non-repeating sampling.

Since socio-economic objects have a complex structure, the sample can be quite difficult to organize. For example, to select households when studying population consumption big city, it is easier to first select territorial cells, residential buildings, then apartments or households, then the respondent. Such a sample is called multistage. Each stage uses different units selection: larger - at the initial stages, at the last stage, the selection unit coincides with the observation unit.

Another type of sampling is multiphase sampling. Such a sample includes a certain number of phases, each of which is distinguished by the details of the observation program. For example, 25% of the entire general population is surveyed according to short program, every 4th unit from this sample is surveyed according to a more complete program, etc.

For any type of sampling, the selection of units is made in three marked ways. Consider a random selection procedure. First of all, a list of units of the population is drawn up, in which each unit is assigned a digital code (number or label). Then a draw is made. Balls with the corresponding numbers are put into the drum, they are mixed and the balls are selected. The drawn numbers correspond to the units in the sample; the number of rooms is equal to the planned sample size.

Selection by lot may be subject to biases due to technical deficiencies (quality of balls, drum) and other reasons. Selection according to the table of random numbers is more reliable from the point of view of objectivity. Such a table contains a series of numbers, alternating at random, selected by electronic signals. Since we are using the decimal numeric system 0, 1, 2,., 9, the probability of any digit appearing is 1/10. Therefore, if it were necessary to create a table of random numbers, including 500 characters, then about 50 of them would be 0, the same amount would be 1, etc.

Sampling according to some scheme (the so-called directed sampling) is often used. The selection scheme is adopted in such a way as to reflect the basic properties and proportions of the general population. The simplest way: according to the lists of units of the general population, drawn up so that the ordering of units would not be associated with the studied properties, a mechanical selection of units is carried out with a step equal to N: n. Usually, the selection begins not with the first unit, but retreating half a step in order to reduce the possibility of sampling bias ... The frequency of the appearance of units with certain characteristics, for example, students with a particular level of academic performance, living in a hostel, etc. will be determined by the structure that has developed in the general population.

To be more confident that the sample will reflect the structure of the general population, the latter is subdivided into types (strata or regions), and random or mechanical selection is made from each type. The total number of units selected from different types, should correspond to the sample size.

Particular difficulties arise when there is no list of units, and the selection must be made either on the spot or from product samples at the finished product warehouse. In these cases, it is important to develop in detail the terrain orientation scheme and the selection scheme and follow it, avoiding deviations. For example, the meter is instructed to move from a certain bus stop to the north along the even-numbered side of the street and, counting two houses from the first corner, enter the third and conduct a survey in every 5th dwelling. Strict adherence to the adopted scheme ensures the fulfillment of the main condition for the formation of a representative sample - the objectivity of the selection of units.

Quota sampling should be distinguished from random sampling, when the sample is constructed from units of certain categories (quotas), which must be presented in specified proportions. For example, in a survey of department store buyers, it may be planned to select 150 respondents, including 90 women, of which 25 are girls, 20 are young women with small children, 35 are middle-aged women dressed in a business suit, 10 are women 50 years old. and older; in addition, a survey of 70 men was planned, of which 25 were adolescents and boys, 20 were young men with children, 15 were men dressed in suits, 10 were men dressed in sportswear. Such a sample may be good for determining consumer orientations and preferences, but if we want to use it to establish the average amount of purchases and their structure, we will get unrepresentative results. This is because the quota sampling aims to select certain categories.

The sample may be unrepresentative, even if it is formed in accordance with the known proportions of the general population, but the selection is carried out without any scheme - units are recruited as you like, just to ensure the ratio of their categories in the same proportions as in the general population (for example, the ratio of men and women, respondents aged younger and older than able-bodied and able-bodied, etc.).

These remarks should caution you against such sampling approaches and re-emphasize the need for objective sampling.

3. Organizational and methodological features of random, mechanical, typical and serial sampling

Depending on how the elements of the population are selected in the sample, several types of sample surveys are distinguished. The selection can be random, mechanical, typical and serial.

Random selection is such a selection in which all elements of the general population have an equal opportunity to be selected. In other words, for each element of the general population, an equal probability of being included in the sample is ensured.

sample statistical probabilistic random

The requirement of random selection is achieved in practice using a lot or a table of random numbers.

When selecting by lot, all elements of the general population are pre-numbered and their numbers are applied to the cards. After careful shuffling, the required number of cards is selected from the pack in any way (in a row or in any other order), corresponding to the sample size. In this case, you can either put the selected cards aside (thereby, the so-called non-repeat selection is carried out), or, having pulled out the card, write down its number and return it to the pack, thereby giving it the opportunity to appear in the sample again (repeated selection). Upon re-selection, each time the card is returned, the pack must be carefully reshuffled.

The method of drawing lots is used in cases where the number of elements of the entire studied population is small. With a large population, random selection by lot becomes difficult. More reliable and less laborious in the case of a large amount of processed data is the method of using a table of random numbers.

Mechanical selection is carried out as follows. If a 10% sample is formed, i.e. out of every ten elements, one must be selected, then the whole set is conditionally divided into equal parts of 10 elements. Then an item is selected at random from the top ten. For example, the draw indicated the ninth number. The selection of the remaining elements of the sample is completely determined by the specified proportion of selection N by the number of the first selected element. In this case, the sample will consist of elements 9, 19, 29, etc.

Mechanical selection should be used with caution, as there is a real risk of so-called systematic errors. Therefore, before making a mechanical sample, it is necessary to analyze the studied population. If its elements are randomly arranged, then the sample obtained mechanically will be random. Often, however, the elements of the original set are partially or even completely ordered. It is highly undesirable for mechanical selection to order the elements with the correct repeatability, the period of which may coincide with the period of mechanical selection.

Often, the elements of a set are ordered by the magnitude of the trait under study in decreasing or increasing order and do not have periodicity. Mechanical selection from such a population takes on the character of directed selection, since individual parts of the population are represented in the sample in proportion to their number in the entire population, i.e. selection is aimed at making the sample representative.

Another type of directed selection is typical selection. Typical selection should be distinguished from typical selection. The selection of typical objects was used in zemstvo statistics, as well as in budget surveys. At the same time, the selection of "typical villages" or "typical farms" was carried out according to some economic characteristics, for example, according to the size of land ownership per yard, according to the occupation of residents, and so on. Selection of this kind cannot be the basis for the application of the sampling method, since here its main requirement, the randomness of selection, has not been fulfilled.

In the case of a typical selection in the sampling method, the population is divided into groups that are qualitatively homogeneous, and then random selection is made within each group. Typical selection is more difficult to organize than random selection itself, since certain knowledge about the composition and properties of the general population is required, but it gives more accurate results.

In serial selection, the entire population is divided into groups (series). Then, by random or mechanical selection, a certain part of these series is isolated and their continuous processing is carried out. In fact, serial selection is a random or mechanical selection carried out for the enlarged elements of the original population.

In theoretical terms, the serial sample is the most imperfect of the considered. For processing the material, it, as a rule, is not used, but it provides certain convenience when organizing a survey, especially in studying Agriculture... For example, annual sample surveys of peasant farms in the years preceding collectivization were carried out by the method of serial selection. It is useful for the historian to be aware of serial sampling as he may come across the results of such surveys.

In addition to the classical methods of selection described above, other methods are used in the practice of the sampling method. Let's consider two of them.

The studied population can have a multistage structure, it can consist of units of the first level, which, in turn, consist of units of the second level, etc. For example, provinces include counties, counties can be viewed as a set of volosts, volosts consist of villages, and villages - of courtyards.

Multistage selection can be applied to such populations, i.e. sequentially carry out the selection at each stage. So, from the aggregate of provinces, by a mechanical, typical or random method, you can select counties (first step), then select volosts in one of the indicated ways (second step), then select villages (third step) and, finally, households (fourth step).

An example of a two-stage mechanical selection is the long-practiced selection of workers' budgets. At the first stage, enterprises are selected mechanically, at the second, workers whose budgets are being examined.

The variability of the features of the objects under study can be different. For example, the provision of peasant farms with their own labor force fluctuates less than, say, the size of their crops. In this regard, a smaller sample in terms of labor force availability will be as representative as a large sample of data on the size of crops, in terms of the number of elements. In this case, from the sample by which the size of crops is determined, it is possible to make a sample that is sufficiently representative to determine the supply of labor, thereby carrying out a two-phase selection. V general case you can add the following phases, i.e. make another subsample from the resulting subsample, and so on. The same selection method is used in cases where the research objectives require different accuracy in calculating different indicators.

Task 1. Descriptive statistics

On the exam, 20 students received the following marks (on a 100 point scale):

1) Construct a series of frequency distributions, relative and accumulated frequencies for 5 intervals;

2) Build polygon, histogram and cumulative polygon;

3) Find the arithmetic mean, mode, median, first and third quartiles, inter-quarter range, standard deviation and coefficients of variation. Analyze the data using these characteristics and indicate an interval that includes 50% of the central values ​​of the indicated quantities.

1) x (min) = 53, x (max) = 98

R = x (max) - x (min) = 98-53 = 45

h = R / 1 + 3.32lgn, where n is the sample size, n = 20

h = 45/1 + 3.32 * lg20 = 9

a (i) - lower limit of the interval, b (i) - upper limit of the interval.

a (1) = x (min) - h / 2, b (1) = a (1) + h, then if b (i) is the upper bound of the ith interval (moreover, a (i + 1) = b (i)), then b (2) = a (2) + h, b (3) = a (3) + h, etc. The construction of intervals continues until the beginning of the next in the order of the interval is equal to or greater than x (max).

a (1) = 47.5 b (1) = 56.5

a (2) = 56.5 b (2) = 65.5

a (3) = 65.5 b (3) = 74.5

a (4) = 74.5 b (4) = 83.5

a (5) = 83.5 b (5) = 92.5

a (6) = 92.5 b (6) = 101.5

Intervals, a (i) - b (i)

Frequency counting

Frequency, n (i)

Accumulated frequency, n (hi)

2) To construct graphs, we write down the variation series of the distribution (interval and discrete) of the relative frequencies W (i) = n (i) / n, the accumulated relative frequencies W (hi) and find the ratio W (i) / h by filling in the table.

x (i) = a (i) + b (i) / 2; W (hi) = n (hi) / n

Statistical series of the distribution of estimates:

Intervals, a (i) - b (i)

To build a histogram of relative frequencies along the abscissa, we postpone partial intervals, on each of which we build a rectangle whose area is equal to the relative frequency W (i) of this i-th interval. Then the height of an elementary rectangle should be equal to W (i) / h.

From the histogram, you can get a polygon of the same distribution if the midpoints upper bases connect rectangles with straight line segments.

To construct the cumulates of a discrete series, we plot the values ​​of the feature along the abscissa axis, and the relative accumulated frequencies W (hi) along the ordinate axis. We connect the resulting points with straight line segments. For the interval series along the abscissa, we postpone the upper boundaries of the grouping.

3) We find the arithmetic mean by the formula:

Fashion is calculated by the formula:

The lower border of the modal interval; h is the width of the grouping interval; - the frequency of the modal interval; - the frequency of the interval preceding the modal; is the frequency of the interval following the modal. = 23.125.

Find the median:

n = 20: 53.58.59.59.63.67.68.69.71.73.78.79.85.86.87.89.91.91.98.98

Substituting the values, we get: Q1 = 65;

The value of the second quartile coincides with the value of the median, therefore Q2 = 75.5; Q3 = 88.

The inter-quarter range is:

The root-mean-square (standard) deviation is found by the formula:

The coefficient of variation:

It can be seen from these calculations that 50% of the central values ​​of the indicated values ​​include the interval 74.5 - 83.5.

Task 2. Statistical hypothesis testing.

The sports preferences for men, women and adolescents are as follows:

Test the hypothesis about the independence of preference from gender and age b = 0.05.

1) Testing the hypothesis about the independence of preferences in sports.

Pearsen coefficient:

The tabular value of the chi-square test with a degree of freedom of 4 at b = 0.05 is equal to h 2 tabl = 9.488.

Since, the hypothesis is rejected. The differences in preference are significant.

2. Conformity hypothesis.

Volleyball as a sport is closest to basketball. Let's check the correspondence in preferences for men, women and adolescents.

Ф 2 = 0.1896 + 0.1531 + 0.1624 + 0.1786 + 0.1415 + 0.1533 = 0.979.

With a significance level of b = 0.05 and a degree of freedom k = 2, the tabular value of h 2 tabl = 9.210.

Since Ф 2>, the differences in preferences are significant.

Task 3. Correlation-regression analysis.

An analysis of road traffic accidents yielded the following statistics for the percentage of drivers under the age of 21 and the number of serious accidents per 1000 drivers:

Conduct graphical and correlation-regression analysis of data, predict the number of accidents with severe consequences for a city in which the number of drivers under the age of 21 is equal to 20% of the total number of drivers.

We get a sample of size n = 10.

x is the percentage of drivers under the age of 21,

y is the number of accidents per 1000 drivers.

The linear regression equation is:

We calculate sequentially:

Similarly, we find

Sample Regression Coefficient

The connection between x, y is strong.

The linear regression equation takes the form:

On figure submitted field scattering and schedule linear regressions . We carry out forecast for x n =20 .

We get y n =0 .2 9*20-1 .4 6 = 4 .3 4 .

Forecast meaning happened more of all values, submitted v the original table . it consequence Togo, what correlation addiction straight and coefficient is equal to 0,29 enough big . On every unit increments Dx he gives increment Dy =0 .3

Exercise 4 . Analysis temporary ranks and forecasting .

Predict index values ​​for the next week using:

a) the moving average method, choosing three-week data for its calculation;

b) an exponential weighted average, choosing as b = 0.1.

From the table of random numbers we find numbers 41, 51, 69, 135, 124, 93, 91, 144, 10, 24.

We arrange them in ascending order: 10, 24, 41, 51, 69, 91, 93, 124, 135, 144.

We carry out a new numbering from 1 to 10. We receive the initial data for ten weeks:

Exponential smoothing at b = 0.1 gives only one value.

For the middle of the entire period, we get three forecasts: 12.855; 1309; 12.895.

There is agreement between these forecasts.

Exercise 5 . Index analysis.

The company is engaged in the transportation of goods. There are data for a number of years on the volume of transportation of 4 types of cargo and the cost of transportation of a unit of cargo.

Determine simple price, quantity and value indices for each type of product, as well as Laspeyres and Pasche indices and a value index. Comment on the results obtained in a meaningful way.

Solution. Let's calculate simple indices:

Laspeyres Index:

Pasche Index:

Turkeys cost:

Individual indices indicate a discrepancy in the change in prices and quantities for goods A, B, C, D. Aggregate indices indicate general trends of change. In general, the cost of transported goods decreased by 13%. The reason is that the most expensive cargo decreased by 42% in terms of quantity, and its tariff remained almost unchanged.

Years 16-20 are numbered in order from 1 to 5. The initial data takes the form:

First, we investigate the dynamics of the amount of cargo A.

Index

Absolute gains

Rates of growth, %

Growth rate,%

At this the pace growth averaged on formulas :

, .

For pace gain v any case T NS = T R -1 .

Now consider cargo D .

Index

Absolute gains

Rates of growth, %

Growth rate,%

Conclusion

Average values ​​and their varieties play an important role in statistics. Average indicators are widely used in analysis, since it is in them that the patterns of mass phenomena and processes both in time and in space find their manifestation. So, for example, the regularity of an increase in labor productivity is expressed in the statistical indicators of the growth of the average output per worker in industry, the regularity of a steady growth in the level of well-being of the population is manifested in the statistical indicators of an increase in the average income of workers and employees, etc.

Such descriptive characteristics of the distribution of a variable feature as mode and median are widely used. They are specific characteristics, their significance is given to any particular variant in the variation series.

So, in order to characterize the most common value of a feature, a fashion is used, and to show the quantitative limit of the value of a variable feature, which half of the members of the population have reached, the median.

Thus, averages help to study the patterns of development of industry, a specific industry, society and the country as a whole.

Bibliography

1. Theory of statistics: Textbook / R.А. Shmoilova, V.G. Minashkin, N.A. Sadovnikova, E.B. Shuvalov; Edited by R.A. Shmoilova. - 4th ed., Rev. and add. - M .: Finance and statistics, 2005. - 656p.

2. Gusarov V.M. Statistics: Tutorial for universities. - M .: UNITY-DANA, 2001.

4. Collection of problems in the theory of statistics: Textbook / Ed. Prof. V. V. Glinsky and Ph.D. D., associate professor L.K. Serga. Ed. Z-e. - M .: INFRA-M; Novosibirsk: Siberian Agreement, 2002.

5. Statistics: Textbook / Kharchenko L-P., Dolzhenkova V.G., Ionin V.G. and others, Ed. V.G. Ionina. - 2nd ed., Rev. and add. - M .: INFRA-M. 2003.

Similar documents

    Descriptive statistics and statistical inference. Selection methods to ensure that the sample is representative. The influence of the type of sample on the magnitude of the error. Tasks when applying the sampling method. Dissemination of observation data to the general population.

    test, added 02/27/2011

    The selective method and its role. Development of the modern theory of selective observation. Typology of selection methods. Practical implementation of simple random sampling. Organization of a typical (stratified) sample. Sample size for quota selection.

    report added 09/03/2011

    Purpose of sampling and sampling. Features of the organization different types selective observation. Sampling errors and methods for their calculation. Application of the sampling method for the analysis of enterprises of the fuel and energy complex.

    term paper added on 10/06/2014

    Selective observation as a method statistical research, its features. Random, mechanical, typical and serial types of selection in the formation of sample sets. The concept and causes of sampling error, methods of its determination.

    abstract, added 06/04/2010

    The concept and role of statistics in the mechanism of modern economy management. Continuous and non-continuous statistical observation, description of the sampling method. Types of selection for selective observation, sampling errors. Production and financial indicators.

    term paper, added 03/17/2011

    Study of the implementation of the plan. A 10% random non-repeat sampling survey. The cost of production of the plant. Marginal sampling error. Dynamics of average prices and product sales. Variable composition price index.

    test, added 02/09/2009

    Obtaining a sample of the size of the n-normal distribution of a random variable. Finding the numerical characteristics of the sample. Data grouping and variation series. Frequency histogram. Empirical distribution function. Statistical estimation of parameters.

    laboratory work, added 03/31/2013

    The essence of the concepts of sampling and selective observation, the main types and categories of selection. Determination of the size and size of the sample. Practical use statistical analysis of sample observation. Calculation of the errors of the sample rate and sample mean.

    term paper added 02/17/2015

    The concept of selective observation. Errors of representativeness, measurement of sampling error. Determination of the required sample size. The use of a sampling method instead of a continuous one. Dispersion in the population and comparison of indicators.

    test, added 07/23/2009

    Selection types and observation errors. Methods for selecting units in the sample. Characteristics of the commercial activity of the enterprise. Sample survey of consumers of products. Distribution of sample characteristics to the general population.

Plan:

1. Problems of mathematical statistics.

2. Types of samples.

3. Selection methods.

4. The statistical distribution of the sample.

5. Empirical distribution function.

6. Polygon and histogram.

7. Numerical characteristics of the variation series.

8. Statistical estimates of distribution parameters.

9. Interval estimates of distribution parameters.

1. Problems and Methods of Mathematical Statistics

Math statistics is a section of mathematics devoted to methods of collecting, analyzing and processing the results of statistical observational data for scientific and practical purposes.

Let it be required to study a set of homogeneous objects in relation to some qualitative or quantitative attribute that characterizes these objects. For example, if there is a batch of parts, then the standard of the part can serve as a qualitative indicator, and the controlled size of the part can serve as a quantitative indicator.

Sometimes a continuous study is carried out, i.e. inspect each object for the desired attribute. In practice, a complete survey is rarely used. For example, if a population contains a very large number of objects, then it is physically impossible to conduct a complete survey. If the survey of an object is associated with its destruction or requires large material costs, then it makes no sense to conduct a complete survey. In such cases, a limited number of objects (sample population) are randomly selected from the entire population and subjected to study.

The main task of mathematical statistics is to study the entire population using sample data, depending on the goal, i.e. study of the probabilistic properties of the population: the distribution law, numerical characteristics, etc. for making management decisions in conditions of uncertainty.

2. Sample types

General population Is a collection of objects from which a selection is made.

Sample population (sample) Is a collection of randomly selected objects.

Population volume Is the number of objects in this population. The volume of the general population is indicated N, selective - n.

Example:

If out of 1000 parts 100 parts are selected for inspection, then the volume of the general population is N = 1000, and the sample size n = 100.

The selection can be done in two ways: after the object is selected and observed over it, it can be returned or not returned to the general population. That. samples are divided into repeated and non-repeated samples.

Repeatedare called sampling, in which the selected object (before selecting the next one) is returned to the general population.

Nonrepeatableare called sampling, in which the selected object is not returned to the general population.

In practice, repeat-free random sampling is usually used.

In order for the sample data to be confident enough to judge the characteristic of interest in the general population, it is necessary that the sample objects correctly represent it. The sample must correctly represent the proportions of the population. The sample should be representative (representative).

By virtue of the law of large numbers, it can be argued that the sample will be representative if taken randomly.

If the size of the general population is large enough, and the sample is only an insignificant part of this population, then the distinction between repeated and non-repeated samples is erased; in the limiting case, when an infinite general population is considered, and the sample has a finite size, this difference disappears.

Example:

In the American magazine "Literary Review", using statistical methods, a study was carried out of forecasts regarding the outcome of the upcoming presidential elections in the United States in 1936. The contenders for this post were F.D. Roosevelt and A.M. Landon. The reference books of telephone subscribers were taken as a source for the general population of the surveyed Americans. Of these, 4 million addresses were randomly selected, to which the editorial staff of the magazine sent out postcards asking them to express their attitude towards the presidential candidates. After processing the results of the poll, the magazine published a sociological forecast that Landon will win by a large margin in the upcoming elections. And ... I was wrong: Roosevelt won.
This example can be seen as an example of a non-representative sample. The fact is that in the United States in the first half of the twentieth century, only the wealthy part of the population had telephones, which supported Landon's views.

3. Selection methods

In practice, applied different ways selection, which can be divided into 2 types:

1. Selection does not require dismemberment of the general population into parts (s) simple random nonrepeatable; b) simple random repeat).

2. Selection, in which the population is divided into parts. (a) typical selection; b) mechanical selection; v) serial selection).

Simple casual call this selection, in which objects are retrieved one at a time from the entire population (randomly).

Typicalare called selection, in which objects are selected not from the entire general population, but from each of its "typical" parts. For example, if a part is made on several machines, then the selection is made not from the entire set of parts produced by all machines, but from the products of each machine separately. Such selection is used when the subject being examined fluctuates noticeably in different "typical" parts of the general population.

Mechanicalare called selection, in which the general population is “mechanically” divided into as many groups as the number of objects should be included in the sample, and one object is selected from each group. For example, if you need to select 20% of machine-made parts, then every 5th part is selected; if you need to select 5% of the parts, every 20th, etc. Sometimes such selection may not ensure the representativeness of the sample (if every 20th grind roll is selected, and the cutter is replaced immediately after the selection, then all the rolls grinded with blunt cutters will be selected).

Serialare called selection, in which objects are selected from the general population not one at a time, but in "series", which are subjected to a continuous survey. For example, if products are manufactured by a large group of automatic machines, then only a few machines are subjected to a complete inspection.

In practice, combined selection is often used, in which the above methods are combined.

4. Statistical distribution of the sample

Let a sample be extracted from the general population, and the value x 1- observed times, x 2 -n 2 times,… x k - n k times. n = n 1 + n 2 + ... + n k is the sample size. Observed valuesare called options, and the sequence of variants, written in ascending order- variation series... Observation numbersare called frequencies (absolute frequencies), and their relationship to the sample size- relative frequencies or statistical probabilities.

If the number of variants is large or the sample is made from a continuous general population, then the variation series is compiled not by individual point values, but by intervals of values ​​of the general population. Such a variation series is called interval. In this case, the lengths of the intervals must be equal.

Statistical distribution of the sample called a list of options and their corresponding frequencies or relative frequencies.

Statistical distribution can also be specified in the form of a sequence of intervals and corresponding frequencies (the sum of frequencies that fall within this interval of values)

The point variation range of frequencies can be represented by the table:

x i
x 1
x 2

x k
n i
n 1
n 2

n k

Similarly, you can represent the point variation series of relative frequencies.

Moreover:

Example:

The number of letters in some text X turned out to be 1000. The first one was the letter “I”, the second was the letter “i”, the third was the letter “a”, the fourth was “u”. Then came the letters "o", "e", "y", "e", "s".

Let us write out the places that they occupy in the alphabet, respectively, we have: 33, 10, 1, 32, 16, 6, 21, 31, 29.

After ordering these numbers in ascending order, we get the variation series: 1, 6, 10, 16, 21, 29, 31, 32, 33.

Frequencies of letters in the text: "a" - 75, "e" - 87, "i" - 75, "o" - 110, "y" - 25, "s" - 8, "e" - 3, "u" "- 7," I "- 22.

Let's compose a point variation series of frequencies:

Example:

The distribution of the sampling frequencies of the volume is given n = 20.

Make a point variation series of relative frequencies.

x i

2

6

12

n i

3

10

7

Solution:

Let's find the relative frequencies:


x i

2

6

12

w i

0,15

0,5

0,35

When constructing an interval distribution, there are rules for choosing the number of intervals or the value of each interval. The criterion here is the optimal ratio: with an increase in the number of intervals, the representativeness improves, but the volume of data and the time for their processing increase. Difference x max - x min between the largest and smallest values, the variant is called sweep sampling.

To count the number of intervals k the Strojess empirical formula is usually used (assuming rounding to the nearest convenient integer): k = 1 + 3.322 lg n.

Accordingly, the value of each interval h can be calculated by the formula:

5. Empirical distribution function

Let's consider some sample from the general population. Let the statistical distribution of the frequencies of the quantitative attribute X be known.Let us introduce the notation: n x- the number of observations in which the observed value of the feature is less than x; n - the total number of observations (sample size). Relative frequency of event X<х равна n x / n. If x changes, then the relative frequency also changes, i.e. relative frequencyn x / n- there is a function of x. Because it is found empirically, then it is called empirical.

Empirical distribution function (sample distribution function) call the function, which determines for each x the relative frequency of the event X<х.


where is the number of options less than x,

n is the sample size.

In contrast to the empirical distribution function of the sample, the distribution function F (x) of the general population is called theoretical distribution function.

The difference between empirical and theoretical distribution functions is that the theoretical function F (x) determines the probability of event X F * (x) tends in probability to the probability F (x) of this event. That is, for large n F * (x) and F (x) differ little from each other.

That. it is advisable to use the empirical distribution function of the sample for an approximate representation of the theoretical (integral) distribution function of the general population.

F * (x) has all the properties F (x).

1. Values F * (x) belong to the interval.

2. F * (x) is a non-decreasing function.

3. If is the smallest option, then F * (x) = 0, for x < x 1; if x k is the largest option, then F * (x) = 1, for x> x k.

Those. F * (x) serves to estimate F (x).

If the sample is given by the variation series, then the empirical function has the form:

The empirical function graph is called the cumulative.

Example:

Plot an empirical function for the given distribution of the sample.


Solution:

Sample size n = 12 + 18 +30 = 60. The smallest option 2, ie. at x < 2. Event X<6, (x 1 = 2) наблюдалось 12 раз, т.е. F * (x) = 12/60 = 0.2 at 2 < x < 6. Event X<10, (x 1 =2, x 2 = 6) наблюдалось 12 + 18 = 30 раз, т.е.F*(x)=30/60=0,5 при 6 < x < 10. Because x = 10 is the largest option, then F * (x) = 1 for x> 10. The desired empirical function has the form:

Cumulata:


Cumulate makes it possible to understand the graphically presented information, for example, to answer the questions: “Determine the number of observations for which the value of the feature was less than 6 or not less than 6. F * (6) = 0.2 »Then the number of observations in which the value of the observed feature was less than 6 is equal to 0.2 * n = 0.2 * 60 = 12. The number of observations in which the value of the observed feature was at least 6 is equal to (1-0.2) * n = 0.8 * 60 = 48.

If an interval variation series is specified, then to compose an empirical distribution function, the midpoints of the intervals are found and from them an empirical distribution function is obtained similar to a point variation series.

6. Polygon and histogram

For clarity, various graphs of the statistical distribution are built: polynomial and histograms

Frequency polygon this is a broken line, the segments of which connect the points (x 1; n 1), (x 2; n 2),…, (x k; n k), where are the options, are the corresponding frequencies.

Polygon of relative frequencies this is a broken line, the segments of which connect the points (x 1; w 1), (x 2; w 2),…, (x k; w k), where x i are options, w i are the relative frequencies corresponding to them.

Example:

Plot a polynomial of relative frequencies over a given sample distribution:

Solution:

In the case of a continuous feature, it is advisable to build a histogram, for which the interval in which all the observed values ​​of the feature are enclosed is divided into several partial intervals of length h and for each partial interval n i is found - the sum of the frequencies of the variant that fell into the i-th interval. (For example, when measuring a person's height or weight, we are dealing with a continuous sign).

Frequency histogram it is a stepped figure, consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio (frequency density).

Square of the i -th partial rectangle is equal to the sum of the frequencies of the variant of the i -th interval, i.e. the area of ​​the frequency histogram is equal to the sum of all frequencies, i.e. sample size.

Example:

The results of voltage changes (in volts) in the power grid are given. Make a variation series, plot the polygon and histogram of frequencies if the voltage values ​​are as follows: 227, 215, 230, 232, 223, 220, 228, 222, 221, 226, 226, 215, 218, 220, 216, 220, 225, 212 , 217, 220.

Solution:

Let's compose a variation series. We have n = 20, x min = 212, x max = 232.

Let's apply the Strojess formula to count the number of bins.

The interval variation series of frequencies is as follows:


Frequency density

212-21 6

0,75

21 6-22 0

0,75

220-224

1,75

224-228

228-232

0,75

Let's build a histogram of frequencies:

Let's construct a frequency polygon by first finding the midpoints of the intervals:


Histogram of relative frequencies is called a stepped figure consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio w i/ h (relative frequency density).

Square The i-th partial rectangle is equal to the relative frequency of the variant falling into the i-th interval. Those. the area of ​​the histogram of relative frequencies is equal to the sum of all relative frequencies, i.e. unit.

7. Numerical characteristics of the variation series

Let's consider the main characteristics of the general and sample populations.

General middle is called the arithmetic mean of the values ​​of the attribute of the general population.

For different values ​​x 1, x 2, x 3,…, x n. of the general population of the volume N we have:

If the attribute values ​​have the corresponding frequencies N 1 + N 2 +… + N k = N, then


Sample mean is called the arithmetic mean of the values ​​of the characteristic of the sample population.

If the attribute values ​​have the corresponding frequencies n 1 + n 2 + ... + n k = n, then


Example:

Calculate the sample mean for the sample: x 1 = 51.12; x 2 = 51.07; x 3 = 52.95; x 4 = 52.93; x 5 = 51.1; x 6 = 52.98; x 7 = 52.29; x 8 = 51.23; x 9 = 51.07; x 10 = 51.04.

Solution:

General variance is the arithmetic mean of the squares of the deviations of the values ​​of the attribute X of the general population from the general average.

For various values ​​x 1, x 2, x 3, ..., x N of the attribute of the general population of volume N, we have:

If the attribute values ​​have the corresponding frequencies N 1 + N 2 +… + N k = N, then

General root-mean-square deviation (standard) called the square root of the general variance

Selective variance is called the arithmetic mean of the squares of the deviations of the observed values ​​of the feature from the mean.

For different values ​​x 1, x 2, x 3, ..., x n of the attribute of the sample population of volume n, we have:


If the attribute values ​​have the corresponding frequencies n 1 + n 2 + ... + n k = n, then


Selected standard deviation (standard) called the square root of the sample variance.


Example:

The sample population is specified by the distribution table. Find the sample variance.


Solution:

Theorem: The variance is equal to the difference between the mean squares of the feature values ​​and the square of the total mean.

Example:

Find the variance for the given distribution.



Solution:

8. Statistical estimates of distribution parameters

Let the general population be investigated for a certain sample. In this case, it is possible to obtain only an approximate value of the unknown parameter Q, which serves as its estimate. Obviously, estimates can change from one sample to another.

Statistical assessmentQ * the unknown parameter of the theoretical distribution is called the function f, which depends on the observed values ​​of the sample. The task of statistical estimation of unknown parameters from a sample is to construct such a function from the available statistical observation data, which would give the most accurate approximate values ​​of the real, unknown to the researcher, values ​​of these parameters.

Statistical estimates are divided into point and interval, depending on the way they are presented (number or interval).

A point is called a statistical estimate. parameter Q of the theoretical distribution determined by one value of the parameter Q * = f (x 1, x 2, ..., x n), wherex 1, x 2, ..., x n- the results of empirical observations of the quantitative trait X of a certain sample.

Such parameter estimates obtained from different samples most often differ from each other. The absolute difference / Q * -Q / is called sampling (estimation) error.

In order for statistical assessments to give reliable results about the parameters being assessed, it is necessary that they be unbiased, efficient and consistent.

Point estimate, the mathematical expectation of which is equal (not equal) to the estimated parameter, is called unbiased (biased)... M (Q *) = Q.

Difference М ( Q *) - Q is called bias or bias... For unbiased estimates, the bias is 0.

Effective appraisal Q *, which for a given sample size n has the smallest possible variance: D min (n = const). The effective estimate has the smallest variation compared to other unbiased and consistent estimates.

Wealthycall this statistical appraisal Q *, which for ntends in probability to the estimated parameter Q , i.e. with increasing sample size n the estimate tends in probability to the true value of the parameter Q.

The requirement of consistency is consistent with the law of large numbers: the more initial information about the object under study, the more accurate the result. If the sample size is small, then the point estimate of the parameter can lead to serious errors.

Any sample (volumen) can be thought of as an ordered setx 1, x 2, ..., x n independent identically distributed random variables.

Sample means for different sample sizes n from the same general population will be different. That is, the sample mean can be considered as a random variable, which means that we can talk about the distribution of the sample mean and its numerical characteristics.

The sample mean satisfies all the requirements imposed on statistical estimates, i.e. gives an unbiased, efficient and consistent estimate of the general average.

It can be proved that... Thus, the sample variance is a biased estimate of the general variance, giving it an underestimated value. That is, with a small sample size, it will give a systematic error. For an unbiased, consistent estimate, it suffices to take the value, which is called variance corrected. Ie.

In practice, to estimate the general variance, the corrected variance is used at n < 30. In other cases ( n> 30) deviation from hardly noticeable. Therefore, at large values n the offset error is negligible.

You can also prove that the relative frequencyn i / n is an unbiased and consistent estimate of the probability P (X = x i ). Empirical distribution function F * (x ) is an unbiased and consistent estimate of the theoretical distribution function F (x) = P (X< x ).

Example:

Find the unbiased estimates of the mean and variance from the sample table.

x i
n i

Solution:

Sample size n = 20.

The unbiased estimate of the mathematical expectation is the sample mean.


To compute the unbiased variance estimate, we first find the sample variance:

Now let's find the unbiased estimate:

9. Interval estimates of distribution parameters

Interval is a statistical estimate determined by two numerical values, the ends of the interval under study.

Number> 0 for which | Q - Q * |< , characterizes the accuracy of interval estimation.

Trusteecalled interval , which with a given probabilitycovers the unknown parameter value Q ... Expanding the confidence interval to the set of all possible parameter values Q called critical area... If the critical region is located only on one side of the confidence interval, then the confidence interval is called unilateral: left-handed if the critical region exists only on the left, and right-handed if only on the right. Otherwise, the confidence interval is called bilateral.

Reliability, or confidence level, estimates Q (using Q *) is the probability with which the following inequality holds: | Q - Q * |< .

Most often, the confidence level is set in advance (0.95; 0.99; 0.999) and the requirement is imposed on it to be close to one.

Probabilityare called the probability of error, or the level of significance.

Let | Q - Q * |< , then... This means that with the probabilityit can be argued that the true value of the parameter Q belongs to the interval... The smaller the deviation, the more accurate the estimate.

The boundaries (ends) of the confidence interval are called confidence limits, or critical limits.

The values ​​of the boundaries of the confidence interval depend on the distribution law of the parameter Q *.

The amount of deviationequal to half the width of the confidence interval, is called the accuracy of the assessment.

Methods for constructing confidence intervals were first developed by the American statistician J. Neumann. Accuracy of estimation, confidence probability and sample size n related. Therefore, knowing the specific values ​​of two quantities, you can always calculate the third.

Finding the confidence interval for estimating the mathematical expectation of a normal distribution, if the standard deviation is known.

Let the sample be made from the general population, subject to the law of normal distribution. Let the general standard deviation be known, but the mathematical expectation of the theoretical distribution is unknown a ().

The following formula is valid:

Those. at a given deviation valueone can find the probability with which the unknown general average belongs to the interval... And vice versa. It can be seen from the formula that with an increase in the sample size and a fixed value of the confidence probability, the value- decreases, i.e. the accuracy of the estimate is increased. With increasing reliability (confidence level), the value-increases, i.e. the accuracy of the estimate decreases.

Example:

As a result of the tests, the following values ​​were obtained -25, 34, -20, 10, 21. It is known that they obey the law of normal distribution with a standard deviation of 2. Find the estimate a * for the mathematical expectation a. Plot a 90% confidence interval for it.

Solution:

Find the unbiased estimate

Then


The confidence interval for a is: 4 - 1.47< a< 4+ 1,47 или 2,53 < a < 5, 47

Finding the confidence interval for estimating the mathematical expectation of a normal distribution if the standard deviation is unknown.

Let it be known that the general population is subject to the law of normal distribution, where a and... Accuracy of the confidence interval covering with reliabilitythe true value of the parameter a, in this case is calculated by the formula:

, where n is the sample size, , - Student's coefficient (it should be found according to the given values n and from the table "Critical points of the Student's distribution").

Example:

As a result of the tests, the following values ​​were obtained -35, -32, -26, -35, -30, -17. It is known that they obey the law of normal distribution. Find the confidence interval for the mathematical expectation in the general population with a confidence level of 0.9.

Solution:

Find the unbiased estimate.

Find.

Then

The confidence interval becomes(-29.2 - 5.62; -29.2 + 5.62) or (-34.82; -23.58).

Finding the Confidence Interval for the Variance and Standard Deviation of the Normal Distribution

Let a random sample of volume be taken from some general population of values ​​distributed according to the normal lawn < 30, for which the sample variances are calculated: biasedand corrected s 2... Then, to find interval estimates with a given reliabilityfor general varianceDgeneral standard deviationthe following formulas are used.


or,

The values- find using a table of critical point valuesthe Pearson distribution.

The confidence interval for the variance is found from these inequalities by squaring all parts of the inequality.

Example:

The quality of 15 bolts was checked. Assuming that the error in their manufacture is subject to the normal distribution law, and the sample standard deviationequal to 5 mm, determine with reliabilityconfidence interval for unknown parameter

We represent the boundaries of the interval as a double inequality:

The ends of the two-sided confidence interval for variance can be determined without performing arithmetic operations on a given confidence level and sample size using the appropriate table (Boundaries of confidence intervals for variance depending on the number of degrees of freedom and reliability). To do this, the ends of the interval obtained from the table are multiplied by the corrected variance s 2.

Example:

Let's solve the previous problem in a different way.

Solution:

Let's find the corrected variance:

Using the table "Borders of confidence intervals for variance, depending on the number of degrees of freedom and reliability", we find the boundaries of the confidence interval for variance atk= 14 and: the lower limit is 0.513 and the upper limit is 2.354.

Multiply the resulting boundaries bys 2 and extract the root (since we need the confidence interval not for the variance, but for the standard deviation).

As can be seen from the examples, the value of the confidence interval depends on the method of its construction and gives similar, but different results.

With sufficiently large samples (n> 30) the boundaries of the confidence interval for the general standard deviation can be determined by the formula: - some number, which is tabulated and given in the corresponding reference table.

If 1- q<1, то формула имеет вид:

Example:

Let's solve the previous problem in the third way.

Solution:

Previously founds= 5,17. q(0.95; 15) = 0.46 - we find it from the table.

Then: