# Data mining. Optimization of orders of goods in a pharmacy (pharmacy)

In a small pharmacy there is a need for a flexible system of orders for medicines and para-pharmaceutical products sensitive to constant market fluctuations. In the framework of modern reality, single pharmacy points do not have sufficient storage facilities (material rooms), which leaves its mark and forces the person responsible for orders to make them daily from the consolidated price list for several suppliers, avoiding duplication, at minimal prices, excluding Products with inappropriate shelf life. Moreover, the total nomenclature is several tens of thousands of units.

We live in a modern world where routine operations are performed by a computer for us. Therefore, you can say: “Let's use a computer, and he will do all the dirty work for us!” “Do you have a database containing statistics on sales of various drugs?” - you continue, “So why not use these statistics to forecast sales and create an automatic application for the required drugs?”

Yes, you’ll be right in the first approximation. Such solutions are in software systems that automate pharmacies in Russia. But there is one very big “BUT”. All these solutions will not work correctly until you create product groups.

I will explain: There is a drug: “Donormil 15mg Tab. X30 "manufactured by Upsa Laboratoir France and" Donormil Tab. 15 mg No. 30 Aventis / Bristol-Myers Squibb - France. In the database, these are two completely different drugs with a different identifier and a different name, BUT this is one and the same. If you take into account the statistics for these two different products - then you will get the wrong result.

In order to obtain reliable information on the movement and the need to order goods, it is necessary to create groups of identical goods. As a rule, you need to create groups manually and process a large number of records. If there is a dedicated employee who can for a long period of time work on filling out the “Product Groups” directory, then this is more or less realistic (however, it should be borne in mind that when new products arrive, the directory must be constantly updated). In a small pharmacy, where usually two or three people work behind the counter, this is simply not possible.

The question arises: "What are the algorithms that allow finding the same drugs?". The first thing that comes to mind is the calculation of the Levenshtein distance. But here we are faced with a limitation of this algorithm - the Levenshtein distance for different products “Line caps. x16 "and" Linex caps. x32 "(the distance is two), less than the same products" Linex caps. x16 "and" Linex N16 caps "(the distance is nine). The problem is that suppliers can swap words, replace abbreviations of quantity (someone writes No., someone N, someone X), volume, etc. It is not possible to combine identical products with a barcode. The same drugs produced by different factories have a different barcode. Moreover, the same drug produced at one factory may have a different barcode after re-registering.

After a long search, I came to the following algorithm for finding the same drugs:

1. For the first approximation (searching for “similar” drugs), the N-Gram algorithm is used. This algorithm makes up all possible combinations of substrings, up to the specified length, and calculates their matches. The number of matches, divided by the number of options, is declared by the coefficient of line similarity for a fixed N (I chose the value 3) and is given as the result of the function.

For example, for Linex caps N16 and Linex N16 caps, the lines are broken into 3 grams:

The result is 0.88.

Thus, we combine drugs for which keywords are reversed. However, this algorithm has two drawbacks:

a) The algorithm combines drugs with “x10” and “x20”, “g.” And “mg.”, Etc .;

b) The load on the database increases dramatically - the grammar dictionary is very large. For example, for a reference book of inventory of approximately 30 thousand records, a 3-gram dictionary contains 900 thousand records. For a price list of 47 thousand entries (a combined price from several suppliers), the dictionary already contains 1.7 million entries.

2. For the final filling of the directory “Group of goods” (after which the work of a person is required), the initial directory of material values is divided into a dictionary of words. A meta-directory of the “key” properties of the drugs is created:

a) Volume;

b) Quantity;

c) The amount of active substance;

g) drug substance;

e) color, taste;

etc.

A meta-reference book contains data on synonyms, correspondences (for example: g = 1000 mg) and how to search for properties in the dictionary. Excluding among “similar” products (the result of the first algorithm) drugs for which the “key” properties are different, we get a guide “Product Groups”.

The specified algorithm allows you to automatically fill in the directory "Product Groups", which is already subsequently edited by the user.

The next question I had to solve was the question “which prediction algorithm” should I use? Since I did not want to use complex and resource-intensive algorithms and, moreover, the pharmacy only works for the first year (there is no seasonality), I chose the Double Exponential Smoothing algorithm.

The formulas look like this:

where they take values from the range [0; 1]

y - the real number of sales;

To predict the next value, use the formula:

To predict several values:

As we can see, to calculate the forecast, you need to know the value of two variables - and. The optimal values of and chosen from a minimum prediction squared error (the sum of the squares of the difference of the actual number of goods sold and the forecast). Thus, we are faced with the classical problem of finding the minimum of the function of several variables with linear constraints.

At school, when I was just studying Fortram, Dad bought J. Forsyth's book, Machine Methods of Mathematical Computing. I remember my surprise at my first acquaintance with the calculation of floating point, the concept of "Machine epsilon". Remembering about this book, I found in it an algorithm for finding a minimum, but only for a function of one variable. For the function of several variables, the author referred the reader to the unfinished (1977, at the time of writing the book) MINPACK package from the National Laboratory in Argon. Imagine my surprise when I found this package written in Fortran, found the MINPACK C / C ++ package and talked with the author of the “transfer” from Fortran to C.

To date, I have implemented a software package for predicting drug sales that consists of :

a) Libraries for MS SQL server (dll), implemented in the form of extended stored procedures written in C ++ and implementing forecast calculation;

b) Databases of MS SQL server containing dictionaries, metadata and stored procedures: forecast calculation, comparison of inventory and price list and others.

c) The client in which the user makes a forecast works with directories and price lists.

The introduction of the software complex has accelerated the work of ordering goods at times and increased the efficiency of the pharmacy. And more importantly - he returned his wife to the family!

The software package that I implemented is tightly tied to one vendor that automates the pharmacy business in Russia. Using the universality of the approach, you can implement a similar solution for other software products. The specified algorithm can be applied not only in the pharmacy business, but also in any other, where there is a large range of products of the same type, and there is a need to combine the inventory guide and sales forecast in the face of fierce competition and constant market changes.

We live in a modern world where routine operations are performed by a computer for us. Therefore, you can say: “Let's use a computer, and he will do all the dirty work for us!” “Do you have a database containing statistics on sales of various drugs?” - you continue, “So why not use these statistics to forecast sales and create an automatic application for the required drugs?”

Yes, you’ll be right in the first approximation. Such solutions are in software systems that automate pharmacies in Russia. But there is one very big “BUT”. All these solutions will not work correctly until you create product groups.

I will explain: There is a drug: “Donormil 15mg Tab. X30 "manufactured by Upsa Laboratoir France and" Donormil Tab. 15 mg No. 30 Aventis / Bristol-Myers Squibb - France. In the database, these are two completely different drugs with a different identifier and a different name, BUT this is one and the same. If you take into account the statistics for these two different products - then you will get the wrong result.

In order to obtain reliable information on the movement and the need to order goods, it is necessary to create groups of identical goods. As a rule, you need to create groups manually and process a large number of records. If there is a dedicated employee who can for a long period of time work on filling out the “Product Groups” directory, then this is more or less realistic (however, it should be borne in mind that when new products arrive, the directory must be constantly updated). In a small pharmacy, where usually two or three people work behind the counter, this is simply not possible.

The question arises: "What are the algorithms that allow finding the same drugs?". The first thing that comes to mind is the calculation of the Levenshtein distance. But here we are faced with a limitation of this algorithm - the Levenshtein distance for different products “Line caps. x16 "and" Linex caps. x32 "(the distance is two), less than the same products" Linex caps. x16 "and" Linex N16 caps "(the distance is nine). The problem is that suppliers can swap words, replace abbreviations of quantity (someone writes No., someone N, someone X), volume, etc. It is not possible to combine identical products with a barcode. The same drugs produced by different factories have a different barcode. Moreover, the same drug produced at one factory may have a different barcode after re-registering.

After a long search, I came to the following algorithm for finding the same drugs:

1. For the first approximation (searching for “similar” drugs), the N-Gram algorithm is used. This algorithm makes up all possible combinations of substrings, up to the specified length, and calculates their matches. The number of matches, divided by the number of options, is declared by the coefficient of line similarity for a fixed N (I chose the value 3) and is given as the result of the function.

For example, for Linex caps N16 and Linex N16 caps, the lines are broken into 3 grams:

Compare substring | Substrings of the second line | Matches | Number of matches | Number of options | Similarity coefficient |
---|---|---|---|---|---|

Ling | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

no | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

some | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

ex | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

cop | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | eleven | thirteen | |

from to | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Not | |||

ka | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

cap | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

aps | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

ps | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Not | |||

with N | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

N1 | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

N16 | Lin, no, nek, ex, ks, s N, N1, N16.16, 6 k, ka, cap, aps | Yes | |||

(11 + 12) / (13 + 13) = 0.88 | |||||

Ling | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

no | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

some | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

eks | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

cop | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

with N | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

N1 | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | 12 | thirteen | |

N16 | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

16 | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

6 to | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Not | |||

ka | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

cap | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes | |||

aps | Lin, no, nek, ex, ks, s k, ka, cap, aps, ps, s N, N1, N16 | Yes |

The result is 0.88.

Thus, we combine drugs for which keywords are reversed. However, this algorithm has two drawbacks:

a) The algorithm combines drugs with “x10” and “x20”, “g.” And “mg.”, Etc .;

b) The load on the database increases dramatically - the grammar dictionary is very large. For example, for a reference book of inventory of approximately 30 thousand records, a 3-gram dictionary contains 900 thousand records. For a price list of 47 thousand entries (a combined price from several suppliers), the dictionary already contains 1.7 million entries.

2. For the final filling of the directory “Group of goods” (after which the work of a person is required), the initial directory of material values is divided into a dictionary of words. A meta-directory of the “key” properties of the drugs is created:

a) Volume;

b) Quantity;

c) The amount of active substance;

g) drug substance;

e) color, taste;

etc.

A meta-reference book contains data on synonyms, correspondences (for example: g = 1000 mg) and how to search for properties in the dictionary. Excluding among “similar” products (the result of the first algorithm) drugs for which the “key” properties are different, we get a guide “Product Groups”.

The specified algorithm allows you to automatically fill in the directory "Product Groups", which is already subsequently edited by the user.

The next question I had to solve was the question “which prediction algorithm” should I use? Since I did not want to use complex and resource-intensive algorithms and, moreover, the pharmacy only works for the first year (there is no seasonality), I chose the Double Exponential Smoothing algorithm.

The formulas look like this:

where they take values from the range [0; 1]

y - the real number of sales;

To predict the next value, use the formula:

To predict several values:

As we can see, to calculate the forecast, you need to know the value of two variables - and. The optimal values of and chosen from a minimum prediction squared error (the sum of the squares of the difference of the actual number of goods sold and the forecast). Thus, we are faced with the classical problem of finding the minimum of the function of several variables with linear constraints.

At school, when I was just studying Fortram, Dad bought J. Forsyth's book, Machine Methods of Mathematical Computing. I remember my surprise at my first acquaintance with the calculation of floating point, the concept of "Machine epsilon". Remembering about this book, I found in it an algorithm for finding a minimum, but only for a function of one variable. For the function of several variables, the author referred the reader to the unfinished (1977, at the time of writing the book) MINPACK package from the National Laboratory in Argon. Imagine my surprise when I found this package written in Fortran, found the MINPACK C / C ++ package and talked with the author of the “transfer” from Fortran to C.

To date, I have implemented a software package for predicting drug sales that consists of :

a) Libraries for MS SQL server (dll), implemented in the form of extended stored procedures written in C ++ and implementing forecast calculation;

b) Databases of MS SQL server containing dictionaries, metadata and stored procedures: forecast calculation, comparison of inventory and price list and others.

c) The client in which the user makes a forecast works with directories and price lists.

The introduction of the software complex has accelerated the work of ordering goods at times and increased the efficiency of the pharmacy. And more importantly - he returned his wife to the family!

The software package that I implemented is tightly tied to one vendor that automates the pharmacy business in Russia. Using the universality of the approach, you can implement a similar solution for other software products. The specified algorithm can be applied not only in the pharmacy business, but also in any other, where there is a large range of products of the same type, and there is a need to combine the inventory guide and sales forecast in the face of fierce competition and constant market changes.