Preview

Chebyshevskii Sbornik

Advanced search

The construction and analysis of the Russian language models for a cryptographic algorithm research

https://doi.org/10.22405/2226-8383-2022-23-2-151-160

Abstract

The article provides a statistical analysis of the properties of lexical and n-gram models of the Russian language based on the news text corpus. A specialized corpus of political news articles of recent years has been created, reflecting a narrow area of language use. The token and n-gram dictionaries are compiled, the coverage values are found, as well as the values of
entropy. Lemmatization of the original text corpus and extrapolation of the dictionary volumes are performed.

About the Authors

Anastasia Gennad’evna Malashina
National Research University «Higher School of Economics»
Russian Federation


Alexey Borisovich Los
National Research University «Higher School of Economics»
Russian Federation

candidate of technical sciences, assosiate professor



References

1. Alferov A.P., Zubov A. Ju., Kuz’min A. S. & Cheremushkin A. V. 2005, “Fundamentals of cryptography: textbook: 3rd ed., ISPR. and add.”[“Osnovy kriptografii: uchebnoe posobie: 3-e

2. izd., ispr. i dop.”], Gelios ARV, p. 408.

3. Babash A. V. & Shankin G.P. 2007, “Cryptography” [“Kriptografija”], SOLON-PRESS.

4. Viktorov A. B., Gramnickij S. G., Gordeev S. S., Eskevich M. V. & Klimina E. M. 2009, “The universal method for preparing speech recognition system training components” [“Universal’-

5. naja metodika podgotovki komponentov obuchenija sistem raspoznavanija rechi”], Rechevye tehnologii, pp. 39-56.

6. Volosatova T. M. & Chichvarin N. V. 2018, “Computer science and linguistics: textbook” [“Informatika i lingvistika: ucheb. posobie”], INFRA-M, p.196.

7. Kipjatkova I. S. 2010, “Research of statistical N-gram language models for recognition of merged Russian speech with a super-large dictionary” [“Issledovanie statisticheskih n-grammnyh modelej jazyka dlja raspoznavanija slitnoj russkoj rechi so sverhbol’shim slovarem”], Analiz razgovornoj russkoj rechi, Sankt-Peterburg.

8. Malashina A. G. 2019, “The algorithm for recovering discrete message parts based on information about possible values of its characters. Proc.” [“Algoritm vosstanovlenija otdel’nyh

9. chastej soobshhenija po informacii o vozmozhnyh znachenijah ego znakov. Materialy konferencii”], Mezhvuzovskaja nauchno-tehnicheskaja konferencija studentov, aspirantov i molodyh

10. specialistov imeni E.V. Armenskogo, Moscow, pp. 215-217.

11. Shennon C. E. 1963, “Works on information theory and Cybernetics” [“Raboty po teorii informacii i kibernetike”], Izdatel’stvo inostrannoj literatury.

12. Jaglom A. M. & Jaglom I. M. 1973, “Probability and information: 3rd ed., cor. and exp.” [“Verojatnost’ i informacija: 3-e izd., ispr. i dop.”], Nauka, pp. 236-290.

13. Bellegarda J. R. 2001, “Robustness in Statistical Language Modeling”, Robustness in Language and Speech Technology , Springer Science+Business Media Dordrecht, pp. 104-106.

14. Chase L., Rosenfeld R. & Ward W. 1994, “Error-responsive modifications to speech recognizers: negative n-grams”, Third International Conference on Spoken Language Processing.

15. Florencio, D. & Herley, C. 2007, “A Large-Scale Study of Web Password Habits”, Proceeds of the International World Wide Web Conference Committee.

16. Gelbukh A. & Sidorov G. 2001, “Zipf and Heaps Laws’ Coefficients Depend on Language”, Conference on Intelligent Text Processing and Computational Linguistics.

17. Kechedzhy K. E., Usatenko K. E. & V. A. Yampol’skii 2005, “Rank distributions of words in additive many-step Markov chaons and the Zipf law”, Phys. Rev. E., vol. 72.

18. Massey J. 1994, “Guessing and entropy”, Proceedings of 1994 IEEE International Symposium on Information Theory, p. 204.

19. Rosenfeld R. 1995, “Optimizing lexical and n-gram coverage via judicious use of linguistic data”, Proceedings of the Fourth European Conference on Speech Communication and Technology.


Review

For citations:


Malashina A.G., Los A.B. The construction and analysis of the Russian language models for a cryptographic algorithm research. Chebyshevskii Sbornik. 2022;23(2):151-160. (In Russ.) https://doi.org/10.22405/2226-8383-2022-23-2-151-160

Views: 360


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2226-8383 (Print)