• Ешқандай Нәтиже Табылған Жоқ

THESIS APPROVAL FORM

N/A
N/A
Protected

Academic year: 2024

Share "THESIS APPROVAL FORM"

Copied!
90
0
0

Толық мәтін

Diploma thesis, which partially fulfilled the conditions for the master's degree in Eurasian studies. The thesis investigates the status quo of the Kazakh language from the perspective of corpus linguistics. The aim of the project is to review the currently existing corpora of the Kazakh language and to contribute to the existing corpus with a more flexible and more automatic way of building and annotating the corpus.

After reviewing the field, it was found that some efforts to digitize the Kazakh language are still developing. Therefore, the aim of this project was to try to build a journalistic Kazakh language corpus using neural networks for part-of-speech marking. The final stage of the study involves using neural networks to assign words to their parts of speech.

Neural networks provide an automated way to perform part-of-speech tagging that is faster compared to humans, with accuracy that can be nearly equal to that of humans.

Background

Some of the voice assistive technology applications rely heavily on speech synthesis – using the vocabulary according to the situation. Therefore, there is a ready base of the devices that can be used for the deployment of voice assistants. My research project will focus on adapting the existing corpus analysis methodology to the Kazakh language and attempt to bring neural.

There are several reasons for this, but the most important one is that although the Kazakh language seems to be largely on track in terms of being digitized, it is still somewhat understudied in terms of the corpus methodology. The status quo of the corpus methodology for the Kazakh language is that not much has been done for it. The authors of the corpus claim to have collected millions of words across various texts.

Furthermore, even if there were a corpus of the Kazakh language extensive enough to work well with voice-support software, there is very little that can be done about it.

Theoretical foundation

To change the state of the door from closed to open (from A to B), someone must pull the door – the action of pulling is transition 1, and the door changes state, as reflected by the finite state machine. World Atlas of Language Structures” can give a good general starting idea of ​​the language family to which the Kazakh language belongs. However, some aspects of the Kazakh language are overlooked in this book: for example, the superlative degree via doubling of the initial syllable is not mentioned as a possible option.

Muhamedowa emphasizes the fact that Kazakh is an agglutinative language; this same fact was emphasized by Bekbulatov et al as one of the features that make machine translation problematic. One of the more practical considerations in terms of morphological analysis is the extensive inventory of "parts of speech". However, it is necessary to note that the topic of morphological features of the Kazakh language in relation to machine translation has been addressed before, but the articles on the topic are rare.

One of the articles about the Kazakh language and its challenges for machine interpretation is called “A Study of sure.

Methodology

I chose the Internet news articles as my research medium because they are one of the most useful and diverse sources available. Overall, the book explores the potential of the Web as a linguistic medium for quantitative analysis. However, one issue that is not addressed in the book is the problem of a language having more than one writing system – which is the case with the Kazakh language, especially online.

However, they are often highly specialized in the data they can work with. Ullman (1986) is one of the best known books covering the basics and practices of compiler creation. There can be several ways to split an input stream (in our case text) and assign categories to the individual units, but one of the most common methods of doing this is through pattern matching.

In particular, one of the big differences is probably expressed in the methodology of reading words due to the features of Kazakh as an. While there are certainly differences between natural and formal, artificial languages ​​such as programming languages ​​are some of the biggest ambiguities. Thus, with the help of a set of rules, the machine can understand the text.

One of the methods of corpus analysis is pattern recognition; these patterns are a lot broader than the patterns described in the section on software. There is a large body of scholarly literature showing the ways in which corpus linguistics can be used for the benefit of society. The authors first show the shortcomings of the current legal system, and offer ways in which other systems can be introduced.

As mentioned earlier, at this point the neural network will not consider the derived suffixes as part of the morphology. The flexible structure of the corpus will allow the words with derived suffixes present in the database to be further analyzed into the derived morphemes and roots.

Practical approach

In addition, the selected websites generally display little or no advertising in the body of the news, making them good candidates for the research. These can also appear at any point in the HTML code, sometimes in the middle of the article, as shown in Appendix A. One of the prerequisites for the project is the fixed word order of Kazakh.

The index for the word "dərıger" in the first sentence is (3), but (1) in the other two sentences. As is the case with many SOV languages, Kazakh is a pro-drop language – the subject is likely to be dropped, as speakers can generally infer the subject of the sentence, based on the context. Although there is no subject in the sentence "Кітап окідык"([We] read a book), the personal suffix at the end of the verb indicates the missing element of the sentence.

The diagram above is a rather simplified representation of how I wanted to store the data in the machine. The tests were performed using the worst-case scenario: the very last element of the array, “я”, was repeated n number of times. The second character retrieved and counted as the second part of the first symbol (supposedly a space) is the first character of M.

One possible way to solve the problem is to run the characters through 2 separate functions. The main idea of ​​the neural network will be to recognize the patterns present in the language in terms of syntax, analyze the input (sentence to be parsed) and output separate words with their classifier (noun, verb, etc.) . One of the most optimal approaches is the implementation of the working structure of a recurrent neural network.

In terms of this project's natural language processing framework, "occurrence time" is most closely aligned with word position in a sentence. Depending on the position of the word in the sentence and the context, the word in question can change the meaning greatly. The word in the first percentile is likely to be the subject of the sentence according to the model's prediction.

Some of the failures can be attributed to the fact that while the skeletal structure of a Kazakh sentence is SOV, there is no guarantee that a subject will come first in the sentence.

Conclusion

For example, assignment of adjectives was not a problem for this iteration of the neural network, since the adjectives generally have predictable morphological patterns, such as suffixes, initial reduplication, which distinguish them unambiguously in most cases. The software that can be expected to work with the Kazakh language should be able to handle complex morphology of the Kazakh language. Therefore, programs/engines that work well for a language like English are unlikely to work well for the Kazakh language.

However, Kazakh language is also a pro-drop language, meaning that it can drop the subject or pronoun if the contextual clues give an indication of what the subject of the sentence is. In general, the lack of a unified annotation system for the Kazakh corpus, combined with the challenging nature of the Kazakh language in terms of morphology, contributes to its poor representation in computational terms. Overall, this project highlights studying the Kazakh language through the quantitative methodology via corpus linguistics, its challenges and approaches.

The method implies merging the sample corpus with the lexc file used by Apertium, a free open source software, to contribute to the Kazakh language digitization effort. I have also identified a particular utility of a uniform, publicly available system of notations that can correctly reflect the nuances of the Kazakh language. To address the problem, I used the available descriptions of Turkic language family in general, and Kazakh language specifically.

The literature suggests that the Internet can be a representative corpus for written Kazakh language; at this point it is the best data collection environment. The data collection involved several stages: first, I created a list of URLs of news sites that would serve as the sources of the data. The final version of the neural network was able to classify the words with 88.32% accuracy – while the modern software can produce more accurate results, I would consider this neural network a successfully implemented model based on the limited amount of training data.

The present project does not take into account the multilingualism of the region, as this in itself requires a separate research. The data source for the corpus was chosen in an effort to remain as close to monolingual as possible – spoken or Internet communication is likely to be multilingual.

Ақпарат көздері

СӘЙКЕС КЕЛЕТІН ҚҰЖАТТАР