• Ешқандай Нәтиже Табылған Жоқ

44 III. Experimental results and future work

After the hardware of the interface assembling and established the connection with the software we run tests to observe the behavior of the system. Despite several errors that were then adjusted the haptic interface simulation might to give good results in terms of overall performance.

That includes the response of the hardware to software and vice-versa.

Fig. 4. Model of Desktop Haptic Interface

Desktop Haptic Interface (fig. 4) would be able to read the signals sent from potentiometers and responded within short amount of time. Then the virtual model of STAUBLI and a box representing the obstacle can be added. Both systems and models can make correct movements while being operated by the user. For testing whether it is even possible to use the output of the haptic system as the input for the actual robot the data should be recorded in a separate file and be sent it to the input system of STAUBLI. As a result, the robot responds accordingly, with the vibrations and cautions related to obstacle being taken into account.

45 References:

1. C. B. Zilles and J. K. Salisbury, “A constraint-based god-object method for haptic display,” in Intelligent Robots and Systems 95.’Human Robot Interaction and Cooperative Robots’, Proceedings. 1995 IEEE/RSJ In- ternational Conference on, vol. 3. IEEE, 1995, pp. 146–151.

2. B. Horan, D. Creighton, S. Nahavandi, and M. Jamshidi, “Bilateral haptic teleoperation of an articulated track mobile robot,” in System of Systems Engineering, 2007. SoSE’07. IEEE International Conference on. IEEE, 2007, pp. 1–8.

3. J. C. Perry and J. Rosen, “Design of a 7 degree-of-freedom upper-limb powered exoskeleton,” in Biomedical Robotics and Biomechatronics, 2006. BioRob 2006. The First IEEE/RAS-EMBS International Conference on. IEEE, 2006, pp. 805–810.

4. T. Hayashi, H. Kawamoto, and Y. Sankai, “Control method of robot suit hal working as operator’s muscle using biological and dynamical information,” in Intelligent Robots and Systems, 2005.(IROS 2005). 2005 IEEE/RSJ International Conference on. IEEE, 2005, pp. 3063–3068.

5. A. Schiele and F. C. van der Helm, “Kinematic design to improve ergonomics in human machine interaction,” Neural Systems and Rehabil- itation Engineering, IEEE Transactions on, vol. 14, no. 4, pp.

456–469, 2006.

6. A. Tatematsu and Y. Ishibashi, Mapping workspaces to virtual space in work usingheterogeneous haptic interface devices. INTECH Open Access Publisher, 2010.

7. J. J. Berkley, “Haptic devices,” White Paper, Mimic Technologies Inc., Seattle, 2003.

8. N. C. Mitsou, S. V. Velanas, and C. S. Tzafestas, “Visuo-haptic interface for teleoperation of mobile robot exploration tasks,” in Robot and Human Interactive Communication, 2006. ROMAN 2006.

The 15th IEEE International Symposium on. IEEE, 2006, pp. 157–163.

UDС 004


EMOTION RECOGNITION BASED ON IMAGE PROCESSING (Kazakh-British Technical University, Almaty, Kazakhstan) Abstract

Facial expressions are an inconceivably imperative part of human life and basic research on emotions of the previous couple of decades has delivered a few revelations that have prompted vital true applications. People can embrace an facial expressions willfully or automatically, and the neural components in charge of controlling the expression vary for every situation. However solid expression acknowledgment by machine is still a challenge. This paper presents application of the machine learning system of support vector machines (SVM) to recognition and classification of human emotions based on image processing.

Keywords: Expression recognition, Face detection, SVM Introduction

Exact recognition and classification of facial expressions ends up being an extremely troublesome errand. Despite immense efforts in computer hardware and software development, including the improvement of sophisticated algorithms for machine learning, today still no computational system exists that approximates the performance of humans. Traditionally, facial expressions have been studied by clinical and social psychologists, medical practitioners, actors and artists. However in the last quarter of the 20th century, with the advances in the investigation of artificial intelligence, computer vision and computer graphics, 3D modeling and computer scientists started showing interest in the study of facial expressions.

Face detection and feature extraction

Automatic emotion acknowledgment frameworks are partitioned into three modules:

1. Face Recognition

46 2. Feature Extraction

3. Expression Classification

First of all, I detected face from loaded into matrix image using Viola-Jones object detection framework. The Viola–Jones object identification system is the primary framework that introduced powerful object detection rates created by Paul Viola and Michael Jones in 2001. Although it can be trained to detect a variety of object classes. This algorithm is implemented in OpenCV library, which was also included in my project. Cascade of supported classifiers working with haar - like components is now prepared in OpenCV with a couple of hundred specimen perspectives of a specific objects subjective pictures of the same size. After loading different pertained on face detection cascades they can be used to a region of interest in an input image. To search for the object in the entire picture one can move the inquiry window over the picture and check each area utilizing the classifier. So to discover an object of an obscure size in the picture the scan procedure have to be done a few times at various scales.

Figure 1 - The current algorithm uses the following Haar-like features.

Figure 2 - Example of work of face recognition using Haar-like features.

Second step is extracting features like mouth, eyes and nose. The principle of work remains the same. Only using haar-cascades change according to the needed region of interest. Also image of detected eye pairs and mouth of each testing picture is saved for further usage for training classifier.

Support Vector Machine

Support Vector Machines are a maximal edge hyperplane classification method that relies on results from statistical learning hypothesis to ensure high speculation execution. Kernel functions are utilized to proficiently map input data which may not be linearly separable to a high dimensional feature space where linear methods can then be applied. SVMs display great order


precision even when only a humble quantity of training data is prepared, making them especially suitable to a dynamic, intelligent way to deal with expression acknowledgment.

Figure 3 - Recognition of face, eyes, mouth and nose in my project compared to non-face based picture.

The frequently unpretentious contrasts recognizing separate expressions, for example,

«anger» or «disgust» in our displacement-based data as well as the wide range of possible variations in a particular expression when performed by different subjects drove us to the appropriation of SVMs as the classifier of decision. Selection of an appropriate kernel function allowed further modification and improvement of the SVM classifier to our specific space of outward appearance acknowledgment. Support vector machines have already been effectively utilized in an assortment of classification applications including character and text acknowledgment as well as DNA microarray data analysis.

Used Database, Evaluation and Conclusion

All images with were used training my classifier were taken from Cohn-Kanade AU-Coded Expression Database. The Cohn-Kanade Database is for exploration in programmed facial picture investigation, synthesis and for perceptual studies. Mostly scientists define 7 universal facial expressions: Happiness, Sadness, Surprise, Disgust, Fear, Anger and Neutral. My underlying execution effectively perceived expressions in 70% of trials, with ensuing changes including selection of a kernel function customized to the training data boosting acknowledgment precision up to 85%. Consolidating further conceivable upgrades, for example, expanding measure of training data or performing programmed SVM model selection is prone to yield far better execution and further build the suitability of SVM-based expression acknowledgment approaches in building emotional and socially intelligent human-computer interfaces.

Future work

Application of emotion recognition in different spheres of life is major and vital. First of all, such technology will help in building human-like robots and human-computer based interactions itself. Also it can find utilization in social sphere like monitoring mood of visitors of various institutions, cafes, restaurants and anywhere else where profit of company depends on customers’

replies. Collecting such information can ameliorate customer service. Another way of usage of this technology can be utilized in development of different applications or systems that can change, brighten human spirits by listening appropriate music or film. Moreover acknowledgment of microexpressions from live stream video can improve accuracy of lie detectors and can find its adaptation law enforcement system.


1. P. Michel and R. Kaliouby, “Real Time Facial Expression Recognition in Video Using Support Vector Machines”

2. V. Bettadapura, “Facial Expression Recognition and Analysis: The State of the Art”

3. P. Wagner, “Machine Learning with OpenCV2”

4. C. C. Chibelushi, F. Bourel, “Facial Expression Recognition: A Brief Tutorial Overview”

48 УДК 004.8



(National Laboratory Astana, Nazarbayev University, Astana, Kazakhstan) Abstract

In this paper we describe our attempt to build a baseline system for Kazakh broadcast news transcription. A neural network based acoustic model for the system was trained in Kaldi platform using the previously available KazSpeechDB speech corpus and KazMedia speech corpus which is the collection of the broadcast news from three different TV channels created specifically for this task. A language model was trained in IRSTLM toolkit using mass media news available online.

The best word error rate of 4.06% was obtained for the Khabar channel.


Nowadays, many media agencies produce and spread a large amount of audio and video materials to the Internet, which often do not have the accompanying text description of the contents.

One of the main problems of the lack of texts or transcriptions to the audio and video materials is the need to attract linguists or operators to recover the text, and it is accordingly entail additional costs and time of organization. Consequently, the lack of text content of audio and video materials effects on the poor quality of the search results of the news. This brings to the weak online presence and lower ratings of media agencies compared to the foreign media and therefore the dissatisfaction of Internet users. Furthermore, the absence of transcription limits impaired people or those who are not able to hear from access to such content.

Our research aims to address the problem of transcribing the news in the Kazakh language, using modern speech recognition technologies. Although the problem of broadcast news transcription is well studied for foreign languages such as Arabic [1], English [2, 4], Chinese [3]

and others, it is very challenging problem due to the high variability of acoustic events in the data.

A common news track may contain acoustic segments with different speakers (male/female), languages (Kazakh, Russian, English, etc.), channels (broadband/telephone), acoustic environments (studio/outdoor) and noises (music, cars, etc.) which dramatically affect the accuracy of speech recognition systems. Another challenge with respect to Kazakh is that many speakers are bilinguals and mix Russian and Kazakh during the conversation.

In this work we present a baseline system for Kazakh broadcast news transcription. Here we do not deal with the standard task such as segmentation and clustering of speech, language and speaker identification and others, but we show our preliminary results on building and testing this system on real data.

The following sections describe the speech corpus used for acoustic modelling, experiment setup and the results of broadcast news transcription.

Speech corpus

The main data for our acoustic modeling and speech recognition experiments is KazBNT acoustic corpus (database). The KazBNT corpus consists of two independent sub-corpora – KazSpeechDB and KazMedia.

The KazSpeechDB corpus as part of Kazakh Language Corpus [5] is a body of utterances consisting of 12675 Kazakh sentences recorded in a sound recording studio, uttered by speakers of different age and gender, from different regions of Kazakhstan. The corpus contains 22 hours of speech; its sampling rate is 16 kHz. The total number of speakers is 169, 73 of which are men and 96 are women. Each speaker uttered 75 sentences. Every audio file is supplied with a text file that contains transcription text of the utterance.

The KazMedia corpus is a body of text and audio data collected from official websites of broadcast news channels “Khabar” [6], “Astana TV” [7] and “Channel 31” [8]. The text data is a


collection of all Kazakh news in plain text, published on the official websites of these 3 media channels from 2013 to 2015. The audio data is 518 wav-files, which are actually audio tracks extracted from a number of video news in Kazakh. The total duration of these audio files is 11 hours of speech; the sampling rate is 16 kHz. Every wav-file is supplied with a txt-file that contains detailed transcription text of the news and an time-aligned annotation file with labels about speaker gender, language and noise.

It is worth to mention that, in fact, the KazMedia corpus contains more than 400 hours of audio news in Kazakh published from 2013 to 2015. However this data has initially got no orthographic transcriptions or other accompanying annotations. Therefore, we have preprocessed only a certain number of video news for our preliminary experiments: as stated above, it makes 11 hours of Kazakh speech in total.

Preparations of the experiment

The dictionary and the language model of the KazBNT system were formed on the basis of cumulative text data of both KazSpeechDB and KazMedia sub-corpora. We used the IRSTLM Toolkit [9] for language modeling.

A train set, a validation set and 3 independent test sets of the KazBNT system were formed on the basis of audio data from the KazSpeechDB and KazMedia sub-corpora, as described in Table 1. Then there were carried out a series of interdependent acoustic modeling experiments with this audio data and attendant txt-files. We used the Kaldi speech recognition toolkit [10] for acoustic modeling. The experiments started with training a simple monophone model, and ended with training a deep neural network. It should be noted that every next experiment is based on the previous one’s result, and generally refines upon it.

Table 1 List and characteristics of experiment sets

Set type Set name Number and source of files in the set Total duration of audio data Train set kazbnt.train 11175 wav-files from KazSpeechDB +

406 wav-files from KazMedia

29 hours Validation set kazbnt.dev 750 wav-files from KazSpeechDB + 49

wav-files from KazMedia

2.4 hours Test set 1:


kazbnt.test_khabar 30 wav-files of audio news from the «Khabar» channel

20 minutes Test set 2:

Astana TV

kazbnt.test_astanatv 14 wav-files of audio news from the «Astana TV» channel

20 minutes Test set 3:

Channel 31

kazbnt.test_channel31 19 wav-files of audio news from the «Channel 31»

20 minutes

Experimental results

Experiment 1 (Monophones: Delta-Deltas) – a monophone model using delta-delta features and cepstral mean and variance normalization on a per-speaker basis.

Experiment 2 (Triphones: LDA + MLLT + SAT) – a triphone model using linear discriminant analysis, maximum likelihood linear transform, and speaker adaptive training.

Experiment 3 (DNN1) – a deep neural network with 2 hidden layers each having 300 neurons.

Experiment 4 (DNN2) – a deep neural network with 4 hidden layers each having 2000 neurons.


A common metric to evaluate the performance of speech recognition models is WER (word error rate), which is computed as the ratio of erroneously recognized words to the total number of words in the reference text. The lower the WER, the better the accuracy of the recognition system is.

The summary of the experimental results for all available sets are shown in Table 2.

Table 2 Minimum value of WER on the train, validation and test sets

Experiment \ Set kazbnt.










Monophones 9.50 % 9.84 % 14.56 % 18.75 % 29.71 %

Triphones 5.70 % 6.32 % 6.36 % 9.88 % 17.13 %

DNN1 5.15 % 5.38 % 5.44 % 8.68 % 17.25 %

DNN2 3.86 % 4.54 % 4.06 % 7.52 % 14.54 %

Bold font indicates the best WER results. In all cases the best WER was achieved by using the DNN2 acoustic model. There is a marked difference between the results for the “Khabar”

channel (WER 4.06%), “Astana TV” (WER 7.52%) and “Channel 31” (WER 14.54%). It can be seemingly explained by different quality of the audio data, in terms of background noise and interfering sounds.

The obtained results are commensurable with similar results for other languages. For example, for Arabic the WER is 8.61% (KACST v1.10 [1]), for English the WER is 11.6% (CU-HTK 2006 [2]), for Mandarin Chinese the CER is 15.9% (based on LIMSI [3]).

It should be mentioned that in spite of the fact that the DNN2 model shows the best results, it is still very slow in action. This is due to its large size and high resource intensity, which makes the model require a great deal of time to load into the RAM and initialize itself. To solve this problem we shall need to take certain actions on the optimization of the model loading at the system level.

Conclusion and Future Work

In this work we presented a baseline Kazakh broadcast news transcription system built on Kaldi platform which demonstrates quite tolerable recognition accuracy when using deep neural networks. Also it is worth mentioning that we have collected and prepared speech data containing real TV news which we used for acoustic modelling.

Although the results are promising, there are several directions to improve the system performance in terms of recognition accuracy. These are segmentation and clustering of speech data into homogeneous intervals. Another important issue to address is the speed of speech recognition.


1. Mansour Alghamdi, Moustafa Elshafei, Husni Al-Muhtaseb. “Arabic broadcast news transcription system”. International Journal of Speech Technology, Volume 10, Issue 4, pp. 183–195.

2. M. J. F. Gales, Do Yeong Kim; P. C. Woodland; Ho Yin Chan; D. Mrva; R. Sinha; S. E. Tranter.

"Progress in the CU-HTK broadcast news transcription system". In the IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, September 2006, pp. 1513–1525.

3. R. Sinha, M.J.F. Gales, D.Y. Kim, X.A. Liu, K.C. Sim, P.C. Woodland. “The CU-HTK Mandarin broadcast news transcription system”. In the Proc. ICASSP, 2006. IV 1280.

4. Jean-luc Gauvain , Lori Lamel , Gilles Adda. “The LIMSI Broadcast News Transcription System”.

Speech Communication, vol. 37, iss. 1–2, pp. 89–108.

5. O. Makhambetov, A. Makazhanov, Zh. Yessenbayev, B. Matkarimov, I. Sabyrgaliyev, and A.

Sharafudinov. 2013. “Assembling the Kazakh Language Corpus”. In Proceedings of the 2013 Conference on


Empirical Methods in Natural Language Processing, pp. 1022–1031. Association for Computational Linguistics.

6. “Khabar” TV channel, official site. URL: khabar.kz [Access date: 18.04.2016].

7. “Astana TV” channel, official site. URL: astanatv.kz [Access date: 18.04.2016].

8. “Channel 31” TV channel, official site. URL: 31.kz [Access date: 18.04.2016].

9. IRSTLM Toolkit version 5.80.08. URL: https://sourceforge.net/projects/irstlm/ [Access date:


10. Povey, Daniel and Ghoshal, Arnab and Boulianne, Gilles and Burget, Lukas and Glembek, Ondrej and Goel, Nagendra and Hannemann, Mirko and Motlicek, Petr and Qian, Yanmin and Schwarz, Petr and Silovsky, Jan and Stemmer, Georg and Vesely, Karel. “The Kaldi Speech Recognition Toolkit”. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, 2011.

UDC 681.5



SYSTEMS SYNTHESIS WITH INACCURATE DATA (Institute of Information and Computational Technologies, Al-Farabi Kazakh National University, Almaty, Kazakhstan)

The article presents the procedure for solving the task of parametric control synthesis, which is brought to the resolvability of interval algebraic equations. Solution for the obtained system has been found in the class of "controlled" solutions.

It was stated several times that the real technical objects function under conditions of parametric uncertainty. Such uncertainty is resulted from the presence of uncontrolled disturbances which affect the control objects, because of not lack of knowledge of true parameter values of control objects due to the complexity of the process, and sometimes their unpredictable variation in time. In almost all cases, the above-mentioned parametric uncertainty is characterized by belonging real parameter values of the technical object to some intervals, the limits of which are known on a priori basis. Their mathematical models can be represented by systems of integral differential and difference equations with the use of rules and designations of interval analysis [1], and the class of such control objects is commonly known as interval-based.

Thus, we face the problem of control of not only the subject, but a family or set of objects.

It has been noticed that the formulated problem brought to resolvability of the system of such linear interval algebraic inclusions [2]:

PKH , (1)

Meaning of the term "solutions" of interval system of inclusions of type (1) requires a special clarification, as interval uncertainty of the system data can be interpreted in two ways, in accordance with the dual understanding of intervals themselves. In the first case, interval

 

x x, is a set of all real numbers from x to x, and in the second case it holds even a single meaning between x and x. In mathematical terms, this difference is expressed by use of universal quantifiers and existential quantifier : in the fist case it is recorded   x

 

x x, , and in the second case   x

 

x x, . As for the parameters of the system of linear interval equations pij, known only with their belonging to some intervals, the vital difference between two types of interval uncertainty manifests as the difference between the parameters that can be changed within the indicated intervals as a result of external unpredictable disturbances and parameters which we can willfully vary within the set intervals, i.e., control them.