Emotion Detection Completed by Artificial Intelligence

Emotion Detection is an ability that has set humans apart for many years. We are social animals, and our ability to interpret emotions on such a complex level is something that is truly next-level. Unfortunately, many such as those with Autism, Asperger's syndrome, or many other disorders do not have this ability. Unfortunately, these individuals, especially those who live with Autism, are often misjudged and perceived in a negative light due to factors they can’t control. This is an unfortunate reality of our world, and the injustices these people face are something that is a problem. I set out to fix this problem by developing technology, more specifically a Natural Language Processing Algorithm to help these people detect emotions.

To start out with this experiment, I created a hypothesis of how effective my algorithm would be able to recognize emotions. I believed it would be able to detect emotions with a 50% or higher emotion recognition rate. I chose this number simply because I wanted to see whether it could get the emotions right more times than wrong. My algorithm’s goal was to identify voice files from a diverse pool of 8 participants. There were four males and four females in this group, with their ages ranging between 6 and 45. The goal of my experiment was to create an algorithm that can accurately test voice files to determine the emotion of the voice files. The first step in my procedure was to take an audio file from a participant and convert it to a .wav file using the website cloud convert. The next step of this procedure was to split the .wav files from stereo audio to mono audio.


My algorithm was only able to interpret mono audio as compared to the default stereo audio. The voice files were split using an online piece of software. To develop and create my original software, I first built an NLP algorithm based on an online tutorial . An NLP algorithm uses data to train an algorithm so that it’s able to read and understand human language in a variety of contexts and understand the writer or speaker’s intent and the reader’s or listener’s understanding of speech. NLP is currently being used in programs like Siri or Alexa, email filters to help detect spam or junk email, and predictive text programs like autocorrect and autocomplete. For the sake of simplicity, I will refer to the original algorithm from DataFlair as algorithm 1 and my proprietary algorithm as algorithm 2.

First, algorithm 1 imports the python libraries Librosa, sound file, ob, glob, pickle, NumPy, train_test_split, MLPClassifier, and accuracy_score. The next step is to establish the function, extract_feature, which is what algorithm 2 will need to use to detect the emotion of the new voice file. The extract_feature function uses code to extract three features from the voice file, MFCC, Chroma, and Mel. Any sound that humans make is determined by the shape of their vocal tract and parts of their body such as their tongue or teeth. If you can appropriately represent this shape, you can represent sounds. This is what the MFCC feature does. It tries to determine the shape of their vocal tract which eases identification of the emotion. Chroma is a feature that is related to pitch in a voice. Mel is a feature that represents a short-term power spectrum sound of a mel scale which also aids in measuring pitches. Together, these features give the extract_feature the ability to make a good prediction on the voice file’s emotion. Algorithm 1 then lists the emotion files from the RAVDESS dataset in a python dictionary and assigns a dictionary key that signifies a specific emotion, e.g. 01 — neutral, 02 — calm. These same numbers are used in the naming of all the files on the dataset. Four emotions from this list are ones that I used which are put in observed_emotions. These are calm, happy, fearful, and disgust.

The algorithm then creates another function called load_data where it loads in data from the dataset. Each emotion’s number will have the same number for the dictionary key of the emotion file from the RAVDESS dataset. This number aids the program during training phases when checking over the results of training. The load_data function goes through all the files and loads in 20% of the total data files of these four emotions for training algorithm 1. If the emotion file from the dataset isn’t in observed_emotions, the algorithm moves on. These voice files are then run through the extract_feature and are tested for their emotion. This training data is then returned. The algorithm proceeds to split the dataset and get the shape of the training and testing datasets. It gets the total number of features extracted from all the voice files. After this, the program utilizes an MLPClassifier to create a model.

MLPClassifier is a Natural Language Processing neural network that can use an existing neural network to classify the voice files. This model is then called back to predict the test set. It calculates the accuracy of algorithm 1 based on how many test set files it got correct. Finally, the algorithm prints the accuracy, the number of features extracted, and information about the model. The next step of the experiment is using algorithm 1 to aid algorithm 2 to give me the exact emotion of the human mono audio voice file from one of my participants. The new voice files were then used by algorithm 2. First, it loads in a .sav file which is where this voice file will be stored. It then calls back the model created in algorithm 1 to help identify the emotion of this file. The next step was to load in the voice file from my computer. The new code then used the extract_feature function of algorithm 1 to identify the emotion. The data is reshaped so that it can now print out the prediction for the emotion of the voice file. Finally, all the emotion identifications were taken and the results were used to calculate the emotion recognition rates of the external voice files. Any voice file can be uploaded to the algorithm to have its emotion detected.

Results and observations

There was quantitative data collected from these results that directly related to the emotion detection rate. The results indicate that fear was highly detectable with an emotion recognition accuracy of 87.5%. This is most likely because fear had the most obvious sound features in the entire database. The timing was similar in every voice file. The rate of speaking remained the same throughout with slight speed increases to signify urgency throughout the file. The emotion was also always near-identical. Secondly, calm was also highly detectable with a recognition accuracy of 75%. For the most part, the instructions were followed and a monotone and steady speaking rate was maintained. Calm was confused for happiness 12.5% of the time and fear 12.5% of the time. That can most likely be attributed to a participant using a sped-up voice at certain points instead of maintaining the same speed throughout. Both happiness and fear invoke some feelings of excitement or urgency which causes a sped-up rate of talking. Thirdly, happiness was harder to detect compared to the previous two emotions with a recognition accuracy of 62.5%. It was confused with fear 25% of the time and calm 12.5% of the time. This could be attributed to people speaking fast when they are happy. The excitement of being happy results in a quickened pace. This means that the quickness of fear can often be confused with the quickness of a happy voice and tone. If the voice file’s rate of speaking was too slow, it was then mistaken for calm. Finally, disgust was the hardest emotion for the algorithm to detect with a recognition rate of only 50%. Disgust was confused with fear the other 50% of the time. The algorithm may have confused disgust with fear because both disgust and fear use similar voice qualities while you are speaking. They are also often spoken with a similar pitch and volume.


My algorithm did satisfy the hypothesis of getting 50% or over on each emotion. Trends that were generated involving the participants’ results on the algorithm were plentiful. The first trend from the programs’ results was that Participant 7 and Participant 8, both elementary school age, had a 100% recognition rate by the algorithm. Every single one of their voice files was identified correctly by the algorithm. Secondly, both genders had the same overall accuracy rate. The averaged out accuracy for all males in all emotions was 68.75% and the averaged out accuracy for all females in all emotions was 68.75%. It is interesting to see that the accuracy didn’t change for either males or females. Finally, the data was consistent throughout each trial. The test was run three times, and for each of those times, the results remained constant with no variation. Although the accuracy had a range of 50% — 87.5%, the precision of the voice files stood at 100%. The same results were given each time. With an expanded dataset and more training, this tool will only improve.

For Further Reading

(My projectboard presentation for the Canada Wide Science Fair)

Hi! I'm Angad, a Grade 9 student from the University of Toronto Schools who loves learning anything and everything!