A Frank Look at Simon: Where To Go From Here

As I had previously announced, I am resigning my active positions in Simon and KDE Speech.

As part of me handing over the project to an eventual successor, I had announced a day-long workshop on speech recognition basics for anyone who’s interested. Mario Fux of Randa fame took me up on that offer. In a long and intense Jitsi meeting we discussed basic theory, went through all the processes involved in creating and adapting language- and acoustic models and looked at the Simon codebase. But maybe most importantly of all, we talked about what I also want to outline in this blog post: What Simon is, what it could and should be, and how to get there.


As some of you may know, Simon started as a speech control solution for people with speech impediments. This use case required immense flexibility when it comes to speech model processing. But flexibility doesn’t come cheap – it eliminates many simplifying assumptions. This increases the complexity of the codebase substantially, makes adding new features more difficult and ultimately also leads to a more confusing user interface.
Just to give you an example of how deep this problem runs, I remember fretting over what to call the smallest lexical element of the recognition vocabulary. Can we even call it a “word”? It may actually not be in Simon – this is up to the user’s configuration.

Now, everyone reading this would be forgiven for asking, why, almost 9 years in, Simon hasn’t simply been streamlined yet. The answer is, that removing what makes Simon difficult to understand, and difficult to maintain, would necessarily also entail removing most of what makes Simon a great tool for what it was engineered to be: An extremely flexible speech control platform, allowing even people with speech impediments to control all kinds of systems in (almost) all kinds of environments.


Over the years, it became clear that Simon’s core functionality could also be useful to a much wider audience.
Eventually, this lead to the decision to shoehorn a “simple” speech recognition system for end-users into the Simon concept.

The logic was simple: Putting both use cases in the same application allowed for easier branding, and additional developers, who would hopefully be attracted by the prospect of working on a speech recognition system for a larger audience, would automatically improve the system for both use-cases. Moreover, the shared codebase could ease maintenance and further development.

In hindsight, however, this was a mistake.

Ostensibly making Simon easier to use meant that all the complexity, which was purposefully still there to support the core use case, needed to be wrapped in another layer of “simplification”, which in practice only further complicated the codebase. For the end-user, this was problematic as well, as the low-level settings were basically only hidden under a thin veil of convenience over a power-user system.


In my opinion, it’s time to treat Simon’s two personalities as the two separate projects, that simply share common libraries for common tasks.

Simon itself should stay the tool for power-users, that allows to fiddle with vocabulary transcription, grammar rules and adaption configurations. It’s really quite good at this, and there is a genuine need, as the adoption of Simon shows.
The new project should be a straight-forward dictation-enabled command and control system for the Plasma desktop. Plasma’s answer to Windows’ and OS X’ built in speech recognition, so to say. This project’s task would be vastly simpler than Simon’s task, which allows a substantially leaner codebase. Let’s look at a small list of simplifying assumptions that could never hold in Simon, but which would be appropriate for this new project:

  • As the system will be dictation enabled, it will necessarily only work for languages where a dictation-capable acoustic model already exists. Therefore, the capability to create acoustic models from scratch is not required.
  • As dictation capable speech models would need to be built anyway, a common model architecture can be enforced, removing the need to support HTK / Julius.
  • As generic speech models (base models) will be used, the pronunciations of words can be assumed to be known (for example, following the “rules” for “US English”). Therefore, users would not need to transcribe their words, as this can be done automatically through grapheme to phoneme conversion (the g2p model would be part of the speech model distribution). This, together with the switch from Grammars to N-Grams would eliminate the need for what were the entire “Vocabulary” and “Grammar” sections in the Simon UI.

But talk is cheap. Let’s look at a prototype. Let’s look at Lera.

Lera's Main UI

Lera’s main user interface is a simple app indicator, that gets out the way. Clicking on it opens the configuration dialog.

Lera's Configuration

Lera’s configuration dialog (mockup, non-functional) is an exercise in austerity. A drop-down lets the user chose the used speech model, which should default to the system’s language if a matching speech model is available. A list of scenarios, which should be auto-enabled based on installed applications, show what can be controlled and how. The user should be able to improve performance by going through training (in the second tab) and to configure when Lera should be listening (in the third tab).

Here’s the best part: Lera is a working prototype. Only the core functionality, the actual decoding, is implemented, but it works out of the box, powered by an improved version of the speech model I presented on this blog in 2013, enabling continuous “dictation” in English (the model is available in Lera’s git repository; So far, the only output produced is a small popup showing the recognition result).

Lera in Action

I implemented this prototype mostly to show off what I think the future of open-source speech recognition should look like, and how you could get started to get there. Lera’s whole codebase has 1099 lines, 821 of which are responsible for recording audio. The actual integration of the SPHINX speech recognizer is only a handful of lines. The model too, is built with absolute simplicity in mind. There’s no “secret sauce”, just a basic SPHINX acoustic model, built from open corpora (see the readme in the model folder).


If anything, Lera is a starting point. The next steps would be to move Simon’s “eventsimulation” library into a separate framework, to be shared between Lera and Simon. Lera could then use this to type out the recognition results (see Simon’s Dictation plugin). Then, I would suggest porting a simplified notion of “Scenarios” to Lera, which should only really contain a set of commands, and maybe context information (vocabulary and “grammar” can be synthesized automatically from the command triggers). The implementation of training (acoustic model adaption) would then complete a very sensible, very usable version 1.0.

Sadly, as I mentioned before, I will not be able to work on this any longer. I do, however, consider open-source speech recognition to be an important project, and would love to see it continued. If Lera kindled your interest, feel free to clone the repo and hack on it a little. It’s fun. I promise.

Passing the Torch

This fall, the Simon project will turn 9 years old. What started with a team of ambitious 17-year-olds and a school project, evolved over the course of more than 3000 commits and 4 major releases into a sophisticated speech-recognition platform.

Like most open-source projects, Simon saw periods of stagnation and periods of explosive growth. It saw GSoCs and Open Academies; Research projects and commercial deployments. And through it all, I was proud to be Simon’s maintainer. Almost 9 years in, however, it is time for me to take a step back, pass on the torch, and focus on a new adventure.

At the end of this month, I will start an exciting new career at Apple’s Siri team in Paris. I will therefore sadly no longer be able to serve as Simon’s maintainer, starting on the 20th of August.
I hereby want to announce an open call to find a new maintainer for KDE’s speech recognition efforts.

To help my successor get started, I will hold an online workshop on the upcoming Tuesday, 4th of August 2015 (starting at around 11am, Vienna time), where I will introduce speech-recognition basics, outline Simon’s codebase, and answer any arising questions. I am also planning on discussing my sketches and ideas on where I personally think the project should be going, and how to get there. After the workshop, I will stay available to answer any questions, especially when it comes to the workings of the actual speech recognition, via email until the 20th, to make the transition as smooth as possible.

I encourage everyone interested in potentially leading the Simon team to introduce themselves on the kde-speech@kde.org mailing list. Prior knowledge about speech recognition is not required.

At this point, I also want to extend my sincere gratitude to everyone involved in turning this project into what it has become. Franz and Mathias Stieger, Alexander Breznik, Frederik Gladhorn, Yash Shah, Vladislav Sitalo, Adam Nash, Patrick von Reth, Manfred Scheucher, Phillip Goriup, Martin Gigerl, Bettina Sturmann, Susanne Tschernegg, Jos Poortvliet, Lydia Pintscher, the KDE e.V. board, and the countless others: thank you!

Speech-Based, Natural Language Conversational Recommender Systems

A while ago, I published a post about ReComment, a speech-based recommender system. Well, today I want to talk about SpeechRec: ReComment’s bigger, better, faster, stronger brother.


With ReComment we showed, that a speech-based interface can enable users to specify their preferences quicker and more accurately. Specifically, ReComment exploited information from attributes such as “a little” or “a lot” which naturally occur in spoken input, to refine its user model.
With SpeechRec, I wanted to build on the idea, that spoken language carries more semantic information than traditional user interfaces typically allow to express, by integrating paralingual features in the recommendation strategy, to give proportionally more weight to requirements that are said in a more forceful manner. For example, this allows to assign more weight to the constraint “Cheaper!, than to the statement “Cheaper..?”. In other words: SpeechRec doesn’t just listen to what you are saying, and how you are phrasing it, but also how you are pronouncing it.

Moreover, I wanted to up the ante when it came to task complexity. The ReComment prototype recommended compact digital cameras, which turned out to be a problem domain where a user’s requirements are fairly predictable, and finding a fitting product is arguably easy for the majority of users. To provoke conflicting requirements, and therefore better highlight the strengths and weaknesses of the recommender system under test, SpeechRec was evaluated with the domain of laptops. And let me tell you: recommending laptops to what were primarily students at a technical university is quite a challenge :).

Another problem we observed in the evaluation of ReComment, was that some people did not seem to “trust” that the system would understand complex input, and instead defaulted to very simple, command-like sentences, which, in turn, carried less information about the user’s true, hidden preferences. SpeechRec was therefore engineered to provoke natural user interaction by engaging user’s in a human-like, mixed initiative, spoken sales dialog with an animated avatar.


The developed prototype was written in C++, using Qt4. The speech recognition system was realized with the open-source speech recognition solution Simon, using a custom, domain-specific speech model that was especially adapted to the pervasive Styrian dialect. Simon was modified to integrate OpenEAR, which was used to evaluate a statement’s “arousal” value, to realize the paralingual weighting discussed above (this modification can be found in Simon’s “emotion” branch).

The used avatar was designed in Blender 3D. MARY TTS was used as a text-to-speech service.

A comprehensive product database of more than 600 notebook was collected from various sources. Each product was annotated with information on 40 attributes, ranging from simple ones like price to the used display panel technology. As early pilots showed that users, facing a “natural” sales dialog, did not hesitate to also talk about subjective attributes (e.g., “I want a device that looks good”), SpeechRec’s database additionally included sentiment towards 41 distinct aspects, sourced through automatic sentiment analysis from thousands of customer reviews. MongoDB was used as a DBMS. Optimality criteria further allowed SpeechRec to further decode statements such as “Show me one with a better graphics card” or “I want one with good battery life”.

SpeechRec’s recommendation strategy was based on the incremental critiquing approach as described by Reilly et al, with a fuzzy satisfaction function.


Check out the video demonstration of a typical interaction session below.

Note, how SpeechRec takes initiative in the beginning, asking for the user’s primary use case. It does this, because it determined that it did not yet have sufficient information to make a sensible recommendation (mixed-initiative dialog strategy).
Also note, how the system corrects it’s mistake of favoring to satisfy the “price” attribute over the “screen size” attribute, when the user complains about it (towards the end). SpeechRec instead comes up with a different, more favorable compromise, which even includes slightly bending the user’s request for something that costs at most 1200 euros (“It’s just a little more, and it’s worth it for you.”).

Direct link: Watch Video on Youtube
(The experiment was conducted in German to find more native speaking testers in Austria; be sure to turn on subtitles!)

Results and Further Information

The conducted empirical study showed, that the nuanced user input extracted from the natural language processing and paralingual analysis enabled SpeechRec to find better fitting products significantly quicker than when a traditional, knowledge-based recommender system. Further comparison showed that the full version of SpeechRec described above also substantially outperformed a restricted version of itself, which was configured to not act on the extracted lexical and paralingual nuances, confirming that the richer user model facilitated by spoken natural language interaction is a major contribution toward the increase in recommendation performance of SpeechRec.

More information about SpeechRec and related research an be found in the journal paper P. Grasch and A. Felfernig. On the Importance of Subtext in Recommender Systems, icom, 14(1):41-52, 2015, and in my Master’s thesis.

SpeechRec was published under the terms of the GPLv2, and is available on Github. All dependencies and external components are free software and available at their respective websites.

Conf.KDE.In 2014: A New Generation

About three weeks ago, I was lucky enough to attend Conf.KDE.In at the DA-IICT in Ghandinagar, India. Thank you, KDE e.V. for making this possible.
Much has already been said about the excellent conference in the dot article and personal reports from fellow KDE hackers so I will skip repeating how eager and engaged the students were, how the event was impeccably organized, and even how good the food was.

What I want to talk about instead is something else: The evolution of our community. Continue Reading

Open Academy

Back in 2012, Facebook and Stanford University introduced their “Open Academy” program. The aim was and still is simple: Give University students an opportunity to work on real open source projects in exchange for University credit – and a ton of valuable experience.
This year, KDE has joined as a mentoring organization with a total of 11 students assigned to work on 3 different projects. One of those projects is Simon’s upcoming natural language dialog manager: A system building on the current “Dialog” plugin to enable the creation of advanced spoken dialogs like the ones made popular by Apple’s Siri. Continue Reading

Launching the Open Speech Initiative

Over the course of the summer, I have been working on bringing dictation capabilities to Simon. Now, I’m trying to build up a network of developers and researchers that work together to build high accuracy, large vocabulary speech recognition systems for a variety of domains (desktop dictation being just one of them).

Building such systems using free software and free resources requires a lot of work in many different areas (software development, signal processing, linguistics, etc.). In order to facilitate collaboration and to establish a sustainable community between volunteers of such diverse backgrounds, I am convinced that the right organizational structure is crucial to ensuring continued long-term success.

With this in mind, I am pleased to introduce the new Open Speech Initiative under the KDE umbrella: A team of developers looking to bring first class speech processing to the world of free software. Continue Reading