“Natural” language programming is coming sooner than you think
Sometimes major changes happen virtually unnoticed. May 5, IBM announced the CodeNet project with very little media or academic attention.
CodeNet follows on from ImageNet, a dataset of large-scale images and their descriptions; images are free for non-commercial use. ImageNet is now at the heart of the progress in deep learning of computer vision.
CodeNet is an attempt at coding for artificial intelligence (AI) what ImageNet did for computer vision: it is a dataset of over 14 million code samples, spanning 50 languages programs, designed to solve 4,000 coding problems. The dataset also contains a lot of additional data, such as the amount of memory required to run software and record the outputs of running code.
Accelerate Machine Learning
IBM’s own rationale for CodeNet is that it is designed to quickly update legacy systems programmed in obsolete code, an evolution expected since the panic of the year 2000 over 20 years ago, while many believed that undocumented legacy systems could fail with dire consequences.
However, as security researchers, we believe that the most important implication of CodeNet – and similar projects – is the potential for lowering barriers and the possibility of natural language coding (NLC).
In recent years, companies such as OpenAI and Google quickly improved natural language processing (NLP) technologies. These are machine learning-based programs designed to better understand and mimic natural human language and translate between different languages. Machine learning training systems require access to a large data set with texts written in the desired human languages. NLC applies all of this to coding too.
Coding is a difficult skill to learn let alone master, and an experienced coder should be proficient in several programming languages. NLC, on the other hand, leverages NLP technologies and a large database such as CodeNet to allow anyone to use English, or ultimately French or Chinese or any other natural language, to code. This could make tasks like designing a website as simple as typing “create a red background with an image of an airplane on it, my company logo in the middle and a contact me button below”, and this exact website would be born, the result of the automatic translation of natural language into code.
It is clear that IBM was not the only one thinking. GPT-3, the industry-leading NLP model of OpenAI, was used to enable code a website or app by writing a description of what you want. Shortly after IBM’s announcement, Microsoft announced that it had obtained exclusive rights to GPT-3.
Microsoft also owns GitHub, – the largest collection of open source code on the Internet – acquired in 2018. The company has added to the potential of GitHub with GitHub co-pilot, an AI assistant. When the programmer enters the action he wants to code, Copilot generates a coding sample that could achieve what he specified. The programmer can then accept the sample generated by the AI, modify it or reject it, greatly simplifying the coding process. Copilot is a big step towards NLC, but it’s not there yet.
Consequences of natural language coding
While NLC is not yet fully achievable, we are quickly heading into a future where coding is much more accessible to the average person. The implications are enormous.
First, there are implications for research and development. It is argued that the higher the number of potential innovators, the higher the innovation rate. By removing barriers to coding, the potential for innovation through programming expands.
In addition, academic disciplines as varied as computational physics and statistical sociology increasingly rely on custom computer programs to process data. Decreasing the skills required to create these programs would increase the ability of researchers in specialized fields outside of computer science to deploy such methods and make new discoveries.
However, there are also dangers. Ironically, one is the de-democratization of coding. Currently, many coding platforms exist. Some of these platforms offer a variety of features favored by different programmers, but none offer a competitive advantage. A new programmer could easily use a free, bare bones coding terminal and be at a bit of a disadvantage.
However, AI at the level required for NLC is not cheap to develop or deploy and is likely to be monopolized by large platform companies such as Microsoft, Google, or IBM. The service can be offered against payment or, like most social networking services, free of charge but under unfavorable or abusive conditions for its use.
There is also reason to believe that these technologies will be dominated by platform companies because of how machine learning works. Theoretically, programs like Copilot improve when introduced to new data: the more they are used, the more they improve. This makes it more difficult for new competitors, even if they have a stronger or more ethical product.
Barring a serious counter-effort, it seems likely that the big capitalist conglomerates will be the gatekeepers of the next coding revolution.