Thread: if data is the source code of AI, what tools could we steal from the Software 1.0 world?

Here are 10 examples.
I'll mostly be using an image classification example but most should be applicable for the dataset of any supervised learning task.
Versioning (git): new data and labels are committed in batches, not individually. Can reproduce dataset at any commit in history.

You can experiment in a branch and merge back later. Or, create a pull request to get people to review the new labels. (1/10)
Interface for collaborating on versioned datasets (Github): a nice, web-based interface for browsing data points and labels, historical versions, collaborating, issue tracking, reviewing PR-s.

A robust marketplace for add-ons.

Reasonable free tier and <20USD/mo pricing. (2/10)
CI/CD: YAML-based configuration of how to compile the dataset into a trained model, run automated tests, and how&where to deploy it after that.

Related: http://github.com/uber/ludwig  (3/10)
Environments (dev, staging, prod, ...): deploy any branch/commit to play around with model there, and understand how downstream models/software behave with this version. (4/10)
Model dependency management: train on top of public pre-trained models.

Dataset dependency management: add to training set any version of any public dataset, or a combination thereof. (5/10)
Automatic "unit tests" for a trained model: make sure it performs on specific corner cases. (6/10) https://twitter.com/taivopungas/status/1293124438079287296
Automatic integration testing: make sure the whole system (including APIs, preprocessing, postprocessing) produce the desired result.

On a statistically meaningful number of examples, i.e. 1000s or more. (7/10)
Code auto-complete: when looking at an unlabelled image, show a preview of the suggested label and can Tab to apply it.

Alternatively, having labelled one image, show 5 most similar ones with suggested labels; pressing Tab adds & labels them all. (8/10)
Linter: automatically detect and warn when detected...
* obviously mistaken labels (high training loss)
* extreme class imbalance
* extreme violation of priors (tracked object suddenly jumps off path, bounding box too small/large, etc) (9/10)
IDE: combine all the tools above (and more) into an opinionated and/or highly configurable power-user interface.

Could have a broad range of options: compare how for Python people use IDLE, PyCharm, Jupyter notebooks, vim, etc. (10/10)
Some tools don't transfer easily. How about code formatting? Syntax highlighting?

I'm sure there are other common software tools that could be transferred. What am I missing?
You can follow @taivopungas.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.