πŸŒ…

Crowdsourcing photos for AI classifier - Airtable, Google Collab, Yolo/TensorFlow!

image

Images and fun article explaining what these magic creatures are from https://redditblog.com/2016/01/13/wtf-is-a-nudibranch-and-why-is-it-so-cute/

I was informally asked by an acquaintance "Hi Cesar, how are you doing? I want to start putting nudibranch in a machine learning system, can you give me a pointer to freeware or something cheap?"

I was lucky to have lunch with Pratham Goradia and jolted some idea how to build that in a free or cheap way, with ideally a low or no-code environment.

This is what we came up with:

image
image

  1. Airtable Form - with all the fields below
  2. Airtable Database Structure (Columns)
    1. Image
    2. Link to the image file
    3. Date the photo was taken
    4. Location (GPS coordinates)
    5. Taxonomy (domain, kingdom, phylum, class, order, family, genus, species)
    6. Confidence in Taxonomic classification (0-100, 0 is not identifiable, 100 is 100% confident about the accuracy at the species level)
    7. Image Label (txt file, or raw text akin to .csv syntax for training AI) - that can be done with a free tool (Roboflow, Makesense, IBM Cloud Annotation - these are free and robust)
    8. Author name
    9. Author email
    10. Author Attribution and license
    11. Comments
    12. These are the fields in the form - visible. Of course, default fields would be created such as submission timestamp and IP of poster by Airtable in the process.

      Additional fields would be visible only to the admin, which helps to track whether this data was used for training, testing, and if so using what tools, by which admin. These fields would be increasingly important as the database grows.

Large Dataset

We also talked about the cost of hosting a large dataset. Airtable is great to start at a small scale

  • 1G single Admin is free! So you can test for free this idea!
  • 5GB single Admin on Airtable is 5USD / Month.
  • 20GB single Admin on Airtable is 20USD

For a database larger than 20 GB it would make sense to host "cold storage" in a cheaper server, but that would complicate the architecture. Provisionally, the form could have the "a. Image" field for images hosted on Airtable, and "b. Link to the image file" for images hosted somewhere else. It would add to the complexity and error rate as the link could be broken, the image at the wrong format, or the file it points out outright malicious...

Naming of image & label file

A convention is that for AI training, the image and the label file of that image have "the same name". And they are both in the same folder - ideally.

nudibranch1001.jpg β†’ nudibranch1001.txt

Again, if your label file is in Airtable vs Outside might add complexity or multiply the number of fields and workflows you have.

Downloading a subset of the Airtable database while making sure the naming of the images and label file is consistent/matching will most likely need a little workflow from the admins, but seems doable.

Advantages of Airtable

  1. Messy data: Airtable structure allows a convenient way to collect heterogeneous data from a large group of people.
  2. Parsing data for AI training: Airtable API should make it easy to parse only part of the data to be used for training.
  3. Scalable and publishable: This Airtable database could easily be transformed into a citizen science type website, and continue to collect data overtime
  4. Speed, uptime, admin, price: As the project grows, Airtable enables non-coders to help manage the data with good uptime, fast data rate and relatively affordable price.

Quick search: it seems that some people use that technique already - and even turned this idea into a startup!

Training of AI

The data collected could then be trained for free on google colab, using free AI project such as Yolo or TensorFlow. I found this youtuber who explains that parse data from Airtable with simple python.

There is a ton of videos that explain how to take a dataset and train it on google collab:

This video looks quite clear:

If all of this seems a pain, we used

This idea comes from knowledge that's about 6 months old, and this space is moving really fast - so there might be ready solutions I am not aware of. Of course, you can also do it all on google cloud and Amazon Sagemaker... but that seem to require more computer science / engineering abilities.

I want to check out nudies too!!! I can't wait to have the time to learn to dive and check them out !!!

I hope this helps!