Our design patterns highlight key design opportunities for AI products. They’re organized around common questions in the product development process to help you find what you need.
Filter by questionarrow_drop_down
How do I responsibly build my dataset?
Invest early in good data practices
The better your data planning and collection processes, the higher quality your end output.
AI-powered products can suffer without the right level of focus and resources on data from early on. We call such downstream effects “data cascades” and they can be hard to diagnose and detect until your product experience is impacted.
Good planning and scrutiny of your dataset can help you avoid issues downstream. Some actions that you can take include:
The real world is messy! Expect the same from the data that you gather.
As you develop your training dataset, don’t strive for something perfectly curated. Instead, allow some “noise” to make the data as similar as possible to the real-world data you expect to get from your users. This can help head off errors and poor quality recommendations once you release your model into the real world.
To do this, think about the types of data that you expect to get from your users, and then ensure that data is represented in your training set.
For example, for an image recognition system, consider the data you might get from you users. If it’s likely they will not have the time to take high-quality photographs and your model will have to work with blurry smartphone images, include blurry images in your training data.
Allow for less-than-perfect results in your dataset because they will show up in real-world situations too.
Make sure that data labelers have well designed tools and workflows.
For supervised learning, accurate data labels are a crucial ingredient to achieve relevant ML output. Labels can be added through automated processes or by people known as labelers.
Labeling tools range from in-product prompts to specialized software. If you’re working with labelers, it’s worth investing time upfront in selecting or designing the tools, workflows, and instructions. The best way to do this is often in collaboration with the labelers themselves.
When labelers understand what you’re asking them to label, and why, and they have the tools to do so effectively, they’re more likely to label the data correctly. And as partners in the process, they can also help you improve your labeling tasks overall.
Use multiple shortcuts to optimize key flows, provide easy access to labels, let raters change their minds, and auto-detect and display errors.
Get input from domain experts as you build your dataset
Building partnerships with domain experts early can help reduce iterations on your dataset later on.
When creating your own dataset, make time early on to observe a domain expert your product aims to serve — for example, watch an accountant analyze financial data, or a botanist classify plants. This can give you valuable insights about the types of data that they use to solve the problem your product is addressing.
To identify the right partners, work with user researchers, or others with experience in identifying and interviewing domain experts.
Aim for sustained relationships with domain experts throughout the project lifecycle (rather than one-off consultations), whenever possible.
Do you have domain experts that can help highlight data issues throughout the development lifecycle?
Understand differences in how labelers interpret and apply labels to prevent problems later on.
When you encounter labels that are “messy," unexpected, or hard to reconcile, don’t categorically discard them as “noisy." Take time to investigate whether issues with labeler tools, workflows, instructions, or overall data strategy may be leading to such issues with labels.
For example, say you’re training a model to flag toxic comments. Your labelers might apply different toxicity labels based on their personal experience, which can lead to discrepancies.
These disagreements in labels offer an opportunity to identify deeper data and/or labeling issues that you may need to address to ensure data quality.
Labels the plant without additional descriptors.
Labels that the plant looks unhealthy. Dismissing this as noise misses opportunities in how your system can perform.