Overall, your training data should be representative of each of your classes and show variability within each class. Your training samples should be homogenous and have no overlap with any other class. Your samples should also not have any "fuzzy" boundaries, for example, don't go to the edge of a specific class when collecting your sample. Your training samples should have a significant number of pixels, especially if you are using the maximum likelihood classifier. For example, a 10 by 10 block of pixels equals 100 pixels, which is a reasonable size for a training polygon and is statistically significant.
Yes, you should have multiple samples of each kind of class to show variability. Instead of aiming for a specific amount of training areas, it is important to consider how good and representative those areas are. That being said, the documentation below states that "Parametric classifiers, such as the maximum likelihood classifier, need a statistically significant number of samples to produce a meaningful probability density function. To achieve statistically significant samples, you should have 20 or more samples per class."
It sounds like you may be doing a pixel-based classification approach. I would also look into using Image Segmentation to classify objects. Instead of classifying pixels, the process classifies segments, which can be thought of as super pixels. Each segment, or super pixel, is represented by a set of attributes used by the classifier tools to produce the classified image.
Resources
Use Training Samples Manager: https://pro.arcgis.com/en/pro-app/latest/help/analysis/image-analyst/training-samples-manager.htm#:~....
Understanding segmentation and classification: https://pro.arcgis.com/en/pro-app/latest/tool-reference/spatial-analyst/understanding-segmentation-a...
Segmentation: https://pro.arcgis.com/en/pro-app/latest/help/analysis/image-analyst/segmentation.htm